Extended rough set model based on modified data-driven valued tolerance relation

Abstract

Classical rough set theory is based on the conventional indiscernibility relation. It is not suitable for analyzing incomplete information. Some successful extended rough set models based on different non-equivalence relations have been proposed. The data-driven valued tolerance relation is such a non-equivalence relation. However, when predicting the unknown attribute value of an object, it regards the frequency of an attribute value approximately as the probability of appearance of this value, without considering the effects of other known attribute values of this object on predicting the unknown attribute value. In this paper, considering both the frequency of the known attribute values and the influence weight to predict the unknown attribute values. Modified data-driven valued tolerance relation (MDVT) is defined. On this basis, an extended rough set model based on modified data-driven valued tolerance relation is proposed. Some properties of the new model are analyzed. Experimental results show that the MDVT can get better classification results than other generalized indiscernibility relations.

Keywords

rough set incomplete information system valued toleration relation influence weight

1 Introduction

Classical rough set theory developed by Pawlak in 1982 [1], as a powerful soft computing tool to handle imprecise, uncertain, and vague information, has been extensively and successfully applied in the field of machine learning, data mining and decision analysis [2 –8]. The classical rough set theory based on the conventional indiscernibility relation, is not suitable for analyzing incomplete information systems (IIS), where some attribute values are unknown or missing. But in practice, because of the errors of data measuring, the limitation of acquiring data, some human factors, etc., IIS often occur in knowledge acquisition. The semantics of missing attribute values in IIS has been analyzed by many researchers. Generally, there are four kinds of missing attribute values: "lost values" (the values that were recorded but currently are unavailable), "do not care" conditions (the original values were irrelevant), restricted "do not care" conditions (similar to ordinary "do not care" conditions but interpreted differently, these missing attribute values may occur when in the same data set where there are "lost values" and "do not care" conditions), and "attribute-concept values" (these missing attribute values may be replaced by any attribute value limited to the same concept) [9 –15]. Therefore, many researchers have paid attention to IIS in recent years and endeavored to find out solutions. There are usually two methods in rough set theory to deal with IIS: data reparation [16 –20] and model extension [21 –27]. Compared with data reparation, model extension does not change the original information of IIS. At present, various extensions of classical rough set model have been proposed, in which the indiscernibility relation was extended to some non-equivalence relations (or called generalized indiscernibility relations) such as tolerance relation [9], non-symmetric similarity relation [10], limited tolerance relation [11], valued tolerance relation [12] and characteristic relation [13] etc. The generalized indiscernibility relation is a basic and core concept of rough set in incomplete information system. Attribute reduction and rule extraction are both based on the generalized indiscernibility relation. The classification performance of the generalized indiscernibility relation will affect the results of attribute reduction and rule extraction. However, these existing generalized indiscernibility relations have their own limitations, so proposing a new generalized indiscernibility relation, which can get better classification results, is essential and important.

The tolerance relation is too relaxed. In the tolerance relation, two objects that do not have any known same attribute values may be considered as indiscernible, and classified in the same class. The non-symmetric similarity relation is too strict. In the non symmetric similarity relation, two objects having a lot of known same attribute values may be separated, and classified in the different classes. In the limited tolerance relation, two objects may be considered as similar when they have only one known same attribute value. This condition is too lenient for an IIS with a large number of attributes. The characteristic relation, which is a generalization of both tolerance relation and non-symmetric similarity relation, cannot avoid their limitations. In the valued tolerance relation, we must know the prior probability distribution of the IIS in advance. Unfortunately, this is very difficult for a new IIS with some missing attribute values. So, it is usually supposed that there exists a uniform distribution among the attribute values. Obviously, this is too subjective and unreasonable. In addition, the threshold of tolerance degree is usually selected based on the prior domain knowledge, and it is also difficult to select a suitable threshold for different IIS.

Based on the idea of data-driven data mining [28], data-driven valued tolerance relation was proposed by Wang [29]. The frequency of an attribute value is approximately regarded as the probability of appearance of this value, and then, an approximate probability distribution among the attribute values can be derived. This calculation method of tolerance degree does not require any prior knowledge except the data set. An auto-selection algorithm of tolerance degree threshold is proposed, which does not require any prior domain knowledge except the data set. The data-driven valued tolerance relation can resolve the problems of the valued tolerance relation. However, when predicting probability of the unknown attribute value of an object, it regards the frequency of an attribute value approximately as the probability of appearance of this value, without considering the effects of other known attribute values of this object on predicting unknown attribute value. In order to solve the defect of data-driven valued tolerance relation, In this paper, considering both the frequency of the known attribute values and the influence weight to predict the unknown attribute values. Modified data-driven valued tolerance relation (MDVT) is defined, which can make full use of the prior knowledge of information system. On this basis, an extended rough set model based on modified data-driven valued tolerance relation is proposed. Some properties of the new model are analyzed. Experimental results show that the MDVT can get better classification results than other generalized indiscernibility relations.

The rest of the paper is organized as follows. In Section 2, we review several kinds of typical generalized indiscernibility relations as our foundations for the subsequent discussion. In Section 3, modified data-driven valued tolerance relation is defined. In Section 4, extended rough set model based on modified data-driven valued tolerance relation is proposed. Some properties of the new model are analyzed. Some simulation experimental results are discussed in section 5. At last, in section 6, we conclude this paper.

2 Typical generalized indiscernibility relations

In this section, we review relevant definitions of tolerance relation, non-symmetric similarity relation, limited tolerance relation, and data-driven valued tolerance relation as our foundations for the subsequent discussion.

Formally, an information system (IS) is a quadruple (U, AT, V, f), where U is a non-empty finite set of objects, called the universe; AT is a non-empty finite set of attributes; V is the union of attribute domains V = ∪ _a∈ATV_a, V_a where is the value set of attribute a; f : U × AT → V is an information function which assigns particular values from domains of attribute to objects, such as ∀a ∈ AT, x ∈ U, f (x, a) ∈ V_a, where f (x, a) denotes the value of attribute for object. If some attribute values are missing, the IS is an incomplete information system (IIS), otherwise it is a complete information system (CIS). In an IIS, the missing attribute value is denoted by "*".

2.1 Tolerance relation

Definition 1. For IIS (U, AT, V, f) where all missing attribute values are "do not care" conditions, tolerance relation T_B introduced by Kryszkiewicz [9] is defined as: $\begin{matrix} T_{B} & = {(x, y) \in U \times U | \forall b \in B, b (x) \\ = b (y) \lor b (x) = * \lor b (y) = *} \end{matrix}$ (1)

where B is a subset of attribute set AT. Clearly T_B is a reflexive and symmetric relation, but not necessarily transitive. The tolerance class for any object x ∈ U is defined as: $T_{B} (x) = {y \in U | (x, y) \in T_{B}}$ (2)

2.2 Non-symmetric similarity relation

Definition 2. For IIS (U, AT, V, f) where all missing attribute values are "lost values", non-symmetric similarity relation S_B proposed by Stefanowski and Tsoukis [10] is defined as: $\begin{matrix} S_{B} = & {(x, y) \in U \times U | \forall b \in B, b (x) = b (y) \\ \lor b (x) = *} \end{matrix}$ (3)

where B is a subset of attribute set AT. Clearly S_B is a reflexive and transitive relation, but not necessarily symmetric. Two similarity classes for any object x ∈ U are defined as: $S_{B} (x) = {y \in U | (y, x) \in S_{B}}$ (4) $S_{B}^{- 1} (x) = {y \in U | (x, y) \in S_{B}}$ (5)

Obviously predecessor set S_B (x) of x is the set of objects similar to x, and successor set $S_{B}^{- 1} (x)$ of x is the set of objects to which x is similar. So, they are two different sets.

2.3 Limited tolerance relation

Definition 3. For IIS (U, AT, V, f) and attribute set B ⊆ AT, limited tolerance relation L_B proposed by Wang [11] is defined as:

$\begin{matrix} L_{B} = {(x, y) \in U \times U | \forall b \in B, (b (x) = b (y) = *) \\ \lor ((Q_{B} (x) \cap Q_{B} (y) \neq \emptyset) \\ \land (\forall b \in B, b (x) \neq * \land b (y) \neq * \to b (x) = b (y)))} \end{matrix}$ (6)

where Q_B (x) ={ b ∈ B|b (x) ≠ * }.

Obviously, L_B is a reflexive and symmetric relation, but not necessarily transitive. The limited tolerance class for any object x ∈ U is defined as: $L_{B} (x) = {y \in U | (x, y) \in L_{B}}$ (7)

2.4 Data-driven valued tolerance relation

In [28], Wang indicated that the frequency of an attribute value can be approximately regarded as the probability of appearance of this value, and then, an approximate probability distribution among the attribute values can be derived. This calculation method does not require any prior knowledge except the data set, and is objective.

Definition 4. Consider attribute b of IIS (U, AT, V, f) and set $V_{b}^{'} = {(k_{b}^{1}, b_{1}), (k_{b}^{2}, b_{2}), \dots, (k_{b}^{m}, b_{m})}$ , where b₁, b₂, ⋯ , b_m are all known attribute values of b, and $k_{b}^{i}$ denotes the cardinality of set {x∈ U|b (x) = b_i }. For any x ∈ U, the probability of b (x) = b_i is $k_{b}^{i} / - (k_{b}^{1} + k_{b}^{2} + \dots + k_{b}^{m})$ . Then, for IIS (U, AT, V, f), ∀x, y ∈ U, probability P_{b} (x, y) that x is similar to y with respect to attribute is calculated as: $\begin{matrix} P_{{b}} (x, y) = \\ {\begin{matrix} 1, & \begin{matrix} b (x) = b (y) \land b (x) \neq * \\ \land b (y) \neq *; \end{matrix} \\ 0, & \begin{matrix} b (x) \neq b (y) \land b (x) \neq * \\ \land b (y) \neq *; \end{matrix} \\ k_{b}^{i} / - \sum_{j = 1}^{m} k_{b}^{j}, & \begin{matrix} (b (x) = b_{i} \land b (y) = *) \\ \lor (b (x) = * \land b (y) = b_{i}); \end{matrix} \\ \sum_{i = 1}^{m} {(k_{b}^{i} / - \sum_{j = 1}^{m} k_{b}^{j})}^{2}, & b (x) = * \land b (y) = *; \end{matrix} \end{matrix}$ (8)

On this basis, the probability that two objects are similar with respect to attribute set B is calculated as: $P_{B} (x, y) = \prod_{b \in B} P_{{b}} (x, y) \times N_{B} (x, y)$ (9) where N_B (x, y) is the weight factor of objects x and y based on the known same attribute values with respect to, i.e., $N_{B} (x, y) = \frac{| {b \in B | b (x) = b (y) \neq *} |}{| B |}$ (10) where |X| denotes the cardinality of set X.

Definition 5. For IIS (U, AT, V, f) and attribute set B ⊆ AT, data-driven valued tolerance relation ${DVT}_{B}^{λ}$ proposed by Wang [28] is defined as: ${DVT}_{B}^{λ} = {(x, y) \in U \times U | P_{B} (x, y) \geq λ} \cup I_{U}$ (11)

where I_U ={ (x, x) |x ∈ U } is the identity relation on U, tolerance degree P_B (x, y) is defined by Eq. (9), λ is the threshold of tolerance degree.

The data-driven valued tolerance class for any object x ∈ U is defined as: ${DVT}_{B}^{λ} (x) = {y \in U | P_{B} (x, y) \geq λ} \cup x$ (12)

It is clear that ${DVT}_{B}^{λ}$ is a reflexive and symmetric relation, but not necessarily transitive. Although the data-driven valued tolerance relation can resolve the problems of the valued tolerance relation, the calculation method of tolerance degree of the data-driven valued tolerance relation still has some limitations. In the data-driven valued tolerance relation, when predicting probability of the unknown attribute value of an object, it regards the frequency of an attribute value approximately as the probability of appearance of this value, without considering the effects of other known attribute values of this object on predicting probability of unknown attribute value. So the data-driven valued tolerance relation does not make full use of the prior knowledge of information system and cause inaccurate classification. The following example is used to illustrate the deficiency of the data-driven valued tolerance relation.

Example 1. Suppose an IIS is given in Table 1, where x₁, x₂, x₃, x₄, x₅, x₆ are the available objects, a₁, a₂, a₃, a₄ are four attributes.

Table 1

An IIS (U, AT, V, f)

U	α ₁	α ₂	α ₃	α ₄
x ₁	1	1	1	1
x ₂	1	1	1	1
x ₃	2	2	2	2
x ₄	2	2	2	2
x ₅	1	1	1	^*
x ₆	2	2	2	^*

According to Definition 4, the probability of b (x) = b_i is $k_{b}^{i} / - \sum_{i = 1}^{m} k_{b}^{i}$ , So the probability of a₄ (x₅),a₄ (x₅) equals 1, 2 is calculated as follows: $\begin{matrix} P (a_{4} (x_{5}) = 1) = \frac{2}{2 + 2} = \frac{1}{2} \\ P (a_{4} (x_{5}) = 2) = \frac{2}{2 + 2} = \frac{1}{2} \\ P (a_{4} (x_{6}) = 1) = \frac{2}{2 + 2} = \frac{1}{2} \\ P (a_{4} (x_{6}) = 2) = \frac{2}{2 + 2} = \frac{1}{2} \end{matrix}$

The above calculation concerns only the known attribute values on the attribute a₄, ignoring the effects of the other known attribute values of all object. Although the frequency of appearance of 1 and 2 on attribute a₄ are equal to 1/2, however $\begin{matrix} [a_{1} (x_{1}), a_{2} (x_{1}), a_{3} (x_{1})] \\ = [a_{1} (x_{2}), a_{2} (x_{2}), a_{3} (x_{2})] = [1, 1, 1] \\ [a_{1} (x_{3}), a_{2} (x_{3}), a_{3} (x_{3})] \\ = [a_{1} (x_{4}), a_{2} (x_{4}), a_{3} (x_{4})] = [2, 2, 2] \end{matrix}$

For object x₅, [a₁ (x₅) , a₂ (x₅) , a₃ (x₅)] = [1, 1, 1], which is same as object x₁ and x₂. Obviously, when predict the unknown attribute value a₄ (x₅), from the perspective of the known attribute value similarity, a₄ (x₁) and a₄ (x₂) should have the greater influence on a₄ (x₅) than a₄ (x₃) and a₄ (x₄), That is to say, the influence weight of a₄ (x₁) and a₄ (x₂) is greater than that of a₄ (x₃) and a₄ (x₄). Due to a₄ (x₁) = a₄ (x₂) = 1 and a₄ (x₃) = a₄ (x₄) = 2, the probability of a₄ (x₅) = 1 should greater than that of a₄ (x₅) = 2. Similarly, the probability of a₄ (x₆) = 2 should greater than that of a₄ (x₆) = 1.

3 Modified data-driven valued tolerance relation

In this section, to solve the limitations of the data-driven valued tolerance relation, considering both the frequency of the known attribute values and the influence weight to predict the unknown attribute values, and then, a modified data-driven valued tolerance relation is defined.

Definition 6. Consider attribute b of IIS (U, AT, V, f) and set $V_{b}^{'} = {(k_{b}^{1}, b_{1}), (k_{b}^{2}, b_{2}), \dots, (k_{b}^{m}, b_{m})}$ , where b₁, b₂, ⋯ , b_m are all known attribute values of b, and $k_{b}^{i}$ denotes the cardinality of set {x∈ U|b (x) = b_i }. The frequency of appearance of b_i on the attribute b is defined as: $f_{b_{i}} = \frac{k_{b}^{i}}{\sum_{j = 1}^{m} k_{b}^{j}}$ (13)

Definition 7. Consider attribute b of IIS (U, AT, V, f) and set $V_{b}^{'} = {(k_{b}^{1}, b_{1}), (k_{b}^{2}, b_{2}), \dots, (k_{b}^{m}, b_{m})}$ , where b₁, b₂, ⋯ , b_m are all known attribute values of b, and $k_{b}^{i}$ denotes the cardinality of set {x∈ U|b (x) = b_i }. Y_i ={ y ∈ U|b (y) = b_i } (1 ≤ i ≤ m), For any x ∈ U, b (x) =*, the influence weight of b_i on b (x) is defined as: $w_{b} (b_{i}, x) = \frac{\sum_{y \in Y_{i}} I_{b} (x, y)}{\sum_{j = 1}^{m} \sum_{y \in Y_{j}} I_{b} (x, y)}$ (14)

where $I_{b} (x, y) = \frac{n + 1}{| AT |}$ , n = | { a ∈ AT|a (x) = a (y) ≠ * } |, |AT| denotes the cardinality of set AT. Obviously, 0 < I_b (x, y) ≤ 1. if ∀a ∈ AT - b, a (x) = a (y), then I_b (x, y) = 1, if ∀a ∈ AT - b, a (x) ≠ a (y), then $I_{b} (x, y) = \frac{1}{| AT |}$ .

Property 1. For IIS (U, AT, V, f), b ∈ AT, set $V_{b}^{'} = {(k_{b}^{1}, b_{1}), (k_{b}^{2}, b_{2}), \dots, (k_{b}^{m}, b_{m})}$ , where b₁,b₂, ⋯,b_m are all known attribute values of b, ∀x ∈ U, then 0 < w_b (b_i, x) ≤ 1.

Proof. From Definition 7, we have that $0 < \sum_{y \in Y_{i}} I_{b} (x, y) \leq \sum_{j = 1}^{m} \sum_{y \in Y_{j}} I_{b} (x, y)$ . Hence, $0 < \frac{\sum_{y \in Y_{i}} I_{b} (x, y)}{\sum_{j = 1}^{m} \sum_{y \in Y_{j}} I_{b} (x, y)} \leq 1$ , Therefore, 0 < w_b (b_i, x) ≤ 1.

Property 2. For IIS (U, AT, V, f), b ∈ AT, set $V_{b}^{'} = {(k_{b}^{1}, b_{1}), (k_{b}^{2}, b_{2}), \dots, (k_{b}^{m}, b_{m})}$ , where b₁,b₂, ⋯,b_m are all known attribute values of b, ∀x ∈ U, then $\sum_{i = 1}^{m} w_{b} (b_{i}, x) = 1$ .

Proof. From Definition 7, we have that $w_{b} (b_{i}, x) = \frac{\sum_{y \in Y_{i}} I_{b} (x, y)}{\sum_{j = 1}^{m} \sum_{y \in Y_{j}} I_{b} (x, y)}$ , Hence, $\sum_{i = 1}^{m} w_{b} (b_{i}, x) = \sum_{i = 1}^{m} \frac{\sum_{y \in Y_{i}} I_{b} (x, y)}{\sum_{j = 1}^{m} \sum_{y \in Y_{j}} I_{b} (x, y)} = 1$ .

Definition 8. Consider attribute b of IIS (U, AT, V, f) and set $V_{b}^{'} = {(k_{b}^{1}, b_{1}), (k_{b}^{2}, b_{2}), \dots, (k_{b}^{m}, b_{m})}$ , where b₁, b₂, ⋯ , b_m are all known attribute values of b, and $k_{b}^{i}$ denotes the cardinality of set {x∈ U|b (x) = b_i }. the frequency of appearance of b_i on the attribute b is f_{b
_i}, For any x ∈ U, b (x) =*, the influence weight of b_i on x is w_b (b_i, x), then the probability of b (x) = b_i is defined as: $P (b (x) = b_{i}) = \frac{f_{b_{i}} \times w_{b} (b_{i}, x)}{\sum_{i = j}^{m} (f_{b_{i}} \times w_{b} (b_{i}, x))}$ (15)

Definition 9. Consider attribute b of IIS (U, AT, V, f) and set $V_{b}^{'} = {(k_{b}^{1}, b_{1}), (k_{b}^{2}, b_{2}), \dots, (k_{b}^{m}, b_{m})}$ , where b₁, b₂, ⋯ , b_m are all known attribute values of b, and $k_{b}^{i}$ denotes the cardinality of set {x∈ U|b (x) = b_i }. For any x ∈ U, b (x) =*, the probability of b (x) = b_i is P (b (x) = b_i). Then, for IIS (U, AT, V, f), B ⊆ AT, and ∀x, y ∈ U, the probability P_{b} (x, y) that x is similar to y with respect to attribute b ∈ B is calculated as: $\begin{matrix} P_{{b}} (x, y) = \\ {\begin{matrix} 1, b (x) = b (y) \land b (x) \neq * \land b (y) \neq * \\ 0, b (x) \neq b (y) \land b (x) \neq * \land b (y) \neq * \\ P (b (x) = b_{i}), b (x) = * \land b (y) = b_{i}; \\ P (b (y) = b_{i}), b (x) = b_{i} \land b (y) = *; \\ \sum_{i = 1}^{m} (\begin{matrix} P (b (x) = b_{i}) \times \\ P (b (y) = b_{i}) \end{matrix}), b (x) = * \land b (y) = *; \end{matrix} \end{matrix}$ (16)

Then the probability that two objects are possibly same with respect to attribute set B is calculated as: $P_{B} (x, y) = \prod_{b \in B} P_{{b}} (x, y) \times N_{B} (x, y)$ (17) where N_B (x, y) is the weight factor of objects x and y based on the known same attribute values with respect to, i.e., $N_{B} (x, y) = \frac{| {b \in B | b (x) = b (y) \neq *} |}{| B |}$ (18) where |X| denotes the cardinality of set X.

Definition 10. For IIS (U, AT, V, f) and attribute set B ⊆ AT, modified data-driven valued tolerance relation MDVT is defined as: ${MDVT}_{B}^{λ} = {(x, y) \in U \times U | P_{B} (x, y) \geq λ} \cup I_{U}$ (19)

where I_U ={ (x, x) |x ∈ U } is the identity relation on U, λ is the threshold of tolerance degree. In this paper, λ > 0 is calculated as follows.

Consider IIS (U, AT, V, f) and tolerance relation T_B with B ⊆ AT. For any x ∈ U, tolerance class T_B (x) can be calculated. Based on modified data-driven valued tolerance relation, according to the calculation method of tolerance degree, for any y in T_B (x), tolerance degree P_B (x, y) can be calculated. Therefore, the threshold in the modified data-driven valued tolerance relation can be calculated as [28]:

$\begin{matrix} λ = & min {min_{x \in U} {max_{y \in T_{B} (x)} {P_{B} (x, y)}}, \\ max_{x \in U} {min_{y \in T_{B} (x)} {P_{B} (x, y)}}} \end{matrix}$ (20) Eq. (20) gives a method to calculate the tolerance degree threshold automatically, which does not require any prior domain knowledge except the data set. The purpose of this calculation method is to select a suitable threshold, which is neither too large nor too small, so the classification results based on the tolerance degree threshold will be reasonable.

Obviously, ${MDVT}_{B}^{λ}$ is a reflexive and symmetric relation, but not necessarily transitive.

The modified data-driven valued tolerance class for any object x ∈ U is defined as: ${MDVT}_{B}^{λ} (x) = {y \in U | P_{B} (x, y) \geq λ} \cup x$ (21)

Example 2. Let us consider the incomplete information system given in Table 1

From Table 1 we can see that a₄ (x₅) =*, a₄ (x₆) =*, set $V_{a_{4}}^{'} = {(2, 1), (2, 2)}$ , where 1 and 2 are all known attribute values of a₄, Y₁ ={ y ∈ U|a₄ (y) =1 } = { x₁, x₂ }, Y₂ ={ y ∈ U|a₄ (y) = 2 } = { x₃, x₄ }. We take a₄ (x₅) and a₄ (x₆) for example to illustrate the calculation method of the probability of a₄ (x₅) = 1, a₄ (x₅) = 2, a₄ (x₆) = 1, a₄ (x₆) = 2.

First, we calculate the frequency of appearance of 1 and 2 on the attribute a₄. According to Definition 6, we have $\begin{matrix} f_{a_{4} = 1} & = & 0.5 \\ f_{a_{4} = 2} & = & 0.5 \end{matrix}$

Second, we calculate the influence weight of 1 and 2 on a₄ (x₅), a₄ (x₆). According to Definition 7, we have $\begin{matrix} I_{a_{4}} (x_{5}, x_{1}) & = & I_{a_{4}} (x_{5}, x_{2}) = 1 \\ I_{a_{4}} (x_{5}, x_{3}) & = & I_{a_{4}} (x_{5}, x_{4}) = 0.25 \\ I_{a_{4}} (x_{6}, x_{1}) & = & I_{a_{4}} (x_{6}, x_{2}) = 0.25 \\ I_{a_{4}} (x_{6}, x_{3}) & = & I_{a_{4}} (x_{6}, x_{4}) = 1 \end{matrix}$ Then we have $\begin{matrix} w_{a_{4}} (1, x_{5}) & = & \frac{1 + 1}{1 + 1 + 0.25 + 0.25} = 0.8 \\ w_{a_{4}} (2, x_{5}) & = & \frac{0.25 + 0.25}{1 + 1 + 0.25 + 0.25} = 0.2 \\ w_{a_{4}} (1, x_{6}) & = & \frac{0.25 + 0.25}{0.25 + 0.25 + 1 + 1} = 0.2 \\ w_{a_{4}} (2, x_{6}) & = & \frac{1 + 1}{0.25 + 0.25 + 1 + 1} = 0.8 \end{matrix}$

Finally, we calculate the probability of a₄ (x₆) = 1, a₄ (x₆) = 2, a₄ (x₇) = 1, a₄ (x₇) = 2. According to Definition 8, we have $\begin{matrix} P (a_{4} (x_{5}) = 1) & = & \frac{0.5 \times 0.8}{0.5 \times 0.8 + 0.5 \times 0.2} = 0.8 \\ P (a_{4} (x_{5}) = 2) & = & \frac{0.5 \times 0.2}{0.5 \times 0.8 + 0.5 \times 0.2} = 0.2 \\ P (a_{4} (x_{6}) = 1) & = & \frac{0.5 \times 0.2}{0.5 \times 0.2 + 0.5 \times 0.8} = 0.2 \\ P (a_{4} (x_{6}) = 2) & = & \frac{0.5 \times 0.8}{0.5 \times 0.2 + 0.5 \times 0.8} = 0.8 \end{matrix}$

From Table 1, According to the analysis of Example 1, the probability of a₄ (x₅) = 1 should greater than that of a₄ (x₅) = 2, whereas the probability of a₄ (x₆) = 2 should greater than that of a₄ (x₆) = 1. From our calculation results, we can see that P (a₄ (x₅) = 1) > P (a₄ (x₅) = 2), P (a₄ (x₆) = 1) < P (a₄ (x₆) = 2). which is consistent with the truth.

4 Extended rough set model based on modified data-driven valued tolerance relation

Definition 11. For IIS (U, AT, V, f) and attribute set B ⊆ AT, the definitions of lower approximation ${\underline{MDVT}}_{B}^{λ} (X)$ and upper approximation ${\bar{MDVT}}_{B}^{λ} (X)$ of object set X based on the modified data-driven valued tolerance relation are that ${\underline{MDVT}}_{B}^{λ} (X) = {x \in U | {MDVT}_{B}^{λ} (x) \subseteq X}$ (22) ${\bar{MDVT}}_{B}^{λ} (X) = {x \in U | {MDVT}_{B}^{λ} (x) \cap X \neq Φ}$ (23)

Definition 12. Considering IIS (U, AT, V, f) and attribute set B ⊆ AT, then positive region ${POS}_{B}^{λ} (X)$ , negative region ${NEG}_{B}^{λ} (X)$ and boundary region ${BND}_{B}^{λ} (X)$ of object set X with reference to B is further defined as: ${POS}_{B}^{λ} (X) = {\underline{MDVT}}_{B}^{λ} (X)$ (24) ${NEG}_{B}^{λ} (X) = U - {\bar{MDVT}}_{B}^{λ} (X)$ (25) ${BND}_{B}^{λ} (X) = {\bar{MDVT}}_{B}^{λ} (X) - {\underline{MDVT}}_{B}^{λ} (X)$ (26)

By comparing the tolerance relation, non-symmetric similarity relation, limited tolerance relation, and data-driven valued tolerance relation, we can get the following Properties.

Property 3. Considering IIS (U, AT, V, f), ∀X ⊆ U and ∀B ⊆ AT, 0 ≤ λ ≤ 1, then ${\underline{MDVT}}_{B}^{λ} (X) \subseteq X \subseteq {\bar{MDVT}}_{B}^{λ} (X)$ .

Proof. The proof can easily be completed according to Definition (11)

Property 4. Considering IIS (U, AT, V, f), ∀X ⊆ Y ⊆ U and ∀B ⊆ AT, 0 ≤ λ ≤ 1, then ${\underline{MDVT}}_{B}^{λ} (X) \subseteq {\underline{MDVT}}_{B}^{λ} (Y)$ and ${\bar{MDVT}}_{B}^{λ} (X) \subseteq {\bar{MDVT}}_{B}^{λ} (Y)$ .

Proof. The proof can easily be completed according to Definition (11).

Property 5. Considering IIS (U, AT, V, f), ∀X ⊆ U and ∀B ⊆ AT, 0 ≤ λ₁ ≤ λ₂ ≤ 1, then ${\underline{MDVT}}_{B}^{λ_{1}} (X) \subseteq {\underline{MDVT}}_{B}^{λ_{2}} (X)$ and ${\bar{MDVT}}_{B}^{λ_{2}} (X) \subseteq {\bar{MDVT}}_{B}^{λ_{1}} (X)$ .

Proof. The proof can easily be completed according to Definition (11)

Property 6. Considering IIS (U, AT, V, f), ∀x ∈ U and ∀A ⊆ B ⊆ AT, then ${MDVT}_{B}^{λ} \subseteq {MDVT}_{A}^{λ}$ and ${MDVT}_{B}^{λ} (x) \subseteq {MDVT}_{A}^{λ} (x)$ are not true.

The proof of Property 6 is shown in the following Example 3.

Example 3. Suppose an incomplete information system is given in Table 2, where U ={ x₁, x₂, x₃, x₄ } is the object set, and AT ={ a₁, a₂, a₃, a₄ } is the attribute set. A, B, C ⊆ AT, let A ={ a₁, a₂, a₃ }, B ={ a₂, a₃, a₄ } and C ={ a₁, a₂, a₃, a₄ }, then A ⊆ C and B ⊆ C.

Table 2

An IIS (U, AT, V, f)

U	α ₁	α ₂	α ₃	α ₄
x ₁	1	2	2	*
x ₂	1	2	*	2
x ₃	2	1	1	1
x ₄	1	2	2	2

According to Eq. (17), we have $\begin{matrix} P_{A} (x_{1}, x_{2}) = 1 \times 1 \times \frac{12}{13} \times \frac{2}{3} = \frac{24}{39} = 0.615 \\ P_{B} (x_{1}, x_{2}) = 1 \times \frac{10}{11} \times \frac{10}{11} \times \frac{1}{3} = \frac{100}{363} = 0.275 \\ P_{C} (x_{1}, x_{2}) = 1 \times 1 \times \frac{14}{15} \times \frac{14}{15} \times \frac{2}{4} = \frac{98}{225} = 0.436 \end{matrix}$

So there are P_B (x₁, x₂) < P_C (x₁, x₂) < P_A (x₁, x₂), therefore, ∀ (x, y) ∈ U × U, if A ⊆ B, based on MDVT, the relationship between P_A (x, y) and P_B (x, y) is uncertain. By the definition ${MDVT}_{B}^{λ}$ of and ${MDVT}_{B}^{λ} (x)$ , it is clear that ${MDVT}_{B}^{λ} \subseteq {MDVT}_{A}^{λ}$ and ${MDVT}_{B}^{λ} (x) \subseteq {MDVT}_{A}^{λ} (x)$ are not true.

Property 6 shows that the modified data-driven valued tolerance relation and its tolerance class are not necessarily monotonically decreasing with the growth of attribute set. Furthermore, it is easy to get the following Property 7 from Property 6.

Property 7. Considering IIS (U, AT, V, f), ∀x ∈ U and ∀A ⊆ B ⊆ AT, then ${\underline{MDVT}}_{A}^{λ} (X) \subseteq {\underline{MDVT}}_{B}^{λ} (X)$ and ${\bar{MDVT}}_{B}^{λ} (X) \subseteq {\bar{MDVT}}_{A}^{λ} (X)$ are not always true.

Property 7 shows that in the modified data-driven valued tolerance relation, with the increase of attribute set, the lower approximation is not necessarily monotonically increasing, and the upper approximation is not necessarily monotonically decreasing. These properties are not consistent with the corresponding properties based on the indiscernibility relation in CIS.

5 Experimental results

As MDVT is an reflexive and symmetric generalized indiscernibility relation, in order to demonstrate the effectiveness of MDVT proposed in this paper, we compare its classification accuracy with that of other three typical reflexive and symmetric generalized indiscernibility relations, which are tolerance relation, limited tolerance relation and data-driven valued tolerance relation. The detailed introductions of these typical relations are shown in Section 2. As non-symmetric similarity relation is a reflexive and transitive relation, but not necessarily symmetric, we do not compare classification accuracy of MDVT with that of non-symmetric similarity relation.

The classification accuracy of the generalized indiscernibility relation is defined as follows [29, 30]. Firstly, for a given CIS S₁ = (U, AT, V, f), let E ={ (x, y) ∈ U × U| ∀ b ∈ AT, b (x) = b (y) } be an equivalence relation on U, and [x] _E be the equivalence class of x in U, then the classification U/E ={ [x] _E|x ∈ U } can be got. Secondly, CIS S₁ is modified by introducing some percentage of randomly chosen missing attribute values, and then IIS S₂ = (U°, AT, V, f) is derived. Let R be a generalized indiscernibility relation on U°, R (x) be the generalized indiscernibility class of x in U, then the classification U/R ={ R (x) |x ∈ U } can be derived. Finally, the classification accuracy can be defined as: $u_{R} = \frac{\sum_{x \in U} \frac{| {[x]}_{E} \cap R (x) |}{| {[x]}_{E} \cup R (x) |}}{| U |} R is symmetric$ (27) where |X| denotes the cardinality of set X. Obviously, u_R ∈ [0, 1], and the greater u_R is, the better classification of R.

In our experiments, six complete data sets from UCI Machine Learning Repository [31] are used as shown in Table 3, which are Zoo, Monk, Spect, Hayes, TicTacToe and Chess. In Table 3, the first column is the name of data sets, the second column is the number of condition attributes and the third column is the number of decision attributes. All these six data sets are complete. Each complete data set is modified by introducing 10%, 30% and 50% randomly chosen missing attribute values, and then 18 incomplete data sets (X-a% represents data set X missing a% data) are derived. For each incomplete data set, we run the experiments 100 times and get the average result, which is used as the final classification accuracy. The experimental results are shown in Table 4, where T, L, DVT and MDVT denote tolerance relation, limited tolerance relation, data-driven valued tolerance relation and modified data-driven valued tolerance relation, respectively.

Figure 1 plots the classification accuracy from Table 4. In these plots, the X-axis represents the percentage of randomly chosen missing attribute values, and the Y-axis represents the classification accuracy. From Table 4 and Fig.1, we can see that the classification accuracy of MDVT is higher than that of T, L on all test data sets. The classification accuracy of MDVT is generally higher than that of DVT on all test data sets except the Monk-50% and Hayes-50%. Note that with the increasing of missing attribute values, the classification accuracy deteriorate for all test data sets. However, the classification accuracies of T and L decrease quickly with the increasing of the incomplete degree of data set, whereas the classification accuracies of DVT and MDVT is relatively stable. The experimental results show that the modified data-driven valued tolerance relation can get better classification accuracy than that of other generalized indiscernibility relations.

Fig.1

Classification accuracies for the data sets used in the experiment.

Table 3

Four complete data sets in UCI

Data sets	No. of objects	No. of Condition attributes	No. of Decision attributes
Zoo	101	16	1
Monk	432	6	1
Spect	267	22	1
Hayes	160	5	1
TicTacToe	958	9	1
Chess	3196	36	1

Table 4

Six complete data sets in UCI

Data sets	T	L	DVT	MDVT
Zoo - 10%	0.77171	0.77171	0.77503	0.77542
Zoo - 30%	0.38383	0.38390	0.51347	0.56735
Zoo - 50%	0.13818	0.14153	0.29807	0.36917
Monk - 10%	0.33570	0.33570	0.34955	0.35190
Monk - 30%	0.03397	0.03837	0.04246	0.04583
Monk - 50%	0.00816	0.01569	0.03189	0.03189
Spect - 10%	0.88367	0.88367	0.89934	0.92118
Spect - 30%	0.46641	0.46641	0.65987	0.75006
Spect - 50%	0.09576	0.09638	0.39403	0.57115
Hayes - 10%	0.47649	0.47890	0.55590	0.59388
Hayes - 30%	0.11636	0.14199	0.16638	0.19427
Hayes - 50%	0.04206	0.08432	0.10725	0.10725
Tictactoe - 10%	0.65306	0.65306	0.75630	0.76263
Tictactoe - 30%	0.27795	0.27995	0.43152	0.46306
Tictactoe - 50%	0.07561	0.09678	0.21965	0.26658
Chess - 10%	0.56165	0.56457	0.76931	0.77126
Chess - 30%	0.27616	0.27925	0.51346	0.53265
Chess - 50%	0.13656	0.13761	0.36356	0.39164

6 Conclusions

Incomplete information is one of the most difficult problems of data mining. In order to deal with incomplete information, various extended models of classical rough set theory have been proposed. The datadriven valued tolerance relation does not require any prior knowledge except the data set, which can solve the problems of the valued tolerance relation. However, when predicting probability of the unknown attribute value of an object, it regards the frequency of an attribute value approximately as the probability of appearance of this value, without considering the effects of other known attribute values of this object on predicting probability of unknown attribute value. In this paper, to solve the limitations of data-driven valued tolerance relation, modified data-driven valued tolerance relation is defined, which not only considering the frequency of an attribute value, but also considering the effects of other known attribute values of an object on predicting probability of unknown attribute value of this object. On this basis, the extended rough set model based on modified data-driven valued tolerance relation is proposed. Some properties of the new model are analyzed. Experimental results show that the modified data-driven valued tolerance relation can get better classification accuracy than that of other generalized indiscernibility relations. In the future, we will investigate the attribute reduction and rule extraction based on the new generalized indiscernibility relation. Its effectiveness in dealing with incomplete information system will be further inspected.

Footnotes

Acknowledgments

The authors would like to thank the editors and the anonymous reviewers for their valuable comments and suggestions to improve the paper. This work was supported by the National Natural Science Foundation of China (No. 61402005), the Natural Science Foundation of Anhui Province, China (No. 1308085QF114), the Higher Education Natural Science Foundation of Anhui Province, China (No. KJ2013A015), and the Open Foundation of Key Laboratory of Intelligent Computing and Signal Processing at Anhui University of Ministry of Education of China (2014).

References

Pawlak

, Rough sets[J], International Journal of Computer & Information Sciences 11 (1982), 341–356.

, Lu

and Zhi

, Variable precision rough set based decision tree classifier[J], Journal of Intelligent & Fuzzy Systems 23(2,3) (2012), 61–70.

Vluymans

, D’eer

, Saeys

and Cornelis

, Applications of fuzzy rough set theory in machine learning: A survey[J], Fundamenta Informaticae 142 (2015), 53–86.

Formica

, Semantic web search based on rough sets and Fuzzy formal concept analysis[J], Knowledge-Based Systems 26 (2012), 40–47.

Chen

, Li

, Luo

and Horng

, A decision-theoretic rough set approach for dynamic data mining, IEEE Transactions on Fuzzy Systems 23 (2015), 1–1.

Zhang

, Li

, Da

and Liu

, Neighborhood rough sets for dynamic data mining[J], International Journal of Intelligent Systems 27 (2012), 317–342.

Sun

, Ma

and Zhao

, An approach to emergency decision making based on decision-theoretic rough set over two universes[J], Soft Computing 20 (2016), 3617–3628.

Fan

T.F.

, Liu

D.R.

and Tzeng

G.H.

, Rough set-based logics for multicriteria decision analysis[J], European Journal of Operational Research 182 (2007), 340–355.

Kryszkiewicz

, Rough set approach to incomplete information systems[J], Information Sciences 112 (1998), 39–49.

10.

Slowinski

and Vanderpooten

, A generalized definition of rough approximations based on similarity, IEEE Transactions on Knowledge & Data Engineering 12 (2000), 331–336.

11.

Wang

, Extension of rough set under incomplete information systems, 2002, pp. 1098–1103.

12.

Stefanowski

, Tsoukiás

, On the Extension of Rough Sets under Incomplete Information, New Directions in Rough Sets, Data Mining, and Granular-Soft Computing, International Workshop, Rsfdgrc ’99, Yamaguchi, Japan, 1999, Proceedings, 1999, pp. 73–81.

13.

Grzymala-Busse

J.W.

, Characteristic relations for incomplete data: A generalization of the indiscernibility relation, Lecture Notes in Computer Science 3700 (1970), 58–68.

14.

Stefanowski

and Tsoukiás

, Incomplete information tables and rough classification[J], Computational Intelligence 17 (2001), 545–566.

15.

and Li

, Extended rough set model based on known same probability dominant valued tolerance relation[J], International Journal of Approximate Reasoning 74 (2016), 108–119.

16.

Grzymala-Busse

J.W.

and Hu

, A Comparison of Several Approaches to Missing Attribute Values in Data Mining, Rough Sets and Current Trends in Computing, Second International Conference, RSCTC 2000 Banff, Canada, Revised Papers, 2000, pp. 378–385.

17.

Liu

, Pan

, Dezert

, et al., Adaptive imputation of missing values for incomplete pattern classification[J], Pattern Recognition 52(C) (2016), 85–95.

18.

Amiri

and Jensen

, Missing data imputation using fuzzy-rough methods[J], Neurocomputing 205 (2016), 152–164.

19.

Turrado

C.C.

, Sánchez

L.F.

, Calvorollé

J.L.

, et al., A hybrid algorithm for missing data imputation and its application to electrical data loggers[J], Sensors 16(9) (2016).

20.

and Kim

, Multiple imputation for nonignorable missing data[J], Journal of the Korean Statistical Society 46 (2017), 583–592.

21.

Attia

A.H.

, Sherif

A.S.

and El-Tawel

G.S.

, Maximal limited similaritybased rough set model[J], Soft Computing 20(8) (2016), 1–9.

22.

Meng

and Shi

, Extended rough set-based attribute reduction in inconsistent incomplete decision systems[J], Information Sciences 204(20) (2012), 44–69.

23.

Saedudin

R.R.

, Mahdin

, Kasim

, et al., A Relative Tolerance Relation of Rough Set for Incomplete Information Systems[C], International Conference on Soft Computing and Data Mining, Springer, Cham, 2018, pp. 72–81.

24.

and Guo

, An extension model of rough set in incomplete information system[C], International Conference on Future Computer and Communication IEEE, 2010, pp. V2–434–V2–438.

25.

Chen

and Xia

, An extended rough set model based on a new characteristic relation[C], IEEE International Conference on Granular Computing IEEE, 2011, pp. 100–105.

26.

Luo

, Li

and Yao

, Dynamic probabilistic rough sets with incomplete data[J], Information Sciences (2017).

27.

Kang

and Miao

, A variable precision rough set model based on the granularity of tolerance relation[J], Knowledge-Based Systems 102 (2016), 103–115.

28.

Wang

and Wang

, 3DM: Domain-oriented Data-driven Data Mining[J], Fundamenta Informaticae 90 (2009), 395–426.

29.

Wang

, Guan

, Wu

and Hu

, Data-driven valued tolerance relation based on the extended rough set[J], Fundamenta Informaticae 132 (2014), 349–363.

30.

Ping

, Qiu

, Xiong

and Yang

, An incomplete data filling approach based on a new valued tolerance relation[J], Open Automation & Control Systems Journal 6 (2014), 1456–1462.

31.

Bache

and Lichman

, UCI Machine Learning Repository. http:/archive.ics.uci.edu/ml.