Abstract
In this paper, information entropy about logarithmic form is used to measure uncertainty of knowledge, and the correlation connection between information entropy, conditional information entropy, joint information entropy and mutual information entropy are analyzed. Correlation coefficient ρ (B, C) between two attribute subsets in random information systems and ρ (C/-B) for a conditional attributes set B relative to another condition attributes set C w.r.t. the decision attributes set D in random decision information systems are introduced. The properties of correlation coefficients are discussed, and some concepts about random information systems and random decision information systems, such as indispensable attributes, consistent attributes and core attributes, etc, are described by using correlation coefficients. Some new methods of attribute reduction are proposed by using correlation coefficients, the corresponding algorithms based on attribute correlation coefficient are given in random information systems and in random decision information systems, the listed examples and the experimental result show that the algorithms are effective.
Keywords
Introduction
Feature selection (or attribute reduction) is a key step of knowledge discovery or data mining. An attribute subset selection involves determining a relevant subset of attributes for using knowledge discovery model. There are three categories of the attribute extraction algorithms, namely, filters, wrappers, and embedded methods [1, 2]. Attribute reduction is an aspect of knowledge discovery, which is also a feature selection method in essence. The rough set theory, introduced by Pawlak, has been conceived as a widely used tool to reduct attributes [3], and hence attribute reduction is one of the most important work in rough set theory. As we all known that attribute reduction is inefficient and NP-Hard problem [4]. And hence many scholars from different perspective, applying various research strategies and research methods, makes the attribute reduction algorithm based on rough set theory appearing constantly [5–8].
A decision rule in a decision information system is a combination of conditional attribute values and decision attribute values connected by some logical connectives. In order to get some simple decision rules in a decision information system the conditional attributes must be reduced, the essence of conditional attribute reduction of decision information system is to find a minimum set of attributes in a conditional attribute set [9–14]. In all attribute reduction methods, using the information entropy to attribute reduction is an important one. Miao et al. introduced Shannon information Entropy in information system and proposed attribute reduction algorithm based on mutual information [7, 9]. Wang et al. proposed attribute reduction algorithm based on conditional information entropy [10, 11]. Liang et al. defined the concept of information measure in information system, and proposed an attribute reduction algorithm based on information measure [13]. Qian et al. proposed a positive approximation which is used to accelerate algorithms of heuristic attribute reduction [14]. Zhang et al. studied the uncertainty measure of attribute and attribute set of random decision information system and random fuzzy information system, and used the uncertainty measure to carry out attribute reduction [15]. Liu et al. proposed the corresponding attribute reduction algorithm based on the conditional entropy theory [16]. Liu et al. proposed the corresponding attribute reduction algorithm based on the conditional entropy theory [17]. Batle et al. discuss both quantum entanglement and nonlocality (implied by the violation of Bell inequalities) constitute quantum correlations that cannot be arbitrarily shared among subsystems [18–24].
With the emergence of big data, in order to meet the demand of processing performance and real-time response requirements is a key issue that can enhance efficiency of data preprocessing, and hence many updating algorithms have been successfully applied to massive data processing and interactive applications. In processing the attribute reduction of information system in which the real data vary dynamically with time, Lang et al. investigate the mechanisms of updating the attribute reductions by the utilization of the previously results in dynamic covering decision information systems and design incremental algorithms for attribute reduction of dynamic covering decision information systems [25]. Considering the large-scale and streaming arrivals of data characteristics in the era of big data, Chen et al. proposed a hybrid parallel solution approach on the Apache Spark Platform combining data-parallel and task-parallel optimization, benefiting from the data-parallel and task-parallel optimization, the data transmission cost are effectively reduced and the performance of the algorithms are obviously improved [26].
The present paper focus on discussing two new correlation coefficients and applying them to the attributes reduction in random information systems and in random decision information systems. Firstly, probability distribution, joint probability distribution and conditional probability distribution based on random information systems and random decision information systems are given. Information entropy about logarithmic form is introduced to measure uncertainty of knowledge, and the correlation connection between information entropy, conditional information entropy, joint information entropy and mutual information entropy are analyzed. Secondly, correlation coefficient ρ (B, C) between two attribute subsets in random information systems and ρ (C/B) for a conditional attributes set B relative to another condition attributes set C with respect to (w.r.t., for short) the decision attributes set D in random decision information systems are introduced. The properties of correlation coefficients are discussed, and some concepts about random information systems and random decision information systems, such as indispensable attributes, consistent attributes and core attributes, etc, are described by using correlation coefficient. Thirdly, some new methods of attribute reduction are proposed by using correlation coefficients, the corresponding algorithms based on attribute correlation coefficient are given in random information systems and in random decision information systems, the listed examples and the experimental result show that the algorithms are effective.
Information entropy of attributes in random information systems
An information system (or database system) is a quarter-nary form (U, A, V, F), where U = {x1, x2, ⋯ , x n } is object set, every x k is called an object; A = {a1, a2, ⋯ , a m } is attribute set, every a k is called an attribute; V = ⋃ a k ∈AV(a k ) is attribute value set, V(a k ) is the value domain of attribute a k , F = {f a k : U → V(a k ), k ⩽ m} is relation set from U to A. If the attributes set of an information system is partitioned into A and D then it is called decision information system, A is referred to as condition attributes set and D is referred to as decision attributes set. An information system is referred to as random information system if there is a normal probability measure P on U, where a normal probability measure P means that ∀x ∈ U, P ({x}) >0 and ∑x∈UP ({x}) =1.
In an information system, the relation set F shows the connection between object set and attribute set and gives the information source of knowledge discovery. For example, f a k (x i ) = v means that the attribute value of attribute a k of object x i is v.
In a random information system ((U, P) , A, V, F), for a ∈ A,
Probability distribution of attribute a
Probability distribution of attribute a
Where
Analogously, we can give the joint probability distribution P (ab) and the conditional probability distribution P (b/a) as Tables 2 and 3 respectively.
Joint probability distribution of a and b
Conditional probability distribution of b given a
Where
Where
Analogously, we can obtain joint probability distribution P (B ∪ C) of two attribute subsets B and C and the conditional probability distribution P (C/B) of C given B. The above probability distribution have the following properties: (1) If a ∈ B then the joint probability distribution of {a} and B is same as the joint probability distribution of {a} and B - {a}, and it equal to the probability distribution of B. (2) The conditional probability distribution P (a/B) of {a} given B is same as the conditional probability distribution P (a/(B - {a})) of {a} given B - {a}. (3) If the probability distribution of B is same as the probability distribution of C, then the joint probability distribution P (B ∪ C) is same as the probability distribution P (B) (or P (C)).
If the attribute values in random information system are numerical, then the probability distribution of attributes subset can be regarded as the probability distribution of random variables. Hence an attribute a determines a random variable and an attributes subset B = {a1, a2, ⋯ , a q } determines a q-dimensional random vector, these random variables or random vectors are denoted as a and B. But in general, the values of attribute a may not be numerical values, and then the probability distribution defined as above may not be the probability distribution of random variables. Hence the probability analysis method is not suitable for analyzing the random information system. But since the symbols used to describe the information sources is not necessarily the numerical values in Shannon information entropy theory, so we can use the Shannon information entropy theory to analyze the random information system.
The joint information entropy of attribute a and b is defined by
The conditional information entropy of attribute b given a is defined by
The average mutual information of attribute a and b is defined by
Analogously, the above definition can be generalized to the case of subset of attributes. The following properties hold for information entropy.
H (B/C) ⩽ H (B), and if A and B are independent then the equality holds; H (CB) = H (C) + H (B/C) = H (C) + H (B) - I (C, B); I (C, B) = H (B) - H (B/C) ⩾0; I (C, B) = H (C) + H (B) - H (CB).
where
A car information system
Using the probability distribution P on U, we can compute the probability distribution of attributes. For example, the probability distribution P (a2) of attribute a2 as shown in Table 5.
Probability distribution of attribute a2
The joint probability distribution P (a2a3) of attributes a2 and a3 as shown in Table 6.
Joint probability distribution of attribute a2 and a3
The conditional probability distribution p (a3/a2) of attribute a3 given a2 as shown in Table 7.
Conditional probability distribution of attribute a3 given a2
Correlation coefficient of attributes subset in random information system
0 ⩽ ρ (B, C) ⩽1. If random vector B and C are independent, then ρ (B, C) =0. ρ (B, C) =1 if and only if the probability distribution of B is same as the probability distribution of C.
a
is indispensable w.r.t.
A
if and only if ρ (A, A - {a}) =1.
Core (A) = {a ∈ A|ρ (A - {a} , A) <1}.
By Proposition 2.2, we have that
Hence ρ (B, C) ⩾0. Since
we know that If random vectors B and C are independent, then H (B/C) = H (B), H (C/B) = H (C). It follows that I (B, C) =0. Hence ρ (B, C) =0. Suppose that the probability distribution of B is same as the probability distribution of C, then the joint probability distribution of B and C is also same as the probability distribution of B (or C). Hence the conditional probability distribution of B given C and the conditional probability distribution of C given B are taking value 0 or 1. Hence the conditional information entropy H (B/C) = H (C/B) =0. It follows that Conversely, suppose that ρ (B, C) =1. Then H (B/C) = H (C/B) =0. By the definition of conditional information entropy, we know that p (b
i
, c
j
) =0 or p (b
i
, c
j
) = p (b
i
) = p (c
j
) for any b
i
∈ B and c
j
∈ C. This shows that the probability distribution of B is same as the probability distribution of C. a is indispensable w.r.t. A if and only if the probability distribution of A - {a} is same as the probability distribution of A, and if and only if ρ (A, A - {a}) =1 by (3). By the definition of core and (4), we concludes Core (A) = {a ∈ A|ρ (A - {a} , A) <1}.
0 ⩽ η
C
(a) ⩽1; Core (A) = {a ∈ A|ηA-{a} (a) <1}; If C ⊆ B, then η
C
(a) ⩽ η
B
(a).
By the properties of information entropy, we have that if C ⊆ B then H (C) ⩽ H (B), H (a/B) ⩽ H (a/C). Hence
ρ (A - {a1} , A) = ρ (A - {a3} , A) =1, ρ (A - {a2} , A) =0.943. Let C = {a2, a3} ,
Correlation coefficient of attribute set w.r.t. the decision attribute in random decision information system
Define a correlation ρ (C/B) of two attribute subsets B and C w.r.t. the decision attribute D as follows:
then we have
Hence
In the following we denote simply
Then
where
Then
Hence
Thus when
0 ⩽ ρ (C/B) ⩽1; If C ⊆ B then ρ (C/B) =0; ρ (C/B) =1 if and only if the probability distribution of B ∪ C is the same as the probability distribution of D.
By Lemma 3.7 we have H (D/(B∪ C)) ⩽ H (D/B) then I (D, C/B) ⩾0 and I (D, C/B) = H (D/B) - H (D/(C ∪ B)) ⩽ H (D/B). Hence we obtain 0 ⩽ ρ (C/B) ⩽1. If C ⊆ B then H (D/(B ∪ C)) = H (D/B). Thus I (D, C/B) =0, so ρ (C/B) =0. If the probability distribution of B ∪ C is the same as the probability distribution of D, then the joint probability distribution P (B ∪ C ∪ D) is the same as the probability distribution of P (D). Hence the conditional probability distribution P (D/(B ∪ C)) taking value 0 or 1. This shows that H (D/(B ∪ C)) =0, so I (D, C/B) = H (D/B). Therefore ρ (C/B) =1.
Conversely, if ρ (C/B) =1 then H (D/(B ∪ C)) =0. By the definition of conditional entropy we have that P ({a i } , {d j } d j ∈D) =0 or P ({a i } , {d j } d j ∈D) = P ({d j } d j ∈D) for each attribute value a i of B ∪ C and each values set {d j } d j ∈D of D. This shows that the probability distribution of B ∪ C is the same as the probability distribution of D.
Attribute reduction methods based on correlation coefficients
Attribute reduction based on correlation coefficients in a random information system
Let Q = {a1, a2, ⋯ , a q } and a ∈ Q. If the probability distribution of Q - {a} is same as the probability distribution of Q, then a is called dispensable w.r.t. Q else a is called indispensable. If every attribute in Q is indispensable then Q is called an indispensable attribute subset. The set of allA-indispensable attributes of A is called a core of A, denote it by Core (A).
Obviously, an attribute set A may have more than one reduction, the intersection of all reduction is just the core of A.
By the Proposition 3.2 and 3.4, we have the following theorem.
H (C) = H (A), ∀a ∈ C, η
C
(a) <1,
then C is a reduction of A.
Using Theorem 4.2, we can give the following algorithm for attribute reduction.
Input: A random information system ((U, P) , A, V, F).
Output: An attribute reduction Q of A.
Set C : = Core (A).
While H (C) ≠ H (A) Do ∀a ∈ A - C, computing η
C
(a), take attribute a satisfying η
C
(a) = min{η
C
(a′) |a′ ∈ A - C}, and let C : = C ∪ {a}, compute H (C).
Endwhile
Set C′ = C - Core (A) , |C′| = N,
For i = 1 to N Do delete the ith attributes a
i
from C′, compute H (C′ ∪ Core (A)), if H (C′ ∪ Core (A)) ≠ H (A) then C′ : = C′ ∪ {a
i
}.
Endfor
In the following we discuss the time complexity of the algorithm CCAR. The time complexity of a traditional heuristic reduction algorithm based on information entropy given in [14] is O (|A|2|U| + |A||U|2), which include the computational time of partition and entropy.
At Step 1, the time complexity of computing the probability distribution of A is O (|A| (∑a k ∈Av k )), where v k is the number of attribute value of attribute a k . In a large decision information system, the values of attribute a k are usually not particularly scattered, hence we obtain ∑a k ∈Av k < |U| [29]. Thus the time complexity of computing the probability distribution of A is usually less than O (|A||U|).
At Step 2, the time complexity of computing H (A) is O (∑a k ∈Av k ), which is less than O (|U|).
At Step 3, all the ηA-{a} (a) need to be computed we know the time complexity of computing Core (A) is O (|A||U|).
At Step 4, the time complexity of computing all η
C
(a)(a ∈ A, C ⊆ A) is (|A| + (|A|-1) + ⋯ +1) × |U| = O (|A|2|U|). The time complexity to choose minimum relatively correlation attribute is
At Step 5, the time complexity of computing all H (C′ ∪ Core (A)) is less than O (|A||U|).
Hence the time complexity of algorithm CCAR is O (|A|2|U|).
Since H (C) ≠ H (A) we compute η C (a1) = ρ ({a1, a2, a3} , {a2, a3}) = 0.9203, η C (a4) =0.9203. The smallest one in {η C (a1) , η C (a4)} is η C (a4) (or η C (a1)), and hence we take a′ = a4, C : = C ∪ {a4} = {a2, a3, a4}. By computing we have that H (C) =2.25. Since H (C) = H (A), go to Step 5A.
Firstly, we take a i = a3, C′ : = C′ - {a3} = {a4}.By computing we have that H (C′ ∪ Core (A))= H ({a2, a4}) =1.75. Since H (C′∪ Core (A)) ≠H (A), we get C′ : = C′ ∪ {a3} = {a3, a4}.
Second, we take a i = a4, C′ : = C′ - {a4} = {a3}. By computing we have that = H ({a2, a3}) =1.906. Since H (C′ ∪ Core (A))≠H (A), we get C′ : = C′ ∪ {a3} = {a3, a4}.
The process of attribute reduction are simply listed in Table 8.
Attribute reduction process of the car information system
Attribute reduction process of the car information system
Let ((U, P) , A ∪ D, V, F) be a random decision information system. For C ⊆ A, attribute subset C is said consistent with the decision attribute subset D, if each probability term of the probability distribution P (D) of D is same as the probability term in P (C), or it can be expressed as the sum of some probability terms in P (C). If a conditional attribute set A is consistent w.r.t. decision attribute D, we call it as consistent random decision information system. For C ⊆ A and a ∈ C, if the conditional probability distribution P (D/(C - {a})) and P (D/C) are the same, then a is called dispensable w.r.t. the decision attribute subset D, otherwise it is called indispensable. If each attribute in C is indispensable, then C is called an indispensable attribute set. The set of all the indispensable attributes of A is called the core of A and is denoted by Core (A).
a ∈ C is dispensable if and only if ρ ({a}/ - (C - {a})) =0. C is consistent w.r.t. decision attribute D, if and only if for any a ∈ C, ρ ({a}/ - (C - {a})) =1. Core (A) = {a ∈ A|ρ ({a}/ - (A - {a})) >0}.
If a is dispensable, then the conditional probability distributions P (D/C - {a}) are the sames as P (D/A). Then H (D/C - {a}) = H (D/C). Hence I (D, {a}/ - C - {a}) =0. Thus ρ ({a}/ - (C - {a})) =0. Conversely, let ρ ({a}/ - C - {a}) =0. ThenI (D, {a}/ - C - {a}) =0. This means that H (D/C - {a}) = H (D/C). By Proposition 3.6 we have P (D/ - C) = P (D/ - C ∪ {a}). Thus a is dispensable. Because C is consistent w.r.t. decision attribute D, a term in P (D) equals to a term in P (C) or equals to the sum of some terms in P (C). Hence P (C) = P (C ∪ D). Then
Hence
Conversely, let ρ ({a}/ - C - {a}) =1. Then
(3) By the mean of Core (A) and (1), we can obtain it.
By Theorem 4.6, the below conclusion is obtained.
H (D/ - A) = H (D/ - C), ∀a ∈ A, ρ ({a}/ - (C - {a})) >0,
then C is a reduction of A.
Based on the Theorem 4.7, we give an algorithm for attribute reduction in random decision information systems.
Input: A random decision information system ((U, P) , A ∪ D, V, F).
Output: Reduction Q of A.
If H (D/Core (A)) = H (D/A), then the algorithm terminates (Core (A) is the reduction).
While H (D/C) ≠ H (D/A) Do For each a ∈ A - C, compute ρ ({a}/ -(C - {a})), Select the attribute a such that ρ ({a}/ -(C - {a})) = min {ρ ({b}/ - (C - {b})) |b ∈ A - C} and C : = C ∪ {a}, Compute H (D/C).
Endwhile
For i = 1 to N Do Delete the attribute a
i
from C′, Compute H (D/C′ ∪ Core (A)), If H (D/C′ ∪ Core (A)) ≠ H (D/A) then C′: = C′ ∪ {a
i
}.
End for
In the following we discuss the time complexity of the algorithm CCDAR.
At Step 1, the time complexity of computing the probability distribution of conditional attribute set A is O (|A| (∑a k ∈Av k )), where v k is the number of attribute value of attribute a k . In a large decision information system, the values of attribute a k are usually not particularly scattered, hence we obtain ∑a k ∈Av k < |U| [29]. Thus the time complexity of computing the probability distribution of conditional attribute set A less than O (|A||U|).
At Step 2, the time complexity of computing H (D/A) is O (∑a k ∈A∪Dv k ), followed the analysis in the above step we obtain ∑a k ∈A∪Dv k < |U|. Hence the time complexity of computing H (D/A) less than O (|U|).
At Step 3, need to compute ρ ({a}/ - (A - {a})) for all a ∈ A in computing Core (A), the time complexity of computing Core (A) is O (|U||A|).
At Step 4, the time complexity of computing all ρ ({a}/ - (C - {a}))(a ∈ C, C ⊆ A) is O ((|A| + (|A|-1) + ⋯ +1) |U|) = O (|A|2|U|).
At Step 5, the time complexity of computing all H (D/C′ ∪ Core (A)) is less than O (|A||U|).
Hence the time complexity of algorithm CCDAR is O (|A|2|U|).
In the following we give an example to illustrate the application of algorithm.
A car decision information system
A car decision information system
We find the attribute reduction of the conditional attributes set A by using CCDAR
The process of attribute reduction are simply listed in Table 10.
Attribute reduction process of the car decision information system
To show the utility of the algorithm CCDER (Algorithm 4.8) we implement this algorithm by using MTLAB programming language. The computing environment is a PC (Pentium T3400 2.16G CPU, 2-GB memory, Windows XP operating system). It is hard to find an instances of large random information system, so the test data sets are selected from UCI machine learning repository (http://archive.ics.uci.edu/ml/), which are be viewed as random decision information systems with the uniform probability distribution, to compute the probability distribution. The description of data sets is shown in Table 11. In the data sets, Mushroom, Soybean and Dermatolog are three data sets with missing values, and for a uniform treatment of all data sets, we remove the objects with missing values. Connect-4 has a much large number of objects relative to the other five data sets, so we select the first 20% of them. In these experiments, we select two attribute reduction algorithms, one is the accelerated algorithm of attribute reduction which remains the conditional entropy of target decision unchanged (FSPA-SCE) [14], and the other the incremental attribute reduction based on relational matrix (IARRMKG) [28].
Data sets description
Data sets description
The test results are listed in the Tables 12 and 13. As we can see that the reduction effect of the algorithms CCDAR, FSPA-SCE and IARRMKG are almost same, but the computational time of the algorithm CCDER is better than the algorithms FSPA-SCE and IARRMKG.
Comparison of computational time (second) between different algorithms
Comparison of attribute reduction number between different algorithms
In this paper, we study two new correlation coefficients and apply them to the attributes reduction. The concepts of information entropy about logarithmic form, conditional information entropy, joint information entropy and mutual information entropy are introduced and the connection between them is analyzed. Two new correlation coefficients are introduced, one is ρ (B, C) to describe the correlation between two attribute subsets B and C in random information system, and the other is ρ (C/B) of a conditional attributes set B relative to another condition attributes set C w.r.t. the decision attributes set D in random information system. The properties of correlation coefficient are discussed, and the concepts of indispensable attributes, consistent attributes and core attributes in random information systems or random decision information systems are described by the correlation coefficient. Based on the correlation coefficient ρ (B, C) we propose a new attribute reduction algorithm CCAR in random information systems, and based on the correlation coefficient ρ (C/B) we propose a new attribute reduction algorithm CCDAR in random decision information systems. We analyze the time complexity of algorithm CCAR and algorithm CCDAR. The listed examples and the experimental results show that the algorithms are effective.
The present paper focuses only on studying the correlation coefficients w.r.t. Shannon’s information entropy, which are used to describe indispensable attributes, consistent attributes and core attributes, etc, and are applied to attribute reduction in random information systems or random decision information systems respectively. In order to respond to the demand of updating dynamically the attribute reduction of information systems or decision information systems in big data environment, our future work will discuss two correlation coefficients ρ (B, C) and ρ (C/B) change with objects, attributes and their values, and then propose the application of correlation coefficients to the intersection of high-dimension data sets, distributed computing, big data mining and deep learning, etc.
Footnotes
Acknowledgments
This work has been supported by the Industry-University-Research Innovation Fund of Ministry of Education (No. 2018A02014), the Natural Science Foundation of Hunan Province (No. 2018JJ2371, No. 2017JJ2241).
