Identification of informational and probabilistic independence by adaptive thresholding

Abstract

The independence assumptions help Bayesian network classifier (BNC), e.g., Naive Bayes (NB), reduce structure complexity and perform surprisingly well in many real-world applications. Semi-naive Bayesian techniques seek to improve the classification performance by relaxing the attribute independence assumption. However, the study of dependence rather than independence has received more attention during the past decade and the validity of independence assumptions needs to be further explored. In this paper, a novel learning technique, called Adaptive Independence Thresholding (AIT), is proposed to automatically identify the informational independence and probabilistic independence. AIT can respectively tune the network topologies of BNC learned from training data and testing instance under the framework of target learning. Zero-one loss, bias, variance and conditional log likelihood are introduced to compare the classification performance in the experimental study. The extensive experimental results on a collection of 36 benchmark datasets from the UCI machine learning repository show that AIT is more effective than other learning techniques (such as structure extension, attribute weighting) and helps make the final BNCs achieve remarkable classification improvements.

Keywords

Bayesian network classifier informational independence probabilistic independence adaptive independence thresholding

1. Introduction

Classification is regarded as one of the key issues in data mining and statistical learning, with the aim of predicting the class of an object with unknown label [1]. It has been widely applied in computer science and engineering, and can deal with facial expression analysis [2], positive and unlabeled learning [3], text classification [4] and so on. Classification model needs to establish the mapping relationship between the input, instance x, and the output, class label $y$ . The basic form of classification can be expressed as the mapping function $y=f(\textbf{x})$ or conditional probability distribution $P(y|\textbf{x})$ . To find this mapping relationship more accurately and mine the knowledge hidden in the data, many machine learning models [5, 6], such as decision tree [7], Bayesian network classifier (BNC) [8, 9], support vector machine [10] and neural network [11] have been proposed. Among them, BNC demonstrates superior performance while representing statistical knowledge and inferring under conditions of uncertainty [12]. Learning the network topology of BNC from data by considering all possible structures exhaustively is an NP-hard problem [13, 14, 15].

Independence assumption is the most effective way to reduce structure complexity for BNC learning. However, most studies focus on conditional dependence rather than conditional independence. Some information-theoretic metrics, such as mutual information (MI) and conditional mutual information (CMI), are applied as benchmark criteria to measure the direct dependence or conditional dependence [16, 17]. It should be noted that BNCs are probabilistic models rather than informational models, conditional independence in probability theory (denoted as CIP for short) does not correspond to conditional independence in information theory (denoted as CII for short). Weak informational dependencies are commonly regarded as probabilistic independencies. For BNC learning, information-theoretic metrics can’t distinguish between probabilistic dependence and probabilistic independence. If probabilistic dependencies are introduced into the topology of BNC as probabilistic independencies by mistake, the estimates of conditional probabilities may be biased and the classification performance will be degraded. Therefore, the reason why the independence assumptions work needs to be further explored, and researchers urgently need a scalable learning technique for learning BNCs with complex dependencies that can capture the right conditional independencies and dependencies.

The main contributions of this paper are as follows:

•
We propose a novel semi-naive Bayesian learning technique, called Adaptive Independence Thresholding (AIT), to distinguish between dependence and independence in one single pass. Different criteria are applied to identify informational independence and probabilistic independence, and the resulting highly scalable algorithm combines the high expressivity of generative learning with the low bias of discriminative learning.
•
We compare the performance of our algorithm with other BNCs on 36 benchmark datasets, ranging in size from 57 to 164860 instances. We show that AIT helps base BNCs (such as TAN and KDB) achieve competitive classification performance in terms of zero-one loss, bias, variance and conditional log likelihood.

The paper is organized as follows. Section 2 reviews some state-of-the-art BNCs. Section 3 clarified the difference between informational independence and probabilistic independence. Section 4 introduces the basic idea of AIT. Section 5 presents a set of comparisons for our proposed algorithm on 36 UCI datasets with out-of-core BNCs. To finalize, the last section draws conclusions and outlines the directions for further research.
2. Related works

A BNC consists of two parts: the network topology $\mathcal{G}$ in the form of a directed acyclic graph (DAG) and a set of conditional probabilities that quantifies the dependencies within the DAG. Nodes in the DAG correspond to predictive attributes $\{X_{1},\cdots,X_{n}\}$ or class variable $Y$ , and edges describe dependency relationships among them. The lack of edges between specific nodes represents independencies which can reduce the number of parameters required to describe the probability distribution. Suppose that the order of predictive attributes is $\{X_{1},\cdots,X_{n}\}$ , full Bayesian network classifier (FBNC) considers all possible dependency relationships among attributes and the joint probability distribution corresponding to instance $t=(\bm{x},y)=(x_{1},\cdots,x_{n},y)$ can be factorized as follows,

$\displaystyle P(\bm{x},y)=P(y)P(\bm{x}|y)=P(y)\prod_{i=1}^{n}P(x_{i}|x_{1},% \cdots,x_{i-1},y)$ (1)

As shown in Eq. (1), for FBNC attribute $X_{i}$ takes $\Pi_{i}=\{X_{1},X_{2},\cdots,X_{i-1}\}$ as its parents, whereas restricted BNCs take a subset of $\Pi_{i}$ as the parents of $X_{i}$ , thus they implicitly assume that $X_{i}$ is conditionally independent of the rest of the attributes in $\Pi_{i}$ . With the increase of the number of attributes the structure complexity of FBNC will increase exponentially and the confidence level of the probability estimation will decrease significantly. Researchers have proposed numerous approaches to learn BNC, and these works can be roughly divided into two categories in terms of independence analysis: explicit independence assumption without any priori domain knowledge [18, 19] and implicit independence assumption based on posteriori dependence identification [20, 21].

Naive Bayes (NB) explicitly assumes that attributes are conditionally independent given class $Y$ . The total number of conditional independencies (TNCI) implicated for NB is $\rm{TCNI_{NB}}=(\textit{n}-1)+(\textit{n}-2)+\cdots+2+1+0=\frac{\textit{n}(% \textit{n}-1)}{2}$ . Superparent one-dependence estimator (SPODE) explicitly assumes that attributes are conditionally independent given super parents $\{X_{j},Y\}$ . However, inappropriate conditional independence assumption will lead to biased estimate of joint probability distribution and then negatively affect the decision boundary. Different BNCs encode different independence statements, where each attribute is independent of its non-descendants given the state of its parents. These independence relationships can be used to reduce the structure complexity and the number of parameters that characterize the probability distributions [22]. By adding a number of augmented edges to NB’s topology, the refined BNC will achieve the trade-off between structure complexity and unbiased probability estimation. For example, tree augmented naive Bayes (TAN), which is proposed by Friedman et al. [23], allows each attribute to have at most one attribute as its parent. TAN first calculates the CMI between attribute $X_{i}$ and its candidate parent $X_{j}$ ( $1\leqslant j\leqslant i-1$ ), and then find the parent attribute of $X_{i}$ from the built maximum spanning tree. Each attribute in TAN, e.g., $X_{i}$ , is assumed to be conditionally independent from its remaining $i-2$ parents. Correspondingly, $\rm{TCNI_{TAN}}=(\textit{n}-2)+(\textit{n}-3)+\cdots+2+1+0+0=\frac{(\textit{n}% -1)(\textit{n}-2)}{2}$ . Sahami et al. [24] proposed the $k$ -dependence Bayesian classifier (KDB) to fit more complex datasets. KDB first compares $I(X_{i};Y)$ to determine the attribute order. Suppose that the attribute order is $\{X_{1},X_{2},\cdots,X_{n}\}$ , attribute $X_{i}$ can have $k$ parent attributes if ( $i>k$ ) or $i-1$ parent attributes otherwise. For simplicity, KDB assumes that $X_{i}$ is conditionally independent from its $i-k-1$ candidate parent attributes. Hence, $\rm{TCNI_{KDB}}=(\textit{n}-\textit{k}-1)+(\textit{n}-\textit{k}-2)+\cdots+2+1% =\frac{(\textit{n}-\textit{k}-1)(\textit{n}-\textit{k})}{2}$ . With the increase of $k$ , KDB may encode less independence statements in its complex topology whereas that may lead to overfitting training data and underfitting testing data.

The information-theoretic metrics used for weighting are commonly similar to that for identifying significant conditional dependencies. Thus the network topology is further refined by weighting on the basis of structure extension, and weighting can indirectly weaken the independence assumptions and finely tune the estimates of conditional probabilities. For example, Jiang et al. [20] proposed to take each attribute as the root node of the network topology of TAN in turn. The mutual information between the root node and class variable is applied as the weighting metric. The final decision is made by aggregating the predictions of a restricted class of weighted TANs. Zaidi [21] proposed to apply conditional log likelihood and mean square error as the weighting metrics to alleviate NB’s independence assumption. Wu et al. [16] proposed to combine evolutionary computation and self-adaptive weighting. The objective function can automatically calculate the proper attribute weight value to ensure that the attribute weighting can adapt to different classification tasks. Jiang et al. [25] proposed to use mutual information as the weighting metric. A hidden parent node is introduced for each attribute to replace the dependency relationships between the attribute node and all other attributes. Lee et al. [26] argued that the importance of attribute $X_{i}$ should be measured by the amount of information provided by $X_{i}$ to the target attribute. The proposed attribute weighting naive Bayes (FWNB) algorithm takes the Kullback-Leibler measure as the weighting metric. Jiang et al. [27] proposed to weaken the influence of independent redundancy attribute by weighting. Higher weights are assigned to the attributes of NB with maximum mutual relevance and minimum average mutual redundancy.

The BNCs learned from training data cannot naturally represent asymmetric independence assertions. In other words, the conditional independence assertion or conditional dependencies learned from training data may not fit different testing instance. Some researchers propose to use variants of these information-theoretic metrics to identify conditional dependencies among attribute values in testing instance. Wang et al. [28] proposed a novel learning framework, target learning, to mine conditional independencies or dependencies implicated in training data and testing instance. Duan et al. [29] proposed to use the normalized CMI as the weighting metric for each SPODE in AODE, and the metric considers the conditional dependence (or independence) among superparent attributes, categorical variables and non-superparent attributes. Frank et al. [30] proposed to relax the independence assumption of NB by learning local models at prediction time. The algorithm weights the training instances and allocates less weights to the instances which are far away from the test instances. This helps to mitigate the impact of attribute dependencies which may exist in the data as a whole. Chen et al. [31] proposed to assign distinct weights to different SPODEs in AODE for different testing instances. Pensar et al. [32] proposed to relax the independence assumption by assuming that when given $y$ and $x_{e}$ , $X_{i}$ is independent of $X_{j}$ if and only if the criterion $P(x_{i},x_{j}|y,x_{e})=P(x_{i}|y,x_{e})P(x_{j}|y,x_{e})$ holds over all possible outcomes of $X_{i}$ , $X_{j}$ and $Y$ , that is, the introduced independence statements only hold for subspaces of the training data.

3. The independence analysis in terms of probability theory and information theory

As mentioned above, independence assumption is one of the most effective method to reduce structure complexity although it is unrealistic in practice. For over all possible outcomes of attributes $X_{i}$ , $X_{j}$ and class $Y$ , if $X_{i}$ and $X_{j}$ are conditionally independent given class $Y$ from the perspective of probability theory, then

$\displaystyle P(x_{i},x_{j}|y)=P(x_{i}|y)P(x_{j}|y)$ (2)

where $x_{i}\in X_{i}$ , $x_{j}\in X_{j}$ and $y\in Y$ . The probabilistic independence measured by Eq. (2) is too strict for classification and almost impossible in practice. To measure the conditional dependence between attributes $X_{i}$ and $X_{j}$ given $Y$ , researchers propose to use CMI $I(X_{i};X_{j}|Y)$ as the metric, which is defined as follows:

$\displaystyle I(X_{i};X_{j}|Y)=\sum_{X_{i}}\sum_{X_{j}}\sum_{Y}P(x_{i},x_{j},y% )\log\frac{P(x_{i},x_{j}|y)}{P(x_{i}|y)P(x_{j}|y)}=\sum_{X_{i}}\sum_{X_{j}}% \sum_{Y}I(x_{i};x_{j}|y)\geqslant 0,$ (3)

where $I(x_{i};x_{j}|y)$ denotes the CMI when $X_{i}$ , $X_{j}$ and $Y$ take specific values. The higher the value of $I(X_{i};X_{j}|Y)$ is, the stronger corresponding conditional dependence is. Suppose that $\delta$ is a pre-determined threshold, if $I(X_{i};X_{j}|Y)<\delta$ then the weak conditional dependence between $X_{i}$ and $X_{j}$ is assumed to be CII.

Obviously, the criterion for identifying CII relationship is different from that for identifying CIP relationship. The conditional independencies corresponding to weak conditional dependencies may result in biased estimate of conditional probability and negative effect on classification performance. The wide range of data quantity makes ever more urgent the need for highly scalable BNC learners that can identify CIP and CII relationships for refining the network topology.

The log likelihood function can measure the extent to which the learned BNC models the probability distribution of data $\mathcal{D}$ , and the CMI $I(X_{i};X_{j}|Y)$ can be represented in the form of log likelihood as follows,

$\displaystyle I(X_{i};X_{j}|Y)=\sum_{X_{i}}\sum_{X_{j}}\sum_{Y}P(x_{i},x_{j},y% )\log P(x_{i},x_{j}|y){}\!-\sum_{X_{i}}\sum_{X_{j}}\sum_{Y}P(x_{i},x_{j},y)% \log P(x_{i}|y)P(x_{j}|y)=\sum_{X_{i}}\sum_{X_{j}}\sum_{Y}LL(b_{de})-LL(b_{% \textit{ind}})$ (4)

where $LL(b_{de})$ and $LL(b_{\textit{ind}})$ respectively denote the log likelihood functions corresponding to conditional dependence or conditional independence when $X_{i}$ , $X_{j}$ and $Y$ take specific values. When $LL(b_{de})=LL(b_{\textit{ind}})$ always holds, $P(x_{i},x_{j}|y)=P(x_{i}|y)P(x_{j}|y)$ also holds and the CII assumption will certainly hold. In contrast, when $LL(b_{de})=LL(b_{\textit{ind}})$ only holds for some instances, the CII assumption may still hold. Probabilistic independence can lead to informational independence, but not vice versa. Thus we need further analysis to clarify independence or dependence more comprehensively from the perspective of probability theory.

In the following discussion, we will use different BNCs to clarify the difference between informational independence and probabilistic independence. Take TAN and dataset mfeat-mor (see Table 1 for details) as an example for experimental study. TAN compares CMIs among all attribute pairs, and then selects significant conditional dependencies to build maximum spanning tree. For dataset mfeat-mor, $I(X_{6};X_{4}|Y)$ and $I(X_{2};X_{1}|Y)$ respectively correspond to the maximum and the minimum among all CMI values. The distribution of values of $I(x_{6};x_{4}|y)$ and $I(x_{2};x_{1}|y)$ are shown in Fig. 1a, where the values of $I(x_{i};x_{j}|y)$ are sorted in descending order. Figure 1a shows that there are many instances with $I(x_{i};x_{j}|y)<0$ for attribute pair $\{X_{6},X_{4}\}$ or $\{X_{2},X_{1}\}$ and the conditional dependence in probability theory (denoted as CDP for short) relationship between $\{X_{6},X_{4}\}$ is stronger than that between $\{X_{2},X_{1}\}$ much more often than not. The bar charts in Fig. 1b and c respectively visualize the probability (i.e., $P(x_{i},x_{j},y)$ ) when $I(x_{i};x_{j}|y)<0,=0$ or $>0$ . As shown in Fig. 1b, the probability of the CDP relationship between attribute values of $\{X_{6},X_{4}\}$ is over 83.38% among all possible outcomes of $X_{6}$ , $X_{4}$ and $Y$ , that of the CIP relationship is less than 16.62%, and the former is much greater than the latter. That is, the CDP relationship happens much more often than the CIP relationship, thus it is reasonable to add undirected edge $X_{6}$ – $X_{4}$ to the topology of TAN. In contrast, as shown in Fig. 1c, the probability of the CDP relationship between attribute values of $\{X_{2},X_{1}\}$ is over 53.31% among all possible outcomes of $X_{2}$ , $X_{1}$ and $Y$ , thus the probability of the CIP relationship is as high as 46.69%, and the difference between them is not significant.

Figure 1.

Distributions of values of $I(x_{6};x_{4}|y)$ , $I(x_{2};x_{1}|y)$ and $I(x_{3};x_{1}|y)$ .

With the increase of the structure complexity, more weak dependencies will be encoded in the learned BNC [33], and the risk of occurrence of CPI relationship will increase correspondingly. For example, KDB can represent arbitrary $k$ -dependence relationships between attribute $X_{i}$ and its parents. For KDB with $k=2$ , $I(X_{3};X_{1}|Y)$ achieves the minimum among all CMI values. As shown in Fig. 1d, the probability of the CDP relationship between attribute values $\{X_{3},X_{1}\}$ is over 41.32% among all possible outcomes of $X_{3}$ , $X_{1}$ and $Y$ , whereas the probability of the CIP relationship is as high as 58.68%, and the latter is much greater than the former. That is, the CIP relationship happens much more often than the CDP relationship, thus it is unreasonable to add undirected edge $X_{3}$ – $X_{1}$ to the topology of KDB.

The non-negative characteristic of CMI often misleads researchers to believe that there always exist conditional dependence between attribute pair, weak or strong, and adding augment edges to the topology of BNC may positively rather than negatively impact on the result of classification. However, when attributes take different values, dependency relationships do not necessarily exist. Even for strong conditional dependencies measured by CMI, the CIP relationship happens for some instances. And for weak conditional dependencies, the CIP relationship will happen much more often. To avoid biased estimate of conditional probabilities, we need to identify these CIP relationships and remove them from the network topology. Otherwise, with more CIP relationships processed as CDP relationships, the negative effect will accumulate and the classification performance will be degraded greatly. That can clarify why KDB with $k\geqslant 2$ performs poorer than TAN in some cases. Thus CMI $>0$ should not be used as the only criterion to identify conditional dependence, the mapping relationships between informational (in)dependence and probabilistic (in)dependence should be accurately described.

4. Adaptive independence thresholding for identifying conditional independence

As discussed in Section 3, the independence from the perspective of information theory is not equivalent to the independence from the perspective of probability theory. Thus in this paper, we propose to use adaptive independence thresholding (AIT) to discriminate conditional independence from weak conditional dependence. The two components of AIT, AIT-CII and AIT-CIP, can respectively identify CII and CIP relationships.

By comparing the CMI values with the threshold, the conditional independencies identified by applying AIT can help build robust BNC with simplified network topology and reduce the computational overhead needed to perform inference and classification. For AIT, the thresholds should be determined carefully rather than arbitrarily. High threshold may result unjustified edge removals and too simple network topology. Low threshold may yield redundant edges and thus too complex network topology that overfits the data and leads to high-order probability distributions, which decrease reliability and increase running time. Therefore, for different datasets it is necessary to tune the threshold adaptively to achieve the trade-off between bias and variance.

4.1 Threshold identification of conditional independence from the perspective of information theory

If $I(X_{i};X_{j}|Y)=0$ holds, then $X_{i}$ and $X_{j}$ can be assumed to be conditionally independent. However, this criterion is too strict for identifying the CII relationship in the topology of BNC. To alleviate this unrealistic assumption, we formalize the concepts of conditional independence and dependence by introducing threshold $T_{I}$ as follows,

Definition 1. Given training set $\mathcal{T}$ , the relationship between $X_{i}$ and $X_{j}$ given $Y$ is conditional independence in information theory (CII) when $0\leqslant I(X_{i};X_{j}|Y)\leqslant T_{I}$ or conditional dependence in information theory (CDI) when $I(X_{i};X_{j}|Y)>T_{I}$ .

The threshold $T_{I}$ is the baseline to discriminate between weak conditional dependence and strong conditional dependence, and these weak conditional dependencies will be regarded as conditional independencies for further research. The number of CMI values will increase as the the number of attributes increases, a reasonable search interval needs to be determined first to reduce the computational overhead.

AIT-CII sorts all the $\frac{n(n-1)}{2}$ CMI values in ascending order and adds them to the list $L$ . Then $L$ is divided into $m$ subintervals, i.e., $\{L_{1},\cdots,L_{m}\}$ . These CMI values will be evenly distributed into the subintervals, and each subinterval has at most 5 CMI values. To efficiently identify the threshold, we first need to determine the search interval, i.e., $L_{i}(1\leqslant i\leqslant m)$ rather than $L$ . The average of CMI values for $L_{i}(1\leqslant i\leqslant m)$ , which is called ACMI ${}_{i}$ , will be computed for comparison. If ACMI ${}_{i-1}>1.05\times$ ACMI ${}_{i}$ holds then ACMI ${}_{i-1}$ is assumed to be significantly higher than ACMI ${}_{i}$ , and $L_{i}$ is regarded as one of the candidate search intervals. Among these intervals, the one which corresponds to the minimum of ACMI will be identified as the right interval (denoted as $\mathcal{L}$ ), from which acute angle $\theta$ will be introduced to search for the right threshold.

Definition 2. Given three points with co-ordinates in rectangular coordinate system, i.e., $A(u_{1},v_{1})$ , $B(u_{2},v_{2})$ , $C(u_{3},v_{3})$ (as shown in Fig. 2), where $u_{1}<u_{2}<u_{3}$ and $v_{1}\geqslant v_{2}\geqslant y_{3}$ , the cosine of the acute angle $\theta$ between vectors $\overrightarrow{\mathrm{BA}}$ and $\overrightarrow{\mathrm{BC}}$ at point $B$ is calculated as follows [34]:

$\displaystyle\cos\theta_{B}=\frac{\overrightarrow{\mathrm{BA}}\cdot% \overrightarrow{\mathrm{BC}}}{|\overrightarrow{\mathrm{BA}}\|\overrightarrow{% \mathrm{BC}}|}=\frac{|(u_{2}-u_{1})(u_{3}-u_{2})+(v_{2}-v_{1})(v_{3}-v_{2})|}{% \sqrt{(u_{2}-u_{1})^{2}+(v_{2}-v_{1})^{2}}\sqrt{(u_{3}-u_{2})^{2}+(v_{3}-v_{2}% )^{2}}}$ (5)

Figure 2.

An example of the knee point.

Thus, the acute angle $\theta$ can be computed as follows:

$\displaystyle\theta_{B}=\arccos\frac{|(u_{2}-u_{1})(u_{3}-u_{2})+(v_{2}-v_{1})% (v_{3}-v_{2})|}{\sqrt{(u_{2}-u_{1})^{2}+(v_{2}-v_{1})^{2}}\sqrt{(u_{3}-u_{2})^% {2}+(v_{3}-v_{2})^{2}}}$ (6)

Obviously, the larger the acute angle is, the more significant the difference between $A$ and $C$ is.

Definition 3. Given a set of discrete points in rectangular coordinate system and corresponding acute angles, the knee point, P, is the one that corresponds to the maximum of the acute angles.

In the following discussion, we map the CMI values into discrete points in rectangular coordinate system. The $i$ -th CMI value (i.e., $I_{i}$ ) in $\mathcal{L}$ corresponds to point $P_{i}(I_{i},i$ ). From Eq. (6), the acute angle $\theta_{i}$ for point $P_{i}(I_{i},i$ ) is computed as follows:

$\displaystyle\theta_{i}=\arccos\frac{|(I_{i}-I_{i-1})(I_{i+1}-I_{i})+1|}{\sqrt% {(I_{i}-I_{i-1})^{2}+1}\sqrt{(I_{i+1}-I_{i})^{2}+1}}$ (7)

By applying heuristic search approach to compare the acute angle for each CMI value in list $\mathcal{L}$ , the knee point corresponds to the maximum of acute angles and its CMI value will be identified as the threshold.

Algorithm 2 shows the detail of AIT-CII algorithm, including the learning procedure of how to determine the search interval and identify the threshold. Then the CII relationships will be identified and removed from the network topology of BNC ${}_{\mathcal{T}}$ according to the threshold $T_{I}$ .

[h] AIT-CIITraining set $\mathcal{T}$ with attributes $\{X_{1},\ldots,X_{n}\}$ and class $Y$ . The threshold $T_{I}$ . Compute $I(X_{i};X_{j}|Y)$ for each pairwise combination of attributes ( $i\neq j$ ) Sort the values of $I(X_{i};X_{j}|Y)$ in ascending order and add them to list $L$ Divide $L$ into $m$ subintervals $\{L_{1},\ldots,L_{m}\}$ Compute ACMI ${}_{i}$ for each $L_{i}$ Let $L_{T}=L_{m}$ $j=m$ to 1 If $\mathrm{ACMI}_{j}\geqslant\mathrm{ACMI}_{j-1}\times 1.05$ , then $L_{T}=L_{j-1}$ Compute and get the value of $\theta_{k}$ for each value of $I(X_{i};X_{j}|Y)$ in $L_{T}$ (See Eq. (5)) Select $\theta_{\max}$ by comparing $\theta_{k}$ Select the $\hat{I}(X_{i};X_{j}|Y)$ corresponding to $\theta_{\max}$ $T_{I}=\hat{I}(X_{i};X_{j}|Y)$ . $T_{I}$

4.2 Threshold identification of conditional independence from the perspective of probability theory

Complex network topology may make the learned BNC overfit training data whereas underfit testing data, and that may result in high variance and inappropriate identification of conditional dependencies. To address this issue, Wang et al. [12] proposed the target learning (TL) framework to respectively learn BNC ${}_{\mathcal{T}}$ from training data $T$ and specific BNC ${}_{\mathcal{P}}$ from each testing instance $P$ . BNC ${}_{\mathcal{T}}$ and BNC ${}_{\mathcal{P}}$ can work jointly to make the final prediction. Given testing instance $\mathcal{P}=\{x_{1},\cdots,x_{n},\hat{y}\}=\{\textbf{x},\hat{y}\}$ , where $\hat{y}$ is the pseudo class label $y$ for $\mathcal{P}$ . The estimate of conditional local mutual information (CLMI) $I(x_{i};x_{j}|\hat{y})$ is defined as follows:

$\displaystyle I(x_{i};x_{j}|\hat{y})=P(x_{i},x_{j},\hat{y})\log\frac{P(x_{i},x% _{j}|\hat{y})}{P(x_{i}|\hat{y})P(x_{j}|\hat{y})}$ (8)

To learn the network topology from the testing instance and remove redundant edges, AIT-CIP focuses on the conditional dependence between attribute values. The threshold $T_{P}$ is introduced to differentiate the conditional independencies from conditional dependencies. Therefore, the concept of conditional independence and dependence in terms of probability theory, i.e., CIP and CDP, can be defined as follows:

Definition 4. Given an instance $\textbf{x}=\{x_{1},\cdots,x_{n},y\}$ , the relationship between $x_{i}$ and $x_{j}$ is conditional independence in probability theory (CIP) when $I(x_{i};x_{j}|y)\leqslant T_{P}$ or conditional dependence in probability theory (CDP) when $I(x_{i};x_{j}|y)>T_{P}$ .

The threshold $T_{P}$ is the baseline to discriminate between probabilistic dependence and probabilistic independence. Similar to AIT-CII, AIT-CIP sorts the values of $I(x_{i};x_{j}|y)$ in ascending order and adds them to the list $L$ , which is then divided into $m$ subintervals, i.e. $\{L_{1},\cdots,L_{m}\}$ , these CMI values will be evenly distributed into the subintervals, and each subinterval has at most 5 CMI values. The average of $I(x_{i};x_{j}|y)$ for each part, i.e., $\textit{acmi}_{i}$ , will be computed for comparison. The search interval, i.e., $\mathcal{L}$ , should correspond to the minimum of acmi and satisfies the criterion that acmi ${}_{i-1}>1.05\times$ acmi ${}_{i}$ . Within the search interval, AIT-CIP identifies the threshold $T_{P}$ according to Eq. (7).

When the class variable $Y$ takes different values, the CIP relationship may change greatly. AIT-CIP will remove the CIP relationship according the threshold $T_{P}$ , which will optimize the network topology BNC ${}_{\hat{\textbf{x}}}$ learned from the testing instance ${\hat{\textbf{x}}}$ and enhance the estimate of $P({\hat{\textbf{x}}},y)$ .

Algorithm 4.2 shows the detail of AIT-CIP algorithm, including how to determine the search interval and identify the threshold.

[h] AIT-CIPtesting instance $\hat{\textbf{x}}=(x_{1},x_{2},\ldots,x_{n})$ . The threshold $T_{P}$ . Calculate the $I(x_{i};x_{j}|y)$ for each pairwise combination of attribute values ( $i\neq j$ ) and each class label $y$ Sort the values of $I(x_{i};x_{j}|y)$ in ascending order and add them to list $L$ Divide $L$ into $m$ subintervals $\{L_{1},\ldots,L_{m}\}$ Compute acmi ${}_{i}$ for each $L_{i}$ Let $L_{T}=L_{1}$ $j=m$ to 1 If $\mathrm{acmi}_{j}\geqslant\mathrm{acmi}_{j-1}\times 1.05$ , then $L_{T}=L_{j-1}$ Compute according to Eq. (5) and get the value of $\theta_{k}$ for each value of $I(x_{i};x_{j}|y)$ in $L_{T}$ Select $\theta_{\max}$ by comparing $\theta_{k}$ Select the $\hat{I}(x_{i};x_{j}|y)$ corresponding to $\theta_{\max}$ $T_{P}=\hat{I}(x_{i};x_{j}|y)$ . $T_{P}$

4.3 Time complexity

Bayesian model averaging is theoretically the optimal method for combining learned models. After applying AIT-CII and AIT-CIP to respectively refine the network topologies of BNC ${}_{\mathcal{T}}$ and BNC ${}_{\mathcal{P}}$ , which are respectively learned from training data $\mathcal{T}$ and testing instance $\mathcal{P}$ , the final BNCs (i.e., BNC ${}^{\emph{AIT}}_{\mathcal{T}}$ and BNC ${}^{\emph{AIT}}_{\mathcal{P}}$ ) make the final decision by simply averaging the class-membership probabilities produced severally. AIT-CII will calculate the CMI values for each pairs of attribute values and class label, which will take $\mathrm{O}(tm_{c}n^{2})$ time, where $t$ is the number of instances in the training set, $m_{c}$ is the number of classes, and $n$ is the number of attributes (dominated by calculating $I(X_{i};X_{j}|Y)$ in Eq. (4)). Next, AIT-CII will sort these CMI values, divide them into $m$ intervals, calculate and compare the mean value of the intervals, determine the final search interval and find the threshold, which will take $\mathrm{O}(n^{\prime}mt\log{t})$ time, where $n^{\prime}$ is the number of instances in each sub-interval. Similar to AIT-CII, AIT-CIP needs to establish pseudo training set for each test instances and calculate CLMI, which will take $\mathrm{O}(rm_{c}n^{2})$ time, where $r$ the number of instances in the pseudo training set. Then the algorithm calculates and compares the mean to find the final threshold. Therefore, the computational complexity of AIT-CIP to get the appropriate threshold is $\mathrm{O}(n^{\prime\prime}mr\log{r})$ , where $n^{\prime\prime}$ is the number of instances in each sub-interval. Finally, the computational complexity of AIT is $\mathrm{O}(tm_{c}n^{2}+n^{\prime}mt\log{t}+n^{\prime\prime}mr\log{r})$ .

5. Experiments and results

To evaluate the efficiency and effectiveness of our proposed algorithm, AIT, we evaluates its performance on 36 datasets from the UCI machine learning repository [35]. The detailed characteristics of datasets are shown in Table 1. We sorted the 36 datasets in ascending order according to the number of instances. As listed in Table 1, the number of instances ranges from a minimum of 57 to a maximum of 164860, that allows us to compare classifiers on datasets of various sizes. These datasets are divided into two groups with number of instances $\geqslant 2000$ or $<2000$ . At the same time, the number of class labels also ranges from 2 to 50, that allows us to analyze the performance of the AIT on datasets with different numbers of labels. Missing values for qualitative attributes are replaced with modes and those for quantitative attributes are replaced with means from the training data in our experiments. For each benchmark dataset, numeric attributes are discretized using Minimum Description Length (MDL) [25]. All experiment are assessed via 10 runs of 10-fold cross validation, and for each fold the training set and testing set respectively account for 90% and 10% of the dataset. All algorithms run on the same training data and are evaluated on the same testing data.

Table 1
Datasets

No.	Dataset	Instance	Att	Class	No.	Dataset	Instance	Att	Class
1	Labor-negotiations	57	16	2	19	Splice-c4.5	3177	60	3
2	Post-operative	90	8	3	20	Kr-vs-kp	3196	36	2
3	Zoo	101	16	7	21	Dis	3772	29	2
4	Wine	178	13	3	22	Abalone	4177	8	3
5	Sonar	208	60	2	23	Spambase	4601	57	2
6	Glass-id	214	9	3	24	Phoneme	5438	7	50
7	Hungarian	294	13	2	25	Page-blocks	5473	10	5
8	Heart-disease-c	303	13	2	26	Optdigits	5620	64	10
9	Primary-tumor	339	17	22	27	Mushrooms	8124	22	2
10	Ionosphere	351	34	2	28	Pendigits	10992	16	10
11	House-votes-84	435	16	2	29	Sign	12546	8	3
12	Soybean	683	35	19	30	Seermdl	18962	13	2
13	Credit-a	690	15	2	31	Magic	19020	10	2
14	Crx	690	15	2	32	Letter-recog	20000	16	26
15	Tic-tac-toe	958	9	2	33	Adult	48842	14	2
16	German	1000	20	2	34	Shuttle	58000	9	7
17	Yeast	1484	8	10	35	Waveform	100000	21	3
18	Mfeat-mor	2000	6	10	36	Localization	164860	5	11

For comparison purpose, we apply AIT to refine the network topologies of state-of-the-art TAN and KDB. The final BNCs (i.e., TAN ${}^{\emph{AIT}}$ and KDB ${}^{\emph{AIT}}$ ) are compared with the other five algorithms, which are described as follows:

•

CFWNB [27]: Correlation-based feature weighting filter for NB, which assigns higher weight to the attributes with maximum mutual relevance and minimum average mutual redundancy (i.e. high prediction attributes).

•

FKDB[36]: Flexible KDB, which is based on KDB and uses conditional entropy to measure causality between attributes.

•

SKDB[37]: Selective KDB, which extends KDB by selecting attribute subsets and values of $k$ .

•

TAODE[31]: Targeted AODE, which uses log likelihood to assign distinct weights to different SPODEs in AODE.

•

IWAODE[29]: Instance-based weighed AODE, which assigns distinct weights to different SPODEs in AODE for different test instances.

Tables (Appendix) A1, (Appendix) A2, (Appendix) A4 and (Appendix) A3 in the Appendix respectively show the experimental results in terms of zero-one loss (ZOL), bias, variance and conditional log likelihood (CLL). We employ the Win/Draw/Loss (WDL) record to interpret the results. We set the significance level to be 0.05, i.e., if the output of a one-tailed binomial sign test is less than 0.05 then we assume that there exists significant difference between the experimental results. Tables 2–4 respectively show corresponding WDL records in terms of ZOL, bias and variance. Each cell $[i,j]$ in these tables contain the number of wins, losses, and draws when comparing the classifier in a given row $i$ to the classifier in a given column $j$ . Figures 4–6 describes the extent to which TAN ${}^{\emph{AIT}}$ and KDB ${}^{\emph{AIT}}$ perform better or worse than the other five BNCs while dealing with small/large datasets in terms of of ZOL, bias and variance, and the number above each bar denotes the average number of wins or losses when comparing TAN ${}^{\emph{AIT}}$ or KDB ${}^{\emph{AIT}}$ with the other five BNCs.

5.1 Zero-one loss

Zero-one loss is a standard loss function in classification [38], which can intuitively evaluate the extents to which the algorithm performs well or poor. As shown in Table 2, after AIT is introduced to discriminate between conditional independencies and conditional dependencies, TAN ${}^{\emph{AIT}}$ and KDB ${}^{\emph{AIT}}$ enjoy significant advantages over their base BNCs, i.e. TAN and KDB. For example, TAN ${}^{\emph{AIT}}$ performs much better than TAN (20 wins and 5 losses), and KDB ${}^{\emph{AIT}}$ possess significant advantages over the KDB (18 wins and 4 losses). In addition, as the structure complexity increases more weak conditional dependencies will be introduced into the the network topology, and the effectiveness of AIT will be more significant. The advantage of KDB over TAN is not significant (16 wins and 12 losses), whereas KDB ${}^{\emph{AIT}}$ performs much better than TAN (22 wins and 2 losses) and TAN ${}^{\emph{AIT}}$ (9 wins and 1 loss).

Table 2
WDL records for all BNCs in terms of zero-one loss

	TAN	KDB	CFWNB	FKDB	SKDB	TAODE	IWAODE	TAN ${}^{\emph{AIT}}$
KDB	16-8-12	–	–	–	–	–	–	–
CFWNB	12-9-15	13-4-19	–	–	–	–	–	–
FKDB	16-13-7	9-22-5	18-7-11	–	–	–	–	–
SKDB	15-16-5	14-17-5	19-6-11	12-15-9	–	–	–	–
TAODE	16-3-7	19-7-10	19-8-9	16-9-11	13-14-9	–	–	–
IWAODE	16-14-6	17-10-9	21-8-7	19-7-10	14-13-9	9-22-5	–	–
TAN ${}^{\emph{AIT}}$	20-11-5	15-12-9	22-7-7	16-10-10	11-16-9	10-21-5	8-22-6	–
KDB ${}^{\emph{AIT}}$	22-12-2	18-14-4	23-6-7	17-12-7	14-16-6	13-18-5	13-17-6	9-26-1

In addition, after AIT is applied to refine TAN and KDB, the average of zero-one loss for TAN ${}^{\emph{AIT}}$ and KDB ${}^{\emph{AIT}}$ also decrease significantly. The average of zero-one loss for TAN ${}^{\emph{AIT}}$ decreases by 5% when compared with TAN, and the average of zero-one loss for KDB ${}^{\emph{AIT}}$ decreases by nearly 8% when compared with KDB. This proves that the application of AIT to BNC can effectively help identify the difference between dependence and independence in BNC. By removing redundant edges, AIT can help refine the topology of high-dependence BNCs, make the joint probability fit training data well and improve classification performance significantly. For the other five BNCs, FKDB and SKDB outperform CFWNB and that demonstrates that structure extension is more effective in improving classification performance than attribute weighting. Due to the advantage of ODE aggregation, TAODE and IWAODE perform better than CFWNB, FKDB and SKDB.

During the past decade researchers began to study the possibility of scaling-up of existing learning algorithms as the data quantity increases. As argued by Brain and Webb [39] that accurate learners for large data will achieve lower bias than accurate learners for small data. To further intuitively illustrate the effectiveness of AIT while dealing with large data, Goal Difference (GD) [40] is introduced and described as follows,

$\displaystyle\rm{GD}(A_{1};A_{2}|S_{t})=|\textit{win}|-|\textit{loss}|$ (9)

where $A_{1}$ and $A_{2}$ are two classification algorithms, $S_{t}=\{D_{1},\cdots,D_{t}|1\leqslant t\leqslant m\}$ is the subset of datasets shown in Table 1 and $t$ is the corresponding index number, $|\textit{win}|$ and $|\textit{loss}|$ are the numbers of datasets $A_{1}$ performs better or worse than $A_{2}$ on $S_{t}$ , respectively.

Figure 3.

The fitting curves of GD between TAN and TAN ${}^{\emph{AIT}}$ , KDB and KDB ${}^{\emph{AIT}}$ in terms of zero-one loss.

Figure 3 shows the fitting curve of GD in terms of zero-one loss. The X-axis represents the index number of the dataset. As can be seen from Fig. 3, TAN ${}^{\emph{AIT}}$ and KDB ${}^{\emph{AIT}}$ both have obvious performance improvement when compared with their base classifiers. For small datasets ( $1\leqslant t<18$ ), the fitting curve of GD $(\rm{TAN}^{\emph{AIT}};\rm{TAN}|S_{t})$ rises in a wavelike form, that indicates unstable advantage of TAN ${}^{\emph{AIT}}$ over TAN. For large datasets ( $t\geqslant 18$ ), the low amplitude of the fitting curve indicates that TAN ${}^{\emph{AIT}}$ performs better than TAN much more often than not. In contrast, the fitting curve of GD $(\rm{KDB}^{\emph{AIT}};\rm{KDB}|S_{t})$ indicates the advantage of KDB ${}^{\emph{AIT}}$ over KDB is consistent while dealing with small or large datasets.

As shown in Fig. 4a, TAN ${}^{\emph{AIT}}$ and KDB ${}^{\emph{AIT}}$ enjoy advantage over the other five BNCs on average when dealing with small datasets, and the advantage becomes more significant when dealing with large datasets.

5.2 Bias and variance

The bias-variance trade-off is one of the key issues for supervised learning [41]. High-bias learner may underfit the training data and fail to capture important regularities. High-variance learner may model the random noise and overfit the unrepresentative training data rather than the intended outputs.

Table 3
The WDL records for all BNCs in terms of bias

	TAN	KDB	CFWNB	FKDB	SKDB	TAODE	IWAODE	TAN ${}^{\emph{AIT}}$
KDB	15-17-4	–	–	–	–	–	–	–
CFWNB	9-4-23	8-4-24	–	–	–	–	–	–
FKDB	19-15-2	15-17-4	27-2-7	–	–	–	–	–
SKDB	22-10-4	15-18-3	25-3-8	11-14-11	–	–	–	–
TAODE	16-14-6	14-9-13	23-4-9	7-13-16	8-11-17	–	–	–
IWAODE	14-10-12	11-8-17	22-5-9	8-8-20	7-12-17	8-19-9	–	–
TAN ${}^{\emph{AIT}}$	11-19-6	10-11-15	25-4-7	6-9-21	7-9-20	12-13-11	12-12-12	–
KDB ${}^{\emph{AIT}}$	16-15-5	14-18-4	27-3-6	7-14-15	11-9-16	16-11-9	19-7-10	14-15-7

Figure 4.

The comparison results in terms of ZOL.

As shown in Table 3, when applied to TAN and KDB AIT helps decreases bias more often than not. Thus the advantage in ZOL is greatly attributed to the advantage in bias. However, the difference in bias between KDB ${}^{\emph{AIT}}$ and KDB is much greater than that between TAN ${}^{\emph{AIT}}$ and TAN. KDB ${}^{\emph{AIT}}$ performs poorer than FKDB and SKDB. Thus complex network topology with weak dependencies can help reduce bias.

As shown in Fig. 5a and b, the advantage of TAN ${}^{\emph{AIT}}$ and KDB ${}^{\emph{AIT}}$ over the other five BNCs is much more significant while dealing with large datasets. Thus AIT is one of the effective strategies (such as attribute weighting, ensemble learning) to reduce bias.

Table 4

The WDL records for all BNCs in terms of variance

	TAN	KDB	CFWNB	FKDB	SKDB	TAODE	IWAODE	TAN ${}^{\emph{AIT}}$
KDB	7-7-22	–	–	–	–	–	–	–
CFWNB	31-1-4	31-2-3	–	–	–	–	–	–
FKDB	8-10-18	11-13-12	5-0-31	–	–	–	–	–
SKDB	14-9-13	19-9-8	6-1-29	21-6-9	–	–	–	–
TAODE	27-5-4	30-2-4	9-3-24	30-3-3	26-6-4	–	–	–
IWAODE	30-3-3	31-1-4	10-3-23	32-2-2	30-1-5	27-8-1	–	–
TAN ${}^{\emph{AIT}}$	24-8-4	27-7-2	7-2-27	25-6-5	20-5-11	8-7-21	7-3-26	–
KDB ${}^{\emph{AIT}}$	19-8-9	26-8-2	8-1-27	23-6-7	19-6-11	11-4-21	10-2-24	9-13-14

Figure 5.

The comparison results in terms of bias.

Figure 6.

The comparison results in terms of variance.

Variance-wise, as shown in Table 4 TAN ${}^{\emph{AIT}}$ performs better than TAN, and KDB ${}^{\emph{AIT}}$ better than KDB and its variations (i.e., FKDB and SKDB). By removing weak conditional dependencies and simplifying the network topology, AIT can reduce possible negative effect caused by overfitting training data. As shown in Fig. 6a and b, TAN ${}^{\emph{AIT}}$ and KDB ${}^{\emph{AIT}}$ perform much better on large datasets than on small datasets in terms of variance. Thus the advantage of AIT over other learning strategies (such as attribute weighting, ensemble learning) is not significant for reducing variance.

Figure 7.

The comparison results for TAN and KDB with or without AIT in terms of rCLL.

5.3 Conditional log likelihood

Conditional log likelihood (CLL) function [42], which is defined as follows, is introduced to measure the goodness of fit of a statistical model $\mathcal{B}$ to a given set of data for given values of the model parameters.

$\displaystyle\mathrm{CLL}(\mathcal{B}|D)=\sum_{i=1}^{N}\log P_{\mathcal{B}}(y_% {i}|x_{i}^{1},\ldots,x_{i}^{n})$ (10)

where $d_{i}=\{x_{i}^{1},\ldots,x_{i}^{n},y_{i}\}$ is the $i$ -th instances in dataset $D$ . In order to clarify the difference between algorithms in terms of CLL, we introduce the definition of relative conditional log likelihood (rCLL) as follows,

$\displaystyle\mathrm{rCLL}(\mathrm{BNC}^{AIT}|\mathrm{BNC})=\frac{\mathrm{CLL}% (\mathrm{BNC}^{AIT}|D)}{\mathrm{CLL}(\mathrm{BNC}|D)}$ (11)

Figure 7 presents the scatter plot of rCLL. As shown in Fig. 7, the number of blue points representing $\textit{rCLL}\geqslant 1$ is more than that of red points representing $\textit{rCLL}<1$ , that means BNC ${}^{\emph{AIT}}$ performs much better than its base BNC. As shown in Fig. 7a and b, TAN ${}^{\emph{AIT}}$ outperforms TAN on 31 datasets, and KDB ${}^{\emph{AIT}}$ outperforms KDB on 32 datasets among 36 UCI datasets. The experimental results prove that the performance of base BNC will be improved significantly in terms of CLL when AIT is applied.

5.4 Efficiency comparisons

The comparison results of training and classification time are shown in Fig. 8, where each bar represents the average of time on all the 36 datasets. The structure complexity of the base classifier and the size of the test data are the main factors that affect the training and classification time.

As shown in Fig. 8a, TAN ${}^{\emph{AIT}}$ and KDB ${}^{\emph{AIT}}$ spend more training time than CFWNB, TAODE and IWAODE, but less time than FKDB and SKDB. This is because AIT needs an extra pass over the data to identify conditional independence compared with KDB, and the complexity of this extra pass is linearly dependent on the number of CMI values. FKDB applies heuristic search strategy to mine causal relationships among attributes, and SKDB uses cross validation to select between any collection of alternative models. As shown in Fig. 8b, TAN ${}^{\emph{AIT}}$ and KDB ${}^{\emph{AIT}}$ require less classification time than CFWNB, TAODE and IWAODE, but more time than FKDB and SKDB. Because for AIT, redundant weak dependencies have been filtered out and that simplifies the computation of joint probability. In contrast, the variations of AODE (e.g., TAODE and IWAODE) utilize a restricted class of one-dependence estimators (ODEs) and aggregate the predictions of all qualified estimators within this class.

Figure 8.

The comparison results of training and classification time.

6. Conclusions and future work

Because of the inconsistency between conditional independence in information theory and that in probability theory, it is not appropriate to describe the independence relationship between attributes by using information-theoretic criterion only. In this paper, an adaptive independence threshold (AIT) scheme is proposed to automatically identify the conditional independence between attributes or attribute values, and dynamically remove the redundant edges to build robust network topology. We explore reasons for the effectiveness of AIT. Extensive experimental results on 36 datasets show that AIT significantly improves the generalization performance of base BNCs (including TAN and KDB). From the experimental results presented in this paper, weighting and independence identification are two effective approaches to improving the estimates of conditional probabilities and they should be mutually compatible. It remains a direction for future research to explore techniques for combing weighting and AIT.

Footnotes

Appendix

Table (Appendix) A1

Experimental results of zero-one loss

Datasets	TAN	KDB	CFWNB	FKDB	SKDB	TAODE	IWAODE	TAN ${}^{\emph{AIT}}$	KDB ${}^{\emph{AIT}}$
Labor-negotiations	0.1053	0.0702	0.1053	0.0351	0.0877	0.0526	0.0526	0.0533	0.0511
Post-operative	0.3667	0.3778	0.3000	0.3778	0.3556	0.3333	0.3556	0.3333	0.3220
Zoo	0.0099	0.0495	0.0396	0.0198	0.0297	0.0198	0.0198	0.0099	0.0099
Wine	0.0337	0.0225	0.0056	0.0562	0.0337	0.0281	0.0169	0.0393	0.0181
Sonar	0.2212	0.2452	0.1587	0.2452	0.2212	0.2260	0.2260	0.2260	0.2356
Glass-id	0.2196	0.2196	0.1729	0.2103	0.2196	0.2523	0.2196	0.2150	0.2056
Hungarian	0.1701	0.1803	0.1599	0.1769	0.1837	0.1599	0.1599	0.1612	0.1497
Heart-disease-c	0.2079	0.2244	0.1683	0.2145	0.2079	0.2013	0.1980	0.1881	0.1863
Primary-tumor	0.5428	0.5723	0.5634	0.5546	0.5693	0.5782	0.5457	0.5575	0.5520
Ionosphere	0.0684	0.0741	0.0855	0.0741	0.1026	0.0741	0.0712	0.0741	0.0684
House-votes-84	0.0552	0.0506	0.0782	0.0437	0.0529	0.0529	0.0483	0.0552	0.0529
Soybean	0.0469	0.0556	0.0615	0.0630	0.0600	0.0483	0.0542	0.0425	0.0498
Credit-a	0.1507	0.1464	0.1333	0.1536	0.1551	0.1507	0.1391	0.1391	0.1435
Crx	0.1478	0.1565	0.1304	0.1391	0.1507	0.1391	0.1319	0.1386	0.1362
Tic-tac-toe	0.2286	0.2035	0.3100	0.0689	0.1514	0.2630	0.2662	0.2413	0.2093
German	0.2730	0.2890	0.2370	0.2780	0.2540	0.2550	0.2560	0.2590	0.2500
Yeast	0.4171	0.4387	0.4319	0.4319	0.4373	0.4218	0.4232	0.4225	0.4225
Mfeat-mor	0.2970	0.3060	0.3060	0.3080	0.2990	0.3105	0.3120	0.3060	0.3000
Splice-c4.5	0.0444	0.0941	0.0375	0.0416	0.0349	0.0365	0.0101	0.0348	0.0344
Kr-vs-kp	0.0776	0.0416	0.0645	0.0472	0.0329	0.0773	0.0826	0.0569	0.0507
Dis	0.0159	0.0138	0.0156	0.0141	0.0127	0.0125	0.0127	0.0130	0.0128
Abalone	0.4587	0.4563	0.4755	0.4563	0.4654	0.4465	0.4482	0.4554	0.4506
Spambase	0.0669	0.0635	0.0859	0.0728	0.0643	0.0602	0.0646	0.0624	0.0615
Phoneme	0.2733	0.1984	0.2407	0.2444	0.1912	0.2427	0.2104	0.2059	0.1889
Page-blocks	0.0415	0.0391	0.0417	0.0396	0.0340	0.0327	0.0325	0.0342	0.0347
Optdigits	0.0407	0.0372	0.0676	0.0356	0.0374	0.0290	0.0276	0.0378	0.0340
Mushrooms	0.0001	0.0000	0.0080	0.0000	0.0000	0.0002	0.0002	0.0002	0.0000
Pendigits	0.0321	0.0294	0.1130	0.0272	0.0294	0.0200	0.0185	0.0276	0.0285
Sign	0.2755	0.2539	0.3701	0.2463	0.2125	0.2743	0.2789	0.2712	0.2826
Seermdl	0.2376	0.2555	0.2330	0.2600	0.2361	0.2340	0.2325	0.2331	0.2330
Magic	0.1675	0.1637	0.2034	0.1589	0.1626	0.1725	0.1744	0.1790	0.1746
Letter-recog	0.1300	0.0986	0.2479	0.0974	0.1013	0.0838	0.0854	0.0972	0.0893
Adult	0.1380	0.1383	0.1499	0.1363	0.1358	0.1558	0.1502	0.1326	0.1325
Shuttle	0.0015	0.0009	0.0021	0.0007	0.0008	0.0008	0.0011	0.0010	0.0010
Waveform	0.0202	0.0256	0.0199	0.0202	0.0241	0.0182	0.0181	0.0187	0.0190
Localization	0.3575	0.2964	0.4936	0.2963	0.3013	0.3544	0.3593	0.3209	0.3054

Table (Appendix) A2

Experimental results of bias

Datasets	TAN	KDB	CFWNB	FKDB	SKDB	TAODE	IWAODE	TAN ${}^{\emph{AIT}}$	KDB ${}^{\emph{AIT}}$
Labor-negotiations	0.0716	0.0553	0.0349	0.0795	0.0442	0.0474	0.0268	0.0411	0.0326
Post-operative	0.2687	0.2737	0.2928	0.2703	0.2077	0.2403	0.2190	0.2207	0.2467
Zoo	0.0303	0.0403	0.0840	0.0373	0.0400	0.0282	0.0282	0.0342	0.0367
Wine	0.0507	0.0520	0.0137	0.0483	0.0388	0.0376	0.0317	0.0339	0.0175
Sonar	0.1646	0.1686	0.1230	0.1633	0.1700	0.1707	0.1694	0.1625	0.1806
Glass-id	0.2756	0.2713	0.1197	0.2865	0.2785	0.2785	0.2818	0.2770	0.2752
Hungarian	0.1424	0.1480	0.1484	0.1466	0.1583	0.1581	0.1597	0.1363	0.1467
Heart-disease-c	0.1263	0.1299	0.1440	0.1029	0.1194	0.1134	0.1160	0.1088	0.1185
Primary-tumor	0.4249	0.4184	0.3417	0.4138	0.4068	0.4324	0.4188	0.4388	0.4397
Ionosphere	0.0804	0.0855	0.0813	0.0710	0.0897	0.0764	0.0881	0.0861	0.0804
House-votes-84	0.0410	0.0258	0.0575	0.0302	0.0272	0.0429	0.0493	0.0437	0.0361
Soybean	0.0522	0.0491	0.0695	0.0472	0.0508	0.0515	0.0693	0.0464	0.0482
Credit-a	0.1171	0.1137	0.1301	0.0955	0.1024	0.0940	0.0893	0.1144	0.1118
Crx	0.1180	0.1197	0.1332	0.1030	0.1046	0.0985	0.0904	0.1096	0.1129
Tic-tac-toe	0.1746	0.1367	0.2257	0.0351	0.1207	0.2008	0.1994	0.1752	0.1665
German	0.2057	0.2108	0.2075	0.2058	0.2001	0.2052	0.2112	0.2015	0.2091
Yeast	0.3481	0.3462	0.3644	0.3449	0.3469	0.3457	0.3458	0.3466	0.3444
Mfeat-mor	0.2077	0.2142	0.2455	0.2134	0.2071	0.2431	0.2492	0.2130	0.2177
splice-c4.5	0.0395	0.0961	0.0345	0.0395	0.0289	0.0315	0.4576	0.0320	0.0609
Kr-vs-kp	0.0702	0.0417	0.0583	0.0416	0.0284	0.0688	0.0763	0.0525	0.0372
Dis	0.0193	0.0191	0.0127	0.0201	0.0182	0.0178	0.0168	0.0191	0.0191
Abalone	0.3126	0.3033	0.3728	0.3033	0.3102	0.3183	0.3199	0.3285	0.3146
Spambase	0.0570	0.0497	0.0750	0.0580	0.0483	0.0541	0.0602	0.0562	0.0522
Phoneme	0.2394	0.1572	0.2003	0.1641	0.1514	0.2186	0.1829	0.2026	0.1409
Page-blocks	0.0308	0.0280	0.0331	0.0259	0.0263	0.0248	0.0257	0.0278	0.0282
Optdigits	0.0275	0.0250	0.0594	0.0230	0.0252	0.0224	0.0200	0.0289	0.0228
Mushrooms	0.0001	0.0001	0.0103	0.0000	0.0000	0.0004	0.0004	0.0002	0.0001
Pendigits	0.0314	0.0207	0.1011	0.0197	0.0207	0.0225	0.0200	0.0293	0.0187
Sign	0.2420	0.2161	0.3435	0.1993	0.1802	0.2446	0.2510	0.2387	0.2176
Seermdl	0.2114	0.2100	0.2252	0.2077	0.2260	0.2150	0.2214	0.2144	0.2099
Magic	0.1252	0.1241	0.1898	0.1251	0.1244	0.1546	0.1595	0.1364	0.1340
Letter-recog	0.1032	0.0806	0.2133	0.0738	0.0782	0.0814	0.0877	0.1038	0.0730
Adult	0.1312	0.1220	0.1461	0.1230	0.1249	0.1459	0.1437	0.1297	0.1261
Shuttle	0.0008	0.0007	0.0024	0.0006	0.0007	0.0006	0.0007	0.0008	0.0007
Waveform	0.0152	0.0210	0.0199	0.0144	0.0137	0.0149	0.0157	0.0162	0.0177
Localization	0.3106	0.2134	0.4746	0.2137	0.1949	0.3010	0.3126	0.3112	0.2175

Table (Appendix) A3

Experimental results of variance

Datasets	TAN	KDB	CFWNB	FKDB	SKDB	TAODE	IWAODE	TAN ${}^{\emph{AIT}}$	KDB ${}^{\emph{AIT}}$
Labor-negotiations	0.1389	0.1289	0.0320	0.1258	0.1137	0.0789	0.0626	0.1285	0.1146
Post-operative	0.1513	0.1697	0.0467	0.1730	0.1590	0.1563	0.1510	0.1297	0.1309
Zoo	0.0606	0.0658	0.0675	0.0536	0.0570	0.0445	0.0445	0.0574	0.0598
Wine	0.0493	0.0649	0.0042	0.0483	0.0341	0.0251	0.0141	0.0391	0.0459
Sonar	0.1165	0.1199	0.0432	0.1193	0.1097	0.0959	0.0929	0.1015	0.0191
Glass-id	0.1075	0.1189	0.0492	0.1065	0.1075	0.1004	0.0999	0.1077	0.1052
Hungarian	0.0596	0.0561	0.0201	0.0442	0.0366	0.0287	0.0270	0.0428	0.0495
Heart-disease-c	0.0479	0.0582	0.0389	0.0645	0.0489	0.0361	0.0305	0.0407	0.0561
Primary-tumor	0.2424	0.2391	0.2117	0.2419	0.2215	0.1880	0.1785	0.2363	0.2467
Ionosphere	0.0401	0.0581	0.0087	0.0563	0.0462	0.0381	0.0238	0.0389	0.0404
House-votes-84	0.0170	0.0197	0.0108	0.0222	0.0203	0.0081	0.0079	0.0250	0.0272
Soybean	0.0654	0.0439	0.0378	0.0541	0.0457	0.0331	0.0290	0.0461	0.0426
Credit-a	0.0555	0.0768	0.0205	0.0588	0.0380	0.0360	0.0276	0.0480	0.0510
Crx	0.0520	0.0663	0.0203	0.0501	0.0310	0.0310	0.0240	0.0421	0.0534
Tic-tac-toe	0.0824	0.1125	0.0550	0.0699	0.1417	0.0528	0.0529	0.1064	0.1034
German	0.1009	0.1192	0.0473	0.1216	0.0816	0.0789	0.0692	0.0903	0.0919
Yeast	0.1037	0.1020	0.1073	0.1014	0.0990	0.0972	0.0967	0.1002	0.0990
Mfeat-mor	0.1020	0.1031	0.0563	0.1010	0.1040	0.0730	0.0676	0.0289	0.0228
splice-c4.5	0.0289	0.0800	0.0068	0.1492	0.0133	0.0119	0.1250	0.0196	0.02457
Kr-vs-kp	0.0152	0.0111	0.0169	0.0191	0.0126	0.0208	0.0185	0.0108	0.0110
Dis	0.0005	0.0011	0.0050	0.0006	0.0016	0.0040	0.0036	0.0009	0.0009
Abalone	0.1693	0.1769	0.0904	0.1772	0.1679	0.1561	0.1539	0.1526	0.1409
Spambase	0.0158	0.0214	0.0054	0.0205	0.0240	0.0124	0.0094	0.0137	0.0106
Phoneme	0.1828	0.1064	0.0961	0.1349	0.0867	0.1355	0.1270	0.1388	0.0976
Page-blocks	0.0143	0.0177	0.0070	0.0169	0.0162	0.0122	0.0113	0.0138	0.0169
Optdigits	0.0185	0.0254	0.0137	0.0239	0.0253	0.0140	0.0132	0.0008	0.0007
Mushrooms	0.0002	0.0002	0.0001	0.0000	0.0002	0.0002	0.0001	0.0002	0.0002
Pendigits	0.0200	0.0236	0.0126	0.0265	0.0236	0.0130	0.0107	0.0190	0.0208
Sign	0.0386	0.0596	0.0250	0.0642	0.0725	0.0406	0.0380	0.0464	0.0482
Seermdl	0.0381	0.0613	0.0130	0.0692	0.0155	0.0295	0.0200	0.0362	0.0522
Magic	0.0490	0.0491	0.0092	0.0479	0.0483	0.0313	0.0291	0.0320	0.0409
Letter-recog	0.0591	0.0709	0.0498	0.0797	0.0745	0.0457	0.0417	0.0512	0.0651
Adult	0.0165	0.0285	0.0071	0.0252	0.0209	0.0174	0.0109	0.0122	0.0177
Shuttle	0.0004	0.0003	0.0006	0.0004	0.0004	0.0004	0.0003	0.0003	0.0003
Waveform	0.0053	0.0037	0.0004	0.0047	0.0088	0.0034	0.0024	0.0034	0.0045
Localization	0.0594	0.1099	0.0186	0.1096	0.1337	0.0657	0.0577	0.0321	0.0300

Table (Appendix) A4

Experimental results of RMSE

Datasets	TAN	KDB	CFWNB	FKDB	SKDB	TAODE	IWAODE	TAN ${}^{\emph{AIT}}$	KDB ${}^{\emph{AIT}}$
Labor-negotiations	0.2778	0.2477	0.2810	0.1952	0.2153	0.2029	0.1739	0.2453	0.2243
Post-operative	0.5340	0.5632	0.3970	0.4499	0.4389	0.4184	0.4101	0.4113	0.4113
Zoo	0.1309	0.1815	0.0933	0.0659	0.0817	0.0686	0.0650	0.2368	0.0819
Wine	0.1746	0.1501	0.0532	0.1641	0.1194	0.1021	0.4689	0.2787	0.1275
Sonar	0.4131	0.4084	0.3409	0.4280	0.4100	0.4202	0.1001	0.4054	0.4005
Glass-id	0.3332	0.3395	0.2952	0.3241	0.3376	0.3409	0.4246	0.3273	0.3284
Hungarian	0.3429	0.3552	0.3384	0.3567	0.3434	0.3443	0.3237	0.3337	0.3284
Heart-disease-c	0.3775	0.3963	0.3417	0.3942	0.386	0.3677	0.345	0.3577	0.3656
Primary-tumor	0.7170	0.7262	0.1790	0.1841	0.1828	0.1864	0.3620	0.1773	0.1781
Ionosphere	0.2615	0.2714	0.2765	0.2612	0.2817	0.2464	0.1778	0.2518	0.2487
House-votes-84	0.2181	0.1969	0.2558	0.1977	0.2044	0.1968	0.2546	0.2128	0.2068
Soybean	0.2014	0.2063	0.0723	0.0734	0.0676	0.0657	0.1998	0.0591	0.0620
Credit-a	0.3415	0.3480	0.3116	0.3386	0.3346	0.3305	0.0697	0.3301	0.3260
Crx	0.3411	0.3525	0.3142	0.3289	0.3304	0.3322	0.3271	0.3245	0.3221
Tic-tac-toe	0.4023	0.3772	0.1116	0.0984	0.1030	0.0865	0.3259	0.0923	0.0933
German	0.4367	0.4665	0.4080	0.4567	0.4236	0.4224	0.3992	0.4206	0.4209
Yeast	0.5994	0.6035	0.2423	0.2387	0.2392	0.2371	0.4157	0.1292	0.2370
Mfeat-mor	0.4657	0.4707	0.1943	0.1978	0.1944	0.1980	0.2363	0.1936	0.1933
splice-c4.5	0.1917	0.2756	0.4334	0.2260	0.3268	0.3984	0.1979	0.4216	0.4193
Kr-vs-kp	0.2358	0.1869	0.2779	0.1935	0.1573	0.2561	0.2635	0.2418	0.2189
Dis	0.1103	0.1024	0.1130	0.1045	0.0987	0.1047	0.1058	0.1084	0.1041
Abalone	0.5638	0.5646	0.4433	0.4277	0.4269	0.4195	0.4191	0.4209	0.4222
Spambase	0.2403	0.2300	0.2657	0.2418	0.2298	0.2239	0.2317	0.2432	0.242
Phoneme	0.5048	0.4195	0.0806	0.0865	0.0756	0.0891	0.0795	0.0844	0.0762
Page-blocks	0.1894	0.1811	0.1117	0.1123	0.1020	0.1013	0.0986	0.1026	0.1031
Optdigits	0.1906	0.1806	0.1075	0.0761	0.0792	0.0727	0.0686	0.0803	0.0800
Mushrooms	0.0083	0.0001	0.0857	0.0017	0.0000	0.0121	0.0114	0.0288	0.0267
Pendigits	0.1640	0.1588	0.1318	0.0671	0.0687	0.0565	0.0540	0.1565	0.0700
Sign	0.3505	0.3334	0.3929	0.3333	0.3151	0.3487	0.3516	0.3521	0.3551
Seermdl	0.4131	0.4340	0.4121	0.4386	0.4083	0.4139	0.4106	0.4071	0.4069
Magic	0.3461	0.3470	0.3709	0.3410	0.3471	0.3519	0.3534	0.3541	0.3502
Letter-recog	0.3350	0.2963	0.1139	0.0764	0.0778	0.0691	0.0693	0.0842	0.0825
Adult	0.3076	0.3089	0.3150	0.3083	0.3043	0.3297	0.3250	0.3024	0.3014
Shuttle	0.0356	0.0290	0.0270	0.0127	0.0139	0.0124	0.0159	0.0147	0.0156
Waveform	0.1164	0.1402	0.0068	0.0198	0.0297	0.0198	0.0859	0.0529	0.2776
Localization	0.5656	0.5106	0.2402	0.1962	0.2010	0.2081	0.2093	0.2094	0.2105

References

Scanagatta

Salmeron

and Stella

, A survey on Bayesian network structure learning from data, Progress in Artificial Intelligence 8(4) (2019), 425–539.

Sendari

Zaeni

I.A.E.

and Lestari

D.C.

, Opinion analysis for emotional classification on emoji tweets using the naive bayes algorithm, Knowledge Engineering and Data Science 3(1) (2020), 50–59.

Zhang

and Li

, Bayesian classifiers for positive unlabeled learning, in: Proceedings of the 12th International Conference on Web-Age Information Management, Springer, Berlin, Heidelberg, 2011, pp. 81–93.

Chai

K.M.A.

Chieu

H.L.

and Ng

H.T.

, Bayesian online classifiers for text classification and filtering, in: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2002, pp. 97–104.

Brain

and Cotton

, Explanation and Justification in Machine Learning: A Survey, in: Proceedings of the 17th IJCAI Explainable AI (XAI) Workshop, Melbourne, Australia, 2017, pp. 8–13.

Qiu

and Ding

, A survey of machine learning for big data processing, EURASIP Journal on Advances in Signal Processing (2016), 67–83.

Friedl

M.A.

and Brodley

C.E.

, Decision tree classification of land cover from remotely sensed data, Remote Sensing of Environment 61(3) (1997), 399-409.

Bielza

and Larranaga

, Discrete bayesian network classifiers: A survey, ACM Computing Surveys (CSUR) 47(1) (2014), 1–43.

Jiang

Wang

and Cai

, Survey of Improving Naive Bayes for Classification, in: Proceedings of the International Conference on Advanced Data Mining and Applications, Springer, Berlin, Heidelberg, 2007, pp. 134–145.

10.

Suykens

J.A.K.

and Vandewalle

, Least squares support vector machine classifiers, Neural Processing Letters 9(3) (1999), 293–300.

11.

Richard

M.D.

and Lippmann

R.P.

, Neural network classifiers estimate Bayesian a posteriori probabilities, Neural Computation 3(4) (1991), 461–483.

12.

Wang

Chen

and Mammadov

, Target learning: A novel framework to mine significant dependencies for unlabeled data, in: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2018, pp. 106–117.

13.

Chickering

D.M.

Heckerman

and Meek

, Large-sample learning of Bayesian networks is NP-hard, Journal of Machine Learning Research 5 (2004), 1287–1330.

14.

Koivisto

and Sood

, Exact Bayesian structure discovery in Bayesian networks, Journal of Machine Learning Research 5 (2004), 549–573.

15.

Sutrisnowati

R.A.

Bae

and Park

, Learning bayesian network from event logs using mutual information test, in: Proceedings of the IEEE 6th International Conference on Service-Oriented Computing and Applications, 2013, pp. 356–360.

16.

Cai

and Zeng

, Artificial immune system for attribute weighted naive bayes classification, in: Proceedings of the International Joint Conference on Neural Networks, 2013, pp. 798–805.

17.

Mukherjee

Asnani

and Kannan

, CCMI: Classifier based conditional mutual information estimation, in: Proceedings of the Uncertainty in Artificial Intelligence, 2020, pp. 1083–1093.

18.

Liu

W.Y.

Yue

and Li

W.H.

, Constructing the Bayesian network structure from dependencies implied in multiple relational schemas, Expert Systems with Applications 38(6) (2011), 7123–7134.

19.

Jiang

Cai

and Zhang

, Not so greedy: Randomly selected naive bayes, Expert Systems with Applications 39(12) (2012), 11022–11028.

20.

Jiang

Cai

and Wang

, Improving Tree augmented Naive Bayes for class probability estimation, Knowledge-Based Systems 26 (2012), 239–245.

21.

Zaidi

N.A.

Cerquides

and Carman

M.J.

, Alleviating naive bayes attribute independence assumption by attribute weighting, Journal of Machine Learning Research 14 (2013), 1947–1988.

22.

Heckerman

, A tutorial on learning with Bayesian networks, in: Proceedings of the Innovations in Bayesian Networks, 2008, pp. 33–82.

23.

Friedman

Geiger

and Goldszmidt

, Bayesian network classifiers, Machine Learning 29 (1997), 131–163.

24.

Sahami

, Learning Limited Dependence Bayesian Classifiers, in: Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1996, pp. 335–338.

25.

Jiang

Zhang

and Cai

, A novel Bayes model: Hidden naive Bayes, IEEE Transactions on Knowledge and Data Engineering 21(10) (2008), 1361-1371.

26.

Lee

Gutierrez

and Dou

, Calculating Feature Weights in Naive Bayes with Kullback-Leibler Measure, in: Proceedings of IEEE 11th International Conference on Data Mining, 2011, pp. 1146–1151.

27.

Jiang

Zhang

and Li

, A correlation-based feature weighting filter for naive bayes, IEEE Transactions on Knowledge and Data Engineering 31(2) (2019), 201–213.

28.

Wang

Zhao

and Sun

, General and local: Averaged k-dependence bayesian classifiers, Entropy 17(6) (2015), 4134-4154.

29.

Duan

Wang

and Chen

, Instance-based weighting filter for superparent one-dependence estimators, Knowledge-Based Systems 203 (2020), 106085.

30.

Frank

Hall

and Pfahringer

, Locally weighted naive bayes, in: Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence, 2002, pp. 249–256.

31.

Wang

Chen

and Liu

, Self-adaptive attribute value weighting for averaged one-dependence estimators, IEEE Access 8 (2020), 27887–27900.

32.

Pensar

Nyman

and Lintusaari

, The role of local partial independence in learning of Bayesian networks, International Journal of Approximate Reasoning 69 (2016), 91–105.

33.

Dawid

A.P.

, Conditional independence in statistical theory, Journal of the Royal Statistical Society: Series B (Methodological), 41(1) (1979), 1–15.

34.

Vadivel

A.K.

Majumdar

A.K.

and Sural

, Performance comparison of distance metrics in content-based image retrieval applications, in: Proceedings of the International Conference on Information Technology, 2003, pp. 159–164.

35.

Frank

and Asuncion

, UCI Machine learning repository, [http://archive.ics.uci.edu/ml], Irvine, CA: University of California, School of Information and Computer Sciences (2010).

36.

Wang

and Duan

, Optimizing the topology of Bayesian network classifiers by applying conditional entropy to mine causal relationships between attributes, IEEE Access 7 (2019), 134271–134279.

37.

Martinez

A.M.

Webb

G.I.

and Chen

, Scalable learning of bayesian network classifiers, Journal of Machine Learning Research 17(44) (2016), 1515–1549.

38.

Kohavi

and Wolpert

D.H.

, Bias plus variance decomposition for zero-one loss functions, in: Proceedings of the 13th International Conference on Machine Learning, 1996, pp. 275–283.

39.

Brain

and Webb

G.I.

, On the effect of dataset size on bias and variance in classification learning, in: Proceedings of the 4th Australian Knowledge Acquisition Workshop, 1999, pp. 117–128.

40.

Karlis

and Ntzoufras

, Bayesian modelling of football outcomes: Using the Skellam’s distribution for the goal difference, IMA Journal of Management Mathematics 20(2) (2009), 133–145.

41.

Briscoe

and Feldman

, Conceptual complexity and the bias/variance tradeoff, Cognition 118(1) (2009), 2–16.

42.

Kalbfleisch

J.D.

and Prentice

R.L.

, Marginal likelihoods based on Cox’s regression and life model, Biometrika 60(2) (1973), 267–278.

43.

Gianni

and Cornelis

, Probabilistic models of information retrieval based on measuring the divergence from randomness, ACM Transactions on Information Systems (TOIS) 20(4) (2002), 357–389.

Identification of informational and probabilistic independence by adaptive thresholding

Abstract

Keywords

1. Introduction

4.1 Threshold identification of conditional independence from the perspective of information theory

5. Experiments and results

Table 1 Datasets

Table 2 WDL records for all BNCs in terms of zero-one loss

Table 3 The WDL records for all BNCs in terms of bias

Footnotes

Appendix

References

Table 1
Datasets

Table 2
WDL records for all BNCs in terms of zero-one loss

Table 3
The WDL records for all BNCs in terms of bias