Exploiting the implicit independence assumption for learning directed graphical models

Abstract

Bayesian network classifiers (BNCs) provide a sound formalism for representing probabilistic knowledge and reasoning with uncertainty. Explicit independence assumptions can effectively and efficiently reduce the size of the search space for solving the NP-complete problem of structure learning. Strong conditional dependencies, when added to the network topology of BNC, can relax the independence assumptions, whereas the weak ones may result in biased estimates of conditional probability and degradation in generalization performance. In this paper, we propose an extension to the $k$ -dependence Bayesian classifier (KDB) that achieves the bias/variance trade-off by verifying the rationality of implicit independence assumptions implicated. The informational and probabilistic dependency relationships represented in the learned robust topologies will be more appropriate for fitting labeled and unlabeled data, respectively. The comprehensive experimental results on 40 UCI datasets show that our proposed algorithm achieves competitive classification performance when compared to state-of-the-art BNC learners and their efficient variants in terms of zero-one loss, root mean square error (RMSE), bias and variance.

Keywords

Bayesian network classifier implicit independence assumption informational independence probabilistic independence

1. Introduction

Classification is one of the key issues in data mining and machine learning [1]. The classifier learned from data can be described as the mapping relationship between predictive attributes $\emph{{X}}=\{X_{1},X_{2},\ldots,X_{n}\}$ and class variable $Y$ , and numerous state-of-the-art classification algorithms, e.g., Bayesian network classifier (BNC) [2], decision tree [3], support vector machine [4] and neural network [5], have been proposed. Among these algorithms, BNC can graphically represent the joint probability distribution over variables of interest in a compact manner [6], and its topology encodes the dependency relationships among attributes in the form of a directed acyclic graph (DAG). The study of BNC has attracted extensive attention in the past decades due to its interpretability and simplicity [8, 7]. However, learning an optimal BNC has been proven to be NP-hard [9]. Explicit or implicit independence assumptions have been proven to be efficient and feasible, and are commonly applied to simplify the topologies of DAGs and achieve high-confidence estimates of probability distribution [10, 11, 12].

Naive Bayes (NB) [13] is an extremely simple and remarkably effective BNC. Its explicit independence assumption determines the network topology and the form of factorization of joint probability. However, its independence assumption is often violated in practice and as a result its probability estimates are often suboptimal. A large literature explores the approaches, e.g., attribute weighting [14, 15], attribute selection [16, 17], instance weighting [18, 19], instance selection [20, 21] and structure extension [1, 22, 23], to improving the classification performance while handling real-world problems with complex dependencies. Among these approaches, structure extension is the most natural and direct way to alleviate NB’s unrealistic independence assumption by adding augmented edges between attributes to the network topology.

Among the BNCs based on structure extension, the $k$ -dependence Bayesian classifier (KDB) [24] is appealing due to its flexibility especially when the structure complexity and computational complexity can not be pre-determined. KDB controls its bias/variance trade-off with a single parameter, $k$ , which is the maximum number of parents allowed for each node. Correspondingly, some non-significant dependency relationships are neglected and then implicit independence assumptions will be introduced. Unverified independence assumptions may result in suboptimal network topology and biased estimates of conditional probability. Furthermore, information-theoretic metrics for measuring mutual dependence or conditional dependence between attributes, e.g., mutual information $I(X_{i};Y)$ and conditional mutual information $I(X_{i};X_{j}|Y)$ , can not measure probabilistic dependence between attribute values. Information-theoretic (in)dependence doesn’t correspond to probability-theoretic (in)dependence, and vice versa.

To the best of our knowledge, previous research focuses on the study of (in)dependency relationships learned from training data only. However, if the relationships can not be mapped into the ones implicated in the testing instance, the learned network topology may be biased and the generalization performance will be degraded, whereas the inter-operability between these two kinds of relationships has not been investigated fully in previous studies. To achieve the trade-off between data fitting and classification, information-theoretic metrics should be redefined to mine significant instantiated (in)dependency relationships from specific testing instance. A general learning framework is urgently needed to identify the significant conditional (in)dependencies respectively implicated in labeled and unlabeled data. The main contributions are listed as follows:

Information-theoretic and probability-theoretic metrics are introduced to measure explicit dependency relationships or to identify implicit independency relationships. By inferring the implicit independence assumption from the approximate expression of conditional probability, we extend KDB to learn conditional dependence conditioned on class variable or predictive attribute. The resulting highly scalable algorithm combines the low variance of instance learning with the low bias of ensemble learning.

We compare our proposed $k$ -independence Bayesian classifier (KIBC) with other competitors on 40 UCI datasets in terms of zero-one loss (ZOL), root mean square error (RMSE), bias and variance. The Friedman and Nemenyi tests are also used to explore the statistical significance of the experimental results. The results show that KIBC achieves competitive classification performance compared to a range of state-of-the-art single-model BNCs (e.g., SKDB and CFWNB) and ensemble BNCs (e.g., WATAN, IWAODE, WAODE-MI, TAODE and DWAODE).

Section 2 reviews the state-of-the-art BNCs and then discusses the issue of implicit independence assumption based on the analysis of explicit independence assumption. Our novel techniques for identifying implicit independence implicated in labeled and unlabeled data are described in Section 3 where we also discuss their connection with KDB in terms of information-theoretic metrics. Section 4 presents the experimental evaluation of our proposed algorithm with related approaches. Section 5 shows the main conclusions and outlines future work.

2. Related work

A BNC provides a framework for encoding a joint probability distribution over a set of finite attributes $\emph{{X}}=\{X_{1},X_{2},\ldots,X_{n}\}$ and class variable $Y$ , denoted by $\mathcal{B}=<\mathcal{G},\Theta>$ . The component $\mathcal{G}=\{\textbf{V},\textbf{E}\}$ is a directed acyclic graph (DAG). V is a set of nodes corresponding to the attributes in $\{\emph{{X}},Y\}$ . E is a set of directed edges, which represents the dependencies between nodes. An arbitrary edge $X_{j}\rightarrow X_{i}$ in E implies that $X_{j}$ is the parent of $X_{i}$ , and the set of parents of $X_{i}$ is denoted as $\Pi_{i}$ . $\Theta$ is a set of parameters that quantifies the dependencies among nodes in V. The structure $\mathcal{G}$ encodes conditional independencies among attributes, that is, each attribute $X_{i}$ is conditionally independent of its non-descendants given $\Pi_{i}$ [25]. Given a restricted BNC $\mathcal{B}$ in which class node $Y$ is the root of all attribute nodes, it assigns the optimal class label to an instance x by using Bayes formula as follows:

$\displaystyle y^{*}=\arg\max_{y\in Y}P(y|\emph{{x}})=\arg\max_{y\in Y}\frac{P(% y,\emph{{x}})}{P(\emph{{x}})}\propto\arg\max_{y\in Y}P(y,\emph{{x}})=\arg\max_% {y\in Y}P(y)\prod_{i=1}^{n}P(x_{i}|\pi_{i},y),$ (1)

where $\pi_{i}$ denotes the values of parent attributes $\Pi_{i}$ . Given a limited number of instances, the biased estimate of $P(y|\emph{{x}})$ will result in wrong classification results. Since $P(\emph{{x}})$ is a constant irrelevant to class variable, according to Eq. (1) the estimate of $P(y,\emph{{x}})$ can also be applied for classification. The network topology of BNC provides a feasible approach to decomposing $P(y,\emph{{x}})$ into the product of a set of conditional probabilities. Full BNC (FBC) is an optimal BNC and it perfectly represents the joint distribution [26] in the form of $P(y,\emph{{x}})=P(y)P(x_{1}|y)P(x_{2}|x_{1},y)\cdots P(x_{n}|x_{1},x_{2},% \ldots,x_{n-1},y)$ . In other words, $\Pi_{i}=\{X_{1},X_{2},\ldots,X_{i-1}\}$ . If the true distribution is available to us, we can theoretically achieve the optimal classification performance. However, FBC represents all possible rather than those significant dependency relationships in the network topology and thus it is highly undesirable. On the other hand, high computational complexity and the limited number of instances for training may result in explicit or implicit independence assumptions and then biased estimates of conditional probability for some attributes. According to the network topology of $\mathcal{B}$ , if the estimates of the joint probability distribution over all attributes can approximate that for FBC, the resulting BNC will be suboptimal.

2.1 Explicit independence assumption

Explicit independence assumption directly defines the independency relationships under certain conditions without any prior domain knowledge. NB takes the independence assumption to the extreme by assuming the attributes are conditionally independent given the class, and as shown in Fig. 1a the network topology of NB can be learned from the independence assumption rather than training data. The estimate of $P(x_{i}|\pi_{i},y)$ is simplified to be $P(x_{i}|y)$ and then the estimate of joint probability $P(y,\emph{{x}})$ turns to be

$\displaystyle P_{\rm{NB}}(y,\emph{{x}})=P(y)\prod_{i=1}^{n}P(x_{i}|y).$ (2)

NB’s unrealistic assumption can be described in the probabilistic form as $P(\emph{{x}}|y)=\prod_{i=1}^{n}P(x_{i}|y)$ , which should hold for all possible combinations of attribute values. Although NB has demonstrated excellent classification performance and surprisingly outperformed many sophisticated learners, its independence assumption rarely holds in real-world problems and thus it is hard for NB to fit data well. NB implicitly assumes that the significance for different attributes is the same, and consequently some researchers propose to apply attribute weighting [11, 27] to calibrate the estimates of conditional probabilities. Jiang et al. [28] observe that for different class labels, the attribute importance should also be different and the proposed weighting method, called class-specific attribute weighted NB (CAWNB), discriminatively assigns weights to attributes according to different classes. Wu et al. [29] propose to apply evolutionary computation to automatically determine the best attribute weights for NB. Jiang et al. [15] assume that significant attributes should be highly correlated with the class whereas uncorrelated with each other. Thus the normalized $I(X_{i};X_{j})$ and $I(X_{i};Y)$ are introduced to compute the weighting metric for weighted NB.

Figure 1.

An example of (a) NB and (b) SPODE ${}^{\alpha}$ .

Superparent-one-dependence estimator (SPODE) assumes that all attributes depend on the same attribute, namely the superparent attribute, in addition to the class [30]. As shown in Fig. 1b, the SPODE with the superparent attribute $X_{\alpha}$ , denoted as SPODE ${}^{\alpha}$ , assumes that the non-superparent attributes are independent of each other given $X_{\alpha}$ and $Y$ . This explicit independence assumption is weaker than that of NB since the network topology of SPODE represents the dependency relationships between the superparent attribute and non-superparent ones. Then the estimate of $P(x_{i}|\pi_{i},y)$ turns to be $P(x_{i}|x_{\alpha},y)$ . If we combine $X_{\alpha}$ and $Y$ into one, the network topology of SPODE is the same as that of NB. SPODE’s unrealistic assumption can be described in the probabilistic form as $P(\emph{{x}}|x_{\alpha},y)=\prod_{i=1,i\neq\alpha}^{n}P(x_{i}|x_{\alpha},y)$ , which should also hold for all possible combinations of attribute values. Then the estimate of joint probability $P(y,\emph{{x}})$ turns to be

$\displaystyle P_{\rm{SPODE}^{\alpha}}(y,\emph{{x}})=P(x_{\alpha},y)\prod_{i=1,% i\neq\alpha}^{n}P(x_{i}|x_{\alpha},y).$ (3)

An ensemble of SPODEs performs much better than one single SPODE more often than not. Averaged one-dependence estimators (AODE) [30] uses uniform weights to ensemble all qualified SPODE members and estimates posterior probability by averaging them. For different SPODE members in AODE, the independence assumptions are different due to the variations in the superparent attributes. Thus they can not fit training data to the same extent and demand differential treatment. By assigning distinctive weights to the SPODE members, model weighting can help calibrate the joint probability of the final weighted AODE. Duan et al. [31] propose an instance-based weighting approach, for which the weights are defined by instantiated information-theoretic metrics and may vary from instance to instance. Wang et al. [32] propose to assign each SPODE a discriminative weight by identifying the differences among these SPODEs in terms of log likelihood. Jiang et al. [33] propose to respectively apply area under the ROC curve (AUC), classification accuracy, conditional log likelihood and mutual information as the weighting metrics of AODE.

2.2 Implicit independence assumption

By adding augmented edges to the network topology of NB, its independence assumption can be alleviated to a certain extent. However, no BNC can fully represent the dependency relationships in practice due to the restriction in computational complexity and precision of the probability estimates, and implicit independence assumptions are introduced accordingly. Given the network topology $\mathcal{B}$ , the corresponding joint probability will be factorized into the form as follows:

$\displaystyle P_{\mathcal{B}}(y,\emph{{x}})=P(y)\prod_{i=1}^{n}P(x_{i}|\pi_{i}% ^{\mathcal{B}},y)=P(y)P(x_{1}|y)\prod_{i=2}^{n}P(x_{i}|\pi_{i}^{\mathcal{B}},y),$ (4)

where $\pi_{i}^{\mathcal{B}}$ is the parent attribute values of $\Pi_{i}^{\mathcal{B}}$ and $\Pi_{i}^{\mathcal{B}}\subseteq\Pi_{i}$ . If $P(x_{i}|\pi_{i},y)=P(x_{i}|\pi_{i}^{\mathcal{B}},y)$ or $P(x_{i}|\pi_{i},y)\approx P(x_{i}|\pi_{i}^{\mathcal{B}},y)$ holds for all the predictive attributes, then the joint probability encoded in $\mathcal{B}$ is expected to be the same as that in FBC. That is, the conditional probability $P(x_{i}|\pi_{i},y)$ for FBC is simplified to $P(x_{i}|\pi_{i}^{\mathcal{B}},y)$ . Since

$\displaystyle P(x_{i}|\pi_{i},y)=P(x_{i}|\pi_{i}^{\mathcal{B}},y)\Rightarrow P% (x_{i}|\pi_{i}-\pi_{i}^{\mathcal{B}},\pi_{i}^{\mathcal{B}},y)=P(x_{i}|\pi_{i}^% {\mathcal{B}},y)\Rightarrow P(x_{i},\pi_{i}-\pi_{i}^{\mathcal{B}}|\pi_{i}^{% \mathcal{B}},y)=P(\pi_{i}-\pi_{i}^{\mathcal{B}}|\pi_{i}^{\mathcal{B}},y)P(x_{i% }|\pi_{i}^{\mathcal{B}},y).$ (5)

Thus the network topology $\mathcal{B}$ encodes implicit conditional independencies among attributes, that is, each attribute $X_{i}$ is conditionally independent of its non-descendants $\Pi_{i}-\Pi_{i}^{\mathcal{B}}$ given $\{\Pi_{i}^{\mathcal{B}},Y\}$ . For example, Tree Augmented Naive Bayes (TAN) [1], as shown in Fig. 2a, is one of the most classical 1-dependence BNCs in which $\Pi_{i}^{\rm{TAN}}$ is allowed to contain at most one other attribute as the parent attribute of $X_{i}$ . TAN extends NB by applying the Chow-Liu tree learning algorithm [34] to build a maximum weighted spanning tree (MWST). The implicit independence assumption for attribute $X_{i}$ in TAN is

$\displaystyle P(x_{i},\pi_{i}-\pi_{i}^{\rm{TAN}}|\pi_{i}^{\rm{TAN}},y)=P(\pi_{% i}-\pi_{i}^{\rm{TAN}}|\pi_{i}^{\rm{TAN}},y)P(x_{i}|\pi_{i}^{\rm{TAN}},y).$ (6)

where $\pi_{i}^{\rm{TAN}}$ denotes the values of the parent attributes of $X_{i}$ in MWST and $|\pi_{i}^{\rm{TAN}}|\leqslant 1$ . For FBC, $\Pi_{i}=\{X_{1},X_{2},\ldots,X_{i-1}\}$ , thus TAN can only represent the most significant dependence between $X_{i}$ and at most one of its $i-1$ candidate parent attributes. To introduce more dependency relationships while keeping the same topology of TAN, Jiang et al. [35] propose to learn the Hidden Naive Bayes (HNB), for which the hidden parent of attribute $X_{i}$ combines the weighted influences from all the other attributes, and the estimates of conditional probability will be finely tuned. As shown in Eq. (4), attribute $X_{1}$ takes class $Y$ as its only parent, thus the mutual dependence $X_{i}-Y$ rather than any conditional dependence $X_{i}-X_{j}$ is represented in the network topology. Since any attribute can be selected as $X_{1}$ , Jiang et al. [36] propose to improve TAN’s performance in class probability estimation by respectively taking each attribute in turn as root to create the corresponding MWST, and the final ensemble model, called averaged TAN (ATAN), estimates the class probabilities by averaging all the TAN models. Different ensemble members may fit training data to different extents, Jiang et al. [36] further propose to apply information-theoretic weighting metric to assign distinct weights to each TAN model. This model weighting strategy can help improve the estimates of joint probability even when some ensemble members are weak learners.

Figure 2.

An example of (a) TAN and (b) KDB with $k=$ 2.

TAN can obviously improve the prediction accuracy of NB when its assumption is violated [1], but it ignores the influences from other attributes due to its restriction in topology complexity. Hence, when stronger and more complex attribute dependencies do exist, some dependencies have to be discarded. KDB avoids TAN’s restriction by allowing each attribute to have at most arbitrary $k$ parent attributes, as shown in Fig. 2b. Within this framework, NB, TAN and FBC are special cases of KDB, with $k=0,1$ and $n-1$ , respectively. The KDB algorithm adopts the heuristic search strategy by comparing mutual information $I(X_{i};Y)$ and conditional mutual information $I(X_{i};X_{j}|Y)$ . The joint probability $P(y,\emph{{x}})$ for KDB is estimated by

$\displaystyle P_{\rm{KDB}}(y,\emph{{x}})=P(y)P(x_{1}|y)\prod_{i=2}^{n}P(x_{i}|% \pi_{i}^{\rm{KDB}},y),$ (7)

where $\pi_{i}^{\rm{KDB}}$ denotes the values of parent attributes of $X_{i}$ and $|\pi_{i}^{\rm{KDB}}|\leqslant k$ . The implicit independence assumption for attribute $X_{i}$ in KDB is

$\displaystyle P(x_{i},\pi_{i}-\pi_{i}^{\rm{KDB}}|\pi_{i}^{\rm{KDB}},y)=P(\pi_{% i}-\pi_{i}^{\rm{KDB}}|\pi_{i}^{\rm{KDB}},y)P(x_{i}|\pi_{i}^{\rm{KDB}},y).$ (8)

KDB can tune the parameter $k$ and represent arbitrary $k$ -dependence relationships to make the final topology fit any given datasets of different size. Wang et al. [37] propose to apply the log likelihood function as the objective function to measure the extent to which the learned topology fits labeled and unlabeled data. More informational dependence and probabilistic dependence can be mined based on semi-supervised learning. Martínez et al. [38] propose to extend KDB by selecting both the attribute subset for training and the proper value of $k$ in a single additional pass through the training data. By comparing Eqs (6) and (8) we can see that, the implicit independence assumption for KDB is weaker than that for TAN. These two state-of-the-art BNCs mine significant dependency relationships and then encode them in the network topology of different complexity, whereas the rationality of implicit independence assumption has not been verified. Furthermore, information-theoretic metrics, e.g., conditional mutual information (CMI), can not identify probabilistic (in)dependence, and the CMI conditioned on class only can not measure the CMI conditioned on the combination of the parent attributes and class.

\bm{K}

-independence bayesian classifier (KIBC)

KDB simply applies $I(X_{i};X_{j}|Y)$ defined by Eq. (9) to measure the conditional dependence between attributes, whereas the conditional dependence conditioned on other attributes rather than class variable is neglected. Parameter $k$ determines the maximum number of parent attributes and the topology complexity of KDB. KDB can be regarded as a full $k$ -dependence BNC since it fully represents the $k$ -dependence relationships implicated in training data. Thus some independency relationships may be wrongly treated as weak dependencies, which may bias the estimates of conditional probability and degradation in classification performance. To make the estimated conditional probability approximate the true one, our proposed KIBC applies information-theoretic metrics to measure informational or probabilistic conditional independence, and the learned topology can fit data well in terms of conditional dependence and independence.

$\displaystyle I(X_{i};X_{j}|Y)=\sum_{x_{i}\in X_{i}}\sum_{x_{j}\in X_{j}}\sum_% {y\in Y}P(x_{i},x_{j},y)\log\frac{P(x_{i},x_{j}|y)}{P(x_{i}|y)P(x_{j}|y)}.$ (9)

The prior and joint probabilities can be estimated using the $m$ -estimation ( $m=$ 1) as follows [39]:

$\displaystyle\left\{\begin{array}[]{l}P(y)=\frac{\sum^{t}_{l=1}\delta(y_{l},y)% +1/n_{y}}{t+1}\\ P(x_{i},y)=\frac{\sum^{t}_{l=1}\delta(x_{li},x_{i})\delta(y_{l},y)+1/(n_{i}*n_% {y})}{t+1}\\ P(x_{i},x_{j},y)=\frac{\sum^{t}_{l=1}\delta(x_{li},x_{i})\delta(x_{lj},x_{j})% \delta(y_{l},y)+1/(n_{i}*n_{j}*n_{y})}{t+1}\\ \end{array}\right.$ (10)

where $t$ is the number of training instances, $n_{i}$ is the number of values of the $i$ -th attribute, $n_{y}$ is the number of classes, $y_{l}$ and $x_{li}$ are the class label and the $i$ -th attribute value of the $l$ -th training instance respectively. $\delta(\cdot)$ is a binary function, which is one if its two parameters are identical and zero otherwise. Then, the conditional probability $P(x_{i}|y)$ and $P(x_{i},x_{j}|y)$ in Eq. (9) can be computed as follows:

$\displaystyle\left\{\begin{array}[]{l}P(x_{i}|y)=\frac{P(x_{i},y)}{P(y)}\\ P(x_{i},x_{j}|y)=\frac{P(x_{i},x_{j},y)}{P(y)}\\ \end{array}\right.$ (11)

3.1 Verification of implicit independence assumption implicated in training data

KDB assumes that the attributes that are more correlated with the class are preferable, and the attributes are sorted in descending order of $I(X_{i};Y)$ defined by Eq. (12). The impacts of the first few attributes on classification are supposed to be significant.

$\displaystyle I(X_{i};Y)=\sum_{x_{i}\in X_{i}}\sum_{y\in Y}P(x_{i},y)\log\frac% {P(x_{i},y)}{P(x_{i})P(y)}.$ (12)

Given attribute order $\{X_{1},X_{2},\cdots,X_{n}\}$ , KDB requires that attribute $X_{i}$ may take at most $k$ attributes before it as its parents, and in corresponding network topology there will exist directed edges pointing from the parents to $X_{i}$ . $\Pi_{i}=\{X_{1},X_{2},\cdots,X_{i-1}\}$ is a set of candidate parents for $X_{i}$ . KDB identifies the right parents from $\Pi_{i}$ by comparing $I(X_{i};X_{j}|Y)$ where $1\leqslant j\leqslant i-1$ . Note that the first $k+1$ attributes that enter the network will select all the attributes in $\Pi_{i}$ as its parents, that is, $\Pi_{i}^{\rm{KDB}}=\Pi_{i}(i\leqslant k+1)$ . The remaining $n-k-1$ attributes will select the best $\Pi_{i}^{\rm{KDB}}$ as parents, where $\Pi_{i}^{\rm{KDB}}\subset\Pi_{i}$ . Thus the implicit independence assumptions exist between these $n-k-1$ attributes and their parents, or more precisely, between $X_{i}$ and $\Pi_{i}-\Pi_{i}^{\rm{KDB}}$ given $\{\Pi_{i}^{\rm{KDB}},Y\}$ as shown in Eq. (8).

For simplicity, the criterion $I(X_{i};\Pi_{i}-\Pi_{i}^{\rm{KDB}}|\Pi_{i}^{\rm{KDB}},Y)=0$ can be applied to identify conditional independence. However, the computational complexity increases as the values of $k$ and $i$ increase, and that may result in biased estimate of conditional probability and wrong identification of conditional (in)dependence. To address this issue, we transform the conditional joint mutual information into two sets of CMIs: $I(X_{i};X_{j}|X_{q})$ and $I(X_{i};X_{j}|Y)$ , where $X_{j}\in\Pi_{i}-\Pi_{i}^{\rm{KDB}}$ and $X_{q}\in\Pi_{i}^{\rm{KDB}}$ . If the maximum of $I(X_{i};X_{j}|X_{q})$ and $I(X_{i};X_{j}|Y)$ is smaller than the given threshold $\varepsilon$ , then we assume that there exist strong conditional independencies between $X_{i}$ and $\Pi_{i}-\Pi_{i}^{\rm{KDB}}$ given $\{\Pi_{i}^{\rm{KDB}},Y\}$ .

The attributes enter into the network topology of KDB according to a pre-determined attribute order, e.g., $\{X_{1},X_{2},\cdots,X_{n}\}$ , and for attribute $X_{i}$ , KDB selects at most $k$ attributes before it as the parent attributes. KIBC also follows this rule. To illustrate the detailed learning procedure of KIBC, we take dataset mfeat-mor (see Table 2 for detail) as an example and set $k=$ 1 for simplicity. After sorting the attributes in descending order of $I(X_{i};Y)$ , the attribute order is $\{X_{6},X_{2},X_{5},X_{1},X_{4},X_{3}\}$ . As described above, $X_{6}$ is the first one in the topology and thus it has no parents. The second one, i.e., $X_{2}$ , takes $X_{6}$ as its only parent attribute. The initial topology composed of the first $k+1$ (i.e., 2) attributes is shown in Fig. 3a.

Starting from $X_{5}$ , KIBC needs to select $k$ parents from candidate ones for each attribute in the order. The candidate parents for $X_{5}$ are $\{X_{6},X_{2}\}$ and that correspond to two potential directed edges, which are represented by two red dashed directed lines as shown in Fig. 3b. If $X_{2}$ is selected as the only parent of $X_{5}$ , then the complete form of conditional probability for attribute $X_{5}$ , i.e., $P(x_{5}|x_{2},x_{6},y)$ , turns to be $P(x_{5}|x_{2},y)$ . According to Eq. (5), the maximum of $I(X_{5};X_{6}|X_{2})$ and $I(X_{5};X_{6}|Y)$ , which is denoted by $\max I(X_{56|2})$ , is applied to measure the significance of possible conditional dependence between $X_{5}$ and $X_{6}$ given $\{X_{2},Y\}$ . On the other hand, if $X_{6}$ is selected as the only parent of $X_{5}$ , then the maximum of $I(X_{5};X_{2}|X_{6})$ and $I(X_{5};X_{2}|Y)$ , which is denoted by $\max I(X_{52|6})$ , measures the significance of possible conditional dependence between $X_{5}$ and $X_{2}$ given $\{X_{6},Y\}$ . The values of corresponding CMIs are listed in Table 1, from which we can see that $\max I(X_{56|2})<\max I(X_{52|6})$ , thus $X_{2}$ is selected as the only parent of $X_{5}$ . The learning procedure repeats until the parent attributes are found for each attribute and the resulting KIBC ${}_{\mathcal{T}}$ algorithm is described in Algorithm 3.

Table 1

The results of $I(X_{52|6})$ and $I(X_{56|2})$ on mfeat-mor

		$I(X_{5};X_{j}\|X_{6})$	$I(X_{5};X_{j}\|X_{2})$	$I(X_{5};X_{j}\|Y)$
$X_{j}$	$X_{2}$	0.104879	$\sim$	0.085645
	$X_{6}$	$\sim$	0.926403	0.504635

Figure 3.

The learning process of KIBC with $k=$ 1 on data set mfeat-mor.

The learning process of KIBC ${}_{\mathcal{T}}$ . Training set $\mathcal{T}$ with attribute set $\textbf{X}=\{X_{1},\ldots,X_{n}\}$ , class $Y$ and $k$ . $\mathcal{G}$ , the structure of KIBC ${}_{\mathcal{T}}$ .

Calculate MI for all attributes. // See Eq. (12)Let $\mathcal{G}$ be a DAG and $L$ be a list of all attributes in descending order of MI.Let $\Pi$ be a two-dimensional vector of candidate parents for attributes, where $\Pi_{i}$ is the vector of candidate parents for $X_{i}$ and $\Pi_{ij}=L_{j}$ .Let $\mathcal{G}$ be a DAG and $L$ be a list of all attributes in descending order of MI. $i$ = 1 $\to$ $n$ Add a node to $\mathcal{G}$ representing $L_{i}$ and an edge from $Y$ to $L_{i}$ ;i $>$ $k$ $+$ 1 $q=$ 1 $\to$ $\Pi_{i}$ .size Calculate CMIs $I(L_{i};\Pi_{ij}|\Pi_{iq})$ and $I(L_{i};\Pi_{ij}|Y)$ for each pair of $L_{i}$ and $\Pi_{ij}$ ( $j\neq q$ ). // See Eq. (9)Let CMI ${}^{q}_{\max}$ be the maximum of all CMIs given $\Pi_{iq}$ .

Sort attributes in $\Pi_{i}$ in ascending order of their corresponding CMI ${}_{\max}$ .Add edges from the first $k$ attributes in $\Pi_{i}$ to $L_{i}$ . Add edges from each attribute in $\Pi_{i}$ to $L_{i}$ .

$\mathcal{G}$ .

3.2 Verification of implicit independence assumption implicated in testing instance

How to improve the generalization performance of BNCs learned from training data is always a challenging issue in data mining [40]. For one specific testing instance, KIBC ${}_{\mathcal{T}}$ may only represent partially “right” dependencies implicated in this instance, and the remaining dependencies in KIBC ${}_{\mathcal{T}}$ may not provide a positive effect on classification. To this end, we apply an instance-based learning method to build a specific model, called KIBC ${}_{\mathcal{P}}$ , for each testing instance. KIBC ${}_{\mathcal{P}}$ encodes the dependency relationships among attribute values implicated. The specific testing instance $\emph{{x}}=\{x_{1},x_{2},\ldots,x_{n}\}$ is transformed to a pseudo training dataset $\mathcal{P}$ , which contains $m$ instances and $\mathcal{P}_{i}=\{x_{1},x_{2},\ldots,x_{n},y_{i}\}$ , where $1\leqslant i\leqslant m$ and $m$ is the number of class labels. The corresponding instance-based MI and the instance-based CMI are defined as follows [41]:

$\displaystyle\left\{\begin{array}[]{l}I_{\mathcal{P}}(x_{i};Y)=\sum\limits_{y% \in Y}P(x_{i},y)\log\frac{P(x_{i},y)}{P(x_{i})P(y)}\\ I_{\mathcal{P}}(x_{i};x_{j}|Y)=\sum\limits_{y\in Y}P(x_{i},x_{j},y)\log\frac{P% (x_{i},x_{j}|y)}{P(x_{i}|y)P(x_{j}|y)}\\ \end{array}\right.$ (13)

The criterion for identifying implicit independence assumption turns to be $I_{\mathcal{P}}(x_{i};\pi_{i}-\pi_{i}^{\rm{KIBC}}|\pi_{i}^{\rm{KIBC}},Y)% \linebreak=0$ . Similar to the learning procedure described in Section 3.1, the maximum of $I_{\mathcal{P}}(x_{i};x_{j}|x_{q})$ and $I_{\mathcal{P}}(x_{i};x_{j}|Y)$ , where $x_{j}\in\pi_{i}-\pi_{i}^{\rm{KIBC}}$ and $x_{q}\in\pi_{i}^{\rm{KIBC}}$ , is applied for comparison. The detailed learning procedure is shown in Algorithm 3.2.

The learning process of KIBC ${}_{\mathcal{P}}$ . Training set $\mathcal{D}$ with attribute set $X=\{X_{1},\ldots,X_{n}\}$ , class $Y$ , testing instance $\emph{{x}}=\{x_{1},x_{2},\ldots,x_{n}\}$ and $k$ . $\mathcal{G}$ , the structure of KIBC ${}_{\mathcal{P}}$ .

Calculate instance-based MI for all attributes. // See Eq. (13)Let $\mathcal{G}$ be a DAG and $L$ be a list of all attributes in descending order of instance-based MI.Let $\Pi$ be a two-dimensional vector of candidate parents for attributes, where $\Pi_{i}$ is the vector of candidate parents for $X_{i}$ and $\Pi_{ij}=L_{j}$ . $i=$ 1 $\to$ $n$ Add a node to $\mathcal{G}$ representing $L_{i}$ and an edge from $Y$ to $L_{i}$ ;i $>$ $k$ $+$ 1

$q=$ 1 $\to$ $\Pi_{i}$ .size Calculate instanced-based CMIs $I_{\mathcal{P}}(l_{i};\pi_{ij}|\pi_{iq})$ and $I_{\mathcal{P}}(l_{i};\pi_{ij}|Y)$ for each pair of attribute values $l_{i}$ and $\pi_{ij}$ ( $j\neq q$ ). // See Eq. (13)Let CMI ${}^{q}_{\max}$ be the maximum of all instance-based CMIs given $\pi_{iq}$ .

Sort attributes in $\Pi_{i}$ in ascending order of their corresponding CMI ${}_{\max}$ . Add edges from the first $k$ attributes in $\Pi_{i}$ to $L_{i}$ . Add edges from each attribute in $\Pi_{i}$ to $L_{i}$ .

$\mathcal{G}$ .

3.3 Posterior probability for classification and complexity analysis

Figure 4.

The architecture of KIBC.

Since the joint probability distributions encoded in base learners are approximations of the true one, it is natural to consider aggregating them together to yield a much more accurate probability distribution estimation [42]. Meanwhile, to achieve a better result, the base learners should be as accurate as possible, and as diverse as possible [43]. As shown in Fig. 4, KIBC ${}_{\mathcal{T}}$ and KIBC ${}_{\mathcal{P}}$ respectively learn the structure from $\mathcal{T}$ and $\mathcal{P}$ by applying the same learning strategy, thus they are complementary in nature. In this paper, the linear combiner is used and the resulting KIBC makes the final decision by using

$\displaystyle y^{*}=\arg\max_{y\in Y}\{\omega_{\mathcal{T}}P_{\rm{KIBC}_{% \mathcal{T}}}(y|\emph{{x}})+\omega_{\mathcal{P}}P_{\rm{KIBC}_{\mathcal{P}}}(y|% \emph{{x}})\},$ (14)

where $\omega_{\mathcal{T}}$ and $\omega_{\mathcal{P}}$ are the weights of KIBC ${}_{\mathcal{T}}$ and KIBC ${}_{\mathcal{P}}$ , respectively. Note that, KIBC ${}_{\mathcal{T}}$ and KIBC ${}_{\mathcal{P}}$ learn from different types of training data, and distinct weights can help achieve better performance than uniform weights. Whereas the weights may vary from instance to instance, the uniform weights $\omega_{\mathcal{T}}=\omega_{\mathcal{P}}=1/2$ are applied for convenience.

When $k=2$ , at training time, KIBC generates a three-dimensional table of occurrence counts for each pair of attribute values and each class value. The corresponding time complexity is $\mathcal{O}(tn^{2})$ , where $t$ and $n$ are the numbers of training instances and attributes, respectively. Calculating MI and CMI requires $\mathcal{O}(m(nv)^{2})$ time, where $m$ is the number of class labels and $v$ is the maximum number of values per attribute. KIBC needs to learn the network topology by continuously identifying the significant conditional (in)dependencies, and the time complexity for building the network of KIBC ${}_{\mathcal{T}}$ is $\mathcal{O}(n^{2}\log n)$ . Therefore if we only take the highest order term, the time complexity at training time is $\mathcal{O}(tn^{2}+m(nv)^{2}+n^{2}\log n)$ . At classification time, KIBC considers only the specific attribute values in a given instance, that is, calculating instance-based MI and instance-based CMI requires $\mathcal{O}(mn^{2})$ time. The time complexity of building the network of KIBC ${}_{\mathcal{P}}$ is the same as that of KIBC ${}_{\mathcal{T}}$ . When performing classification on a testing instance, KIBC only needs $\mathcal{O}(mn)$ time. Therefore if we only take the highest order term, the time complexity at classification time is $\mathcal{O}(n^{2}\log n+mn^{2})$ .

Table 2

Descriptions of 40 UCI datasets used in the experiments

No.	Dataset	Instance	Attribute	Class	No.	Dataset	Instance	Attribute	Class
1	lymphography	148	18	4	21	segment	2310	19	7
2	iris	150	4	3	22	hypothyroid ${}^{*}$	3163	25	2
3	teaching-ae	151	5	3	23	kr-vs-kp	3196	36	2
4	wine	178	13	3	24	dis ${}^{*}$	3772	29	2
5	glass-id	214	9	3	25	hypo ${}^{*}$	3772	29	4
6	primary-tumor ${}^{*}$	339	17	22	26	sick ${}^{*}$	3772	29	2
7	ionosphere	351	34	2	27	spambase	4601	57	2
8	dermatology ${}^{*}$	366	34	6	28	phoneme	5438	7	50
9	horse-colic ${}^{*}$	368	21	2	29	wall-following	5456	24	4
10	house-votes-84	435	16	2	30	page-blocks	5473	10	5
11	chess	551	39	2	31	satellite	6435	36	6
12	credit-a ${}^{*}$	690	15	2	32	mushrooms ${}^{*}$	8124	22	2
13	crx ${}^{*}$	690	15	2	33	thyroid ${}^{*}$	9169	29	20
14	vehicle	846	18	4	34	sign	12546	8	3
15	anneal ${}^{*}$	898	38	6	35	magic	19020	10	2
16	tic-tac-toe	958	9	2	36	letter-recog	20000	16	26
17	vowel	990	13	11	37	adult ${}^{*}$	48842	14	2
18	led	1000	7	10	38	shuttle	58000	9	7
19	contraceptive-mc ${}^{*}$	1473	9	3	39	connect-4	67557	42	3
20	mfeat-mor	2000	6	10	40	localization	164860	5	11

The datasets with missing values are denoted with the symbol “*”.

4. Experiments

To evaluate the performance of our proposed KIBC, we conduct a group of experiments on 40 UCI datasets1 in terms of ZOL, RMSE, bias and variance. Meanwhile, the Friedman and Nemenyi tests are used to explore the statistical significance of the experimental results. The details of all datasets are shown in Table 2, including the number of instances, attributes and classes. These 40 datasets except lymphography, house-votes-84, chess, crx, tic-tac-toe, led, kr-vs-kp, phoneme, mushrooms and connect-4 contain at least one numeric attribute. For each dataset, numeric attributes are discretized using Minimum Description Length (MDL) discretization [44]. The missing values for qualitative attributes and those for quantitative attributes are respectively replaced with the value that appears most frequently and the mean value in all cases. Each algorithm is processed with 10 rounds of 10-fold cross-validation. We compare KIBC with other seven competitors including two single models and five ensemble models, which are shown as follows:

SKDB [38], selective KDB with $k=$ 2.

CFWNB [15], correlation-based feature weighting filter for NB.

WATAN [36], weighted averaged TAN.

IWAODE [31], instance-based weighting AODE.

WAODE-MI [33], weighted AODE by assigning weights using MI.

TAODE [32], targeted AODE.

DWAODE [45], double weighting schema of AODE.

Note that, to achieve the trade-off between efficiency and classification accuracy, we restrict the structure complexity of KIBC to be two-dependence (i.e., $k=$ 2). The detailed experimental results in terms of ZOL, RMSE, bias and variance are shown in Tables A1– A4. For comparison of the models, we employ the one-tailed $t$ -test experiments on all datasets [46]. The results are formatted as $w(w_{s})$ , where $w$ is the number of datasets that the model in the column performs better than the model in the corresponding row, and $w_{s}$ denotes the number of datasets that the model in the column achieves significant wins ( $p\leqslant 0.05$ ) about the model in the corresponding row.

4.1 Zero-one loss and RMSE

ZOL [47] is a commonly used loss function to validate the prediction accuracy. Table 3 shows the ZOL experimental results of the one-tailed $t$ -test in the form of $w(w_{s})$ . From the results, we can observe that SKDB beats CFWNB on 27 datasets and loses on 13 since structure extension is more effective in improving classification performance than attribute weighting. Due to the advantage of ensemble models, AODE’s variants, such as WAODE-MI, perform much better than CFWNB (28 wins and 11 losses) and WATAN (25 wins and 14 losses). Note that, KIBC outperforms other high-dependency models, including SKDB (34 wins and 5 losses) and WATAN (36 wins and 4 losses). When compared to AODE’s variants, KIBC still has much better performance. For example, KIBC enjoys significant advantages over IWAODE (26 wins and 10 losses), WAODE-MI (30 wins and 6 losses), TAODE (30 wins and 7 losses) and DWAODE (29 wins and 11 losses). The results show that KIBC achieves the best performance among all classifiers in terms of ZOL.

Table 3
Comparisons for KIBC and the alternative classifiers in terms of ZOL

	SKDB	CFWNB	WATAN	IWAODE	WAODE-MI	TAODE	DWAODE	KIBC
SKDB	–	13 (10)	15 (10)	17 (13)	19 (15)	18 (16)	21 (18)	34 (25)
CFWNB	27 (26)	–	26 (21)	27 (25)	28 (26)	25 (23)	23 (23)	31 (28)
WATAN	21 (16)	14 (10)	–	20 (17)	25 (19)	23 (17)	24 (18)	36 (30)
IWAODE	20 (14)	12 (6)	19 (9)	–	20 (10)	21 (9)	24 (13)	26 (22)
WAODE-MI	19 (13)	11 (5)	14 (5)	15 (7)	–	20 (7)	22 (11)	30 (20)
TAODE	20 (13)	14 (7)	15 (8)	16 (8)	17 (7)	–	22 (13)	30 (21)
DWAODE	17 (11)	17 (11)	15 (7)	14 (10)	16 (9)	12 (4)	–	29 (21)
KIBC	5 (1)	7 (5)	4 (2)	10 (3)	6 (1)	7 (4)	11 (10)	–

To further demonstrate the advantages of KIBC intuitively, Fig. 5 shows the scatter plots of the comparison results of KIBC against other algorithms in terms of ZOL. Points that fall close to the diagonal line indicate that KIBC has very close performance to the alternative algorithms. As we can observe that most data points are under the diagonal line, which means that KIBC performs much better than other algorithms, and the advantages are significant and obvious.

Figure 5.

Scatter plot of comparisons in terms of ZOL.

Figure 6.

Scatter plot of comparisons in terms of RMSE.

Table 4

Comparisons for KIBC and the alternative classifiers in terms of RMSE

	SKDB	CFWNB	WATAN	IWAODE	WAODE-MI	TAODE	DWAODE	KIBC
SKDB	–	14 (8)	19 (5)	19 (12)	19 (12)	20 (12)	22 (12)	36 (14)
CFWNB	26 (25)	–	27 (22)	28 (24)	26 (24)	25 (25)	23 (23)	28 (27)
WATAN	21 (11)	13 (9)	–	26 (11)	26 (13)	26 (12)	22 (13)	35 (23)
IWAODE	21 (12)	12 (4)	13 (7)	–	24 (8)	22 (5)	19 (6)	27 (18)
WAODE-MI	21 (8)	14 (6)	14 (2)	16 (4)	–	21 (2)	18 (3)	26 (12)
TAODE	20 (9)	15 (8)	14 (7)	18 (5)	19 (3)	–	16 (2)	26 (12)
DWAODE	18 (10)	17 (10)	18 (7)	21 (7)	22 (6)	22 (3)	–	29 (14)
KIBC	4 (2)	11 (6)	5 (1)	13 (2)	14 (3)	14 (5)	11 (6)	–

RMSE is usually used to measure the deviation between the observed value and the true value [48]. In this section, we use RMSE to measure the calibration of class probability predictions of a model. The comparison results with other 7 algorithms in terms of RMSE are shown in Table 4. As can be seen, KIBC enjoys obvious advantages over SKDB (36 wins and 4 losses), CFWNB (28 wins and 11 losses) and WATAN (35 wins and 5 losses). Meanwhile, KIBC performs slightly better than AODE’s variants including IWAODE (27 wins and 13 losses), WAODE-MI (26 wins and 14 losses), TAODE (26 wins and 14 losses) and DWAODE (29 wins and 11 losses). The scatter plots in Fig. 6 show the comparison results of KIBC against other classifiers in terms of RMSE. A diamond symbol indicates that KIBC performs better than alternative classifiers over the corresponding dataset in terms of RMSE, and a cross under the dotted line means worse results for KIBC. Note that some outliner points are removed for significance analysis. As can be seen, most data points are diamond symbols, which indicates that KIBC performs much better than other algorithms.

4.2 Bias and variance

In this section, the bias-variance decomposition is used to further analyze the performance of models. Bias measures how closely the model can describe the decision surfaces, and variance reflects the model’s sensitivity to variations in the training set [49]. The experimental results in terms of bias and variance are shown in Table 5. As can be observed, KIBC performs the best among all algorithms in terms of bias. For example, KIBC beats CFWNB on 32 datasets and loses on 8, and KIBC beats WATAN on 35 datasets and loses on 5. Meanwhile, KIBC also performs much better than AODE’s variants including IWAODE (32 wins and 8 losses), WAODE-MI (29 wins and 10 losses), TAODE (28 wins and 11 losses) and DWAODE (27 wins and 12 losses). The experimental results demonstrate that KIBC fits the datasets better than other algorithms. Variance-wise, we can observe that KIBC performs better than SKDB (33 wins and 6 losses) and WATAN (30 wins and 10 losses). Note that, CFWNB and AODE’s variants have excellent performance since their structures are definite, that is, they are not sensitive to variations in datasets.

Table 5
Comparisons for KIBC and the alternative classifiers in terms of bias and variance

		SKDB	CFWNB	WATAN	IWAODE	WAODE-MI	TAODE	DWAODE	KIBC
bias	SKDB	–	9 (8)	9 (3)	10 (7)	12 (11)	13 (11)	16 (11)	25 (15)
	CFWNB	31 (29)	–	28 (26)	27 (25)	29 (27)	29 (25)	29 (24)	32 (31)
	WATAN	31 (24)	12 (9)	–	19 (17)	23 (16)	24 (18)	25 (21)	35 (30)
	IWAODE	29 (23)	13 (10)	21 (15)	–	28 (15)	26 (14)	25 (16)	32 (26)
	WAODE-MI	28 (20)	11 (7)	16 (8)	12 (8)	–	19 (4)	24 (11)	29 (25)
	TAODE	27 (19)	11 (10)	16 (10)	13 (7)	19 (6)	–	26 (12)	28 (23)
	DWAODE	24 (18)	11 (7)	15 (10)	12 (7)	15 (8)	11 (4)	–	27 (22)
	KIBC	14 (5)	8 (8)	5 (1)	8 (5)	10 (5)	11 (5)	12 (8)	–
variance	SKDB	–	33 (31)	28 (24)	30 (28)	31 (29)	31 (26)	30 (27)	33 (30)
	CFWNB	6 (6)	–	6 (4)	12 (11)	11 (10)	12 (11)	10 (10)	10 (9)
	WATAN	12 (10)	34 (34)	–	35 (31)	31 (27)	30 (25)	31 (28)	30 (28)
	IWAODE	8 (7)	27 (25)	5 (4)	–	8 (6)	5 (4)	5 (4)	8 (7)
	WAODE-MI	8 (7)	28 (27)	8 (5)	31 (22)	–	12 (6)	12 (9)	16 (11)
	TAODE	9 (7)	28 (28)	8 (7)	35 (31)	27 (15)	–	13 (10)	17 (12)
	DWAODE	8 (7)	29 (29)	9 (9)	33 (30)	25 (19)	27 (17)	–	17 (12)
	KIBC	6 (4)	30 (30)	10 (8)	30 (28)	24 (21)	22 (21)	22 (17)	–

4.3 Friedman and Nemenyi test

Table 6
Average ranks of the algorithms

Algorithm	ZOL rank	RMSE rank	bias rank	variance rank
SKDB	4.6000	4.7250	3.3750	6.4875
CFWNB	5.7375	5.5875	6.1250	2.7250
WATAN	5.1875	5.2375	5.2375	6.1125
IWAODE	4.7875	4.4625	5.4125	2.7500
WAODE-MI	4.5000	4.2500	4.5375	3.9750
TAODE	4.6000	4.2250	4.5875	4.4750
DWAODE	4.1625	4.7000	3.9750	4.8125
KIBC	2.4250	2.8125	2.7500	4.6625
Result of the Friedman test	7.3751	5.0442	10.1660	18.1097

Figure 7.

The comparison results of the Nemenyi test in terms of (a) ZOL, (b) RMSE, (c) bias and (d) variance on 40 datasets. CD $=$ 1.6601.

To explore the statistical significance of the experimental results, we perform the Friedman test [50] in terms of ZOL, RMSE, bias and variance. The null hypothesis of the Friedman test is that there is no difference in average ranks. With 8 classifiers and 40 datasets, the Friedman statistic is distributed according to the $F$ distribution with $8-1=7$ and $(8-1)\times(40-1)=273$ degrees of freedom. The critical value of $F$ (7,273) for $\alpha=$ 0.05 is 2.0432. As Table 6 shows, the Friedman statistics for ZOL, RMSE, bias and variance in our experiments are 7.3751, 5.0442, 10.1660 and 18.1097, respectively. Therefore, we can reject the null hypothesis.

To further explore which classifier is significantly different from others, we conduct the Nemenyi test [51] and show the results in terms of ZOL, RMSE, bias and variance in Fig. 7. The classifiers are plotted on the left line and their corresponding average ranks are plotted on the right line. If the difference between a pair of classifiers is greater than the Critical Difference (CD) [51], the difference is supposed to be significant. With 8 classifiers and 40 datasets, the CD for $\alpha=$ 0.05 is $3.031\times\sqrt{8\times(8+1)/(6\times 40)}=1.6601$ . The lower position of classifier means the better performance. As shown in Fig. 7a, KIBC attains the lowest average ZOL rank, followed by DWAODE, WAODE-MI, TAODE, SKDB, IWAODE, WATAN and CFWNB. The Nemenyi test differentiates KIBC from SKDB, CFWNB, WATAN, IWAODE, WAODE-MI, TAODE and DWAODE in terms of ZOL. RMSE-wise, as shown in Fig. 7b, KIBC attains the lowest average rank, followed by TAODE, WAODE-MI, IWAODE, DWAODE, SKDB, WATAN and CFWNB. Meanwhile, KIBC also attains the lowest average bias rank, followed by SKDB, DWAODE, WAODE-MI, TAODE, WATAN, IWAODE and CFWNB as shown in Fig. 7c. CFWNB has poor bias performance since ensemble learning or high-dependency relationships have a significant positive effect on reducing bias. On the other hand, as shown in Fig. 7d, CFWNB attains the lowest average variance rank since its structure is definite, that is, CFWNB is not sensitive to the variations in datasets. As expected, IWAODE, WAODE-MI, TAODE and DWAODE obtain lower average variance ranks than WATAN and SKDB due to their definite structures.

5. Conclusions and future work

Independence assumption is one of the most direct and promising ways to address the issue of the NP-hard problem for learning an optimal BNC. The BNCs except FBC assume independence assumption, explicitly or implicitly. Exploring the reasonableness of the independence assumption is one of the key issues for learning robust BNCs from data. We prove theoretically that the information-theoretic metrics applied by high-dependence BNCs (e.g., KDB) are not strictly appropriate to measure the extents to which the learned joint probability fits data. Thus we propose to verify the implicit independence assumption behind the learned network topology, and that can help build a robust BNC and improve the generalization performance. By aggregating the predictions of KIBC ${}_{\mathcal{T}}$ and KIBC ${}_{\mathcal{P}}$ that are respectively learned from the labeled training set and unlabeled testing instance, the ensemble BNC demonstrates a significant advantage over its competitors from the experimental results on 40 UCI datasets. Different testing instances may contain different partial knowledge, thus the KIBC ${}_{\mathcal{P}}$ learned from different testing instances should be assigned distinct weights. KIBC simply achieves the final decision by applying a uniform averaging of the probability estimates. Further research is needed to study the difference between the learners, and nonuniform combination could in theory lead to lower error. In addition, it can be expected to explore the reasonableness of the independence assumptions implicated in other BNCs. That will be another research direction for our future work.

Footnotes

http://archive.ics.uci.edu/ml/datasets.php.

Acknowledgments

This work is supported by the National Key Research and Development Program of China (No. 2019YFC1804804), Open Research Project of the Hubei Key Laboratory of Intelligent Geo-Information Processing (No. KLIGIP-2021A04), and the Scientific and Technological Developing Scheme of Jilin Province (No. 20200201281JC) and High Performance Computing Center of Jilin University, China.

Appendix A

See Tables A1– A4.

Table A1

Experimental results of ZOL

Dataset	SKDB	CFWNB	WATAN	IWAODE	WAODE-MI	TAODE	DWAODE	KIBC
lymphography	0.2365	0.1486	0.1689	0.1419	0.1554	0.1554	0.1554	0.1486
iris	0.0867	0.0600	0.0800	0.0867	0.0867	0.0867	0.0733	0.0667
teaching-ae	0.5364	0.5099	0.5364	0.4570	0.4503	0.4636	0.4371	0.4702
wine	0.0225	0.0056	0.0337	0.0169	0.0169	0.0281	0.0225	0.0169
glass-id	0.2196	0.1728	0.2196	0.2196	0.2570	0.2523	0.2103	0.2009
primary-tumor	0.5723	0.5634	0.5428	0.5457	0.5752	0.5782	0.5929	0.5575
ionosphere	0.0912	0.0854	0.0684	0.0712	0.0712	0.0741	0.0655	0.0741
dermatology	0.0628	0.0191	0.0328	0.0191	0.0191	0.0191	0.0164	0.0191
horse-colic	0.2446	0.1576	0.2120	0.2011	0.2011	0.2092	0.2092	0.2011
house-votes-84	0.0552	0.0781	0.0529	0.0483	0.0506	0.0529	0.0483	0.0391
chess	0.0998	0.1379	0.0926	0.1034	0.0944	0.0799	0.0653	0.0907
credit-a	0.1464	0.1333	0.1507	0.1391	0.1362	0.1507	0.1507	0.1420
crx	0.1565	0.1304	0.1478	0.1319	0.1377	0.1391	0.1406	0.1319
vehicle	0.2943	0.3711	0.2943	0.2896	0.2872	0.2766	0.2742	0.2849
anneal	0.0100	0.0534	0.0100	0.0178	0.0089	0.0078	0.0089	0.0078
tic-tac-toe	0.2035	0.3100	0.2265	0.2662	0.2724	0.2630	0.2568	0.1461
vowel	0.1818	0.3050	0.1263	0.1697	0.1949	0.1323	0.1222	0.1788
led	0.2620	0.2630	0.2660	0.2700	0.2680	0.2690	0.2690	0.2560
contraceptive-mc	0.5003	0.4677	0.4895	0.4942	0.4922	0.4902	0.4915	0.4874
mfeat-mor	0.3085	0.3060	0.2980	0.3120	0.3130	0.3105	0.3075	0.3030
segment	0.0459	0.0640	0.0394	0.0333	0.0338	0.0346	0.0325	0.0355
hypothyroid	0.0107	0.0139	0.0104	0.0123	0.0104	0.0111	0.0107	0.0095
kr-vs-kp	0.0416	0.0644	0.0776	0.0826	0.0576	0.0773	0.0726	0.0457
dis	0.0138	0.0156	0.0154	0.0127	0.0143	0.0125	0.0278	0.0122
hypo	0.0114	0.0121	0.0130	0.0114	0.0101	0.0119	0.0148	0.0098
sick	0.0223	0.0259	0.0257	0.0260	0.0244	0.0249	0.0294	0.0228
spambase	0.0641	0.0858	0.0669	0.0646	0.0648	0.0602	0.0585	0.0659
phoneme	0.1916	0.2407	0.2345	0.2104	0.2308	0.2427	0.2444	0.1694
wall-following	0.0315	0.0720	0.0550	0.0464	0.0367	0.0361	0.0372	0.0235
page-blocks	0.0391	0.0416	0.0418	0.0325	0.0347	0.0327	0.0327	0.0303
satellite	0.1085	0.1726	0.1207	0.1117	0.1148	0.1147	0.1125	0.1124
mushrooms	0.0000	0.0080	0.0001	0.0002	0.0000	0.0002	0.0002	0.0000
thyroid	0.0683	0.0817	0.0723	0.0706	0.0655	0.0629	0.0107	0.0593
sign	0.2539	0.3700	0.2752	0.2789	0.2768	0.2743	0.2748	0.2324
magic	0.1637	0.2033	0.1674	0.1744	0.1762	0.1725	0.1721	0.1588
letter-recog	0.0986	0.2479	0.1300	0.0854	0.0853	0.0838	0.0837	0.0925
adult	0.1383	0.1499	0.1380	0.1502	0.1445	0.1558	0.1601	0.1304
shuttle	0.0008	0.0020	0.0014	0.0011	0.0009	0.0008	0.0006	0.0007
connect-4	0.2283	0.2847	0.2354	0.2409	0.2406	0.2374	0.2357	0.2271
localization	0.2964	0.4936	0.3575	0.3593	0.3566	0.3544	0.3721	0.3064

The value in boldface indicates the classifier with the best performance.

Table A2

Experimental results of RMSE

Dataset	SKDB	CFWNB	WATAN	IWAODE	WAODE-MI	TAODE	DWAODE	KIBC
lymphography	0.3031	0.2419	0.2705	0.2304	0.2496	0.2501	0.2522	0.2419
iris	0.1973	0.1500	0.1958	0.2024	0.2091	0.2077	0.2132	0.1919
teaching-ae	0.4804	0.4619	0.4762	0.4689	0.4668	0.4644	0.4734	0.4728
wine	0.1214	0.0532	0.1416	0.1001	0.0983	0.1021	0.1038	0.1042
glass-id	0.3387	0.2952	0.3315	0.3237	0.3422	0.3409	0.3364	0.3263
primary-tumor	0.1851	0.1790	0.1812	0.1778	0.1855	0.1864	0.1885	0.1810
ionosphere	0.2822	0.2765	0.2613	0.2546	0.2489	0.2464	0.2446	0.2521
dermatology	0.1207	0.0648	0.0850	0.0661	0.0688	0.0698	0.0660	0.0794
horse-colic	0.4348	0.3508	0.4215	0.3990	0.4022	0.4008	0.4020	0.4040
house-votes-84	0.2107	0.2558	0.2181	0.1998	0.1927	0.1968	0.1960	0.1853
chess	0.2615	0.3208	0.2594	0.2835	0.2603	0.2502	0.2309	0.2611
credit-a	0.3480	0.3116	0.3407	0.3271	0.3236	0.3350	0.3389	0.3286
crx	0.3525	0.3142	0.3415	0.3259	0.3219	0.3322	0.3355	0.3205
vehicle	0.3123	0.3611	0.3103	0.3095	0.3099	0.3083	0.3080	0.3058
anneal	0.0519	0.1240	0.0538	0.0699	0.0536	0.0529	0.0513	0.0560
tic-tac-toe	0.3772	0.4334	0.4023	0.3992	0.4085	0.3984	0.3925	0.3320
vowel	0.1583	0.1982	0.1254	0.1463	0.1633	0.1324	0.1297	0.1538
led	0.2007	0.2163	0.1991	0.1973	0.1975	0.1980	0.1996	0.1990
contraceptive-mc	0.4485	0.4305	0.4392	0.4392	0.4385	0.4394	0.4410	0.4405
mfeat-mor	0.1978	0.1943	0.1941	0.1979	0.1983	0.1980	0.1989	0.1951
segment	0.1033	0.1195	0.0968	0.0879	0.0870	0.0881	0.0873	0.0914
hypothyroid	0.0937	0.1065	0.0951	0.0994	0.0967	0.0974	0.0969	0.0878
kr-vs-kp	0.1867	0.2779	0.2358	0.2635	0.2343	0.2561	0.2506	0.1860
dis	0.1024	0.1130	0.1098	0.1058	0.1046	0.1047	0.1466	0.0998
hypo	0.0671	0.0739	0.0723	0.0698	0.0647	0.0685	0.0751	0.0660
sick	0.1382	0.1498	0.1426	0.1547	0.1452	0.1511	0.1571	0.1353
spambase	0.2293	0.2657	0.2402	0.2317	0.2301	0.2239	0.2180	0.2266
phoneme	0.0754	0.0806	0.0844	0.0795	0.0871	0.0891	0.0912	0.0716
wall-following	0.1097	0.1744	0.1570	0.1433	0.1293	0.1298	0.1299	0.1047
page-blocks	0.1128	0.1117	0.1187	0.0986	0.1025	0.1013	0.1021	0.0979
satellite	0.1778	0.2316	0.1849	0.1774	0.1800	0.1799	0.1799	0.1715
mushrooms	0.0001	0.0857	0.0081	0.0114	0.0062	0.0121	0.0129	0.0004
thyroid	0.0731	0.0789	0.0742	0.0734	0.0715	0.0706	0.0701	0.0682
sign	0.3334	0.3929	0.3504	0.3516	0.3519	0.3487	0.3494	0.3263
magic	0.3470	0.3709	0.3461	0.3534	0.3526	0.3519	0.3501	0.3411
letter-recog	0.0768	0.1139	0.0859	0.0693	0.0695	0.0691	0.0691	0.0737
adult	0.3089	0.3150	0.3076	0.3250	0.3197	0.3297	0.3344	0.3007
shuttle	0.0140	0.0270	0.0177	0.0159	0.0131	0.0124	0.0125	0.0135
connect-4	0.3247	0.3632	0.3315	0.3359	0.3356	0.3339	0.3324	0.3259
localization	0.1960	0.2402	0.2095	0.2093	0.2087	0.2081	0.2128	0.1971

The value in boldface indicates the classifier with the best performance.

Table A3

Experimental results of bias

Dataset	SKDB	CFWNB	WATAN	IWAODE	WAODE-MI	TAODE	DWAODE	KIBC
lymphography	0.1041	0.1647	0.0978	0.0857	0.0951	0.0931	0.0853	0.0959
iris	0.0560	0.0395	0.0618	0.0664	0.0656	0.0592	0.0776	0.0570
teaching-ae	0.4606	0.3989	0.4990	0.4616	0.3984	0.4198	0.3756	0.4504
wine	0.0508	0.0137	0.0531	0.0317	0.0381	0.0376	0.0322	0.0259
glass-id	0.2713	0.1197	0.2748	0.2818	0.2780	0.2785	0.2969	0.2714
primary-tumor	0.4143	0.3417	0.4224	0.4188	0.4247	0.4324	0.4323	0.4212
ionosphere	0.0826	0.0813	0.0823	0.0881	0.0751	0.0764	0.0787	0.0631
dermatology	0.0449	0.0114	0.0263	0.0065	0.0061	0.0057	0.0058	0.0134
horse-colic	0.1689	0.1816	0.1899	0.2007	0.1897	0.1937	0.1911	0.1662
house-votes-84	0.0229	0.0575	0.0393	0.0493	0.0406	0.0429	0.0428	0.0218
chess	0.1119	0.0932	0.1398	0.1397	0.1286	0.1230	0.1143	0.1110
credit-a	0.1137	0.1301	0.1123	0.0893	0.0900	0.0940	0.0940	0.0995
crx	0.1197	0.1332	0.1148	0.0904	0.0953	0.0985	0.0970	0.0991
vehicle	0.2485	0.3016	0.2376	0.2435	0.2398	0.2412	0.2394	0.2425
anneal	0.0071	0.0610	0.0194	0.0181	0.0194	0.0214	0.0185	0.0135
tic-tac-toe	0.1367	0.2257	0.1742	0.1994	0.2104	0.2008	0.1901	0.1103
vowel	0.1755	0.2487	0.1842	0.2249	0.1811	0.1698	0.1592	0.1803
led	0.2317	0.2387	0.2242	0.2327	0.2331	0.2325	0.2327	0.2340
contraceptive-mc	0.3702	0.3759	0.3426	0.3781	0.3766	0.3735	0.3643	0.3497
mfeat-mor	0.2136	0.2455	0.2078	0.2492	0.2464	0.2431	0.2445	0.2166
segment	0.0452	0.0540	0.0489	0.0436	0.0357	0.0353	0.0342	0.0427
hypothyroid	0.0096	0.0133	0.0106	0.0093	0.0099	0.0099	0.0093	0.0085
kr-vs-kp	0.0419	0.0583	0.0700	0.0763	0.0518	0.0688	0.0613	0.0442
dis	0.0191	0.0127	0.0194	0.0168	0.0179	0.0178	0.0173	0.0186
hypo	0.0077	0.0114	0.0119	0.0080	0.0078	0.0079	0.0109	0.0076
sick	0.0198	0.0211	0.0206	0.0220	0.0216	0.0228	0.0257	0.0189
spambase	0.0501	0.0750	0.0567	0.0602	0.0574	0.0541	0.0505	0.0533
phoneme	0.1584	0.2003	0.1982	0.1829	0.2172	0.2186	0.2008	0.1275
wall-following	0.0133	0.0592	0.0482	0.0360	0.0253	0.0260	0.0256	0.0175
page-blocks	0.0280	0.0331	0.0305	0.0257	0.0243	0.0248	0.0251	0.0258
satellite	0.0818	0.1560	0.0945	0.0884	0.0902	0.0897	0.0876	0.0841
mushrooms	0.0000	0.0103	0.0001	0.0004	0.0002	0.0004	0.0004	0.0000
thyroid	0.0531	0.0694	0.0584	0.0648	0.0561	0.0550	0.0533	0.0504
sign	0.2161	0.3435	0.2419	0.2510	0.2461	0.2446	0.2382	0.2060
magic	0.1241	0.1898	0.1252	0.1595	0.1541	0.1546	0.1426	0.1320
letter-recog	0.0806	0.2133	0.1033	0.0877	0.0823	0.0814	0.0792	0.0745
adult	0.1220	0.1461	0.1312	0.1437	0.1387	0.1459	0.1516	0.1217
shuttle	0.0007	0.0024	0.0009	0.0007	0.0006	0.0006	0.0006	0.0006
connect-4	0.2022	0.2740	0.2253	0.2255	0.2237	0.2153	0.2115	0.2042
localization	0.2134	0.4746	0.3105	0.3126	0.3068	0.3010	0.3062	0.2190

The value in boldface indicates the classifier with the best performance.

Table A4

Experimental results of variance

Dataset	SKDB	CFWNB	WATAN	IWAODE	WAODE-MI	TAODE	DWAODE	KIBC
lymphography	0.1408	0.0568	0.1084	0.0408	0.0478	0.0498	0.0412	0.0653
iris	0.0400	0.0120	0.0522	0.0436	0.0364	0.0388	0.0364	0.0430
teaching-ae	0.1494	0.1798	0.1770	0.1564	0.1776	0.1622	0.1624	0.1636
wine	0.0644	0.0042	0.0486	0.0141	0.0246	0.0251	0.0153	0.0385
glass-id	0.1189	0.0492	0.1069	0.0999	0.1051	0.1004	0.0946	0.1089
primary-tumor	0.2450	0.2117	0.2413	0.1785	0.1859	0.1880	0.1934	0.2248
ionosphere	0.0584	0.0087	0.0399	0.0238	0.0368	0.0381	0.0332	0.0361
dermatology	0.0674	0.0240	0.0483	0.0189	0.0242	0.0213	0.0188	0.0316
horse-colic	0.1384	0.0203	0.1027	0.0420	0.0464	0.0514	0.0557	0.0682
house-votes-84	0.0157	0.0108	0.0172	0.0079	0.0083	0.0081	0.0089	0.0086
chess	0.0531	0.0507	0.0504	0.0379	0.0364	0.0448	0.0463	0.0420
credit-a	0.0768	0.0205	0.0555	0.0276	0.0321	0.0360	0.0412	0.0418
crx	0.0663	0.0203	0.0500	0.0240	0.0264	0.0310	0.0361	0.0365
vehicle	0.1288	0.0763	0.1294	0.1245	0.1276	0.1273	0.1287	0.1263
anneal	0.0173	0.0273	0.0158	0.0103	0.0161	0.0174	0.0146	0.0142
tic-tac-toe	0.1125	0.0550	0.0819	0.0529	0.0604	0.0528	0.0642	0.0978
vowel	0.2285	0.1437	0.2361	0.2463	0.2310	0.2284	0.2257	0.2237
led	0.0565	0.0502	0.0530	0.0372	0.0398	0.0408	0.0466	0.0483
contraceptive-mc	0.1705	0.1041	0.1641	0.1086	0.1106	0.1238	0.1437	0.1723
mfeat-mor	0.1047	0.0563	0.1020	0.0676	0.0686	0.0730	0.0725	0.0952
segment	0.0386	0.0196	0.0290	0.0204	0.0255	0.0262	0.0248	0.0250
hypothyroid	0.0024	0.0023	0.0029	0.0026	0.0033	0.0030	0.0033	0.0028
kr-vs-kp	0.0112	0.0169	0.0152	0.0185	0.0119	0.0208	0.0209	0.0049
dis	0.0011	0.0050	0.0004	0.0036	0.0021	0.0040	0.0069	0.0010
hypo	0.0069	0.0036	0.0063	0.0068	0.0056	0.0055	0.0089	0.0058
sick	0.0043	0.0026	0.0048	0.0037	0.0057	0.0045	0.0068	0.0037
spambase	0.0218	0.0054	0.0160	0.0094	0.0111	0.0124	0.0136	0.0173
phoneme	0.0773	0.0961	0.1541	0.1270	0.1311	0.1355	0.1356	0.0898
wall-following	0.0247	0.0106	0.0285	0.0283	0.0242	0.0245	0.0246	0.0165
page-blocks	0.0177	0.0070	0.0145	0.0113	0.0130	0.0122	0.0125	0.0121
satellite	0.0479	0.0099	0.0368	0.0325	0.0364	0.0362	0.0345	0.0416
mushrooms	0.0001	0.0001	0.0002	0.0001	0.0001	0.0002	0.0001	0.0000
thyroid	0.0273	0.0157	0.0253	0.0202	0.0239	0.0243	0.0251	0.0237
sign	0.0596	0.0250	0.0385	0.0380	0.0403	0.0406	0.0445	0.0514
magic	0.0491	0.0092	0.0490	0.0291	0.0289	0.0313	0.0359	0.0410
letter-recog	0.0709	0.0498	0.0588	0.0417	0.0455	0.0457	0.0467	0.0637
adult	0.0285	0.0071	0.0165	0.0109	0.0113	0.0174	0.0184	0.0156
shuttle	0.0003	0.0006	0.0004	0.0003	0.0004	0.0004	0.0003	0.0003
connect-4	0.0309	0.0092	0.0149	0.0209	0.0215	0.0301	0.0320	0.0301
localization	0.1099	0.0186	0.0594	0.0577	0.0632	0.0657	0.0836	0.1124

The value in boldface indicates the classifier with the best performance.

References

Friedman

Geiger

and Goldszmidt

, Bayesian network classifiers, Machine Learning 29 (1997), 131–163.

Pearl

, Probabilistic reasoning in intelligent systems: networks of plausible inference, Morgan Kaufmann Publishers Inc, 1988.

Liu

Cao

Mao

and Tan

, Speech emotion recognition based on feature selection and extreme learning machine decision tree, Neurocomputing 273 (2018), 271–280.

Doran

and Ray

, A theoretical and empirical analysis of support vector machine methods for multiple-instance classification, Machine Learning 97 (2014), 79–102.

Pasa

Navarin

and Sperduti

, Polynomial-based graph convolutional neural networks for graph classification, Machine Learning, 2021.

Zhang

Wang

and Zhang

, Generalized Additive Bayesian Network Classifiers, in: 20th International Joint Conference on Artifical Intelligence, 2007, pp. 913–918.

Jiang

Wang

and Zhang

, Deep feature weighting for naive Bayes and its application to text classification, Engineering Applications of Artificial Intelligence 52 (2016), 26–39.

Liu

Wang

and Mammadov

, Learning semi-lazy Bayesian network classifier under the c.i.i.d assumption, Knowledge-Based Systems 208 (2020), 106422.

Chickering

Heckerman

and Meek

, Large-Sample Learning of Bayesian Networks is NP-Hard, Journal of Machine Learning Research 5 (2004), 1287–1330.

10.

Cai

Zhang

and Hao

, BASSUM: A Bayesian semi-supervised method for classification feature selection, Pattern Recognition 44(4) (2011), 811–820.

11.

Bielza

and Larrañaga

, Discrete Bayesian Network Classifiers: A Survey, ACM Computing Surveys 47 (2014), 1–43.

12.

Liu

Wang

Mammadov

Chen

Wang

and Sun

, Hierarchical Independence Thresholding for learning Bayesian network classifiers, Knowledge-Based Systems 212 (2021), 106627.

13.

Lewis

, Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval, in: The 10th European Conference on Machine Learning, 1998, pp. 4–15.

14.

Zaidi

Cerquides

Carman

and Webb

, Alleviating naive bayes attribute independence assumption by attribute weighting, Journal of Machine Learning Research 14(1) (2013), 1947–1988.

15.

Jiang

Zhang

and Wu

, A correlation-based feature weighting filter for naive bayes, IEEE Transactions on Knowledge and Data Engineering 31(2) (2019), 201–213.

16.

Hall

, Correlation-Based Feature Selection for Discrete and Numeric Class Machine Learning, in: The Seventeenth International Conference on Machine Learning, 2000, pp. 359–366.

17.

Tang

Kay

and He

, Toward optimal feature selection in naive bayes for text categorization, IEEE Transactions on Knowledge and Data Engineering 28(9) (2016), 2508–2521.

18.

Jiang

and Wang

, Cost-sensitive Bayesian network classifiers, Pattern Recognition Letters 45 (2014), 211–216.

19.

Jiang

and Yu

, An attribute value frequency-based instance weighting filter for naive Bayes, Journal of Experimental & Theoretical Artificial Intelligence 31(2) (2019), 225–236.

20.

Frank

Hall

and Pfahringer

, Locally Weighted Naive Bayes, in: The Nineteenth Conference on Uncertainty in Artificial Intelligence, 2002, pp. 249–356.

21.

Wang

Jiang

and Li

, Adapting naive Bayes tree for text classification, Knowledge and Information Systems 44 (2015), 77–89.

22.

Jiang

Wang

and Zhang

, Structure extended multinomial naive Bayes, Information Sciences 329 (2016), 346–356.

23.

Kong

Shi

Wang

Liu

Mammadov

and Wang

, Averaged tree-augmented one-dependence estimators, Applied Intelligence 51 (2021), 4270–4286.

24.

Sahami

, Learning Limited Dependence Bayesian Classifiers, in: The Second International Conference on Knowledge Discovery and Data Mining, 1996, pp. 335–338.

25.

Friedman

and Goldszmidt

, Building Classifiers Using Bayesian Networks, in: The Thirteenth National Conference on Artificial Intelligence, 1996, pp. 1277–1284.

26.

and Zhang

, Full Bayesian Network Classifiers, in: The 23rd International Conference on Machine Learning, 2006, pp. 897–904.

27.

Zhang

Jiang

and Yu

, Attribute and instance weighted naive Bayes, Pattern Recognition 111 (2021), 107674.

28.

Jiang

Zhang

and Wang

, Class-specific attribute weighted naive Bayes, Pattern Recognition 88 (2019), 321–330.

29.

Pan

Zhu

Cai

Zhang

and Zhang

, Self-adaptive attribute weighting for Naive Bayes classification, Expert Systems with Applications 42(3) (2015), 1487–1502.

30.

Webb

Boughton

and Wang

, Not So Naive Bayes: Aggregating one-dependence estimators, Machine Learning 58 (2005), 5–24.

31.

Duan

Wang

Chen

and Sun

, Instance-based weighting filter for superparent one-dependence estimators, Knowledge-Based Systems 203 (2020), 106085.

32.

Wang

Chen

Liu

and Sun

, Self-adaptive attribute value weighting for averaged one-dependence estimators, IEEE Access 8 (2020), 27887–27900.

33.

Jiang

Zhang

Cai

and Wang

, Weighted average of one-dependence estimators, Journal of Experimental & Theoretical Artificial Intelligence 24(2) (2012), 219–230.

34.

Chow

and Liu

, Approximating discrete probability distributions with dependence trees, IEEE Transactions on Information Theory 14(3) (1968), 462–467.

35.

Jiang

Zhang

and Cai

, A Novel Bayes Model: Hidden Naive Bayes, IEEE Transactions on Knowledge and Data Engineering 21(10) (2009), 1361–1371.

36.

Jiang

Cai

Wang

and Zhang

, Improving Tree augmented Naive Bayes for class probability estimation, Knowledge-Based Systems 26 (2012), 239–245.

37.

Wang

Zhang

and Zhang

, Semi-supervised learning for k-dependence Bayesian classifiers, Applied Intelligence 52 (2022), 3604–3622.

38.

Martínez

Webb

Chen

and Zaidi

, Scalable Learning of Bayesian Network Classifiers, Journal of Machine Learning Research 17(1) (2016), 1515–1549.

39.

Cestnik

Visweswaran

and Cooper

, Estimating probabilities: A crucial task in machine learning, in: The 9th European Conference on Artificial Intelligence, 1990, pp. 147–149.

40.

Jabbari

Visweswaran

and Cooper

, Instance-Specific Bayesian Network Structure Learning, in: The Ninth International Conference on Probabilistic Graphical Models, 2018, pp. 169–180.

41.

Wang

Chen

and Mammadov

, Target learning: A novel framework to mine significant dependencies for unlabeled data, in: Twenty-second Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2018, pp. 106–117.

42.

Dietterich

, Ensemble learning, The Handbook of Brain Theory and Neural Networks 2(1) (2002), 110–125.

43.

Krogh

and Vedelsby

, Neural Network Ensembles, Cross Validation and Active Learning, in: The 7th International Conference on Neural Information Processing Systems, 1994, pp. 231–238.

44.

Fayyad

and Irani

, Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning, in: The 13th International Joint Conference on Artificial Intelligence, 1993, pp. 1022–1029.

45.

Wang

Xie

Pang

and Wei

, Alleviating the attribute conditional independence and I.I.D. assumptions of averaged one-dependence estimator by double weighting, Knowledge-Based Systems 250 (2022), 109078.

46.

Demšar

, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (2006), 1–30.

47.

Domingos

and Pazzani

, On the Optimality of the Simple Bayesian Classifier under Zero-One Loss, Machine Learning 29 (1997), 103–130.

48.

Hyndman

and Koehler

, Another look at measures of forecast accuracy, International Journal of Forecasting 22(4) (2006), 679–688.

49.

Kohavi

and Wolpert

, Bias plus Variance Decomposition for Zero-One Loss Functions, in: The Thirteenth International Conference on International Conference on Machine Learning, 1996, pp. 275–283.

50.

Friedman

, The use of ranks to avoid the assumption of normality implicit in the analysis of variance, Journal of the American Statistical Association 32 (1937), 675–701.

51.

Demšar

, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (2006), 1–30.

Exploiting the implicit independence assumption for learning directed graphical models

Abstract

Keywords

1. Introduction

2. Related work

4.1 Zero-one loss and RMSE

Table 3 Comparisons for KIBC and the alternative classifiers in terms of ZOL

Table 5 Comparisons for KIBC and the alternative classifiers in terms of bias and variance

Table 6 Average ranks of the algorithms

Footnotes

Acknowledgments

Appendix A

References

Table 3
Comparisons for KIBC and the alternative classifiers in terms of ZOL

Table 5
Comparisons for KIBC and the alternative classifiers in terms of bias and variance

Table 6
Average ranks of the algorithms