A novel approach to fully representing the diversity in conditional dependencies for learning Bayesian network classifier

Abstract

Bayesian network classifiers (BNCs) have proved their effectiveness and efficiency in the supervised learning framework. Numerous variations of conditional independence assumption have been proposed to address the issue of NP-hard structure learning of BNC. However, researchers focus on identifying conditional dependence rather than conditional independence, and information-theoretic criteria cannot identify the diversity in conditional (in)dependencies for different instances. In this paper, the maximum correlation criterion and minimum dependence criterion are introduced to sort attributes and identify conditional independencies, respectively. The heuristic search strategy is applied to find possible global solution for achieving the trade-off between significant dependency relationships and independence assumption. Our extensive experimental evaluation on widely used benchmark data sets reveals that the proposed algorithm achieves competitive classification performance compared to state-of-the-art single model learners (e.g., TAN, KDB, KNN and SVM) and ensemble learners (e.g., ATAN and AODE).

Keywords

Bayesian network classifier maximum correlation criterion minimum dependence criterion conditional independence conditional mutual information

1. Introduction

Machine learning, which has attracted widespread attention in recent years [1, 2, 3], is roughly divided into supervised learning and unsupervised learning. Nowadays, there are numerous supervised classifiers widely used in the real world [4, 5, 6, 7, 8, 9], among which Bayesian network classifiers (BNCs) is a popular approach due to their model interpretability and competitive classification performance [10, 11]. Formally, the topology of BNC learned from training data is a directed acyclic graph in which vertices correspond to the $n$ predictive attributes $\{X_{1},\ldots,X_{n}\}$ and class variable $Y$ , and edges represent direct dependencies between the attributes. BNCs estimate the probability distribution $P(x_{1},\ldots,x_{n},y)$ 1

¹
Capital letters such as $X_{i}$ and $Y$ denote attributes or variables, and lower-case letters such as $x_{i}$ and $y$ denote specific values taken by those attributes.

with a factorization according to the network topology [12]. Classification is done by applying Bayes rule to assign appropriate class label

y

to an unlabeled instance

\textbf{x}=(x_{1},\ldots,x_{n})

with the highest posterior probability.

For restricted BNC $\mathcal{B}$ , class variable is the root node, thus $P(\textbf{x},y)$ can be decomposed as Eq. (1) shows.

$\displaystyle P(\textbf{x},y)=P(y)\prod^{n}_{i=1}P(x_{i}|y,\pi_{i}^{\mathcal{B% }}).$ (1)

Each factor in Eq. (1), i.e., $P(x_{i}|y,\pi_{i}^{\mathcal{B}})$ , is a categorical distribution, where $\pi_{i}^{\mathcal{B}}$ denotes the values of $\Pi_{i}^{\mathcal{B}}$ and $\Pi_{i}^{\mathcal{B}}$ is the parents of $X_{i}$ in $\mathcal{B}$ . During the past decades, many information-theoretic criteria, e.g., conditional mutual information (CMI), joint mutual information [13], information fragments [14] and etc. [15, 16], have been proposed for the rapid development of BNC. However, information theory originally studies the quantification, storage, and communication of information, not BN. The probability distributions encoded in the network topology of BNC quantitatively describe the dependency relationships among attribute values. For different situations or instances, the relationships may differ greatly whereas the information-theoretic criteria cannot identify the difference.

In practice, most combinations of attribute values are either not represented in the training data or not present in sufficient numbers [17]. The network topology of BNC learned from labeled training data under the supervised learning framework may be unreliable and that may make it fail to generalize to fit unlabeled instances. To address this issue, researchers proposed to apply semi-supervised learning framework to use unlabeled testing instance in conjunction with labeled data for training. Existing semi-supervised approaches can be roughly grouped into four categories, including generative models [18], semi-supervised SVMs [19], graph-based semi-supervised methods [20, 21] and disagreement-based methods [22]. Some approaches, e.g., co-training and self-training, are universal and can work with any unspecified classifiers. Semi-supervised learning methods generally use unlabeled data to either modify or re-prioritize hypotheses obtained from labeled data alone. To achieve this goal, the unlabeled instance must be pre-assigned a class label first. Obviously, if the label is wrong, using such instance to re-train the classifiers learned from labeled training data will result in “noise propagation”, and the negative effect may lead to the biased decision boundaries. Moreover, the dependencies that exist in different instances may differ greatly. It is impossible for one single classifier to describe all “right” dependencies between attributes when they take specific values.

In this paper, we established the relationship between entropy function $H_{\mathcal{B}}$ and the probabilistic topology $\mathcal{B}$ , and then prove that the information-theoretic criterion $I(X_{i};\Pi_{i}|Y)$ can fully describe the significant conditional dependencies among attributes. However, the implicit independence assumption underlying the learned topology may be violated in practice. The maximum correlation criterion and minimum dependence criterion are introduced to sort attributes and identify conditional independencies, respectively. To achieve the trade-off between significant dependency relationships and independence assumption, heuristic search technique is applied to build two independent BNCs and they may respectively “best describes” the probability distributions of the labeled training data and single unlabeled instance.

This paper is organized as follows. In Section 2 we clarify the difference between NB and its variations in terms of log likelihood, and then review some state-of-the-art BNCs. In Section 3 we introduce the definitions of maximum correlation criterion and minimum dependence criterion, and then describe the basic idea of ensemble learning of two independent BNCs that respectively model training data and one single testing instance. In Section 4 we describe in detail the experimental setup and results of our proposed algorithm with other BNCs (including single model classifiers and ensemble classifiers). We conclude and outline future work in Section 5.

2. Background theory and related research work

Given topology $\mathcal{B}$ encoding the joint probability distribution $P_{\mathcal{B}}(\textbf{x},y)$ from a given training set $D$ , corresponding BNC returns the label $y^{*}$ that maximizes the posterior probability $P_{\mathcal{B}}(y|\textbf{x})$ by applying the following classification rule,

$\displaystyle y^{*}=\mathop{\arg\max}_{Y}P_{\mathcal{B}}(y|\textbf{x})=\mathop% {\arg\max}_{Y}\frac{P_{\mathcal{B}}(y,\textbf{x})}{P_{\mathcal{B}}(\textbf{x})}.$ (2)

Figure 1.

Example of (a) full Bayesian network classifier, (b) Naive Bayes, (c) Tree augmented naive Bayes and (d) $k$ -dependence Bayesian classifier.

Because $P_{\mathcal{B}}(\textbf{x})$ can be regarded as a normalization constant $\alpha$ that is irrelevant to classification, we will have $P_{\mathcal{B}}(y|\textbf{x})=P_{\mathcal{B}}(y,\textbf{x})/\alpha$ and $y^{*}=\mathop{\arg\max}P_{\mathcal{B}}(y,\textbf{x})$ . For specific instance $d_{i}=(y,\textbf{x})$ , the probability $P(d_{i})$ can be estimated by counting the number of appearance of $d_{i}$ and $P_{\mathcal{B}}(d_{i})$ can be learned based on Eq. (1). Given data set $D$ with $N$ instances, $P(d_{i})\log P_{\mathcal{B}}(d_{i})$ measures the number of bits needed to describe $d_{i}$ based on topology $\mathcal{B}$ [12]. Correspondingly, the average number of bits encoded in each instance can be computed by entropy function $H_{\mathcal{B}}$ as follows,

$\displaystyle H_{\mathcal{B}}=-\sum_{Y,\textbf{X}}P(y,\textbf{x})\log P_{% \mathcal{B}}(y,\textbf{x})=-\sum_{Y}P(y)\log P(y)-\sum_{i=1}^{n}\sum_{Y,X_{i},% \Pi_{i}}P(y,x_{i},\Pi_{i})\log P(x_{i}|y,\Pi_{i}^{\mathcal{B}})=H(Y)+\sum_{i=1% }^{n}H(X_{i}|Y,\Pi_{i}^{\mathcal{B}}).$ (3)

As shown in Fig. 1a, suppose that the attribute order is $\{X_{1},\cdots,X_{n}\}$ , for full BNC the parent of $X_{i}$ is $\Pi_{i}=\{X_{1},\cdots,X_{i-1}\}$ and how to learn full BNC is an NP-hard problem. As shown in Fig. 1b, the network topology of Naive Bayes (NB) [23] is the simplest among all BNCs due to its assumption that all attributes are independent of each other given class variable, and the independence assumption can be described as

$\displaystyle P_{{\rm NB}}(x_{1},\cdots,x_{n}|y)=\prod^{n}_{i=1}P(x_{i}|y).$ (4)

For NB, Eq. (3) turns to be

$\displaystyle H_{{\rm NB}}=-\sum_{Y,\textbf{X}}P(y,\textbf{x})\log{P_{{\rm NB}% }(y,\textbf{x})}=-\sum_{Y,\textbf{X}}P(y,\textbf{x})\log\left\{P(y)\prod^{n}_{% i=1}P(x_{i}|y)\right\}=H(Y)+\sum^{n}_{i=1}H(X_{i}|Y).$ (5)

By comparing Eqs (3) with (5), the difference in amounts of bits between $\mathcal{B}$ and NB is

$\displaystyle H_{{\rm NB}}-H_{\mathcal{B}}=\sum^{n}_{i=1}H(X_{i}|Y)-\sum_{i=1}% ^{n}H(X_{i}|Y,\Pi_{i}^{\mathcal{B}})=\sum^{n}_{i=1}I(X_{i};\Pi_{i}^{\mathcal{B% }}|Y)$ (6)

NB has been demonstrated as a competitive alternative to other more complex classifiers especially when the data quantity is small. However, the independence assumption is often violated in practice, and as a result its probability estimates are often suboptimal. From Eq. (6), the strong conditional dependency relationship between any attribute $X_{i}$ and its parents can help fully describe the training data. A large literature addresses this issue by allowing additional edges between attributes to capture correlations among them. To allow at most one parent for each attribute, TAN (see Fig. 1c) [24] takes the topology of NB as its framework and builds maximal weighted spanning tree to describe the interdependencies among the attributes. Similarly, by allowing more parents for each attribute it is feasible to describe high-dependence relationships in the tree-shaped topology.

KDB (see Fig. 1d), which is proposed by Sahami [25], allows each attribute to be conditionally dependent on at most $k$ other attributes besides the class variable. There have also been some important refinements that optimize the topology of KDB and improve its performance. By applying forward search or backward search, subset of the attributes will be selected and applied to simplify the topology of KDB [26]. If all of the attributes are considered necessary for prediction, then removing redundant dependency relationships will help to get a $K$ -graph whose dependency complexity is less than $k$ [27]. In addition, Bouckaert [28] proposed to average the weights of all possible network structures (including these lower-order ones) with the fixed value of $k$ . Rubio et al. [29] presented a variant of KDB that employed hill-climbing search to achieve the optimal solution.

Heuristic search strategy is commonly applied due to its efficiency in reducing the search space, but it may suffer from local optimal but global nonoptimal solutions. That is, the network topology learned from $D$ may correspond to the maximum of joint probability for instance $A$ , whereas this will not happen for instance $B$ . Conditional mutual information (CMI) is commonly applied to measure conditional dependence, and is defined as follows,

$\displaystyle I(X_{i};X_{j}|Y)=\sum_{x_{i}\in X_{i}}\sum_{x_{j}\in X_{j}}\sum_% {y\in Y}P(x_{i},x_{j},y)\log\frac{P(x_{i},x_{j}|y)}{P(x_{i}|c)P(x_{j}|y)}=\sum% _{x_{i}\in X_{i}}\sum_{x_{j}\in X_{j}}\sum_{y\in Y}I(x_{i};x_{j}|y).$ (7)

Table 1

Data sets

No.	Data set	Instance	Attribute	Class	No.	Data set	Instance	Attribute	Class
1	Contact-lenses	24	4	3	21	Pima-ind-diabetes	768	8	2
2	Zoo	101	16	7	22	Tic-tac-toe	958	9	2
3	Lymphography	148	18	4	23	Contraceptive-mc	1473	9	3
4	Teaching-ae	151	5	3	24	Car	1728	6	4
5	Wine	178	13	3	25	Mfeat-mor	2000	6	10
6	Autos	205	25	7	26	Segment	2310	19	7
7	Glass-id	214	9	3	27	Hypothyroid	3163	25	2
8	Audio	226	69	24	28	Kr-vs-kp	3196	36	2
9	Heart	270	13	2	29	Hypo	3772	29	4
10	Hungarian	294	13	2	30	Sick	3772	29	2
11	Heart-disease-c	303	13	2	31	Phoneme	5438	7	50
12	Primary-tumor	339	17	22	32	Wall-following	5456	24	4
13	Ionosphere	351	34	2	33	Page-blocks	5473	10	5
14	Dermatology	366	34	6	34	Mushrooms	8124	22	2
15	Horse-colic	368	21	2	35	Sign	12546	8	3
16	House-votes-84	435	16	2	36	Nursery	12960	8	5
17	Cylinder-bands	540	39	2	37	Magic	19020	10	2
18	Balance-scale	625	4	3	38	Shuttle	58000	9	7
19	Credit-a	690	15	2	39	Waveform	100000	21	3
20	Breast-cancer-w	699	9	2	40	Localization	164860	5	11

Figure 2.

The value distributions of $I(x_{2};x_{8}|y)$ and $I(x_{5};x_{6}|y)$ .

$I(X_{i};X_{j}|Y)$ can be regarded as the summation of $I(x_{i};x_{j}|y)$ given all possible combinations of attribute values of $\{X_{i},X_{j}\}$ and class labels. Taking the data set Page-blocks (see detail in Table 1) as an example, $I(X_{2};X_{8}|Y)$ achieves the maximum of CMI (1.4475) and $I(X_{5};X_{6}|Y)$ the minimum (0.0461). There are respectively 79 and 86 combinations of values for $\{X_{2},X_{8},Y\}$ and $\{X_{5},X_{6},Y\}$ , and the value distributions of $I(x_{2};x_{8}|y)$ and $I(x_{5};x_{6}|y)$ are shown in Fig. 2. As can be seen from Fig. 2, although the value of $I(X_{2};X_{8}|Y)$ is significantly larger than that of $I(X_{5};X_{6}|Y)$ , sometimes the value of $I(x_{2};x_{8}|y)$ is smaller than that of $I(x_{5};x_{6}|y)$ . That is, generally the conditional dependency relationship between $X_{2}$ and $X_{8}$ is stronger than that between $X_{5}$ and $X_{6}$ , but this conclusion doesn’t hold all the time.

To represent such diversity, ensemble classifier seems to be a feasible solution. The difference among individual classifiers can adapt themselves to different instances, and most of the time lead the final ensemble classifier to perform as good as or better than the best individual classifier for a domain. ATAN [30] is proposed to improve the estimate of class probability in terms of conditional log likelihood. All attributes will be the root node in turn, resulting in a set of similar topologies with different directed arcs. WATAN [30] further improves ATAN by using the non-uniformly weighted average of the probability estimates, i.e., WATAN takes the mutual information between the root attribute and the class variable as the aggregation weight. To relax the independence assumption of NB while attaining the efficiency and efficacy of 1-dependence classifiers, AODE [31] utilizes a restricted class of one-dependence estimators (ODEs) and aggregates the predictions of these ODEs by using uniform rather than non-uniform weights. Further, subsumption resolution (SR) is proposed to optimize AODE by identifying pairs of attribute-values such that one is a generalization of the other and deletes the generalization [32].

K

-dependence spanning tree

3.1 Learn general BNC from training data

As shown in Fig. 1a, for full BNC there exists directed arc $X_{i}\rightarrow X_{j}$ between attributes $X_{i}$ and $X_{j}$ when $i<j$ . That is, $X_{j}$ is dependent on $X_{i}$ and not vice versa. To fully represent significant conditional dependencies, the attribute that has close relationship with other attributes should get higher rank in the order. Thus maximum correlation criterion $\mathcal{C}(i)$ is introduced to sort attributes as follows,

$\displaystyle\mathcal{C}(i)=\sum_{j=1,j\neq i}^{n}I(X_{i};X_{j}|Y).$ (8)

After sorting attributes in descending order of $\mathcal{C}(i)$ , the basic framework of the proposed $k$ -dependence Spanning Tree (KST) is a directed full BNC on $\{X_{1},\ldots,X_{n}\}$ . $\Pi_{i}$ contains $i-1$ attributes as candidate parents for $X_{i}$ , except for the root attribute, that is, $\Pi_{1}$ contains no candidate parents. A $k$ -dependence tree network can be built by selecting significant parents $\Pi_{i}^{k}$ from these candidate ones for each attribute. For arbitrary $k$ -dependence BNC, when $i\leqslant k$ , $\Pi_{i}^{k}=\Pi_{i}$ and $|\Pi_{i}^{k}|=i-1$ ; when $i>k$ , $\Pi_{i}^{k}\subset\Pi_{i}$ and $|\Pi_{i}^{k}|=k$ . Thus in practice, we will use $P(x_{i}|\Pi_{i}^{k},y)$ rather than $P(x_{i}|\Pi_{i},y)$ to estimate the conditional probability of attribute $X_{i}$ . Since

$\displaystyle P(x_{i}|\Pi_{i},y)=\frac{P(x_{i},\Pi_{i},y)}{P(\Pi_{i},y)}=\frac% {P(x_{i},\Pi_{i}-\Pi_{i}^{k},\Pi_{i}^{k},y)}{P(\Pi_{i}-\Pi_{i}^{k},\Pi_{i}^{k}% ,y)}=\frac{P(x_{i},\Pi_{i}-\Pi_{i}^{k}|\Pi_{i}^{k},y)}{P(\Pi_{i}-\Pi_{i}^{k}|% \Pi_{i}^{k},y)}$ (9)

If $P(x_{i}|\Pi_{i}^{k},y)=P(x_{i}|\Pi_{i},y)$ holds, then from Eq. (9) we will have

$\displaystyle P(x_{i}|\Pi_{i}^{k},y)=P(x_{i}|\Pi_{i},y)\Rightarrow P(x_{i}|\Pi% _{i}^{k},y)=\frac{P(x_{i},\Pi_{i}-\Pi_{i}^{k}|\Pi_{i}^{k},y)}{P(\Pi_{i}-\Pi_{i% }^{k}|\Pi_{i}^{k},y)}\Rightarrow P(x_{i}|\Pi_{i}^{k},y)P(\Pi_{i}-\Pi_{i}^{k}|% \Pi_{i}^{k},y)=P(x_{i},\Pi_{i}-\Pi_{i}^{k}|\Pi_{i}^{k},y)$ (10)

Thus when $P(x_{i}|\Pi_{i}^{k},y)\approx P(x_{i}|\Pi_{i},y)$ , $X_{i}$ should be conditionally independent of $X_{j}$ ( $X_{j}\in\Pi_{i}-\Pi_{i}^{k}$ ) given $\{\Pi_{i}^{k},Y\}$ . The minimum dependence criterion $\mathcal{D}(\Pi_{i}^{k},\Pi_{i})$ is introduced to evaluate the extent to which that $P(x_{i}|\Pi_{i}^{k},y)\approx P(x_{i}|\Pi_{i},y)$ holds, where

$\displaystyle\mathcal{D}(\Pi_{i}^{k},\Pi_{i})=\sum_{X_{p}\in\{\Pi_{i}^{k}\cup Y% \}}\sum_{X_{j}\in\{\Pi_{i}-\Pi_{i}^{k}\}}I(X_{i};X_{j}|X_{p}).$ (11)

We then sort the candidate parents of $X_{i}$ in ascending order of $\mathcal{D}(\Pi_{i}^{k},\Pi_{i})$ and select $\Pi_{i}^{k}$ as the parents of $X_{i}$ . The learning procedure of the heuristic search strategy proposed will be divided into two parts: given data set $\mathcal{T}$ and predictive attributes $\{X_{1},\cdots,X_{n}\}$ , we first sort the attributes by comparing $\mathcal{C}(i)$ . Then for attribute $X_{i}$ , if $i\leqslant k+1$ , its parents will be the set of $\{X_{1},\cdots,X_{i-1}\}$ ; otherwise, its parents will be selected by comparing $\mathcal{D}(\Pi_{i}^{k},\Pi_{i})$ (Algorithm 3.1). The resulting KST ${}_{\mathcal{T}}$ algorithm is presented in Algorithm 2.

SelectParents( $\mathcal{L}$ , $X_{i}$ ) Ordered attributes list $\mathcal{L}=\{X_{1},\cdots,X_{n}\}$ , attribute $X_{i}$ . The parent set $\Pi_{i}^{k}$ of attribute $X_{i}$ . Let $\Pi_{i}=\{X_{1},\cdots,X_{i-1}\}$ , $\Pi_{i}^{k}=$ Ø. $i>k+1$ $|\Pi_{i}^{k}|<k$ Select $X_{p}=\arg\min\>\mathcal{D}(\Pi_{i}^{k}\cup X_{j},\Pi_{i}/X_{j})$ , where $X_{j}\in\Pi_{i}$ .Let $\Pi_{i}^{k}=\Pi_{i}^{k}\cup X_{p}$ , $\Pi_{i}=\Pi_{i}/X_{p}$ . Let $\Pi_{i}^{k}=\Pi_{i}$ .the parent set $\Pi_{i}^{k}$ .

The KST ${}_{\mathcal{T}}$ algorithm. Training set $\mathcal{T}$ , list of attributes $\mathcal{X}=\{X_{1},\cdots,X_{n}\}$ . The topology of KST ${}_{\mathcal{T}}$ . Initialize the Bayesian network, $B N$ , with a single class node, $Y$ .Compute the conditional probabilities from $T$ .Calculate $I(X_{i};X_{j}|Y)$ from $T$ for each pair of attributes $(i\neq j)$ .Let $\mathcal{L}$ be a list of all $X_{i}$ in descending order of $\mathcal{C}(i)$ .

each $X_{i}$ in $\mathcal{L}$ $\Pi_{i}^{k}\leftarrow\textbf{SelectParents}$ ( $\mathcal{L}$ , $X_{i}$ ). $//$ Algorithm 3.1Add a node to $B N$ representing $X_{i}$ .Add arc from class node $Y$ to $X_{i}$ .Add arcs from attributes in $\Pi_{i}^{k}$ to $X_{i}$ .

$B N$ .

3.2 Learn local BNC from testing instance

General BNC learned from training data makes unwarranted assumptions that, the relationship between $X_{i}$ and its parents will not change when they take different values. That may result in biased estimates of intermediate probability values and unreliable classification results. To illustrate this, consider a hypothetical instance with attribute Gender and its parent Pregnant. When Gender and Pregnant take some specific values, the confidence level of the dependency relationships may vary greatly. For example, Pregnant $=$ “yes” $\rightarrow$ Gender $=$ “female” always holds whereas Pregnant $=$ “no” $\rightarrow$ Gender $=$ “female” doesn’t necessarily hold.

Ideally the learned BNC can take unlabeled instance $u$ as the target and represent conditional dependencies between attribute values in $u$ . To achieve this goal, we need to know the class label of $u$ first, that contradict the classification problem. Since the class label of $u$ may take any one of the $m$ possible values of variable $Y$ , $u$ can be extended to a pseudo training set $\mathcal{P}$ that consists of $m$ instances as follows,

$\displaystyle u=\{x_{1},\cdots,x_{n}\}\Leftrightarrow\mathcal{P}=\left\{\begin% {array}[]{c}\{u,y_{1}\}\\ \{u,y_{2}\}\\ \cdots\\ \{u,y_{m}\}\\ \end{array}\right.=\left\{\begin{array}[]{c}\{x_{1},\cdots,x_{n},y_{1}\}\\ \{x_{1},\cdots,x_{n},y_{2}\}\\ \cdots\\ \{x_{1},\cdots,x_{n},y_{m}\}\\ \end{array}\right.$ (12)

Then we need to learn a specific BNC, i.e., KST ${}_{\mathcal{P}}$ , from $\mathcal{P}$ rather than $u$ . The entropy function $h_{\mathcal{B}}(\mathcal{P})$ , which is defined as follows, corresponds to the number of bits needed to describe the instances in $\mathcal{P}$ .

$\displaystyle h_{\mathcal{B}}(\mathcal{P})=-\sum_{i=1}^{m}\log P(u,y_{i}).$ (13)

Equation (13) implies that, to maximize the description of $u$ we need to find a network structure that can minimize the entropy function $h_{\mathcal{B}}(\mathcal{P})$ . Thus based on Eq. (13), we introduce the definition of $h_{\mathcal{B}}(\mathcal{P})$ as follows,

$\displaystyle h_{\mathcal{B}}(\mathcal{P})=-\sum_{i=1}^{m}\log P_{\mathcal{B}}% (u,y_{i})=-\sum_{i=1}^{m}\log\left\{P(y_{i})\prod^{n}_{j=1}P(x_{j}|y_{i},\Pi_{% j}^{\mathcal{B}})\right\}=-\sum_{i=1}^{m}\log P(y_{i})-\sum_{i=1}^{m}\sum^{n}_% {j=1}\log P(x_{j}|y_{i},\Pi_{j}^{\mathcal{B}})=h(Y)+\sum_{j=1}^{n}h(x_{j}|Y,% \Pi_{j}^{\mathcal{B}}).$ (14)

For NB, Eq. (14) turns to be

$\displaystyle h_{{\rm NB}}(\mathcal{P})=h(Y)+\sum^{n}_{j=1}h(x_{j}|Y).$ (15)

By comparing Eqs (14) with (15), the difference in amounts of bits between $\mathcal{B}$ and NB is

$\displaystyle h_{{\rm NB}}(\mathcal{P})-h_{\mathcal{B}}(\mathcal{P})=\sum^{n}_% {j=1}h(x_{j}|Y)-\sum_{j=1}^{n}h(x_{j}|Y,\Pi_{j}^{\mathcal{B}})=\sum_{i=1}^{m}% \log\frac{P(x_{j},\Pi_{j}^{\mathcal{B}}|y_{i})}{P(x_{j}|y_{i})P(\Pi_{j}^{% \mathcal{B}}|y_{i})}=\sum^{n}_{j=1}I(x_{j};\Pi_{j}^{\mathcal{B}}|Y)$ (16)

$I(x_{j};\Pi_{j}^{\mathcal{B}}|Y)$ can be used to measure the conditional dependence between $X_{i}$ and its parents when they take specific values. Similarly, the conditional mutual information between pairs of attribute values in $\mathcal{P}$ turns out to be

$\displaystyle I(x_{j};x_{j}|Y)=\sum_{i=1}^{m}\log\frac{P(x_{j},x_{j}|y_{i})}{P% (x_{j}|y_{i})P(x_{j}|y_{i})}$ (17)

We can then learn the topology of KST ${}_{\mathcal{P}}$ from $\mathcal{P}$ . Similar to Eqs (8) and (11), for instance $\mathcal{P}$ the maximum correlation criterion and minimum dependence criterion turn to be

$\displaystyle\left\{\begin{array}[]{l}c(i)=\sum\limits_{j=1,i\neq j}^{n}I(x_{i% };x_{j}|Y)\\ d(\Pi_{i}^{k},\Pi_{i})=\sum\limits_{x_{p}\in\{\Pi_{i}^{k}\cup Y\}}\sum\limits_% {x_{j}\in\{\Pi_{i}-\Pi_{i}^{k}\}}I(x_{i};x_{j}|x_{p}).\\ \end{array}\right.$ (18)

Similar to the learning procedure shown in Algorithm 2, to learn KST ${}_{\mathcal{P}}$ we first sort the attribute values in $u$ by comparing $c(i)$ . Then for attribute value $x_{i}$ , if $i>k+1$ , its parents will be selected by comparing $d(\Pi_{i}^{k},\Pi_{i})$ .

3.3 Ensemble learning

Figure 3.

Learning process of ensemble model.

As shown in Fig. 3, the pseudo training set is a complementary part of training set. Thus the probability distributions needed for computing $c(i)$ or $d(\Pi_{i}^{k},\Pi_{i})$ can be learned from training set. Both KST ${}_{\mathcal{T}}$ and KST ${}_{\mathcal{P}}$ have their own distinct characteristics. As shown in Eq. (7), the definition of conditional mutual information $I(X_{i};X_{j}|Y)$ considers all possible combinations of attributes values of $\{X_{i},X_{j},Y\}$ (including those that never appear in $u$ ). Thus for specific instance $u$ , the estimate of $I(X_{i};X_{j}|Y)$ is polluted by redundant information. Correspondingly the edge $X_{i}\rightarrow X_{j}$ in KST ${}_{\mathcal{T}}$ generally describes the conditional dependence between attributes $X_{i}$ and $X_{j}$ , not between attribute values $x_{i}$ and $x_{j}$ . In contrast, the pseudo training set $\mathcal{P}$ can fully describe the conditional dependence between attribute values when class $Y$ takes all possible values. For unlabeled instance $u$ , only one class label, e.g., $y_{i}$ , is the right one. The other instances, e.g. $\{u,y_{j}\}(i\neq j)$ , can be regarded as artificially introduced noise. Thus the topology of KST ${}_{\mathcal{P}}$ will not be the optimal one.

KST ${}_{\mathcal{T}}$ underfits $u$ and KST ${}_{\mathcal{P}}$ overfits $u$ . They are complementary in nature for classifying $u$ and the decision of the ensemble of them should have better overall accuracy, on average, than any individual member. After training KST ${}_{\mathcal{T}}$ and KST ${}_{\mathcal{P}}$ , ensemble learning treats them as a “committee” of decision makers for $u$ and combines individual predictions appropriately. For subclassifier BNC, an estimate of the probability of input $u$ with pre-assigned class label $y_{i}$ is $P(u,y_{i}|\rm{BNC})$ . If $y_{i}$ is the right class label for $u$ , an ideal situation is that the estimate of $P(y_{i},u|KST_{\mathcal{T}})$ and $P(y_{i},u|KST_{\mathcal{P}})$ simultaneously achieve their maximum value among all. Thus linear combiner is applied to estimate the ensemble probability,

$\displaystyle P(y_{i},u)=w_{\mathcal{T}}.P(y_{i},u|KST_{\mathcal{T}})+w_{% \mathcal{P}}.P(y_{i},u|KST_{\mathcal{P}}).$

From Eqs (8), (11) and (18), the estimate of the maximum correlation criterion and minimum dependence criterion is biased by the redundant attribute values or class labels that may not appear in $u$ . The weights, $w_{\mathcal{T}}$ and $w_{\mathcal{P}}$ , seem to help to address this issue, whereas the problem of evaluating the negative effect caused by redundancy is intractable due to the distinctive characteristic of different instances. Because KST ${}_{\mathcal{T}}$ and KST ${}_{\mathcal{P}}$ focus on different data spaces and have different topologies, if work independently they will achieve different classification accuracies on unlabeled instances, a nonuniform combination could in theory give a lower error than an uniform combination, and is more feasible. Thus in practice we choose $w_{\mathcal{T}}=w_{\mathcal{P}}=1/2$ .

4. Experiments

The experiments are conducted on 40 data sets from the UCI machine learning repository [33], and for these data sets it is supposed that there is no noisy data. As listed in Table 1, the size of data sets ranges from 57 instances of labor-negotiations to 164860 instances of localization, the number of class labels spans from 2 to 26, enabling us to evaluate classifiers on data sets with various sizes and wide range number of class labels. There exist missing values in 14 datasets. The Bayesian network classifiers for comparison study can deal with numeric attributes only, thus we discretize numeric attributes using Minimum Description Length (MDL) [34]. To incorporate the missing values into probability computation, we respectively use a distinct value to replace the missing values for qualitative attributes in all cases and the mean value to replace those missing ones for quantitative attributes. Consider the “noise” caused by the above data pre-processing steps, we hypothesize that there is sufficient data present for every possible combination of attribute values, and direct estimation of each relevant multi-variate probability will still be reliable for BNC learning.

Each algorithm is tested on each data set using 10 rounds of 10-fold cross validation. When two algorithms are compared, win/draw/loss (W/D/L) record is applied to compare the number of data sets on which one algorithm performs better/similarly/poorer than the other on a given measure. If the outcome of a one-tailed binomial sign test is less than 0.05, the difference is regarded as significant. The following algorithms are compared:

•
NB, standard naive Bayes.
•
TAN, tree-augmented naive Bayes.
•
KDB, $k$ -dependence Bayesian classifier, $k=$ 2.
•
ATAN, averaged tree-augmented naive Bayes.
•
AODE, Averaged One-Dependence Estimators.
•
KNN, $k$ -Nearest Neighbor with default parameters.
•
SVM, Support Vector Machine with default parameters.
•
KST, $k$ -dependence Spanning Tree, $k=$ 2.

These algorithms can be grouped into three types: single model BNCs (NB, TAN and KDB), ensemble BNCs (ATAN, AODE and KST) and non-BNC learners (KNN and SVM). For BNC learners, all the experiments use C $++$ software specially designed to deal with out-of-core classification methods. For these non-BNC learners, the experiments are conducted in the widely used machine learning work-branch Weka (version 3.5.7) and the default parameters are applied to all data sets.
4.1 Comparisons in terms of zero-one loss

Zero-one loss is the most common loss function to evaluate the classification performance. Table A1 reports for each data set the average zero-one loss and Table 2 summarizes corresponding W/D/L records. Cell $\left[i;j\right]$ in Table 2 contains the number of wins/draws/losses for the classifier on row $i$ against the classifier on column $j$ .

Table 2
W/D/L comparison results of 0–1 loss on all data sets

W/D/L	NB	TAN	KDB	ATAN	AODE	KNN	SVM
TAN	21\8\11
KDB	21\7\12	17\10\13
ATAN	20\8\12	0\37\3	13\10\17
AODE	24\11\5	15\14\11	17\10\13	18\12\10
KNN	17\3\20	14\2\24	16\3\21	15\3\22	13\0\27
SVM	13\2\25	8\5\27	8\3\29	9\4\27	8\4\28	10\1\29
KST	26\9\5	29\9\2	21\13\6	29\9\2	19\15\6	27\3\10	31\4\5

The reasonableness of NB’s independence assumption is often questioned due to the non-negative characteristic of conditional mutual information. NB can only represent the direct dependence between any attribute $X_{i}$ and the class variable, and the number of bits encoded in NB is less than that in its extended variations. However, as shown in Table 2, NB still beats TAN and KDB on 11 and 12 data set. Thus the conditional independence does exist in some cases and how to identify them is an important issue for learning BNC. Generally, for single model classifiers, high-dependence BNCs shows significant advantages. KDB beats TAN on 17 data sets and loses on 13, and TAN beats NB on 21 data sets and loses on 11. For ensemble classifiers, they can represent more conditional dependencies and if the submodels are complementary, they can perform better than single model classifiers of the same dependence degree. Each submodel in ATAN represents the same conditional dependencies, thus it performs similarly to TAN. In contrast, 1-dependence AODE outperforms all single model BNCs, including NB (24 wins and 5 losses), TAN (15 wins and 11 losses) and KDB (17 wins and 13 losses). KST respectively uses the maximum correlation criterion and minimum dependence criterion to mine significant conditional dependencies and identify conditional independencies in labeled training data and unlabeled instance. The experimental results provide solid evidence for the effectiveness of the proposed heuristic search strategy. KST enjoys significant advantage over other BNCs and it retains the advantage when compared to non-BNC learners.

4.2 Bias and variance

The Bias-variance decomposition provides valuable insights into the components of the error of learned classifiers [35]. Bias measures the deviation between the expected output of the learning algorithm and the real result, and describes the decision surfaces for a domain. Variance describes the component of error that stems from sampling, which reflects the sensitivity to variations in the training data.

Among single model BNCs, high-dependence BNCs enjoy significant advantage in representing conditional dependencies and correspond to more robust topologies. Due the ensemble learning strategy, ensemble BNC can represent more conditional dependencies than single model BNC of the same dependence degree. For example, TAN, KDB and AODE can respectively represent $n-1$ , $nk-\frac{k^{2}}{2}-\frac{k}{2}$ and $n(n-1)$ conditional dependencies (including duplicated ones). Thus for bias, as shown in Table 3, TAN performs better than NB (23 wins and 11 losses), KDB better than TAN (16 wins and 8 losses) and AODE better than TAN (18 wins and 12 losses). KST is a high-dependence BNC and also an ensemble classifier, it respectively beats TAN, KDB and AODE on these data sets, resulting in W/D/L of 23/11/6, 18/13/9 and 15/15/10. Variance-wise, high-dependence BNCs may overfit training data, that may result in high variance. BNCs with relatively simple network topologies will perform better in terms of variance. The topologies of NB and AODE are definite thus are insensitive to variations in the training data. As shown in Table 4, NB has achieved the most stable performance with its simplest topology. The NB-based AODE, in addition to losing to NB, has a larger lead than other algorithms. TAN beats KDB (20 wins and 10 losses) and ATAN beats TAN (4 wins and 3 losses) although the advantage is not significant. The local topology learned from unlabeled instance helps KST avoid overfitting, and KST performs better than TAN (26 wins and 9 losses) and KDB (27 wins and 6 losses).

Table 3
W/D/L comparison results of bias on all data sets

W/D/L	NB	TAN	KDB	ATAN	AODE	KNN	SVM
TAN	23\6\11
KDB	23\8\9	16\16\8
ATAN	22\6\12	3\33\4	9\14\17
AODE	25\13\2	18\10\12	18\8\14	18\11\11
KNN	21\5\14	16\6\18	10\12\18	15\8\17	17\3\20
SVM	9\3\28	4\2\34	3\2\35	4\2\34	6\1\33	4\3\33
KST	27\7\6	23\11\6	18\13\9	24\11\5	15\15\10	22\5\13	36\1\3

Table 4

W/D/L comparison results of variance on all data sets

W/D/L	NB	TAN	KDB	ATAN	AODE	KNN	SVM
TAN	5\1\34
KDB	7\4\29	10\10\20
ATAN	6\1\33	4\33\3	22\9\9
AODE	7\7\26	30\5\5	29\4\7	32\3\5
KNN	3\2\35	11\3\26	14\4\22	12\3\25	2\3\35
SVM	20\1\19	26\2\12	27\0\13	26\2\12	23\3\14	27\2\11
KST	7\3\30	26\5\9	27\7\6	27\5\8	6\4\30	27\7\6	13\1\26

Bias-variance trade-off is an important task in the field of statistics and machine learning, because it can make the model trained with limited training data better generalize to more data sets. Bias/variance estimation provides insights into how the learning algorithm will perform with varying amount of data. We expect low variance algorithms to have relatively low error for small data and low bias algorithms to have relatively low error for large data. The function, Goal Difference $GD(A;B|\mathcal{T})$ [36], is introduced to compare the classification performance of learners $A$ and $B$ while dealing with different sizes of data sets,

$\displaystyle GD(A;B|\mathcal{T})=|\textit{Win}|-|\textit{Loss}|$ (19)

where $\mathcal{T}$ denotes the set of data sets for experimental study, $|\textit{Win}|$ and $|\textit{Loss}|$ respectively denote the number of data sets on which $A$ beats or loses to $B$ for a given measure. AODE is commonly regarded as the low-bias and low-variance BNC and we will compare it with KST in the following discussion. Figure 4 shows the learning curve of GD. As can be seen from Fig. 4a, the index numbers of the data sets are sorted in ascending order of their sizes. When the index number is smaller than 23 or the number of instances is less than 1473, KST performs similarly to AODE in terms of bias (5 wins and 7 losses), but poorer in terms of variance (2 wins and 19 losses). As the data size increases, KST shows its advantage over AODE in terms of bias (10 wins and 3 losses) whereas the disadvantage in terms of variance becomes less significant (4 wins and 11 losses). However, KST retains the advantage over AODE disregarding the change in data size. Thus we come to the conclusion that KST allows a fine-tuned trade-off between expressivity on the one hand, and efficiency of learning and guarding against overfitting on the other hand, i.e., the bias-variance trade off. SVM applies default parameters for classification without tuning, thus it performs worse than KST in terms of bias (3 wins and 36 losses) whereas better in terms of variance (26 wins and 13 losses).

Table 5

W/D/L comparison results of RMSE on all data sets

W/D/L	NB	TAN	KDB	ATAN	AODE	KNN	SVM
TAN	22\12\6
KDB	20\12\8	13\21\6
ATAN	22\12\6	1\38\1	6\20\14
AODE	21\15\4	14\20\6	17\12\11	15\19\6
KNN	14\4\22	9\5\26	10\3\27	9\5\26	7\6\27
SVM	8\2\30	4\3\33	5\1\34	4\3\33	5\2\33	9\3\28
KST	24\11\5	16\22\2	19\15\6	17\21\2	11\23\6	29\5\6	34\3\3

Figure 4.

The comparison results of GD in terms of (a) bias-variance and (b) 0–1 loss.

4.3 RMSE

The Root-Mean-Square Error (RMSE) is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed. The RMSE represents the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences. The RMSE serves to aggregate the magnitudes of the errors in predictions for various times into a single measure of predictive power. RMSE is a measure of accuracy, to compare forecasting errors of different models for a particular data set and not between data sets, as it is scale-dependent.

The estimate of the conditional probability $P(y|\textbf{x})$ for computing RMSE will be more precise if given enough conditional dependencies in BNC. As shown in Table 5, among single model classifiers, high-dependence BNCs enjoy significant advantages over low-dependence ones. KDB beats TAN on 13 data sets and loses on 6, and TAN beats NB on 22 data sets and loses on 6. However, the biased estimate of high-order conditional probability may result in poor performance in RMSE while dealing with small data. The estimate of low-order conditional probability is of high-confidence level and ensemble learning can help to address this issue. The RMSE results for TAN and ATAN are similar, and KST performed the best among all BNCs. When compared with non-Bayesian classifiers, KST also performs better, it respectively beats KNN and SVM on 29 data sets and 34 data sets.

4.4 Friedman test

The Friedman [37] and the Nemenyi [38] tests are effective for comparing multiple classifiers across multiple data sets. Classifier will be ranked by comparing their classification performance and the Friedman statistic is defined as follows:

$\displaystyle\mathcal{X}_{F}^{2}=\frac{12N}{k(k+1)}\sum_{j=1}^{k}R_{j}^{2}-3N(% k+1)$ (20)

where $R_{j}=\sum_{i}r_{i}^{j}$ and $r_{i}^{j}$ is the rank of the $j$ -th of $k$ algorithms on the $i$ -th of $N$ data sets. The Friedman statistic is distributed according to $\mathcal{X}_{F}^{2}$ with $t-1$ degrees of freedom. Thus, for any pre-determined level of significance $\alpha$ , the null hypothesis will be rejected if $\mathcal{X}_{F}^{2}>\mathcal{X}_{\alpha}^{2}$ . The critical value of $\mathcal{X}_{\alpha}^{2}$ for $\alpha=$ 0.05 with seven degrees of freedom is 14.07. The Friedman statistics of experimental results are 49.63 for zero-one loss, 85.13 for RMSE, 73.56 for bias and 84.72 for variance, which are all larger than 14.07. Hence, the null-hypotheses is rejected.

Figure 5.

Zero-one loss, RMSE, bias and variance comparison with the Nemenyi test on 40 data sets. CD $=$ 1.6601.

We can then use the Nemenyi test [38] to analyze if there exist significant difference between pairs of algorithms in terms of average ranks of the Friedman test. The critical difference (CD) is applied to evaluate the difference and is defined as follows:

$\displaystyle CD=q_{\alpha}\sqrt{\frac{t(t+1)}{6N}},$ (21)

where the critical value $q_{\alpha}$ for $\alpha=$ 0.05 and $t=$ 8 is 3.031. With 8 algorithms ( $t=$ 8) and 40 data sets ( $N=$ 40), the $C D$ value is 1.6601 from Eq. (21). The comparison results are shown in Fig. 5, in which the algorithms are shown on the left line and their average ranks are indicated on the parallel right axis. The lower rank on the bottom the algorithms achieves, the better it performs. If the difference between the average ranks between the two algorithms is less than the CD value, then the two algorithms are connected by lines.

As shown in Fig. 5a, KST achieves the lowest mean zero-one loss rank (2.5250) and the Nemenyi test differentiates KST from all the other algorithms. AODE, TAN, ATAN and KNN perform similarly, and they enjoy relatively significant zero-one loss advantage over KDB. NB and SVM obtain the highest rank among all the algorithms. In contrast, NB can only learns linear decision boundaries and underfitting make it fail to achieve the bias-variance trade off while dealing with large data sets. When RMSE is compared, from Fig. 5b KST and AODE achieve the lowest and second lowest mean RMSE ranks (2.5750 and 3.4000 respectively). KDB has lower rank than TAN, but the advantage is not significant. KNN and SVM obtain the highest variance rank among all the algorithms.

From Fig. 5c, KST obtains the lowest mean bias rank (2.9000), followed by KDB (3.8000) and AODE(3.8625). TAN, ATAN, KNN and NB come next (4.2250, 4.2250, 4.6250 and 5.2625 respectively). The bias disadvantages of SVM relative to other algorithms is clear. High-dependence relationships or ensemble learning has an obvious positive effect on reducing the bias of BNC. When variance is compared, from Fig. 5d NB has the lowest mean variance rank (2.6375). AODE and SVM have the second and the third lowest mean variance ranks (3.0875 and 3.4500 respectively). Although KST performs poorer than these three algorithms, it achieves significantly lower mean variance rank than TAN, ATAN, KNN and KDB.

5. Conclusions

Conditional mutual information $I(X_{i};\Pi_{i}|Y)$ can measure the conditional dependence between $X_{i}$ and its parents $\Pi_{i}$ . Considering the possible interplay between attributes in $\Pi_{i}$ , $I(X_{i};X_{j}|Y)(X_{j}\in\Pi_{i})$ is not appropriate to be used to identify the conditional independence between $X_{i}$ and $X_{j}$ , and the conditional dependence may vary greatly for different instances. To address this issue and fully represent the significant conditional dependencies, we propose to sort attributes by comparing $I(X_{i};X_{j}|Y)(i\neq∼{}j)$ . Then the minimum dependence criterion $\mathcal{D}(\Pi_{i}^{k},\Pi_{i})$ is introduced to identify the independence relationships and filter out redundant parents for each attribute. From the experimental results we can see that, the proposed algorithm shows competitive classification performance compared to state-of-the-art single model BNCs, ensemble BNCs and other learners.

Footnotes

Appendix A

Table A1

Experimental results of 0–1 loss

No.	Data set	NB	TAN	KDB	ATAN	AODE	KNN	SVM	KST
1	Contact-lenses	0.3750	0.3750	0.2500	0.4583	0.3750	0.2083	0.3750	0.3333
2	Zoo	0.0297	0.0099	0.0495	0.0198	0.0297	0.0396	0.0891	0.0297
3	Lymphography	0.1486	0.1757	0.2365	0.1757	0.1689	0.1892	0.1824	0.1757
4	Teaching-ae	0.4967	0.5497	0.5364	0.5497	0.4901	0.3377	0.5298	0.4570
5	Wine	0.0169	0.0337	0.0225	0.0337	0.0225	0.0506	0.5562	0.0225
6	Autos	0.3122	0.2146	0.2049	0.2195	0.2049	0.2390	0.6537	0.1902
7	Glass-id	0.2617	0.2196	0.2196	0.2150	0.2523	0.2103	0.2430	0.2196
8	Audio	0.2389	0.2920	0.3230	0.2965	0.2035	0.2301	0.5487	0.2257
9	Heart	0.1778	0.1926	0.2111	0.1926	0.1704	0.2444	0.4259	0.1815
10	Hungarian	0.1599	0.1701	0.1803	0.1701	0.1667	0.2313	0.3605	0.1565
11	Heart-disease-c	0.1815	0.2079	0.2244	0.2046	0.2013	0.2442	0.4323	0.1914
12	Primary-tumor	0.5457	0.5428	0.5723	0.5398	0.5752	0.6401	0.5605	0.5634
13	Ionosphere	0.1054	0.0684	0.0741	0.0684	0.0741	0.1368	0.0655	0.0684
14	Dermatology	0.0191	0.0328	0.0656	0.0328	0.0164	0.0546	0.1694	0.0273
15	Horse-colic	0.2174	0.2092	0.2446	0.2147	0.2011	0.2228	0.3370	0.1984
16	House-votes-84	0.0943	0.0552	0.0506	0.0529	0.0529	0.0759	0.0460	0.0506
17	Cylinder-bands	0.2148	0.2833	0.2259	0.2944	0.1889	0.2556	0.2333	0.1889
18	Balance-scale	0.2720	0.2736	0.2784	0.2752	0.2832	0.1344	0.1024	0.2624
19	Credit-a	0.1406	0.1507	0.1464	0.1522	0.1391	0.1884	0.4449	0.1420
20	Breast-cancer-w	0.0258	0.0415	0.0744	0.0443	0.0358	0.0443	0.0272	0.0415
21	Pima-ind-diabetes	0.2448	0.2383	0.2448	0.2383	0.2383	0.2982	0.3490	0.2396
22	Tic-tac-toe	0.3069	0.2286	0.2035	0.2276	0.2651	0.0125	0.1221	0.0710
23	Contraceptive-mc	0.5037	0.4888	0.5003	0.4888	0.4942	0.5567	0.4515	0.4915
24	Car	0.1400	0.0567	0.0382	0.0567	0.0816	0.0648	0.0810	0.0486
25	Mfeat-mor	0.3140	0.2970	0.3060	0.2990	0.3145	0.3450	0.6450	0.3050
26	Segment	0.0788	0.0390	0.0472	0.0403	0.0342	0.0286	0.3463	0.0368
27	Hypothyroid	0.0149	0.0104	0.0107	0.0104	0.0136	0.0291	0.0417	0.0111
28	Kr-vs-kp	0.1214	0.0776	0.0416	0.0776	0.0842	0.0372	0.0610	0.0635
29	Hypo	0.0138	0.0141	0.0114	0.0135	0.0095	0.0867	0.0740	0.0090
30	Sick	0.0308	0.0257	0.0223	0.0257	0.0273	0.0382	0.0610	0.0244
31	Phoneme	0.2615	0.2733	0.1984	0.2742	0.2392	0.2560	0.4071	0.2319
32	Wall-following	0.1054	0.0554	0.0401	0.0555	0.0370	0.1182	0.0971	0.0321
33	Page-blocks	0.0619	0.0415	0.0391	0.0422	0.0338	0.0398	0.0870	0.0327
34	Mushrooms	0.0196	0.0001	0.0000	0.0001	0.0001	0.0000	0.0020	0.0000
35	Sign	0.3586	0.2755	0.2539	0.2755	0.2821	0.1340	0.3273	0.2484
36	Nursery	0.0973	0.0654	0.0289	0.0654	0.0730	0.0162	0.0244	0.0481
37	Magic	0.2239	0.1675	0.1637	0.1674	0.1752	0.1906	0.3412	0.1568
38	Shuttle	0.0039	0.0015	0.0009	0.0015	0.0008	0.0007	0.0166	0.0009
39	Waveform	0.0220	0.0202	0.0256	0.0202	0.0180	0.0404	0.0271	0.0186
40	Localization	0.4955	0.3575	0.2964	0.3575	0.3596	0.2226	0.4209	0.2794

Table A2

Experimental results of bias

No.	Data set	NB	TAN	KDB	ATAN	AODE	KNN	SVM	KST
1	Contact-lenses	0.2163	0.1825	0.3175	0.3137	0.2850	0.4352	0.4682	0.2838
2	Zoo	0.0318	0.0303	0.0403	0.0300	0.0273	0.0624	0.2022	0.0333
3	Lymphography	0.0902	0.1027	0.1041	0.0963	0.0933	0.1687	0.2802	0.1000
4	Teaching-ae	0.4836	0.4566	0.4606	0.4566	0.4370	0.3639	0.4843	0.4212
5	Wine	0.0331	0.0507	0.0520	0.0515	0.0346	0.0430	0.3955	0.0275
6	Autos	0.2181	0.2356	0.2253	0.2310	0.2165	0.2222	0.4871	0.2191
7	Glass-id	0.2901	0.2756	0.2713	0.2756	0.2785	0.1729	0.1810	0.2745
8	Audio	0.2733	0.3617	0.3493	0.3173	0.1753	0.2466	0.4860	0.2788
9	Heart	0.1368	0.1472	0.1697	0.1516	0.1403	0.1747	0.4826	0.1503
10	Hungarian	0.1646	0.1424	0.1480	0.1446	0.1582	0.1454	0.3878	0.1254
11	Heart-disease-c	0.1297	0.1263	0.1299	0.1267	0.1138	0.1360	0.3876	0.1106
12	Primary-tumor	0.4106	0.4249	0.4184	0.4244	0.4274	0.4307	0.5103	0.4226
13	Ionosphere	0.1220	0.0804	0.0855	0.0809	0.0744	0.1368	0.0973	0.0769
14	Dermatology	0.0079	0.0274	0.0489	0.0266	0.0055	0.0489	0.3482	0.0143
15	Horse-colic	0.1966	0.1848	0.1689	0.1963	0.1990	0.1717	0.3694	0.1898
16	House-votes-84	0.0899	0.0410	0.0258	0.0412	0.0430	0.0406	0.0331	0.0475
17	Cylinder-bands	0.2000	0.3117	0.1939	0.3217	0.1589	0.2012	0.3630	0.1548
18	Balance-scale	0.1840	0.1843	0.1902	0.1840	0.1905	0.1322	0.0971	0.1850
19	Credit-a	0.0912	0.1171	0.1137	0.1237	0.0921	0.1447	0.4637	0.0997
20	Breast-cancer-w	0.0187	0.0384	0.0449	0.0315	0.0338	0.0435	0.0257	0.0237
21	Pima-ind-diabetes	0.1957	0.1946	0.1944	0.1946	0.1937	0.1976	0.3359	0.1916
22	Tic-tac-toe	0.2614	0.1746	0.1367	0.1742	0.2005	0.0409	0.2041	0.0466
23	Contraceptive-mc	0.3928	0.3425	0.3702	0.3419	0.3816	0.3577	0.3568	0.3534
24	Car	0.0937	0.0478	0.0387	0.0478	0.0556	0.0799	0.1141	0.0451
25	Mfeat-mor	0.2624	0.2077	0.2142	0.2103	0.2477	0.2163	0.4895	0.2223
26	Segment	0.0785	0.0491	0.0453	0.0483	0.0367	0.0297	0.3957	0.0419
27	Hypothyroid	0.0116	0.0104	0.0096	0.0106	0.0094	0.0262	0.0449	0.0094
28	Kr-vs-kp	0.1107	0.0702	0.0417	0.0702	0.0747	0.0531	0.0812	0.0579
29	Hypo	0.0092	0.0124	0.0077	0.0120	0.0071	0.0619	0.0784	0.0087
30	Sick	0.0246	0.0207	0.0198	0.0208	0.0224	0.0354	0.0612	0.0220
31	Phoneme	0.2216	0.2394	0.1572	0.2360	0.2207	0.2088	0.7354	0.1854
32	Wall-following	0.0951	0.0491	0.0257	0.0490	0.0251	0.1105	0.1065	0.0232
33	Page-blocks	0.0451	0.0308	0.0280	0.0308	0.0251	0.0346	0.0894	0.0253
34	Mushrooms	0.0237	0.0001	0.0001	0.0001	0.0004	0.0000	0.0113	0.0002
35	Sign	0.3257	0.2420	0.2161	0.2417	0.2531	0.1140	0.3322	0.2129
36	Nursery	0.0928	0.0521	0.0281	0.0521	0.0651	0.0325	0.0535	0.0363
37	Magic	0.2111	0.1252	0.1241	0.1252	0.1600	0.1341	0.3404	0.1257
38	Shuttle	0.0040	0.0008	0.0007	0.0009	0.0006	0.0009	0.0371	0.0007
39	Waveform	0.0219	0.0152	0.0210	0.0152	0.0156	0.0245	0.0262	0.0152
40	Localization	0.4523	0.3106	0.2134	0.3105	0.3129	0.1671	0.4634	0.2084

Table A3

Experimental results of variance

No.	Data set	NB	TAN	KDB	ATAN	AODE	KNN	SVM	KST
1	Contact-lenses	0.1712	0.1925	0.1700	0.1613	0.1275	0.1484	0.0540	0.1412
2	Zoo	0.0439	0.0606	0.0658	0.0548	0.0424	0.0745	0.1062	0.0485
3	Lymphography	0.0343	0.1116	0.1408	0.1118	0.0476	0.0815	0.0796	0.0755
4	Teaching-ae	0.1484	0.1914	0.1494	0.1914	0.1650	0.1932	0.1879	0.1848
5	Wine	0.0093	0.0493	0.0649	0.0502	0.0231	0.0299	0.2273	0.0319
6	Autos	0.1349	0.1747	0.1821	0.1704	0.1541	0.1988	0.2336	0.1706
7	Glass-id	0.0930	0.1075	0.1189	0.1075	0.1004	0.1182	0.1500	0.1170
8	Audio	0.1000	0.0983	0.1373	0.1613	0.1407	0.1295	0.2437	0.1439
9	Heart	0.0443	0.0739	0.0914	0.0751	0.0497	0.0813	0.0032	0.0663
10	Hungarian	0.0201	0.0596	0.0561	0.0585	0.0255	0.0863	0.0001	0.0470
11	Heart-disease-c	0.0248	0.0479	0.0582	0.0495	0.0357	0.0829	0.0908	0.0488
12	Primary-tumor	0.1752	0.2424	0.2391	0.2428	0.1814	0.2402	0.1596	0.2040
13	Ionosphere	0.0242	0.0401	0.0581	0.0422	0.0385	0.0361	0.0281	0.0325
14	Dermatology	0.0216	0.0513	0.0684	0.0496	0.0199	0.0546	0.1335	0.0389
15	Horse-colic	0.0353	0.1021	0.1384	0.1045	0.0452	0.0776	0.0019	0.0930
16	House-votes-84	0.0066	0.0170	0.0197	0.0167	0.0094	0.0218	0.0089	0.0152
17	Cylinder-bands	0.0656	0.0739	0.0750	0.0699	0.0961	0.1373	0.0511	0.1157
18	Balance-scale	0.0848	0.0941	0.0872	0.0938	0.0854	0.0831	0.0432	0.0852
19	Credit-a	0.0249	0.0555	0.0768	0.0589	0.0305	0.0663	0.0221	0.0464
20	Breast-cancer-w	0.0010	0.0337	0.0504	0.0372	0.0134	0.0169	0.0027	0.0273
21	Pima-ind-diabetes	0.0715	0.0663	0.0689	0.0663	0.0727	0.1134	0.0000	0.0677
22	Tic-tac-toe	0.0455	0.0824	0.1125	0.0823	0.0513	0.0591	0.0528	0.0606
23	Contraceptive-mc	0.0856	0.1646	0.1705	0.1644	0.1058	0.1947	0.1090	0.1565
24	Car	0.0520	0.0376	0.0434	0.0373	0.0438	0.0756	0.0370	0.0556
25	Mfeat-mor	0.0622	0.1020	0.1031	0.1015	0.0677	0.1031	0.2973	0.0924
26	Segment	0.0259	0.0294	0.0381	0.0285	0.0255	0.0259	0.2633	0.0265
27	Hypothyroid	0.0031	0.0034	0.0024	0.0031	0.0034	0.0109	0.0006	0.0024
28	Kr-vs-kp	0.0186	0.0152	0.0111	0.0152	0.0186	0.0559	0.0026	0.0176
29	Hypo	0.0051	0.0071	0.0069	0.0069	0.0049	0.0324	0.0003	0.0062
30	Sick	0.0047	0.0051	0.0043	0.0052	0.0042	0.0197	0.0008	0.0043
31	Phoneme	0.1215	0.1828	0.1064	0.1841	0.1343	0.1559	0.0295	0.1501
32	Wall-following	0.0211	0.0288	0.0294	0.0287	0.0242	0.0696	0.0390	0.0227
33	Page-blocks	0.0135	0.0143	0.0177	0.0144	0.0124	0.0194	0.0034	0.0132
34	Mushrooms	0.0043	0.0002	0.0002	0.0002	0.0001	0.0001	0.0010	0.0005
35	Sign	0.0313	0.0386	0.0596	0.0386	0.0378	0.0673	0.0210	0.0542
36	Nursery	0.0085	0.0168	0.0195	0.0167	0.0105	0.0582	0.0081	0.0177
37	Magic	0.0174	0.0490	0.0491	0.0490	0.0297	0.0707	0.0019	0.0427
38	Shuttle	0.0009	0.0004	0.0003	0.0004	0.0004	0.0006	0.0122	0.0003
39	Waveform	0.0009	0.0053	0.0037	0.0053	0.0025	0.0181	0.0026	0.0030
40	Localization	0.0460	0.0594	0.1099	0.0594	0.0580	0.0863	0.0192	0.1033

Table A4

Experimental results of RMSE

No.	Data set	NB	TAN	KDB	ATAN	AODE	KNN	SVM	KST
1	Contact-lenses	0.3778	0.4496	0.3639	0.4101	0.4066	0.3165	0.5000	0.3904
2	Zoo	0.0802	0.0647	0.0859	0.0757	0.0677	0.0941	0.1596	0.0868
3	Lymphography	0.2446	0.2684	0.3031	0.2706	0.2478	0.2759	0.3020	0.2589
4	Teaching-ae	0.4789	0.4825	0.4804	0.4825	0.4670	0.4591	0.5943	0.4594
5	Wine	0.0926	0.1414	0.1211	0.1434	0.0976	0.1821	0.6089	0.1171
6	Autos	0.2714	0.2361	0.2323	0.2377	0.2324	0.2568	0.4322	0.2147
7	Glass-id	0.3540	0.3332	0.3395	0.3319	0.3439	0.3716	0.4025	0.3216
8	Audio	0.1254	0.1453	0.1528	0.1455	0.1194	0.1260	0.2138	0.1205
9	Heart	0.3651	0.3771	0.3949	0.3762	0.3569	0.4924	0.6526	0.3681
10	Hungarian	0.3667	0.3429	0.3552	0.3401	0.3476	0.4791	0.6005	0.3457
11	Heart-disease-c	0.3743	0.3775	0.3963	0.3767	0.3659	0.4924	0.6575	0.3590
12	Primary-tumor	0.1787	0.1814	0.1864	0.1815	0.1851	0.2243	0.2257	0.1799
13	Ionosphere	0.3157	0.2615	0.2714	0.2611	0.2506	0.3686	0.2560	0.2486
14	Dermatology	0.0631	0.0851	0.1206	0.0857	0.0692	0.1339	0.2376	0.0838
15	Horse-colic	0.4209	0.4205	0.4348	0.4217	0.4015	0.4706	0.5805	0.3899
16	House-votes-84	0.2997	0.2181	0.1969	0.2182	0.1994	0.2440	0.2144	0.2046
17	Cylinder-bands	0.4291	0.4358	0.4431	0.4469	0.4080	0.5045	0.4830	0.3786
18	Balance-scale	0.3260	0.3203	0.3177	0.3202	0.3199	0.2800	0.2613	0.3149
19	Credit-a	0.3350	0.3415	0.3480	0.3433	0.3271	0.4334	0.6670	0.3250
20	Breast-cancer-w	0.1570	0.1928	0.2497	0.1964	0.1848	0.1939	0.1649	0.1998
21	Pima-ind-diabetes	0.4147	0.4059	0.4074	0.4059	0.4078	0.5453	0.5907	0.4099
22	Tic-tac-toe	0.4309	0.4023	0.3772	0.4023	0.3995	0.2315	0.3495	0.2526
23	Contraceptive-mc	0.4506	0.4391	0.4485	0.4392	0.4398	0.5990	0.5486	0.4391
24	Car	0.2252	0.1617	0.1379	0.1616	0.2005	0.1953	0.2013	0.1541
25	Mfeat-mor	0.2086	0.1940	0.1974	0.1944	0.1985	0.2580	0.3592	0.1971
26	Segment	0.1398	0.0967	0.1034	0.0982	0.0879	0.0902	0.3146	0.0921
27	Hypothyroid	0.1138	0.0955	0.0937	0.0951	0.1036	0.1705	0.2043	0.0956
28	Kr-vs-kp	0.3022	0.2358	0.1869	0.2358	0.2638	0.1946	0.2470	0.2430
29	Hypo	0.0766	0.0738	0.0671	0.0734	0.0650	0.2081	0.1923	0.0627
30	Sick	0.1700	0.1434	0.1382	0.1434	0.1572	0.1953	0.2469	0.1425
31	Phoneme	0.0880	0.0902	0.0784	0.0902	0.0885	0.0915	0.1276	0.0846
32	Wall-following	0.2177	0.1586	0.1363	0.1588	0.1292	0.2430	0.2204	0.1192
33	Page-blocks	0.1450	0.1187	0.1128	0.1189	0.1021	0.1257	0.1865	0.1022
34	Mushrooms	0.1229	0.0083	0.0001	0.0082	0.0109	0.0000	0.0444	0.0195
35	Sign	0.3984	0.3505	0.3334	0.3504	0.3524	0.2892	0.4671	0.3363
36	Nursery	0.1766	0.1385	0.1121	0.1385	0.1571	0.1466	0.0988	0.1258
37	Magic	0.3974	0.3461	0.3470	0.3461	0.3541	0.4366	0.5841	0.3412
38	Shuttle	0.0298	0.0182	0.0146	0.0180	0.0126	0.0137	0.0689	0.0137
39	Waveform	0.1176	0.0951	0.1145	0.0951	0.0860	0.1641	0.1345	0.0894
40	Localization	0.2390	0.2095	0.1960	0.2095	0.2095	0.2012	0.2766	0.1881

References

Yang

Liu

and Liu

, An autonomy-oriented computing approach to community mining in distributed and dynamic networks, Autonomous Agents and Multi-Agent Systems 20(2) (2010), 123–157.

Dong

Zhang

Ren

and Li

, Classifier learning algorithm based on genetic algorithms, International Journal of Innovative Computing Information and Control 6(4) (2010), 1973–1981.

Wang

Liu

and Jiao

, Modeling generation of the router-level topology of an ISP network, Computing 90(1–2) (2010), 73–88.

Yan

Tan

Min

and Tsang

, Online heterogeneous transfer by hedge ensemble of offline and online decisions, IEEE Transactions on Neural Networks and Learning Systems 29(7) (2018), 3252–3263.

Zhou

Tan

Yan

and Hao

, Online transfer learning with multiple homogeneous or heterogeneous sources, IEEE Transactions on Knowledge and Data Engineering 29(7) (2017), 1494–1507.

Tan

Song

Chen

and Ng

, ML-Forest: a multi-label tree ensemble method for multi-label classification, IEEE Transactions on Knowledge and Data Engineering 28(10) (2016), 2665–2680.

Zhang

and Chen

, Data-intensive applications, challenges, techniques and technologies: a survey on big data, Information Sciences 275(11) (2014), 314–347.

Vandal

Kodra

Ganguly

Nemani

and Ganguly

, Qantifying Uncertainty in Discrete-Continuous and Skewed Data with Bayesian Deep Learning, in: 24th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2018, pp. 2377–2386.

Sun

Liu

Chen

Han

and Wang

, Feature selection using dynamic weights for classification, Knowledge-Based Systems 37 (2013), 541–549.

10.

Concha

and Pedro

, Discrete bayesian network classifiers: a survey, ACM Computing Surveys 47 (2014), 1–43.

11.

Liu

Huang

Chen

and Jia

, A search problem in complex diagnostic bayesian networks, Knowledge-Based Systems 30 (2012), 95–103.

12.

Pearl

, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann: Burlington, 1988.

13.

Yang

and Moody

, Data visualization and feature selection: new algorithms for nongaussian data, in: Advances in Neural Information Processing Systems, 2000, pp. 687–693.

14.

Vidal-Naquet

and Ullman

, Object recognition with informative features and linear classification, in: 9th IEEE International Conference on Computer Vision, 2003, pp. 281–288.

15.

Fleuret

, Fast binary feature selection with conditional mutual information, Journal of Machine Learning Research 5 (2004), 1531–1555.

16.

Estévez

Tesmer

Perez

and Zurada

, Normalized mutual information feature selection, IEEE Transactions on Neural Networks 20(2) (2009), 189–201.

17.

Zaidi

Cerquides

Carman

and Webb

G.I.

, Alleviating naive bayes attribute independence assumption by attribute weighting, Journal of Machine Learning Research 14(1) (2013), 1947–1988.

18.

Nigam

McCallum

Thrun

and Mitchell

, Text classification from labeled and unlabeled documents using EM, Machine Learning 39(2–3)(2000), 103–134.

19.

Joachims

, Transductive inference for text classification using support vector machines, in: Proceedings of the 16th International Conference on Machine Learning, 1999, pp. 200–209.

20.

Belkin

Niyogi

and Sindhwani

, Manifold regularization: a geometric framework for learning from labeled and unlabeled examples, Journal of Machine Learning Research 7 (2006), 2399–2434.

21.

Liu

Wang

and Chang

, Robust and scalable graph-based semisupervised learning, Proceedings of the IEEE 100(9) (2012), 2624–2638.

22.

Zhou

and Li

, Tri-training: exploiting unlabeled data using three classifiers, IEEE Transactions on Knowledge and Data Engineering 17(11) (2005), 1529–1541.

23.

Jiang

Wang

and Zhang

, Structure extended multinomial naive bayes, Information Sciences 329 (2016), 346–356.

24.

Friedman

Geiger

and Goldszmidt

, Bayesian network classifiers, Machine Learning 29(2) (1997), 131–163.

25.

Sahami

, Learning Limited Dependence Bayesian Classifiers, in: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996, pp. 335–338.

26.

Blanco

Inza

Merino

Quiroga

and Larrañaga

, Feature selection in bayesian classifiers for the prognosis of survival of cirrhotic patients treated with TIPS, Journal of Biomedical Informatics 38(5) (2005), 376–388.

27.

Xiao

and Jiang

, Structure identification of bayesian classifiers based on GMDH, Knowledge-Based Systems 22 (2009), 461–470.

28.

Bouckaert

R.R.

, Voting massive collections of bayesian network classifiers for data streams, in: 19th Australian Joint Conference on Artificial Intelligence, Vol. 4304, 2006, pp. 243–252.

29.

Rubio

and Gamez

J.A.

, Flexible learning of k-dependence bayesian network classifiers, in: 13th Annual Genetic and Evolutionary Computation Conference, 2011, pp. 1219–1226.

30.

Jiang

Cai

Wang

and Zhang

, Improving tree augmented naive bayes for class probability estimation, Knowledge-Based Systems 26 (2012), 239–245.

31.

Webb

G.I.

Boughton

and Wang

, Not so naive bayes: aggregating one-dependence estimators, Machine Learning 58(1) (2005), 5–24.

32.

Zheng

Webb

G.I.

Suraweera

and Zhu

, Subsumption resolution: an efficient and effective technique for semi-naive bayesian learning, Machine Learning 87(1) (2012), 93–125.

33.

Bache

and Lichman

, UCI Machine Learning Repository, Available online: https://archive.ics.uci.edu/ml/datasets.html.

34.

Fayyad

U.M.

and Irani

K.B.

, Multi-interval Discretization of Continuous-Valued Attributes for Classification Learning, in: 13th International Joint Conference on Artificial Intelligence, 1993, pp. 1022–1029.

35.

Kohavi

and Wolpert

D.H.

, Bias plus variance decomposition for zeroone loss functions, in: Thirteenth International Conference on International Conference on Machine Learning, 1996, pp. 275–283.

36.

Duan

and Wang

, K-dependence bayesian classifier ensemble, Entropy 19(12) (2017), 651–671.

37.

Demšar

, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7(1) (2006), 1–30.

38.

Nemenyi

, Distribution-Free Multiple Comparisons, Ph.D. Thesis, Princeton University, Princeton, NJ, USA, 1963.

A novel approach to fully representing the diversity in conditional dependencies for learning Bayesian network classifier

Abstract

Keywords

1. Introduction

1 Capital letters such as X i and Y denote attributes or variables, and lower-case letters such as x i and y denote specific values taken by those attributes.

3.1 Learn general BNC from training data

Table 2 W/D/L comparison results of 0–1 loss on all data sets

Table 3 W/D/L comparison results of bias on all data sets

4.4 Friedman test

Footnotes

Appendix A

References

¹
Capital letters such as $X_{i}$ and $Y$ denote attributes or variables, and lower-case letters such as $x_{i}$ and $y$ denote specific values taken by those attributes.

Table 2
W/D/L comparison results of 0–1 loss on all data sets

Table 3
W/D/L comparison results of bias on all data sets