From undirected dependence to directed causality: A novel Bayesian learning approach

Abstract

Bayesian network (BN) is one of the most powerful probabilistic models in the field of uncertain knowledge representation and reasoning. During the past decade, numerous approaches have been proposed to build directed acyclic graph (DAG) as the structural specification of BN. However, for most Bayesian network classifiers (BNCs) the directed edges in DAG substantially represent assertions of conditional independence rather than causal relationships although the learned joint probability distributions may fit data well, thus they cannot be applied to causal reasoning. In this paper, conditional entropy is introduced to measure causal uncertainty due to its asymmetry characteristic, and heuristic search strategy is applied to build Bayesian causal tree (BCT) by identifying significant causalities. The resulting highly scalable topology can represent causal relationship in terms of causal science, and corresponding joint probability can fit training data in terms of data science. Then ensemble learning strategy is applied to build Bayesian causal forest (BCF) with a set of BCTs, each taking different attribute as the root node to represent root cause for causality analysis. Extensive experiments performed on 32 public datasets from the UCI machine learning repository show that BCF achieves outstanding classification performance compared to state-of-the-art single-model BNCs (e.g., CFWNB), ensemble BNCs (e.g., WATAN, IWAODE, WAODE-MI and TAODE) and non-Bayesian learners (e.g., SVM, k-NN, LR).

Keywords

Bayesian networks conditional entropy causal relationships Bayesian causal forest

1. Introduction

Training classifier is one of the great challenges in data mining and machine learning [1]. Some state-of-the-art supervised learning algorithms (e.g., Decision Tree [2, 3], Bayesian network (BN) [4, 5], Support Vector Machine [6] and Neural Network [7]) have been introduced for learning from data. When applied to deal with classification problems, Bayesian network classifier (BNC) is much more powerful than other algorithms due to its knowledge expressivity [8, 9, 10]. The network topology of BNC and corresponding joint probability qualitatively and quantitatively describe the statistical knowledge. One of the most exciting prospects in recent years has been the possibility of using BNC to discover causal structures implicated in training data [11, 12, 13, 14] – a task previously considered impossible without controlled experiments.

Learning BNC from data includes structure learning and parameter learning. Naive Bayes (NB) [15] receives more attention from researchers around the world due to its extreme simplicity and superior performance. The network structure of NB is commonly used as the basic framework of restricted BNCs, e.g., state-of-the-art single-model BNCs (e.g., tree augmented naive Bayes (TAN) [16]) and ensemble BNCs (e.g., averaged one-dependence estimators (AODE) [17]). To improve the generalization performance and fully represent the coverage of dependency relationships implicated in training data, ensemble BNCs focus on the study of diversity among sub-classifiers and distribute the errors of base BNCs across different parts of the instance space. Any complete probabilistic model of a domain must, either explicitly or implicitly, represents the joint probability distribution of every possible event defined by the values of all the variables. BNCs achieve compactness by factoring the joint probability distribution into conditional probabilities for each variable given its parents. After learning the network structure, weighting can be applied to further improve the estimate of conditional probability for single-model BNC [18] or joint probability for each sub-classifier in ensemble BNC [19, 20, 21, 22].

Judea Pearl, who is the Turing Award winner and recognized for his work in artificial intelligence especially through the invention of Bayesian networks, proposed the idea of conversion from Big Data Revolution to Causal Science Revolution in the keynote speech “The New Science of Cause and Effect with reflections on data science and artificial intelligence” in 2020. Since our intuitive understanding is usually framed in how one variable influences another [23], it can be more helpful to reason in terms of causality in different situations, such as the study of causal dependence between time series [24, 25], analysis of the direction and frequency content of the brain activity flow [26], identification of nonlinear input-state-output systems [27] and etc [28, 29, 30]. Pearl [31] assumes directed acyclic graph (DAG) as the structural specification of causal Bayesian networks, in which the directed edges can represent directed causal relationships, pointing from cause attributes to effect attributes. However, when information theory is applied to learn the network structure of BNC from data and information-theoretic metrics (e.g., mutual information and conditional mutual information) are introduced to measure mutual dependence or conditional dependence, the arrows in the diagram represent the flow of information rather than causal connections. The learned acausal BNC represents assertions of conditional independence, whereas probabilistic independencies in themselves do not imply a causal structure and vice versa.

The contributions of this paper are introduced as follows:

•
We prove the reasonableness of using conditional entropy to measure the causality between attributes. Then heuristic search strategy is applied to build Bayesian causal tree (BCT) by identifying significant causalities. The resulting highly scalable topology can represent causal relationship in terms of causal science, and corresponding joint probability can fit training data in terms of data science.
•
The interactions among variables in causal network are asymmetric and this asymmetry leads to causal ordering of the variables, which implicitly assumes that the first variable in the order is the only cause variable that no other variable influence it. To address this issue, we apply ensemble learning to build Bayesian causal forest (BCF) with a set of BCTs, each taking different attribute as the root node to represent root cause for causality analysis.
•
We compare the classification performance of BCF with five variations of the state-of-the-art NB, TAN and AODE, including single-model BNCs (e.g., CFWNB [18]), ensemble BNCs (e.g., WATAN [20], IWAODE [21], WAODE-MI [22], TAODE [32]) and non-Bayesian learners (e.g., SVM [6], k-NN [33], LR [34]). The experimental results show that BCF demonstrates competitive classification performance in terms of zero-one loss, RMSE, F1-Score and AUC on 32 datasets with 4 to 42 attributes and 57 to 299,285 instances.

The rest of this paper is organized as follows. Section 2 reviews some information-theoretic metrics and clarifies the difference between undirected dependence and directed causality for BNC learning. Section 3 describes in detail the learning procedure of our algorithm, BCF. Section 4 presents the extensive experimental results and compares the performance of BCF with other five algorithms. At last, Section 5 shows the conclusions and puts forward the research direction of future work.
2. Related work

2.1 Definitions and notions

BNC is a probabilistic model designed for classification, which consists of two parts: (1) the network structure $G$ in the form of DAG which describes the (in)dependency relationships among variables $\{X_{1},\cdots X_{n},C\}$ , and (2) the parameters $\Theta$ which quantifies the (in)dependencies within the structure and represents a set of parameters $\theta_{x_{i}|\pi_{i}}=P(x_{i}|\pi_{i})$ for each variable $X_{i}$ given its parents $\Pi_{i}$ in $G$ , where $x_{i}$ denotes the values of $X_{i}$ and $\pi_{i}$ denotes the values of $\Pi_{i}$ . If the causal relationship shows in the form of a directed edge from cause attribute $X_{c}$ to effect attribute $X_{e}$ , then DAG will just be the causal network.

When given a testing instance $\textbf{x}=\{x_{1},\cdots,x_{n}\}$ , classification can be performed by applying Bayes theorem to compute the maximum a-posteriori (MAP) as follows:

$\displaystyle c^{*}=\arg\max_{c\in C}P(c|\textbf{x})=\arg\max_{c\in C}\frac{P(% \textbf{x},c)}{P(\textbf{x})}\propto P(\textbf{x},c).$ (1)

For the joint probability distribution described in Eq. (1), restricted BNC $\mathcal{B}$ takes class variable $C$ as the common parent and assumes that each attribute can have only a limited number of attributes as its parents. Then corresponding joint probability is factorized according to the chain rule as follows:

$\displaystyle P_{\mathcal{B}}(\textbf{x},c)=P_{\mathcal{B}}(c)\prod_{i=1}^{n}{% P_{\mathcal{B}}(x_{i}|\pi_{i},c)}.$ (2)

Each factor in Eq. (2), i.e., $P(x_{i}|\pi_{i},c)$ , is the estimate of conditional probability derived from the frequency of training instances.

Information theory is a science that studies the law of information measurement, transmission and transformation with mathematical statistics. In this paper, entropy $H(X)$ and conditional entropy $H(X|C)$ are introduced to respectively measure uncertainty and conditional uncertainty.

.

[35] Entropy $H(X)$ is an information-theoretic function that measures the uncertainty of discrete random variable $X$ with probability distribution $P(x)$ , and is defined as:

$\displaystyle H(X)=-\sum_{X}P(x)\log{P(x)}.$ (3)

.

[35] Conditional entropy $H(X_{i}|X_{j})$ measures the conditional uncertainty of variable $X_{i}$ given random variable $X_{j}$ , and is defined as:

$\displaystyle H(X_{i}|X_{j})=-\sum_{X_{i}}\sum_{X_{j}}P(x_{i},x_{j})\log P(x_{% i}|x_{j}).$ (4)

.

[35] Conditional mutual information (CMI) $I(X_{i};X_{j}|C)$ measures the conditional dependence between $X_{i}$ and $X_{j}$ given all possible values of variable $C$ , and is defined as:

$\displaystyle I(X_{i};X_{j}|C)=H(X_{i}|C)-H(X_{i}|X_{j},C)=\sum_{X_{i}}\sum_{X% _{j}}\sum_{C}P(x_{i},x_{j},c)\log\frac{P(x_{i},x_{j}|c)}{P(x_{i}|c)P(x_{j}|c)}.$ (5)

When given an instance $\textbf{d}_{i}=\{\textbf{x},c\}$ from the training dataset $\mathcal{D}$ , the log likelihood $\log P_{\mathcal{B}}(\textbf{d}_{i})$ measures the number of bits encoded in $\textbf{d}_{i}$ by using the joint probability distribution corresponding to the restricted BNC $\mathcal{B}$ . Then the entropy function $H_{\mathcal{B}}(X,C)$ measures the extent to which the learned BNC fits the training data $\mathcal{D}$ and can be defined as [36]

$\displaystyle H_{\mathcal{B}}(X,C)=-\sum_{X,C}P(\textbf{x},c)\log\mathit{P_{% \mathcal{B}}(\textbf{x},c)}=-\sum_{X,C}P(\textbf{x},c)\log\left\{P(c)\prod_{i=% 1}^{n}P(x_{i}|\pi_{i}^{\mathcal{B}},c)\right\}=-\sum_{C}P(c)\log P(c)-\sum_{i=% 1}^{n}\sum_{C,X_{i},\Pi_{i}}P(x_{i},\pi_{i},c)\log P(x_{i}|\pi_{i}^{\mathcal{B% }},c)=H(C)+\sum_{i=1}^{n}H(X_{i}|\Pi_{i}^{\mathcal{B}},C),$ (6)

where $\Pi_{i}^{\mathcal{B}}$ denotes the set of parents of $X_{i}$ in $\mathcal{B}$ and $\pi_{i}^{\mathcal{B}}$ denotes the values of $\Pi_{i}^{\mathcal{B}}$ . Equation (6) indicates that finding a learned BNC $\mathcal{B}$ that can fit the training data $\mathcal{D}$ to the maximum is equivalent to finding a network structure that can minimize the value of entropy function $H_{\mathcal{B}}(X,C)$ .

2.2 Bayesian network classifiers

In the following discussion, we will clarify the basic ideas of different BNCs from the perspective of dependence analysis and causality analysis.

2.2.1 Single-model BNC

Among all the BNCs, NB has the simplest topology due to its independence assumption that each attribute is independent of the other attributes given the class variable $C$ [37]. The conditional dependence or causal relationship between attributes doesn’t hold. According to the chain rule and the independence assumption, the joint probability distribution for NB can be factorized as

$\displaystyle P_{\rm{NB}}(\textbf{x},\ c)=P(c)\prod_{i=1}^{n}{P(x_{i}|c)}.$ (7)

NB doesn’t need to learn the network structure, and thus only the estimates of probabilities $P(c)$ and $P(x_{i}|c)$ are necessary to compute the joint probability. However, the attribute independence assumption of NB rarely holds in real situations, and adding directed edges between attributes is an effective way to improve NB. As DAG required, the network structure of BNC should be directed and acyclic. Given attribute order $\{X_{1},\cdots,X_{n}\}$ , attribute $X_{i}$ takes subset of $\Pi_{i}=\{X_{1},\cdots,X_{i-1}\}$ as its parents. Similarly, for causal BNC the cause and effect attributes in the directed edges are also determined by the attribute order. To address this issue, researchers proposed to apply pre-ordering or post-ordering strategy to sort the attributes.

Figure 1.

Example of (a) KDB with $k=2$ , (b) TAN with $X_{1}$ as the root node and (c) TAN with $X_{2}$ as the root node.

The $k$ -dependence Bayesian classifier (KDB) [38, 39] applies pre-ordering to determine which attribute is the candidate parent or cause. KDB first sorts the attributes in descending order by comparing $I(X_{i};C)$ , and extends the network structure of NB by employing parameter $k$ to limit the number of parents of each attribute to an appropriate range. As shown in Fig. 1a, for the $i$ -th attribute in the order, at most $k$ attributes from the first $i-1$ attributes will be selected as its parents $\Pi_{i}^{\textit{KDB}}$ . $I(X_{i};C)$ measures the mutual dependence between attribute $X_{i}$ and class variable $C$ . Obviously, $I(X_{i};C)$ cannot be used to measure the directed causality between $X_{i}$ and $X_{j}$ . If the network topology of BNC is used to represent information flow, the BNC needs to fully identify significant conditional dependencies among attributes. Tree augmented naive Bayes (TAN) extends NB by building a one-dependence maximum weighted spanning tree without any prior attribute order [16]. In the topology of TAN each attribute node $X_{i}$ has at most one other attribute node $\Pi_{i}^{\textit{TAN}}$ as its parent. To convert the undirected spanning tree into directed one, TAN randomly selects one attribute (e.g., $X_{1}$ or $X_{2}$ ) as the root node and makes all the edges point outward from the root node. Then the attribute order is determined accordingly and as shown in Fig. 1b and c, this post-ordering strategy makes directed causalities $X_{1}\rightarrow X_{3}$ or $X_{3}\rightarrow X_{1}$ reasonable for different root nodes.

Roughly speaking, for causal BN each non-root node is the direct effect of its parents and the directed edges are considered to represent spontaneous causality from the cause variables to the effect variables. However, the directed edge in BN only depicts statistical dependence between variables, not necessarily corresponds to causality. $I(X_{j};X_{i}|C)$ substantially quantifies the statistical correlation between $X_{i}$ and $X_{j}$ , and can be used to measure conditional dependence for BNC learning in the context of information flow. The directed edge $X_{i}\rightarrow X_{j}$ for KDB or TAN just describes the conditional dependence between $X_{i}$ and $X_{j}$ rather than the causal relationship that $X_{i}$ is the cause and $X_{j}$ is the effect. Thus the learned conditional dependencies are used to make corresponding probability distributions achieve “curve fitting” or “data fitting”. Due to the restriction in structure complexity and computational complexity, some researchers proposed to improve the estimates of the conditional or joint probability by weighting. To make the attribute independence assumption reasonable, Hidden Naive Bayes (HNB) [40] provides a weighted variation of NB. By combining the influences of all other attributes, HNB assumes a hidden parent for each attribute $X_{i}$ and uses $I(X_{i};X_{j}|C)$ as the weight $W_{ij}$ to represent the importance of attribute $X_{j}$ . Correlation-based feature weighting filter for NB (CFWNB) [18] improves NB by applying weighting approach. It discriminatively adds weight $W_{i}$ to each attribute $X_{i}$ by computing the attribute-class correlation and attribute-attribute inter-correlation. Therefore, the estimate of conditional probability $P(x_{i}|\pi,c)$ for $X_{i}$ will be changed to $P(x_{i}|c)^{W_{i}}$ .

2.2.2 Ensemble BNC

Ensemble learning is widely used to deal with classification problems and can greatly improve the classification accuracy of base classifiers. Its underlying idea is to combine multiple weak supervised learners to obtain a better and more comprehensive strong supervised learner. Even if one weak classifier gets the wrong prediction, other weak classifiers can correct the error. AODE is an ensemble of superparent one-dependence estimators (SPODEs). Figure 2 shows an example of SPODE, which chooses $X_{i}$ as the superparent and the pair of $\{X_{i},C\}$ will be considered as the root node. Correspondingly, the estimate of joint probability is

$\displaystyle P^{i}_{\mathrm{SPODE}}(\textbf{x},c)=P(x_{i},c)\prod_{j=1,j\neq i% }^{n}P(x_{j}|x_{i},c).$ (8)

By comparing Eqs (7) and (8), the independence assumption of SPODE can be regarded as an explicit variation of NB. Since SPODEs in AODE respectively take each attribute as the superparent that points to the rest of the attributes [41]. Directed edges $X_{i}\rightarrow X_{j}$ and $X_{j}\rightarrow X_{i}$ respectively appear in the $i$ -th and $j$ -th SPODEs. Thus no matter whether causality $X_{i}\rightarrow X_{j}$ or $X_{j}\rightarrow X_{i}$ holds, the right causality appears in AODE. That may be one of reasons why AODE performs better than single-model BNCs generally. On the basis of this, those learning strategies applied to improve the estimate of conditional probabilities are also applicable to improving the estimate of joint probability. WAODE-MI proposed by Jiang et al. [22] improves AODE by using the mutual information $I(X_{i},C)$ between the superparent $X_{i}$ and the class variable $C$ as the weight of the $i$ -th SPODE. It can significantly improve the classification performance of AODE with minimal computing overhead. AODE-SR proposed by Zheng et al. [42] can efficiently identify the occurrences of the specialization-generalization relationship and eliminate generalizations at classification time by using subsumption resolution (SR).

Figure 2.

The topology of SPODE.

Similar to the learning strategy of AODE, ATAN [20] also selects each attribute in turn as the root node to build several different directed maximum weighted spanning trees. Directed edges $X_{i}\rightarrow X_{j}$ and $X_{j}\rightarrow X_{i}$ respectively appear in different spanning trees where $X_{i}$ or $X_{j}$ is the root node. For AODE, the reasonableness of causality $X_{i}\rightarrow X_{j}$ for any attribute pair $\{X_{i},X_{j}\}$ is not verified, it appears in the $i$ -th SPODE with $X_{i}$ as the superparent and thus the confidence levels of all causalities represented by any SPODE are the same implicitly. In contrast, ATAN assumes no independence assumption, and the causality $X_{i}\rightarrow X_{j}$ only holds for some attribute pairs. The same causality may appear several times in ATAN, thus different causalities are of different degrees of importance implicitly. In order to make the advantage of ATAN in the performance of classification more obvious, weighted ATAN (WATAN) [20] takes $I(X_{i};C)$ between the root attribute $X_{i}$ and class variable $C$ as the weight of each learned TAN classifier.

3. Bayesian causal forest

3.1 Bayesian causal tree

Information theory is widely used to identify information flow in complex systems. For BNC learning, the conditional mutual information (CMI) $I(X_{i};X_{j}|C)$ is often applied to measure the conditional dependence between attributes $X_{i}$ and $X_{j}$ in order to quantify the amount of information implicated in the network topology of restricted BNC [36]. However, it is inappropriate to identify $X_{i}$ or $X_{j}$ as the cause due to the symmetry characteristic of CMI. In contrast to $I(X_{i};X_{j}|C)$ , conditional entropy $H(X_{i}|X_{j},C)$ quantifies statistical causation from $X_{j}$ to $X_{i}$ due to its asymmetric characteristic. Figure 3a and b respectively show the symmetric distribution of $I(X_{i};X_{j}|C)$ and asymmetric distribution of $H(X_{i}|X_{j},C)$ learned from dataset Vowel (see detail in Table 3).

Figure 3.

The comparison between the distribution of $I(X_{i};X_{j}|C)$ and that of $H(X_{i}|X_{j},C)$ on dataset Vowel.

If directed edge $\Pi_{i}\rightarrow X_{i}$ in the topology of BNC represents causality from cause $\Pi_{i}$ to effect $X_{i}$ , i.e., $\Pi_{i}$ reduces the uncertainty of $X_{i}$ to the minimum, then the value of $H(X_{i}|\Pi_{i})$ will be minimized or the estimate of $P(x_{i}|\pi_{i})$ will be maximized. Suppose that the Bayesian causal tree starts from the root attribute $X_{r}$ and the estimate of the joint probability distribution over all the $n$ attributes $\{X_{1},\cdots,X_{n}\}$ is shown as follows,

$\displaystyle P(\textbf{x},c)=P(c)P(x_{r}|c)\prod_{i=1,i\neq r}^{n}{P(x_{i}|% \pi_{i},c)}.$ (9)

Each factor $P(x_{i}|\pi_{i},c)$ in Eq. (9) denotes the conditional probability corresponding to the causality $\{\Pi_{i},C\}\rightarrow X_{i}$ . Based on Eq. (6), the joint entropy function corresponding to Eq. (9) is

$\displaystyle H(X,C)=H(C)+H(X_{r}|C)+\sum_{i=1,i\neq r}^{n}H(X_{i}|\Pi_{i},C)=% H(X_{r},C)+\sum_{i=1,i\neq r}^{n}H(X_{i}|\Pi_{i},C).$ (10)

Suppose that the attributes have been sorted implicitly or explicitly, and the order is $\{X_{1},X_{2},\cdots,X_{n}\}$ . To make the network topology of BNC be a DAG, the directed edge should point from $X_{i}$ to $X_{j}$ when $i<j$ holds. To minimize the value of the joint entropy function $H(X,C)$ , one effective and feasible approach is to identify $\Pi_{i}$ for each $X_{i}$ in turn, and the information introduced by $\Pi_{i}$ should help reduce the uncertainty of $X_{i}$ . In this paper, the network topology of BCT is restricted to be one-dependence DAG, $X_{i}$ can take only one attribute from $\{X_{1},X_{2},\cdots,X_{i-1}\}$ as $\Pi_{i}$ . That is,

$\displaystyle\Pi_{i}=\arg\min\ H(X_{i}|X_{j},C)(X_{j}\in\{X_{1},X_{2},\cdots,X% _{i-1}\}).$ (11)

Then conditional entropy $H(X_{i}|\Pi_{i},C)$ is introduced to identify significant causality between cause attribute $\Pi_{i}$ and effect attribute $X_{i}$ , and heuristic search strategy is used to build Bayesian causal tree for fully representing causal relationships implicated in training data. In the topology of Bayesian causal tree, directed edges represent causal relationships pointing from causes to effects.

Suppose that node $X_{r}$ is the first attribute in the order, then $H(X_{i}|X_{r},C)(i\neq r)$ will be computed and compared to select the children node $X_{s}$ with the minimum of $H(X_{s}|X_{r},C)$ . The directed edge $\{X_{r},C\}\rightarrow X_{s}$ represents the causality, and $X_{s}$ will be regarded as the second attribute in the order. The same procedure repeats until all the attributes are added into the order. Finally, we will have a set of causalities that correspond to directed edges in the causal tree pointing to different attributes except $X_{r}$ . Thus the procedure of minimizing the conditional entropy function to sort attributes is also the procedure of building the topology of Bayesian causal tree (BCT). The learning procedure of BCT is depicted by Algorithm 3 as follows.

[h] BCT-Learning ( $r$ ).Training dataset $\mathcal{D}$ with attribute set $\textbf{X}=\{X_{1},\cdots,X_{n}\}$ and class $C$ . The topology of BCT. Let $\mathcal{T}$ be a causal tree $\mathcal{T}=(\mathcal{U},\mathcal{V})$ , in which $\mathcal{U}$ represents node set and $\mathcal{V}$ represents directed edge set. $\mathcal{U}=\{C\}$ , $\mathcal{V}=$ Ø. Compute $H(X_{i}|X_{j},C)$ for each pair of attributes $\{X_{i},X_{j}\}(i\neq j)$ .Select an attribute $X_{r}$ from X as the root attribute. $\mathcal{U}=\mathcal{U}\cup\{{X}_{r}\}$ , $\textbf{X}=\textbf{X}\backslash\{{X}_{r}\}$ , $\mathcal{V}=\mathcal{V}\cup\{C\to{X}_{r}\}$ . $\textbf{X}\neq$ Ø Select the attribute $X_{i}\notin\mathcal{U}$ corresponding to the minimum value of $H(X_{i}|\Pi_{i},C)$ (where $\Pi_{i}\subseteq\mathcal{U}$ and $|\Pi_{i}|=1$ ). $\mathcal{U}=\mathcal{U}\cup\{{X}_{i}\}$ , $\textbf{X}=\textbf{X}\backslash\{{X}_{i}\}$ , $\mathcal{V}=\mathcal{V}\cup\{C\to{X}_{i},\Pi_{i}\to{X}_{i}\}$ . BCT

3.2 Ensemble learning

BCT builds a causal tree to identify the causal relationships with a prior root attribute as the root cause. Then the attribute order will be uniquely determined by comparing conditional entropy $H(X_{i}|X_{j},C)$ . If different attributes are chosen as the root cause for causal reasoning, then the causality inferred from training data may vary greatly and that may result in different causal trees. To demonstrate the diversity of causality in different causal trees more clearly, we also take dataset Localization as an example, which can be obtainted from UCI repository of machine learning (https://archive.ics.uci.edu/ml/datasets/Localization+Data+for+Person+Activity). Table 1 presents a detailed description of the dataset Localization and Table 2 shows the results of $H(X_{i}|X_{j},C)$ for each pair of attributes in BCF on dataset Localization. The topologies of causal trees with $X_{1}$ or $X_{2}$ as the root node are depicted in Fig. 4a and b, respectively.

Table 1
Dataset Localization for experimental study

Attribute	Type	Explanation	Symbol
Sequence Name	Nominal	Sequence Name of five people	$X_{1}$
Tag identificator	Nominal	No. of tag identificator	$X_{2}$
x coordinate	Numeric	z coordinate of the tag	$X_{3}$
y coordinate	Numeric	y coordinate of the tag	$X_{4}$
z coordinate	Numeric	z coordinate of the tag	$X_{5}$
Activity	Nominal	Person activity prediction	C

Table 2

The results of $H(X_{i}|X_{j},C)$ for each pair of attributes on Localization

$H(X_{i}\|X_{j},C)$		$X_{i}$
		$X_{1}$	$X_{2}$	$X_{3}$	$X_{4}$	$X_{5}$
$X_{j}$	$X_{1}$	$\sim$	1.9645	3.9366	3.5781	3.4521
	$X_{2}$	4.5681	$\sim$	3.9659	3.7586	3.1136
	$X_{3}$	4.1346	1.5606	$\sim$	3.4214	3.3711
	$X_{4}$	4.3226	1.8999	3.9679	$\sim$	3.396
	$X_{5}$	4.4439	1.5022	4.1648	3.6431	$\sim$

For 1-dependence topology, each children attribute can have at most one parent attribute or each effect attribute can have at most one cause attribute. If attribute $X_{1}$ is chosen as root node, by comparing $H(X_{i}|X_{1},C)(i\neq 1)$ we can see from Table 2 that $H(X_{2}|X_{1},C)$ corresponds to the minimum, thus $X_{2}$ is the effect attribute for cause attribute $X_{1}$ . After that, we choose $X_{1}$ and $X_{2}$ as the candidate cause attributes. By comparing $H(X_{i}|X_{j},C)(j\in\{1,2\},i\notin\{1,2\})$ , $H(X_{5}|X_{2},C)$ corresponds to the minimum, thus $X_{5}$ is the effect attribute for cause attribute $X_{2}$ . This procedure repeats until the effect attribute has been identified for each cause attribute. Finally, as shown in Fig. 4a all the causal relationships $\{X_{1}\rightarrow X_{2},X_{2}\rightarrow X_{5},X_{1}\rightarrow X_{4},X_{1}% \rightarrow X_{3}\}$ are identified. In contrast, if attribute $X_{2}$ is chosen as root node, as shown in Fig. 4b the causal relationships $\{X_{2}\rightarrow X_{5},X_{5}\rightarrow X_{4},X_{2}\rightarrow X_{3},X_{3}% \rightarrow X_{1}\}$ are identified. We may achieve different results if reasoning is performed with different root causes, that is consistent with human cognition. One ensemble member substantially differs from the others and may make accurate predictions in cases where some other has made errors, and the ensemble learner should perform better, on average, than any individual member [43]. Diversity can be recognized as a very important characteristic among these causal trees. Bayesian causal forest (BCF) is the ensemble of these causal trees and can be expected to achieve competitive classification performance.

Figure 4.

Example of (a) BCT with $X_{1}$ as the root and (b) BCT with $X_{2}$ as the root.

The learning framework of BCF is described in Fig. 5, and the training and testing procedures of BCF are respectively described by Algorithms 4 and 3.2. At the training phase, BCF respectively selects each attribute $X_{r}$ as the root node of Bayesian casual tree $\textit{BCT}_{r}$ , thus there are $n$ different BCTs corresponding to $n$ attributes in the training dataset. BCF forms a three-dimensional probability table containing $P(c)$ , $P(x_{i},c)$ , $P(x_{i}|x_{j},c)$ and $P(x_{i},x_{j},c)$ by counting the frequencies of instances in the training dataset, and the time complexity is $\mathcal{O}(tn^{2})$ , where $t$ denotes the number of training instances. To calculate the conditional entropy $H(X_{i}|X_{j},C)$ between each pair of attributes, BCF needs to consider different combinations of any two attributes and class attributes, and the time complexity is $\mathcal{O}(mn^{2}v^{2})$ , where $m$ and $v$ respectively denote the number of class labels and the maximum number of values of discrete attributes. The time complexity for determining the parents of each attribute is $\mathcal{O}(n^{2}\textit{logn})$ by comparing $H(X_{i}|X_{j},C)$ . Consequently, BCF takes $\mathcal{O}(n^{3}\textit{logn})$ time complexity to generate $n$ BCT classifiers.

BCF-Training $(\mathcal{D})$ .Training dataset $\mathcal{D}$ with attribute set $\textbf{X}=\{X_{1},\cdots,X_{n}\}$ and class $C$ . $\textit{BCT}_{1},\textit{BCT}_{2},\cdots,\textit{BCT}_{n}$ . Compute the conditional entropy $H(X_{i}|X_{j},C)$ between each pair of attributes, $i\neq j$ . $r=1\to n$ Choose $X_{r}$ as the root node to build a Bayesian causal tree. $\displaystyle\textit{BCT}_{r}=\textbf{BCT-Learning}(r)$ *[r]Algorithm 3 $\textit{BCT}_{1},\textit{BCT}_{2},\cdots,\textit{BCT}_{n}$ .

BCF-Test $(\textit{BCT}_{1},\textit{BCT}_{2},\cdots,\textit{BCT}_{n},\textbf{x})$ . The topologies $(\textit{BCT}_{1},\textit{BCT}_{2},\cdots,\textit{BCT}_{n})$ and a test instance $\textbf{x}=\{x_{1},\cdots,x_{n}\}$ . Predicted class label $y^{*}$ . $p=1\to m$ $r=1\to n$ Use $\textit{BCT}_{r}$ to estimate the probability $P_{\textit{BCT}_{r}}(\textbf{x},c_{p})$ according to Eq. (9) where $c_{p}$ belongs to class label $C=\{c_{1},\cdots,c_{m}\}$ .

Average the estimates of joint probabilities $P(\textbf{x},c_{p})=\frac{1}{n}\sum_{r=1}^{n}P_{\textit{BCT}_{r}}(\textbf{x},c% _{p})$ .

$\displaystyle y^{*}=\arg\max_{c_{p}\in C}\frac{P(\textbf{x},c_{p})}{\sum_{p=1}% ^{k}P(\textbf{x},c_{p})}$ .

Figure 5.

The learning framework of BCF.

Domingos [44] pointed out that Bayesian model averaging is theoretically the optimal method for combining learned models. At the testing phase, BCF needs $\mathcal{O}(mn^{2})$ time complexity to estimate the posterior probability $P(\textbf{x}|c)$ for each class label, and as shown in Eq. (12) the posterior probability is calculated by averaging all of those joint probability estimates.

$\displaystyle P_{\textit{BCF}}(c|\textbf{x})=\frac{P_{\textit{BCF}}(\textbf{x}% ,c)}{P_{\textit{BCF}}(\textbf{x})}=\frac{\displaystyle\sum_{r=1}^{n}P_{\textit% {BCT}_{r}}(\textbf{x},c)}{\displaystyle\sum_{c\in C}\sum_{r=1}^{n}P_{\textit{% BCT}_{r}}(\textbf{x},c)}.$ (12)

where $P_{\textit{BCT}_{r}}(\textbf{x},c)$ denotes the estimate of joint probability $P(\textbf{x},c)$ for $\textit{BCT}_{r}$ .

4. Experimental study

4.1 Experimental setting and benchmark datasets

To illustrate the difference in classification performance of the algorithms being compared, we perform experiments on 32 benchmark datasets from the UCI machine learning repository [45] and record the number of instances, attributes, and classes for each dataset in Table 3. The datasets were divided into two categories, i.e., large datasets with the number of instances $\geqslant$ 2000 and small datasets with the number of instances $<$ 2000. In order to facilitate the calculation and training of the model, the missing values in the datasets are directly processed into distinct values suitable for training and classification. Numeric attributes are discretized using Minimum Description Length (MDL) [46]. In order to test the accuracy of the algorithms, each algorithm is processed with 10 rounds of 10-fold cross validation. Probability estimates are smoothed using $m$ -estimation with $m=1$ [47]. The source code of BCF can be obtained from the website, https://github.com/Bayes514/BCF. We compare our proposed BCF with the other eight algorithms (including one single-model BNC, four ensemble BNCs and three non-Bayesian learners). The details of the above algorithms are shown as follows:

•
CFWNB [18], correlation-based feature weighting filter for naive Bayes.
•
WATAN [20], weighted averaged tree augmented naive Bayes.
•
IWAODE [21], instance-based weighting filter for SPODE.
•
WAODE-MI [22], mutual information weighted AODE.
•
TAODE [32], targeted AODE.
•
SVM [6], support vector machine with default parameters.
•
LR [34], logistic regression with default parameters.
•
k-NN [33], k-Nearest Neighbor with default parameters.

4.2 Statistics employed

To evaluate the effectiveness and efficiency of BCF, some statistics were employed to interpret the results.

•
Zero-one loss (ZOL). In statistics and decision theory, zero-one loss is one of the most commonly used metrics to evaluate the classification models in terms of generalization. The cost of predicting class label $\hat{y}$ can be measured by $L(\hat{y},y)=0$ if $\hat{y}=y$ and $L(\hat{y},y)=1$ otherwise where $y$ is the true class label. The smaller the zero-one loss value, the better the performance of the model in general [48].
•
The Root Mean Squared Error (RMSE). RMSE is used to measure the deviation between the observed and true values [49]. RMSE sums up the square error of each instance, where the error is the difference between 1.0 and the estimated posterior probability of the true class, then averages the sum of the squared errors, and finally calculates the square root of the mean of the sum.
•
F1-Score. The F1-Score [50] can evaluate the extent to which the BNC works consistently while dealing with different parts of imbalanced data. For dataset with class labels $\{C_{1},C_{2}\cdots,C_{n}\}$ , each entry $N_{ij}$ in confusion matrix denotes the number of instances, whose true class is $C_{i}$ , but are actually assigned to $C_{j}$ . The F1-score is the harmonic average of the precision and recall, and is defined as follows:

$\displaystyle F1=\frac{1}{m}\sum_{i=1}^{m}\frac{2\cdot\textit{Precision}_{i}% \cdot\textit{Recall}_{i}}{\textit{Precision}_{i}+\textit{Recall}_{i}},$ (13)

where

$\displaystyle\left\{\begin{array}[]{l}\textit{Precision}_{i}=\frac{N_{ii}}{% \sum_{j=1}^{m}N_{ji}}\\ \textit{Recall}_{i}=\frac{N_{ii}}{\sum_{j=1}^{m}N_{ij}}.\\ \end{array}\right.$ (14)
•
Area Under Curve(AUC). AUC [51] is the area under the ROC curve which can be used to measure the quality of binary classifications. It is an effective and combined measure of sensitivity and specificity for assessing inherent validity of a diagnostic test.
•
Win/Draw/Loss (W/D/L) record. When comparing the classification performance of algorithms $\mathcal{A}_{1}$ and $\mathcal{A}_{2}$ over multiple datasets, W/D/L provides a more intuitive explanation for comparing results, which records the number of datasets for which $\mathcal{A}_{1}$ performs better, equally well or worse than $\mathcal{A}_{2}$ if the performance difference between them is greater than 0.05 on a given measurement.
•
Significance (Friedman and Nemenyi) test. The Friedman test [52] is a non-parametric equivalent of the repeated measures, which explores the statistical significance of multiple algorithms over multiple datasets, and is defined as follows [53]:

$\displaystyle F_{F}=\frac{(D-1)\chi^{2}_{F}}{D(t-1)-\chi^{2}_{F}},$ (15)

where

$\displaystyle\chi^{2}_{F}=\frac{12}{Dt(t+1)}\left(\sum_{j}R^{2}_{j}-\frac{t(t+% 1)^{2}}{4}\right),$ (16)

where $D$ , $t$ and $R_{j}$ respectively represent the number of datasets, the number of algorithms and the average rank of the $j$ -th algorithm. The null hypothesis of the Friedman test will be rejected if there exists significant difference among algorithms, then the Nemenyi test will be performed to further analyze the difference by comparing $d_{ij}$ and $C D$ (Critical Difference), where $d_{ij}$ denotes the difference between the average ranks of the $i$ -th algorithm and $j$ -th algorithm. The value of CD can be calculated as follows:

$\displaystyle CD=q_{\alpha}\sqrt{\frac{t(t-1)}{6D}},$ (17)

where $q_{\alpha}$ are the critical values that are calculated by dividing the values in the row for the infinite degree of freedom of the table of Studentized range statistics by $\sqrt{2}$ .

Tables A1–A3 in Appendix A respectively show the detailed results of zero-one loss, RMSE and F1-Score of all the 9 algorithms on 32 datasets. The symbols $\bullet$ and $\circ$ in Tables A1–A3 denote statistically significant improvement or degradation over our proposed algorithm BCF. Tables A4 and A5 in Appendix A respectively show the detailed results of rank.

Table 3
Datasets

Index Dataset Instance Attribute class

1 labor $\ \ast$ 57 16 2

2 labor-negotiations $\ \ast$ 57 16 2

3 lymphography 148 18 4

4 iris 150 4 3

5 autos 205 25 7

6 sonar $\ \ast$ 208 60 2

7 glass-id 214 9 3

8 heart $\ \ast$ 270 13 2

9 hungarian $\ \ast$ 294 13 2

10 soybean-large 307 35 19

11 dermatology 366 34 6

12 cylinder-bands $\ \ast$ 540 39 2

13 chess $\ \ast$ 551 39 2

14 balance-scale 625 4 3

15 soybean 683 35 19

16 credit-a $\ \ast$ 690 15 2

17 crx $\ \ast$ 690 15 2

18 tic-tac-toe $\ \ast$ 958 9 2

19 vowel 990 13 11

20 contraceptive-mc 1,473 9 3

21 mfeat-mor 2,000 6 10

22 kr-vs-kp $\ \ast$ 3,196 36 2

23 dis $\ \ast$ 3,772 29 2

24 hypo 3,772 29 4

25 sign 12,546 8 3

26 magic $\ \ast$ 19,020 10 2

27 adult $\ \ast$ 48,842 14 2

28 shuttle 58,000 9 7

29 connect-4 67,557 42 3

30 waveform 100,000 21 3

31 localization 164,860 5 11

32 census-income $\ \ast$ 299,285 41 2

4.3 BCF vs. the state-of-the-art BNCs

Index	Dataset	Instance	Attribute	class
1	labor $\ \ast$	57	16	2
2	labor-negotiations $\ \ast$	57	16	2
3	lymphography	148	18	4
4	iris	150	4	3
5	autos	205	25	7
6	sonar $\ \ast$	208	60	2
7	glass-id	214	9	3
8	heart $\ \ast$	270	13	2
9	hungarian $\ \ast$	294	13	2
10	soybean-large	307	35	19
11	dermatology	366	34	6
12	cylinder-bands $\ \ast$	540	39	2
13	chess $\ \ast$	551	39	2
14	balance-scale	625	4	3
15	soybean	683	35	19
16	credit-a $\ \ast$	690	15	2
17	crx $\ \ast$	690	15	2
18	tic-tac-toe $\ \ast$	958	9	2
19	vowel	990	13	11
20	contraceptive-mc	1,473	9	3
21	mfeat-mor	2,000	6	10
22	kr-vs-kp $\ \ast$	3,196	36	2
23	dis $\ \ast$	3,772	29	2
24	hypo	3,772	29	4
25	sign	12,546	8	3
26	magic $\ \ast$	19,020	10	2
27	adult $\ \ast$	48,842	14	2
28	shuttle	58,000	9	7
29	connect-4	67,557	42	3
30	waveform	100,000	21	3
31	localization	164,860	5	11
32	census-income $\ \ast$	299,285	41	2

4.3.1 Zero-one loss

The W/D/L records summarizing the zero-one loss of the state-of-the-art BNCs are shown in Table 4, from which we can see that weighted AODEs perform better in terms of zero-one loss than CFWNB and WATAN. For example, TAODE beats CFWNB on 15 datasets and loses on 8, and beats WATAN on 15 datasets and loses on 5. The reason may be that whether causality $X_{i}\rightarrow X_{j}$ or $X_{j}\rightarrow X_{i}$ holds, the right causality always appears in weighted AODEs. TAODE performs the best among all the weighted AODEs, and its advantage over WAODE-MI is less significant relative to IWAODE. TAODE beats IWAODE on 7 datasets and loses on 4. BCF performs the best among all of the above algorithms. When compared with single-model BNC, BCF performs much better than CFWNB (17/9/6). When compared with ensemble BNCs, the advantages of BCF over WATAN and IWAODE are also significant (18/11/3 and 16/12/4, respectively).

Table 4
Win-Draw-Loss results of zero-one loss on 32 datasets

		CFWNB	WATAN	IWAODE	WAODE-MI	TAODE
ZOL	WATAN	12\6\14
	IWAODE	16\9\7	13\12\7
	WAODE-MI	18\9\5	16\11\5	9\19\4
	TAODE	15\9\8	15\12\5	7\21\4	4\24\4
	BCF	17\9\6	18\11\3	16\12\4	15\12\5	14\13\5

To further analyze the effectiveness of the mechanism of BCF, we present the scatter plot of zero-one loss for BCF and the other two algorithms (WATN and TAODE) in Fig. 6, where the X-axis shows the index number of datasets and the Y-axis corresponds to the values of relative zero-one loss $\delta=\textit{ZOL}(\textit{BNC})/\textit{ZOL}(\textit{BCF})$ . BCF performs better or worse than corresponding BNC in terms of zero-one loss when $\delta>1$ or $\delta<1$ . Figure 6a shows the comparison results of BCF and WATAN. When dealing with small datasets, we can see that BCF achieves significant advantage over WATAN especially on three datasets (labor, labor-negotiations, dermatology) and seldom performs worse than WATAN. For large datasets, BCF also performs better than WATAN. Figure 6b shows the comparison results of BCF and TAODE. The advantage of BCF over TAODE is obvious but less significant.

The sub classifiers in WTAN apply the same topology and the number of undirected edges measured by conditional mutual information is limited, that cannot help demonstrate the diversity among sub classifiers and thus weighting is the only effective approach to tune the estimate of joint probability for data fitting. In contrast, the SPODE members in weighted AODE apply different independence assumptions and corresponding network topologies vary greatly. All possible conditional dependencies or causal relationships between attributes, significant or non-significant, are fully represented. Some inappropriate independence assumptions may degrade the classification performance of SPODE members and then the final weighted AODE. Weighting is more effective for improving AODE, and weighted AODE perform much better than AODE in general. In contrast, the BCT members in BCF assume no independence assumption and respectively represent directed causalities of high confidence level with different root causes. The topologies of these causal trees are built independently, whereas weighting effectively combines them into one and achieves the trade-off between diversity and complementarity.

4.3.2 RMSE

Table 5
Win-Draw-Loss results of RMSE on 32 datasets

		CFWNB	WATAN	IWAODE	WAODE-MI	TAODE
RMSE	WATAN	14\9\9
	IWAODE	15\13\4	8\18\6
	WAODE-MI	16\11\5	11\18\3	6\21\5
	TAODE	17\7\8	9\18\5	5\22\5	2\28\2
	BCF	17\9\6	11\18\3	14\14\4	11\19\2	15\14\3

Figure 6.

The scatter plot of zero-one loss for BCF and other two algorithms.

We compare the RMSE results for all of the above 6 state-of-the-art BNCs on 32 datasets to clarify their differences and present the W/D/L records in Table 5. The joint probability distribution can be factorized into $n$ conditional probabilities, and weighting can help finely tune the estimates of conditional probabilities or joint probability. Due to the limitation in structure complexity and computational complexity, restricted BNCs assume independence assumptions implicitly or explicitly, and only limited number of dependency relationships in the network topology can be represented. Weighting can be regarded as an effective approach to mitigating the negative effect of independence assumption by applying weighting metrics which measure the dependency relationship between attributes.

As shown in Table 5, ensemble learners perform better than single-model learner in terms of RMSE, e.g., WATAN beats CFWNB on 14 datasets and loses on 9. Thus the strict independence assumption of NB make the positive effect of weighting less significant. Weighted AODEs perform better than WATN in general, e.g., WAODE-MI beats WATN on 11 datasets and loses on 3. The topology diversity is explicable in terms of underlying independence assumptions of SPODE members, and the advantage of weighted AODEs can be attributed to the diversity and complementarity. BCF also embodies these characteristics in its learning procedure. It applies conditional entropy to measure directed causality and uses different root cause to learn different causal trees, thus achieving the trade-off between data fitting and causal reasoning. BCF demonstrates significant advantage over other learners, e.g., BCF respectively beats CFWNB, WATN and TAODE on 17, 11 and 15 datasets.

Figure 7.

The scatter plot of RMSE for BCF and other two algorithms.

In order to give the results a more intuitionistic explanation, we present the scatter plot of RMSE for BCF and other two algorithms in Fig. 7, where the X-axis shows the index number of datasets and the Y-axis corresponds to the values of $\eta=\textit{RMSE}(\textit{BNC})/\textit{RMSE}(\textit{BCF})$ . BCF performs better or worse than corresponding BNC in terms of RMSE when $\eta>1$ or $\eta<1$ . As shown in Fig. 7, BCF performs better than WATAN and TAODE much more often than not while dealing with small or large datasets. Given training data $\mathcal{D}$ with limited number of instances, ideally the estimate of joint probability distribution corresponding to the network topology learned from $\mathcal{D}$ should approximate the true one, and RMSE provides a more fine-grained calibration metric for probability estimation than zero-one loss.

Since $I(X_{i};X_{j},Y)=I(X_{j};X_{i},Y)$ rarely holds in practice, the confidence level of directed edge $X_{i}\rightarrow X_{j}$ corresponding to the estimate of $P(x_{j}|x_{i},y)$ may vary greatly to that of directed edge $X_{j}\rightarrow X_{i}$ corresponding to the estimate of $P(x_{i}|x_{j},y)$ . That makes the estimates of joint probability for different sub classifiers in WTAN fit data to different extents, although the dependency relationships in different topologies are the same. For the same reason, although the difference in independence assumptions makes the SPODE members in AODE avoid structure learning, these assumptions may make none of the SPODE members fit $\mathcal{D}$ well and unverified directed edge $X_{i}\rightarrow X_{j}$ will certainly appear in one of the SPODE members, that can help improve the generalization performance of the final ensemble and will degrade the estimates of joint probability distributions. In contrast, by applying heuristic search strategy to maximize the joint entropy function, each BCT in BCF is trained well to fit the training data without any independence assumption, and corresponding estimates of joint probability distribution approximate the true one.

4.3.3 F1-Score

Table 6
Win-Draw-Loss results of F1-Score on 32 datasets

		CFWNB	WATAN	IWAODE	WAODE-MI	TAODE
F1-Score	WATAN	5\17\10
	IWAODE	5\18\9	4\25\3
	WAODE-MI	5\15\12	5\24\3	1\28\3
	TAODE	5\15\12	5\25\2	2\29\1	1\30\1
	BCF	12\13\7	13\18\1	11\19\2	16\14\2	14\16\2

The W/D/L records summarizing the F1-Score of the state-of-the-art BNC algorithms are shown in Table 6, from which we can see that CFWNB performs better than weighted AODEs and WATAN in terms of F1-Score, e.g., CFWNB respectively beats WATN, IWAODE and TAODE on 10, 9 and 12 datasets. Weighted AODEs perform better in terms of F1-Score than WATAN. For example, TAODE beats WATAN on 5 datasets and loses on 2. The experimental results of IWAODE, WAODE-MI and TAODE are very similar, TAODE only outperforms IWAODE on 2 datasets. TAODE and WAODE-MI almost perform the same (30 draws). BCF performs the best among all of the above algorithms. When compared with single-model BNC, BCF performs better than CFWNB (12/9/7). When compared with ensemble BNCs, the advantages of BCF over WATAN, WAODE-MI and TAODE are also significant (13/18/1, 16/14/2 and 14/16/2, respectively). The results indicate that BCF can achieve significant improvements in terms of F1-Score.

4.3.4 Efficiency comparisons

Figure 8 shows the empirical time comparisons of the different out-of-core BNCs relative to BCF. All the experiments have been conducted on a desktop computer with an Intel(R) Core(TM) i5-8265U CPU @ 1.6 GHz, 64 bits and 8 G of memory. The algorithms described above are implemented using C $++$ software specially designed to deal with classification problems. Each bar represents the average of experimental results on all the 32 datasets through 10 rounds of 10-fold cross validation. The time required for discretization has not been included. No parallelization techniques have been used, although most algorithms could be parallelized.

As shown in Fig. 8, the training procedures of IWAODE and TAODE are just the same as that of AODE, i.e., no structure learning and weighting are required. Thus they need the least time for training. During the testing phase, IWAODE and TAODE learn weights from each testing instance, and this instantiated weighting approach adjusts the weights flexibly but requires more time for testing. WAODE-MI needs to compute mutual information $I(X_{i};C)$ between the superparent $X_{i}$ and class variable $C$ to learn weights for SPODE members from training data, thus WAODE-MI needs a bit more time for training and less time for testing in contrast to IWAODE and TAODE. For single-model BNC, i.e., CFWNB, the network structure is pre-determined and it learns weight for each attribute to improve the estimate of conditional probabilities. Feature-class correlation and the average feature-feature intercorrelation respectively measured by $I(X_{i};C)$ and $I(X_{i};X_{j})$ are considered for assigning weights. BCF only needs to learn the topology of each causal tree. In contrast, WATAN needs to construct $n$ directed maximum weighted spanning trees for $n$ attributes, and it uses the mutual information $I(X_{i};C)$ between the root attribute $X_{i}$ and class variable $C$ as the weight. Thus WATAN requires more time for training in contrast to WAODE-MI and BCF. BCF assigns the same weight to its BCT members, whereas the weights assigned to the members of WAODE-MI or WATAN will be different for most datasets. That makes WAODE-MI and WATAN need a bit more time to compute the joint probability for classifying.

4.4 BCF vs. SVM, LR and k-NN

Table 7
W/D/L records of all compared algorithms in terms of zero-one loss, RMSE and F1-Score on 30 datasets

Compared algorithms	W/D/L
	ZOL	RMSE	F1-Score
BCF versus SVM	23\1\6	25\2\3	19\6\5
BCF versus LR	20\5\5	19\7\4	14\10\6
BCF versus k-NN	20\1\9	19\3\8	15\8\7

Figure 8.

Comparison of averaged training time and classification time for state-of-the-art BNCs on 32 datasets.

In this section, we compare BCF with SVM, LR and k-NN. The W/D/L records for BCF vs. LR, RF and k-NN in terms of zero-one loss, RMSE and F1-Score are given in Table 7. As we cannot obtain the results of SVM, LR and k-NN on the two datasets (labor and census-income), the comparison in this section only includes 30 datasets. From Table 7 we can see that, BCF performs better than SVM, LR and k-NN in terms of zero-one loss, e.g., BCF beats SVM on 23 datasets and loses on 6, and beats LR on 20 datasets and loses on 5. When compared with non-Bayesian learners in terms of RMSE, the advantages of BCF over SVM, LR and k-NN are also significant (25/2/3, 19/7/4 and 19/3/8, respectively). BCF also performs better than SVM, LR and k-NN in terms of F1-Score. For example, BCF beats SVM on 19 datasets and loses on 5, and beats k-NN on 15 datasets and loses on 7. For these non-Bayesian learners (SVM, LR and k-NN), the experiments with 30 datasets were performed on Weka (version 3.5.7), a widely used machine learning work-branch. Figure 9 displays the empirical time comparisons of the different non-Bayesian learners relative to BCF. During the training phase, SVM maps training instances to points in space so as to maximise the width of the gap between the two categories. Test instances are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall. Thus SVM requires more time for training and testing in contrast to BCF. During the testing phase, k-NN needs to compute euclidean distance between a test instance and the specified training instances, thus k-NN needs a bit less time for training and more time for testing in contrast to SVM, LR and BCF. The goal of LR is to model the probability of a random variable being 0 or 1 given experimental data. During the training phase, LR uses optimization techniques such as gradient descent to maximize loss function. Therefore, learning of the LR is a very training time consuming but needs the least time for testing. In contrast, BCF considers more appropriate low-level heuristics, in addition to more iterations, during the heuristic search to facilitate more opportunity to undertake effective search.

Figure 9.

Comparison of averaged training time and classification time for SVM, LR, k-NN and BNC on 30 datasets.

4.5 AUC

In this section, we compare the AUC results for all of the above 9 algorithms on 15 datasets with two classes in our experiments to clarify their differences and present the detailed results of AUC in Table 8. All 15 datasets are denoted with the symbol $\ast$ in Table 3. In order to give the results a more intuitionistic explanation, we present the line chart of AUC for BCF and other four algorithms in Fig. 10, where the X-axis shows the index number of datasets and the Y-axis corresponds to the values of AUC. Note that the curves of BCF are almost always above those of non-Bayesian learners. From Fig. 10a, we can see that BCF achieves significant advantage over SVM on 14 out of 15 datasets in terms of AUC. We can also observe from Fig. 10b that BCF achieves significant advantage over k-NN and seldom performs worse than k-NN. Figure 10c shows the comparison results of BCF and WATAN. The advantage of BCF over WATAN is obvious but less significant. From Fig. 10d it can be seen that BCF performs similarly to TAODE. The experimental results demonstrate that BCF can deal with binary classification problems.

Table 8
Detailed results in terms of AUC on 15 datasets

Dataset	SVM	k-NN	LR	CFWNB	WATAN	IWAODE	WAODE-MI	TAODE	BCF
labor	0.8860	0.9840	0.9770	0.9570	0.9878	0.9973	0.9946	0.9946	1.0000
labor-negotiations	0.8500	0.8180	0.9770	0.9740	0.9662	0.9959	0.9932	0.9932	0.9946
sonar	0.8460	0.8700	0.7710	0.9280	0.8651	0.8646	0.8708	0.8663	0.8671
heart	0.8350	0.7520	0.9390	0.9280	0.8821	0.9041	0.9018	0.9006	0.8994
hungarian	0.8040	0.7330	0.9450	0.9250	0.9066	0.9186	0.9173	0.9155	0.9102
cylinder-bands	0.5240	0.7220	0.8510	0.8520	0.8122	0.9200	0.9076	0.9112	0.9187
chess	0.6870	0.9680	0.9480	0.9290	0.9575	0.9469	0.9633	0.9677	0.9721
credit-a	0.8030	0.8080	0.9040	0.9340	0.9148	0.9243	0.9226	0.9161	0.9243
crx	0.8640	0.8660	0.9000	0.9340	0.9123	0.9262	0.9256	0.9189	0.9311
tic-tac-toe	0.8290	1.0000	0.9960	0.7680	0.8227	0.8370	0.8050	0.8383	0.8222
kr-vs-kp	0.9380	0.9900	0.9960	0.9830	0.9816	0.9765	0.986	0.9785	0.9824
dis	0.5000	0.7080	0.8310	0.9280	0.9000	0.9111	0.9357	0.9267	0.9054
magic	0.8110	0.7800	0.8390	0.8730	0.8935	0.8940	0.8936	0.8961	0.8947
adult	0.7640	0.7160	0.9050	0.9170	0.9195	0.9224	0.9236	0.9198	0.9198
census-income	0.7850	0.6850	0.9020	0.9310	0.9398	0.9041	0.9329	0.9297	0.9430

Figure 10.

The line chart of AUC comparisons for BNC and other four algorithms.

4.6 Friedman and Nemenyi test

Table 9
Average ranks of the algorithms

Algorithm	Zero-one loss rank	RMSE rank
SVM	6.667	7.583
k-NN	5.667	5.750
LR	6.317	5.617
CFWNB	4.817	4.917
WATAN	5.133	4.950
IWAODE	4.733	4.567
WAODE-MI	4.250	4.019
TAODE	4.217	4.433
BCF	3.200	3.000
Result of the $F_{F}$ statistics	5.541	7.880

In this section, we perform the Friedman test followed by the Nemenyi test to explore the statistical significance of experimental results of these 9 algorithms on 30 datasets. The average ranks of the algorithms obtained by applying the Friedman test with respect to zero-one loss and RMSE are shown in Table 9. The Friedman statistic $F_{F}$ is distributed according to the $F$ distribution with $9-1=8$ and $(9-1)\times(30-1)=232$ degrees of freedom. The critical value of $F(8,232)$ is 1.920 for $\alpha=0.05$ . At the bottom of Table 9, we could see that the Friedman statistics $F_{F}$ for zero-one loss and RMSE are 5.541 and 7.880 respectively, which are greater than 1.920. Therefore, the null hypothesis is rejected, indicating that there are significant difference among those 9 algorithms. In order to further explore which algorithms have significant difference, we conduct Nemenyi Test and show the comparison for zero-one loss and RMSE in Fig. 11. The left line in the graph is the axis on which we plot the average ranks of different algorithms, the lower the better. For $\alpha=0.05$ with 9 algorithms and 30 datasets, the critical difference $CD=3.120\times\sqrt{\frac{9\times(9-1)}{9\times 30}}=1.602$ is also plotted in graph. If the average rank difference between the two algorithms is less than the CD value, a line segment is used to connect the two algorithms.

Figure 11.

The comparison results of Nemenyi test in terms of (a) zero-one loss and (b) RMSE.

When the experimental results of zero-one loss are compared, from Fig. 11a we can see that BCF achieves the lowest mean zero-one loss rank (3.200) followed by TAODE (4.217), WAODE-MI (4.250) and IWAODE (4.733). Thus BCF enjoys significant zero-one loss advantage over CFWNB, WATAN, k-NN, LR and SVM. When RMSE is compared, from Fig. 11b we can see that BCF gets the first position in terms of zero-one loss, which is significantly different from CFWNB, WATAN, LR, k-NN and SVM, but there are no statistical differences between WAODE-MI, TAODE, IWAODE and BCF. These results illustrate the causal relationships or ensemble learning have an obvious positive effect on reducing the zero-one loss and RMSE of algorithm.

5. Conclusions and future work

Our work has been primarily motivated by the intuitive understanding how one variable influences another in the network topology. Information-theoretic metrics, e.g., conditional mutual information $I(X_{i};X_{j}|C)$ , can be applied to measure undirected conditional dependence between attribute pair, and the learned topology of BNC achieves data fitting from the viewpoint of data science. However, the symmetric form of $I(X_{i};X_{j}|C)$ makes it inappropriate to be used to measure directed causality. In this paper, conditional entropy $H(X_{i}|X_{j},C)$ is introduced to measure the causal uncertainty of $X_{i}$ , and by applying heuristic search strategy the learned topology can also achieve data fitting in terms of log likelihood function. The asymmetric characteristic of $H(X_{i}|X_{j},C)$ makes it possible to build different BCTs with different root causes, that is consistent with human cognition and then ensemble learning is introduced to build BCF from the viewpoint of causal science. The experimental results on widely used benchmark datasets from the UCI machine learning repository show that BCF is a competitive alternative to single-model or ensemble BNCs in term of zero-one loss and RMSE. The effectiveness of weighting approach and instantiated dependence analysis has been proven. More appropriate ensemble learning strategies to encourage diversity are worthwhile to explore. Moreover, it remains a direction for future research to explore the causality between attribute values for specific instance or situation.

Footnotes

Acknowledgments

This work is supported by the National Key Research and Development Program of China (No. 2019YFC1804804) and the Scientific and Technological Developing Scheme of Jilin Province (No. 20200201281JC).

Appendix

See Tables A1–A5.

Table A1

Detailed results in terms of zero-one loss

Dataset	SVM	k-NN	LR	CFWNB	WATAN	IWAODE	WAODE-MI	TAODE	BCF
labor	Null	Null	Null	0.0701 $\circ$	0.0526 $\circ$	0.0526 $\circ$	0.0527 $\circ$	0.0526 $\circ$	0.0175
labor-negotiations	0.0702 $\circ$	0.1754 $\circ$	0.0702 $\circ$	0.1052 $\circ$	0.1053 $\circ$	0.0526 $\ \$	0.0526 $\ \$	0.0526 $\ \$	0.0526
lymphography	0.1824 $\circ$	0.1891 $\circ$	0.2162 $\circ$	0.1486 $\ \$	0.1689 $\circ$	0.1419 $\ \$	0.1554 $\circ$	0.1554 $\circ$	0.1419
iris	0.0333 $\bullet$	0.0467 $\bullet$	0.0400 $\bullet$	0.0600 $\bullet$	0.0800 $\circ$	0.0867 $\circ$	0.0867 $\circ$	0.0867 $\circ$	0.0667
autos	0.6537 $\circ$	0.2390 $\circ$	0.2878 $\circ$	0.1902 $\ \$	0.2146 $\circ$	0.2098 $\ \$	0.1951 $\ \$	0.1951 $\ \$	0.2000
sonar	0.3413 $\circ$	0.1346 $\bullet$	0.2740 $\circ$	0.1586 $\bullet$	0.2212 $\ \$	0.2260 $\circ$	0.2260 $\circ$	0.2260 $\circ$	0.2115
glass-id	0.2430 $\circ$	0.2103 $\circ$	0.3224 $\circ$	0.1728 $\bullet$	0.2196 $\circ$	0.2196 $\circ$	0.2570 $\circ$	0.2523 $\circ$	0.1963
heart	0.4259 $\circ$	0.2444 $\circ$	0.2556 $\circ$	0.1703 $\bullet$	0.1926 $\ \$	0.1704 $\bullet$	0.1778 $\bullet$	0.1778 $\bullet$	0.1926
hungarian	0.3605 $\circ$	0.2313 $\circ$	0.1599 $\ \$	0.1598 $\ \$	0.1735 $\circ$	0.1599 $\ \$	0.1565 $\ \$	0.1599 $\ \$	0.1565
soybean-large	0.2573 $\circ$	0.0782 $\ \$	0.1531 $\circ$	0.0977 $\circ$	0.1010 $\circ$	0.0912 $\circ$	0.0814 $\ \$	0.0783 $\ \$	0.0782
dermatology	0.1694 $\circ$	0.0546 $\circ$	0.0301 $\circ$	0.0191 $\ \$	0.0328 $\circ$	0.0191 $\ \$	0.0192 $\ \$	0.0191 $\ \$	0.0191
cylinder-bands	0.2333 $\circ$	0.2556 $\circ$	0.2130 $\circ$	0.1981 $\circ$	0.2463 $\circ$	0.1926 $\ \$	0.1796 $\ \$	0.1880 $\ \$	0.1870
chess	0.1615 $\circ$	0.1208 $\circ$	0.1143 $\circ$	0.1379 $\circ$	0.0926 $\ \$	0.1034 $\circ$	0.0944 $\ \$	0.0799 $\bullet$	0.0907
balance-scale	0.1024 $\bullet$	0.1344 $\bullet$	0.1040 $\bullet$	0.2496 $\ \$	0.2736 $\circ$	0.2832 $\circ$	0.2816 $\circ$	0.2832 $\circ$	0.2512
soybean	0.1127 $\circ$	0.0878 $\circ$	0.0615 $\ \$	0.0614 $\ \$	0.0527 $\bullet$	0.0542 $\bullet$	0.0483 $\bullet$	0.0483 $\bullet$	0.0615
credit-a	0.4449 $\circ$	0.1884 $\circ$	0.1478 $\ \$	0.1333 $\bullet$	0.1507 $\ \$	0.1391 $\ \$	0.1362 $\bullet$	0.1507 $\ \$	0.1464
crx	0.1719 $\circ$	0.1678 $\circ$	0.1477 $\circ$	0.1304 $\ \$	0.1478 $\circ$	0.1319 $\ \$	0.1377 $\circ$	0.1391 $\circ$	0.1290
tic-tac-toe	0.1221 $\bullet$	0.0125 $\bullet$	0.0167 $\bullet$	0.3100 $\circ$	0.2265 $\ \$	0.2662 $\circ$	0.2724 $\circ$	0.2630 $\circ$	0.2286
vowel	0.1495 $\bullet$	0.0071 $\bullet$	0.1818 $\bullet$	0.3050 $\circ$	0.1263 $\bullet$	0.1697 $\bullet$	0.1949 $\bullet$	0.1323 $\bullet$	0.2172
contraceptive-mc	0.4515 $\ \$	0.5567 $\circ$	0.4881 $\ \$	0.4677 $\ \$	0.4895 $\ \$	0.4942 $\ \$	0.4922 $\ \$	0.4902 $\ \$	0.4718
mfeat-mor	0.6450 $\circ$	0.3450 $\circ$	0.3240 $\circ$	0.3060 $\ \$	0.2980 $\ \$	0.3120 $\circ$	0.3130 $\circ$	0.3105 $\circ$	0.2945
kr-vs-kp	0.0610 $\bullet$	0.0372 $\bullet$	0.0244 $\bullet$	0.0644 $\bullet$	0.0776 $\ \$	0.0826 $\circ$	0.0576 $\bullet$	0.0773 $\ \$	0.0748
dis	0.0154 $\circ$	0.0170 $\circ$	0.0170 $\circ$	0.0156 $\circ$	0.0154 $\circ$	0.0127 $\bullet$	0.0143 $\ \$	0.0125 $\bullet$	0.0146
hypo	0.0740 $\circ$	0.0867 $\circ$	0.0339 $\circ$	0.0121 $\circ$	0.0130 $\circ$	0.0114 $\circ$	0.0101 $\circ$	0.0119 $\circ$	0.0082
sign	0.3273 $\circ$	0.3340 $\circ$	0.4063 $\circ$	0.3700 $\circ$	0.2752 $\ \$	0.2789 $\ \$	0.2768 $\ \$	0.2743 $\ \$	0.2653
magic	0.3412 $\circ$	0.1906 $\circ$	0.2089 $\circ$	0.2033 $\circ$	0.1674 $\circ$	0.1744 $\circ$	0.1762 $\circ$	0.1725 $\circ$	0.1581
adult	0.2416 $\circ$	0.2049 $\circ$	0.1484 $\circ$	0.1499 $\circ$	0.1380 $\ \$	0.1502 $\circ$	0.1445 $\circ$	0.1558 $\circ$	0.1363
shuttle	0.0166 $\circ$	0.0007 $\bullet$	0.0315 $\circ$	0.0020 $\circ$	0.0014 $\circ$	0.0011 $\circ$	0.0009 $\circ$	0.0008 $\ \$	0.0008
connect-4	0.2242 $\bullet$	0.1900 $\bullet$	0.2423 $\ \$	0.2847 $\circ$	0.2354 $\ \$	0.2409 $\ \$	0.2406 $\ \$	0.2374 $\ \$	0.2468
waveform	0.0271 $\circ$	0.0404 $\circ$	0.0279 $\circ$	0.0198 $\circ$	0.0202 $\circ$	0.0181 $\ \$	0.0181 $\ \$	0.0189 $\ \$	0.0186
localization	0.4209 $\circ$	0.2226 $\bullet$	0.5950 $\circ$	0.4936 $\circ$	0.3575 $\circ$	0.3593 $\circ$	0.3566 $\circ$	0.3544 $\circ$	0.3265
census-income	Null	Null	Null	0.1666 $\circ$	0.0636 $\bullet$	0.0994 $\circ$	0.0883 $\circ$	0.0871 $\circ$	0.0762

Table A2

Detailed results in terms of RMSE

Dataset	SVM	k-NN	LR	CFWNB	WATAN	IWAODE	WAODE-MI	TAODE	BCF
labor	Null	Null	Null	0.2545 $\circ$	0.2105 $\circ$	0.1649 $\circ$	0.1920 $\circ$	0.1931 $\circ$	0.1062
labor-negotiations	0.2649 $\circ$	0.4113 $\circ$	0.2438 $\circ$	0.2810 $\circ$	0.2778 $\circ$	0.1739 $\ \$	0.2057 $\circ$	0.2029 $\circ$	0.1736
lymphography	0.302 $\circ$	0.2759 $\circ$	0.3242 $\circ$	0.2419 $\ \$	0.2705 $\circ$	0.2304 $\ \$	0.2496 $\circ$	0.2501 $\circ$	0.2341
iris	0.1491 $\bullet$	0.1747 $\ \$	0.1424 $\bullet$	0.1500 $\bullet$	0.1958 $\circ$	0.2024 $\circ$	0.2091 $\circ$	0.2077 $\circ$	0.1825
autos	0.4322 $\circ$	0.2568 $\circ$	0.2828 $\circ$	0.2099 $\ \$	0.2320 $\circ$	0.2317 $\circ$	0.2290 $\ \$	0.2306 $\ \$	0.2194
sonar	0.5842 $\circ$	0.365 $\bullet$	0.5235 $\circ$	0.3409 $\bullet$	0.4130 $\ \$	0.4246 $\circ$	0.4091 $\ \$	0.4202 $\ \$	0.3992
glass-id	0.4025 $\circ$	0.3716 $\circ$	0.3812 $\circ$	0.2952 $\bullet$	0.3315 $\ \$	0.3237 $\ \$	0.3422 $\circ$	0.3409 $\circ$	0.3174
heart	0.6526 $\circ$	0.4924 $\circ$	0.3485 $\ \$	0.3429 $\bullet$	0.3765 $\ \$	0.3548 $\ \$	0.3572 $\ \$	0.3696 $\ \$	0.3640
hungarian	0.6005 $\circ$	0.4791 $\circ$	0.3372 $\ \$	0.3384 $\ \$	0.3418 $\ \$	0.3450 $\ \$	0.3380 $\ \$	0.3443 $\ \$	0.3285
soybean-large	0.1646 $\circ$	0.0855 $\circ$	0.1248 $\circ$	0.0886 $\circ$	0.0902 $\circ$	0.0866 $\circ$	0.0860 $\circ$	0.0856 $\circ$	0.0761
dermatology	0.2376 $\circ$	0.1339 $\circ$	0.0975 $\circ$	0.0648 $\bullet$	0.0850 $\circ$	0.0661 $\bullet$	0.0688 $\ \$	0.0718 $\ \$	0.0709
cylinder-bands	0.483 $\circ$	0.5045 $\circ$	0.4465 $\circ$	0.4111 $\circ$	0.4277 $\circ$	0.3952 $\circ$	0.4016 $\circ$	0.4056 $\circ$	0.3696
chess	0.4019 $\circ$	0.2608 $\ \$	0.2771 $\circ$	0.3208 $\circ$	0.2594 $\ \$	0.2835 $\circ$	0.2603 $\ \$	0.2510 $\ \$	0.2509
balance-scale	0.2613 $\bullet$	0.28 $\bullet$	0.2092 $\bullet$	0.3589 $\circ$	0.3203 $\ \$	0.3201 $\ \$	0.3203 $\ \$	0.3198 $\ \$	0.3291
soybean	0.1089 $\circ$	0.0879 $\circ$	0.0758 $\circ$	0.0723 $\circ$	0.0654 $\ \$	0.0697 $\ \$	0.0646 $\ \$	0.0657 $\ \$	0.0674
credit-a	0.667 $\circ$	0.4334 $\circ$	0.3329 $\circ$	0.3116 $\ \$	0.3407 $\circ$	0.3271 $\ \$	0.3236 $\ \$	0.3350 $\circ$	0.3159
crx	0.3142 $\ \$	0.3259 $\circ$	0.3415 $\circ$	0.3142 $\ \$	0.3415 $\circ$	0.3259 $\circ$	0.3219 $\ \$	0.3322 $\circ$	0.3077
tic-tac-toe	0.3495 $\bullet$	0.2315 $\bullet$	0.1289 $\bullet$	0.4334 $\circ$	0.4023 $\ \$	0.3992 $\ \$	0.4085 $\ \$	0.3984 $\circ$	0.3961
vowel	0.1649 $\ \$	0.0358 $\bullet$	0.1722 $\ \$	0.1982 $\circ$	0.1254 $\bullet$	0.1463 $\bullet$	0.1633 $\ \$	0.1324 $\bullet$	0.1669
contraceptive-mc	0.5486 $\circ$	0.599 $\circ$	0.4384 $\ \$	0.4305 $\ \$	0.4392 $\ \$	0.4392 $\ \$	0.4385 $\ \$	0.4394 $\ \$	0.4324
mfeat-mor	0.3592 $\circ$	0.258 $\circ$	0.181 $\ \$	0.1943 $\ \$	0.1941 $\ \$	0.1979 $\circ$	0.1983 $\circ$	0.1980 $\circ$	0.1860
kr-vs-kp	0.247 $\ \$	0.1946 $\circ$	0.1474 $\circ$	0.2779 $\circ$	0.2358 $\ \$	0.2635 $\circ$	0.2343 $\ \$	0.2561 $\circ$	0.2315
dis	0.124 $\circ$	0.1302 $\circ$	0.1228 $\circ$	0.1130 $\ \$	0.1098 $\ \$	0.1058 $\ \$	0.1046 $\ \$	0.1047 $\ \$	0.1092
hypo	0.1923 $\circ$	0.2081 $\circ$	0.1147 $\circ$	0.0739 $\circ$	0.0723 $\circ$	0.0698 $\circ$	0.0647 $\circ$	0.0685 $\circ$	0.0592
sign	0.4671 $\circ$	0.2892 $\bullet$	0.421 $\circ$	0.3929 $\circ$	0.3504 $\ \$	0.3516 $\ \$	0.3519 $\ \$	0.3487 $\ \$	0.3397
magic	0.5841 $\circ$	0.4366 $\circ$	0.3839 $\circ$	0.3709 $\circ$	0.3461 $\ \$	0.3534 $\circ$	0.3526 $\circ$	0.3519 $\circ$	0.3320
adult	0.4915 $\circ$	0.4526 $\circ$	0.3202 $\ \$	0.3150 $\ \$	0.3076 $\ \$	0.3250 $\circ$	0.3197 $\ \$	0.3297 $\circ$	0.3078
shuttle	0.0689 $\circ$	0.0137 $\bullet$	0.0852 $\circ$	0.0270 $\bullet$	0.0177 $\bullet$	0.0159 $\bullet$	0.0131 $\bullet$	0.0124 $\bullet$	0.0331
connect-4	0.3866 $\circ$	0.3069 $\bullet$	0.3358 $\ \$	0.3632 $\circ$	0.3315 $\ \$	0.3359 $\ \$	0.3390 $\ \$	0.3339 $\ \$	0.3384
waveform	0.1345 $\circ$	0.1641 $\bullet$	0.1158 $\ \$	0.1116 $\circ$	0.0951 $\ \$	0.0859 $\bullet$	0.0860 $\bullet$	0.0865 $\bullet$	0.0977
localization	0.2766 $\circ$	0.2012 $\circ$	0.261 $\circ$	0.2402 $\circ$	0.2095 $\ \$	0.2093 $\ \$	0.2087 $\ \$	0.2081 $\ \$	0.2002
census-income	Null	Null	Null	0.3638 $\circ$	0.2168 $\bullet$	0.2785 $\circ$	0.2604 $\circ$	0.2599 $\ \$	0.2373

Table A3

Detailed results in terms of F1-Score

Dataset	SVM	k-NN	LR	CFWNB	WATAN	IWAODE	WAODEMI	TAODE	BCF
labor	null	null	null	0.9300 $\circ$	0.9429 $\circ$	0.9429 $\circ$	0.9429 $\circ$	0.9429 $\circ$	0.9810
labor-negotiations	0.9290 $\ \$	0.8260 $\circ$	0.9310 $\ \$	0.8930 $\ \$	0.8845 $\ \$	0.9429 $\ \$	0.9429 $\ \$	0.9429 $\ \$	0.9230
lymphography	0.7630 $\circ$	0.7530 $\circ$	0.7840 $\circ$	0.8510 $\ \$	0.5614 $\circ$	0.8897 $\ \$	0.6929 $\circ$	0.5720 $\circ$	0.8715
iris	0.9670 $\bullet$	0.9530 $\ \$	0.9600 $\bullet$	0.9470 $\ \$	0.9200 $\ \$	0.9133 $\ \$	0.9133 $\ \$	0.9133 $\ \$	0.9133
autos	0.7210 $\circ$	0.7640 $\circ$	0.7120 $\circ$	0.8100 $\circ$	0.8482 $\ \$	0.8590 $\ \$	0.5845 $\circ$	0.5838 $\circ$	0.8825
sonar	0.6200 $\circ$	0.8650 $\bullet$	0.7260 $\circ$	0.8410 $\bullet$	0.7778 $\ \$	0.7721 $\ \$	0.7721 $\ \$	0.7721 $\ \$	0.7818
glass-id	0.7470 $\circ$	0.7900 $\ \$	0.6790 $\circ$	0.8230 $\ \$	0.7863 $\ \$	0.7868 $\ \$	0.7519 $\circ$	0.7571 $\circ$	0.8089
heart	0.4480 $\circ$	0.7550 $\circ$	0.8430 $\ \$	0.8290 $\ \$	0.8035 $\circ$	0.8269 $\ \$	0.8190 $\circ$	0.8190 $\circ$	0.8639
hungarian	0.8010 $\ \$	0.7690 $\circ$	0.8390 $\ \$	0.8360 $\ \$	0.8082 $\ \$	0.8224 $\ \$	0.8257 $\ \$	0.8232 $\ \$	0.8207
soybean-large	0.8230 $\circ$	0.8321 $\circ$	0.8212 $\circ$	0.8020 $\circ$	0.8621 $\circ$	0.8800 $\circ$	0.9456 $\ \$	0.9468 $\ \$	0.9899
dermatology	0.8140 $\circ$	0.9450 $\ \$	0.9700 $\ \$	0.9970 $\ \$	0.9635 $\ \$	0.9795 $\ \$	0.9773 $\ \$	0.9773 $\ \$	0.9820
cylinder-bands	0.7670 $\circ$	0.7370 $\circ$	0.7830 $\circ$	0.9930 $\bullet$	0.7310 $\circ$	0.7927 $\circ$	0.8091 $\circ$	0.8005 $\circ$	0.9270
chess	0.8100 $\circ$	0.9310 $\ \$	0.8860 $\circ$	0.8550 $\circ$	0.8713 $\circ$	0.8562 $\circ$	0.8698 $\circ$	0.8893 $\circ$	0.9794
balance-scale	0.8620 $\bullet$	0.8420 $\bullet$	0.9010 $\bullet$	0.4823 $\bullet$	0.5041 $\bullet$	0.4974 $\bullet$	0.4984 $\bullet$	0.4974 $\bullet$	0.3929
soybean	0.9230 $\ \$	0.9100 $\circ$	0.9390 $\ \$	0.9380 $\ \$	0.9632 $\ \$	0.9685 $\ \$	0.9748 $\ \$	0.9727 $\ \$	0.9649
credit-a	0.4770 $\circ$	0.8110 $\ \$	0.8530 $\ \$	0.8670 $\ \$	0.8470 $\ \$	0.8587 $\ \$	0.8614 $\ \$	0.8470 $\ \$	0.8527
crx	0.8033 $\ \$	0.8132 $\ \$	0.8233 $\ \$	0.8700 $\ \$	0.8502 $\ \$	0.8658 $\ \$	0.8599 $\ \$	0.8584 $\ \$	0.9662
tic-tac-toe	0.8710 $\bullet$	0.9870 $\bullet$	0.9830 $\bullet$	0.6750 $\circ$	0.7330 $\ \$	0.6832 $\ \$	0.6724 $\circ$	0.6872 $\ \$	0.7188
vowel	0.8480 $\circ$	0.9930 $\bullet$	0.8180 $\circ$	0.6960 $\circ$	0.8735 $\circ$	0.8302 $\circ$	0.8045 $\circ$	0.8679 $\circ$	0.9288
contraceptive-mc	0.5400 $\circ$	0.4430 $\circ$	0.5080 $\circ$	0.5310 $\circ$	0.4978 $\circ$	0.4957 $\circ$	0.4978 $\circ$	0.5006 $\circ$	0.6039
mfeat-mor	0.3710 $\circ$	0.6540 $\circ$	0.7350 $\bullet$	0.6760 $\ \$	0.6994 $\ \$	0.6831 $\ \$	0.6814 $\ \$	0.6843 $\ \$	0.6986
kr-vs-kp	0.9390 $\ \$	0.9630 $\ \$	0.9760 $\bullet$	0.9350 $\ \$	0.9223 $\ \$	0.9170 $\ \$	0.9421 $\ \$	0.9224 $\ \$	0.9258
dis	0.9780 $\ \$	0.9820 $\ \$	0.9760 $\ \$	0.9850 $\ \$	0.5818 $\circ$	0.7746 $\circ$	0.7553 $\circ$	0.7851 $\circ$	0.9613
hypo	0.7820 $\circ$	0.8923 $\circ$	0.9650 $\bullet$	0.9880 $\bullet$	0.7057 $\circ$	0.7161 $\circ$	0.7210 $\circ$	0.7140 $\circ$	0.8265
sign	0.6660 $\circ$	0.8660 $\bullet$	0.5890 $\circ$	0.6260 $\circ$	0.7230 $\ \$	0.7195 $\ \$	0.7215 $\ \$	0.7241 $\ \$	0.7256
magic	0.5350 $\bullet$	0.8070 $\bullet$	0.7840 $\bullet$	0.7800 $\bullet$	0.8076 $\bullet$	0.7887 $\bullet$	0.7880 $\bullet$	0.7913 $\bullet$	0.9039
adult	0.6850 $\circ$	0.7950 $\circ$	0.8460 $\circ$	0.8530 $\circ$	0.8063 $\circ$	0.8062 $\circ$	0.8095 $\circ$	0.8011 $\circ$	0.9066
shuttle	0.9532 $\bullet$	0.999 $\bullet$	0.967 $\bullet$	0.998 $\bullet$	0.8294 $\ \$	0.8590 $\ \$	0.8010 $\circ$	0.8198 $\ \$	0.8496
connect-4	0.7320 $\bullet$	0.7710 $\bullet$	0.7170 $\bullet$	0.6520 $\bullet$	0.5170 $\ \$	0.5321 $\bullet$	0.5377 $\bullet$	0.5506 $\bullet$	0.4984
waveform	0.9730 $\ \$	0.9600 $\ \$	0.9720 $\ \$	0.9800 $\ \$	0.9798 $\ \$	0.9819 $\ \$	0.9819 $\ \$	0.9818 $\ \$	0.9812
localization	0.7820 $\circ$	0.7750 $\circ$	0.7520 $\circ$	0.3523 $\circ$	0.3953 $\circ$	0.3677 $\circ$	0.3757 $\circ$	0.3813 $\circ$	0.8971
census-income	null	null	null	0.8720 $\bullet$	0.7662 $\ \$	0.7175 $\circ$	0.7312 $\circ$	0.7306 $\circ$	0.7514

Table A4

Detailed results of rank in terms of zero-one loss

Dataset	SVM	k-NN	LR	CFWNB	WATAN	IWAODE	WAODE-MI	TAODE	BCF
labor-negotiations	5.5	9.0	5.5	7.0	8.0	2.5	2.5	2.5	2.5
lymphography	7.0	8.0	9.0	3.0	6.0	1.5	4.5	4.5	1.5
iris	1.0	3.0	2.0	4.0	6.0	8.0	8.0	8.0	5.0
autos	9.0	7.0	8.0	1.0	6.0	5.0	2.5	2.5	4.0
sonar	9.0	1.0	8.0	2.0	4.0	6.0	6.0	6.0	3.0
glass-id	6.0	3.0	9.0	1.0	4.5	4.5	8.0	7.0	2.0
heart	9.0	7.0	8.0	1.0	5.5	2.0	3.5	3.5	5.5
hungarian	9.0	8.0	5.0	3.0	7.0	5.0	1.5	5.0	1.5
soybean-large	9.0	1.5	8.0	6.0	7.0	5.0	4.0	3.0	1.5
dermatology	9.0	8.0	6.0	2.5	7.0	2.5	5.0	2.5	2.5
cylinder-bands	7.0	9.0	6.0	5.0	8.0	4.0	1.0	3.0	2.0
chess	9.0	7.0	6.0	8.0	3.0	5.0	4.0	1.0	2.0
balance-scale	1.0	3.0	2.0	4.0	6.0	8.5	7.0	8.5	5.0
soybean	9.0	8.0	6.5	5.0	3.0	4.0	1.5	1.5	6.5
credit-a	9.0	8.0	5.0	1.0	6.5	3.0	2.0	6.5	4.0
crx	9.0	8.0	6.0	2.0	7.0	3.0	4.0	5.0	1.0
tic-tac-toe	3.0	1.0	2.0	9.0	4.0	7.0	8.0	6.0	5.0
vowel	4.0	1.0	6.0	9.0	2.0	5.0	7.0	3.0	8.0
contraceptive-mc	1.0	9.0	4.0	2.0	5.0	8.0	7.0	6.0	3.0
mfeat-mor	9.0	8.0	7.0	3.0	2.0	5.0	6.0	4.0	1.0
kr-vs-kp	4.0	2.0	1.0	5.0	8.0	9.0	3.0	7.0	6.0
dis	5.5	8.5	8.5	7.0	5.5	2.0	3.0	1.0	4.0
hypo	8.0	9.0	7.0	5.0	6.0	3.0	2.0	4.0	1.0
sign	6.0	7.0	9.0	8.0	3.0	5.0	4.0	2.0	1.0
magic	9.0	6.0	8.0	7.0	2.0	4.0	5.0	3.0	1.0
adult	9.0	8.0	4.0	5.0	2.0	6.0	3.0	7.0	1.0
shuttle	8.0	1.0	9.0	7.0	6.0	5.0	4.0	2.5	2.5
connect-4	2.0	1.0	7.0	9.0	3.0	6.0	5.0	4.0	8.0
waveform	7.0	9.0	8.0	5.0	6.0	1.5	1.5	4.0	3.0
localization	7.0	1.0	9.0	8.0	5.0	6.0	4.0	3.0	2.0
Average rank	6.667	5.667	6.317	4.817	5.133	4.733	4.250	4.217	3.200

Table A5

Detailed results of rank in terms of RMSE

Dataset	SVM	KNN	LR	CFWNB	WATAN	IWAODE	WAODE-MI	TAODE	BCF
labor-negotiations	6.0	9.0	5.0	8.0	7.0	2.0	4.0	3.0	1.0
lymphography	8.0	7.0	9.0	3.0	6.0	1.0	4.0	5.0	2.0
iris	2.0	4.0	1.0	3.0	6.0	7.0	9.0	8.0	5.0
autos	9.0	7.0	8.0	1.0	6.0	5.0	3.0	4.0	2.0
sonar	9.0	2.0	8.0	1.0	5.0	7.0	4.0	6.0	3.0
glass-id	9.0	7.0	8.0	1.0	4.0	3.0	6.0	5.0	2.0
heart	9.0	8.0	2.0	1.0	7.0	3.0	4.0	6.0	5.0
hungarian	9.0	8.0	2.0	4.0	5.0	7.0	3.0	6.0	1.0
soybean-large	9.0	2.0	8.0	6.0	7.0	5.0	4.0	3.0	1.0
dermatology	9.0	8.0	7.0	1.0	6.0	2.0	3.0	5.0	4.0
cylinder-bands	8.0	9.0	7.0	5.0	6.0	2.0	3.0	4.0	1.0
chess	9.0	5.0	6.0	8.0	3.0	7.0	4.0	2.0	1.0
balance-scale	2.0	3.0	1.0	9.0	6.5	5.0	6.5	4.0	8.0
soybean	9.0	8.0	7.0	6.0	2.0	5.0	1.0	3.0	4.0
credit-a	9.0	8.0	5.0	1.0	7.0	4.0	3.0	6.0	2.0
crx	2.5	5.5	8.5	2.5	8.5	5.5	4.0	7.0	1.0
tic-tac-toe	3.0	2.0	1.0	9.0	7.0	6.0	8.0	5.0	4.0
vowel	6.0	1.0	8.0	9.0	2.0	4.0	5.0	3.0	7.0
contraceptive-mc	8.0	9.0	3.0	1.0	5.5	5.5	4.0	7.0	2.0
mfeat-mor	9.0	8.0	1.0	4.0	3.0	5.0	7.0	6.0	2.0
kr-vs-kp	6.0	2.0	1.0	9.0	5.0	8.0	4.0	7.0	3.0
dis	8.0	9.0	7.0	6.0	5.0	3.0	1.0	2.0	4.0
hypo	8.0	9.0	7.0	6.0	5.0	4.0	2.0	3.0	1.0
sign	9.0	1.0	8.0	7.0	4.0	5.0	6.0	3.0	2.0
magic	9.0	8.0	7.0	6.0	2.0	5.0	4.0	3.0	1.0
adult	9.0	8.0	5.0	3.0	1.0	6.0	4.0	7.0	2.0
shuttle	8.0	3.0	9.0	6.0	5.0	4.0	2.0	1.0	7.0
connect-4	9.0	1.0	4.0	8.0	2.0	5.0	7.0	3.0	6.0
waveform	8.0	9.0	7.0	6.0	4.0	1.0	2.0	3.0	5.0
localization	9.0	2.0	8.0	7.0	6.0	5.0	4.0	3.0	1.0
Average rank	7.583	5.750	5.617	4.917	4.950	4.567	4.019	4.433	3.000

References

Acid

Campos

L.M.

and Castellano

J.G.

, Learning Bayesian network classifiers: Searching in a space of partially directed acyclic graphs, Machine Learning 59 (2005), 213–235.

Kozak

and Boryczka

, Collective data mining in the ant colony decision tree approach, Information Sciences 372 (2016), 126–147.

Zhai

Wang

Zhang

and Hou

, Tolerance rough fuzzy decision tree, Information Sciences 465 (2018), 425–438.

Chen

Martinez

A.M.

Webb

G.I.

and Wang

, Sample-Based Attribute Selective AnDE for Large Data, IEEE Transactions on Knowledge and Data Engineering 29 (2016), 172–185.

Jiang

Zhang

and Wang

, Class-specific attribute weighted naive Bayes, Pattern Recognition 88 (2019), 321–330.

Cortes

and Vapnik

, Support-vector networks, Machine Learning 20 (1995), 273–297.

Orlandi

, Multiple objectives optimization for an EBG common mode filter by using an artificial neural network, IEEE Transactions on Electromagnetic Compatibility 2 (2018), 507–512.

Wang

Zhang

and Zhang

, Semi-supervised learning for k-dependence Bayesian classifiers, Applied Intelligence 23 (2021), 1–19.

Flores

M.J.

Gámez

J.A.

and Martínez

A.M.

, Domains of competence of the semi-naive Bayesian network classifiers, Information Sciences 260 (2014), 120–148.

10.

Duan

Wang

and Sun

, Efficient heuristics for learning Bayesian network from labeled and unlabeled data, Intelligent Data Analysis 24 (2020), 385–408.

11.

Pearl

and Verma

T.S.

, A theory of inferred causation, in: Proceedings of the 2nd International Conference on the Principles of Knowledge Representation and Reasoning, Vol. 134, 1995, pp. 789–811.

12.

Liu

Wang

and Mammadov

, Learning semi-lazy Bayesian network classifier under the c.i.i.d assumption, Knowledge-Based Systems 208 (2020), 106422.

13.

Heckerman

, A Bayesian Approach to Learning Causal Networks, Advances in Decision Analysis: From Foundations to Applications 150 (2013), 285–295.

14.

Kong

Shi

Wang

Liu

Mammadov

and Wang

, Averaged tree-augmented one-dependence estimators, Applied Intelligence, 2021, 1–17.

15.

Lewis

D.D.

, Naive (Bayes) at forty: The independence assumption in information retrieval, in: Proceedings of European Conference on Machine Learning, 1998, pp. 4–15.

16.

Friedman

Geiger

and Goldszmidt

, Bayesian network classifiers, Machine Learning 29 (1997), 131–163.

17.

Webb

G.I.

Boughton

J.R.

and Wang

, Not so naive Bayes: Aggregating one-dependence estimators, Machine Learning 58 (2005), 5–24.

18.

Jiang

Zhang

and Wu

, A correlation-based feature weighting filter for naive bayes, IEEE Transactions on Knowledge and Data Engineering 31 (2018), 201–213.

19.

Wang

Liu

Lou

and Zuo

, Bagging k-dependence Bayesian network classifiers, Intelligent Data Analysis 25 (2021), 641–667.

20.

Jiang

Cai

Wang

and Zhang

, Improving tree augmented naive bayes for class probability estimation, Knowledge-Based Systems 26 (2012), 239–245.

21.

Duan

Wang

Chen

and Sun

, Instance-based weighting filter for superparent one-dependence estimators, Knowledge-Based Systems 203 (2020), 106085.

22.

Jiang

Zhang

Cai

and Wang

, Weighted average of one-dependence estimators, Journal of Experimental and Theoretical Artificial Intelligence 24 (2012), 219–230.

23.

Tenenbaum

J.B.

Griffiths

T.L.

and Kemp

, Theory-based Bayesian models of inductive learning and reasoning, Trends in Cognitive Sciences 10 (2006), 309–318.

24.

Rissanen

and Wax

, Measures of mutual and causal dependence between two time series (Corresp.), IEEE Transactions on Information Theory 33 (1987), 598–601.

25.

Sun

Taylor

and Bollt

E.M.

, Causal network inference by optimal causation entropy, SIAM Journal on Applied Dynamical Systems 14 (2015), 73–106.

26.

Kamiński

and Blinowska

, A new method of the description of the information flow in the brain structures, Biological Cybernetics 65 (1991), 203–210.

27.

Friston

K.J.

Harrison

and Penny

, Dynamic causal modelling, Neuroimage 19 (2003), 1273–1302.

28.

Cabuz

and Abreu

, Causal inference for multivariate stochastic process prediction, Information Sciences 448 (2018), 134–148.

29.

Chikhaoui

Wang

Xiong

and Pigot

, Pattern-based causal relationships discovery from event sequences for modeling behavioral user profile in ubiquitous environments, Information Sciences 285 (2014), 204–222.

30.

Wang

Zhang

Mammadov

Zhang

and Wu

, Semi-supervised weighting for averaged one-dependence estimators, Applied Intelligence, 2021, 1–17.

31.

Hitchcock

and Pearl

, Causality: Models, Reasoning and Inference, The Philosophical Review 110 (2001), 639–645.

32.

Wang

Chen

Liu

and Sun

, Self-adaptive attribute value weighting for averaged one-dependence estimators, IEEE Access 8 (2020), 27887–27900.

33.

Altman

N.S.

, An introduction to kernel and nearest-neighbor nonparametric regression, The American Statistician 46 (1992), 175–185.

34.

Roos

Wettig

Grünwald

Myllymäki

and Tirri

, On discriminative Bayesian network classifiers and logistic regression, Machine Learning 59 (2005), 267–296.

35.

Shannon

C.E.

, A mathematical theory of communication, The Bell System Technical Journal 27 (1948), 379–423.

36.

Wang

Chen

and Sun

, A novel approach to fully representing the diversity in conditional dependencies for learning Bayesian network classifier, Intelligent Data Analysis 25 (2021), 35–55.

37.

Lee

L.H.

and Isa

, Automatically computed document dependent weighting factor facility for Naïve Bayes classification, Expert Systems with Applications 37 (2010), 8471–8478.

38.

Sahami

, Learning Limited Dependence Bayesian Classifiers, in: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Vol. 96, 1996, pp. 335–338.

39.

Martınez

A.M.

Webb

G.I.

Chen

and Zaidi

N.A.

, Scalable learning of Bayesian network classifiers, Journal of Machine Learning Research 17 (2016), 1–35.

40.

Jiang

Zhang

and Cai

, A novel Bayes model: Hidden naive Bayes, IEEE Transactions on Knowledge and Data Engineering 21 (2008), 1361–1371.

41.

Keogh

E.J.

and Pazzani

M.J.

, Learning augmented Bayesian classifiers: A comparison of distribution-based and classification-based approaches, in: Proceedings of International Workshop on Artificial Intelligence, 1999, pp. 225–230.

42.

Zheng

Webb

G.I.

Suraweera

and Zhu

, Subsumption resolution: An efficient and effective technique for semi-naive Bayesian learning, Machine Learning 87 (2012), 93–125.

43.

Freund

, Schapire and E

, Experiments with a new boosting algorithm, in: Proceedings of the 13th International Conference on Machine Learning, Vol. 96, 1996, pp. 148–156.

44.

Domingos

, Bayesian averaging of classifiers and the overfitting problem, in: Proceedings of the 17th International Conference on Machine Learning, Vol. 747, 2000, pp. 223–230.

45.

Dua

and Graff

, UCI repository of machine learning databases, https://archive.ics.uci.edu/ml/datasets, 2017.

46.

Fayyad

and Irani

, Multi-interval discretization of continuous-valued attributes for classification learning, in: Proceedings of the 13th International Joint Conference on Artificial Intelligence, 1993, pp. 1022–1029.

47.

Cestnik

, Estimating probabilities: a crucial task in machine learning, in: Proceedings of the 9th European Conference on Artificial Intelligence, Vol. 90, 1990, pp. 147–149.

48.

Domingos

, A unified bias-variance decomposition for zero-one and squared loss, in: Proceedings of the 17th National Conference on Artificial Intelligence, Vol. 34, 2000, pp. 564–569.

49.

Hyndman

R.J.

and Koehler

A.B.

, Another look at measures of forecast accuracy, International Journal of Forecasting 22 (2006), 679–688.

50.

Pillai

Fumera

and Roli

, Designing multi-label classifiers that maximize F measures: State of the art, Pattern Recognition 61 (2017), 394–404.

51.

Liu

Wang

Mammadov

Chen

Wang

and Sun

, Hierarchical Independence Thresholding for learning Bayesian network classifiers, Knowledge-Based Systems 212 (2021), 106627.

52.

Demšar

, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (2006), 1–30.

53.

Garcia

and Herrera

, An Extension on “Statistical Comparisons of Classifiers over Multiple Data Sets” for all Pairwise Comparisons, Journal of Machine Learning Research 9 (2008), 2677–2694.

From undirected dependence to directed causality: A novel Bayesian learning approach

Abstract

Keywords

1. Introduction

2.1 Definitions and notions

.

.

.

2.2.1 Single-model BNC

3.1 Bayesian causal tree

Table 1 Dataset Localization for experimental study

4.1 Experimental setting and benchmark datasets

4.3.1 Zero-one loss

Table 4 Win-Draw-Loss results of zero-one loss on 32 datasets

Table 5 Win-Draw-Loss results of RMSE on 32 datasets

Table 6 Win-Draw-Loss results of F1-Score on 32 datasets

4.4 BCF vs. SVM, LR and k-NN

Table 7 W/D/L records of all compared algorithms in terms of zero-one loss, RMSE and F1-Score on 30 datasets

Table 8 Detailed results in terms of AUC on 15 datasets

Table 9 Average ranks of the algorithms

Footnotes

Acknowledgments

Appendix

References

Table 1
Dataset Localization for experimental study

Table 4
Win-Draw-Loss results of zero-one loss on 32 datasets

Table 5
Win-Draw-Loss results of RMSE on 32 datasets

Table 6
Win-Draw-Loss results of F1-Score on 32 datasets

Table 7
W/D/L records of all compared algorithms in terms of zero-one loss, RMSE and F1-Score on 30 datasets

Table 8
Detailed results in terms of AUC on 15 datasets

Table 9
Average ranks of the algorithms