Abstract
In this paper, we propose a new probabilistic modeling approach for interpretable inference and classification using the maximum likelihood evidential reasoning (MAKER) framework. This approach integrates statistical analysis, hybrid evidence combination and belief rule-based (BRB) inference, and machine learning. Statistical analysis is used to acquire evidence from data. The BRB inference is applied to analyze the relationship between system inputs and outputs. An interdependence index is used to quantify the interdependence between input variables. An adapted genetic algorithm is applied to train the models. The model established by the approach features a unique strong interpretability, which is reflected in three aspects: (1) interpretable evidence acquisition, (2) interpretable inference mechanism, and (3) interpretable parameters determination. The MAKER-based model is shown to be a competitive classifier for the Banana, Haberman’s survival, and Iris data set, and generally performs better than other interpretable classifiers, e.g., complex tree, logistic regression, and naive Bayes.
Keywords
Introduction
Machine learning has attracted great attention, for its astounding capability to accurately predict a wide range of complicated phenomena [1]. Despite these successes, some black box machine learning models also have limitations and drawbacks [2]. One of the most prominent issues is insufficient transparency in models’ decision behaviors, which leaves human observers with very limited understanding and knowledge of how certain decisions are generated [2].
Broadly speaking, the term “interpretability” describes the ability to explain or present model predictions to human observers in understandable terms [3, 4]. In low-risk environments, some machine learning models may not require interpretability, whether because mistakes will not have serious consequences (e.g., movie recommender systems) or because the methods used have been extensively studied and assessed (e.g., optical character recognition) [5]. In other environments, however, a lack of interpretability can have more damaging consequences [6]. For example, a driverless car equipped with black box machine learning models, which does not brake when confronted with a stationary fire truck while driving at highway speeds [2]. This incorrect decision can lead the car into serious consequences (e.g., crashing into the fire truck) [2]. Such decisions may be related to bias in the models’ training set, which can lead models to discriminate against a feature to maximize prediction accuracy [5]. Additionally, the black box nature of certain models makes it more challenging to identify the factors and logic leading to a wrong decision, which can be used to prevent future problems. To take another example, a black box neural network outperforms other candidate models in prediction accuracy when predicting risk of death among pneumonia patients [6, 7]. The neural network predicts that pneumonia patients who also have asthma will have a lower probability of death, when in reality patients with both pneumonia and asthma have a higher risk of death [6, 7]. Yet, more timely and aggressive treatments are provided to such patients who consequently have better survival prospects than non-asthmatic counterparts [6]. This type of data leakage can feed misinformation to models or artificially inflate test performance [8]. When such problems surface, interpretation is necessary to interrogate and rectify models, so that they can learn sensible features rather than spurious and misleading correlations [6, 9].
Interpretability is thus necessary in many contexts, due to incomplete problem formalization which creates a major barrier to optimization and evaluation [10]. Prediction accuracy alone cannot justify a model’s validity [11], and problems require both correct prediction and explicit interpretation [10]. Interpretability bridges the gap between domain knowledge and data science [12]. It facilitates a system’s learning, verification, and improvement [13], and reinforces human trust in a system [14]. Understanding model predictions and how information is coded in models helps humans understand why models fail, avoiding undesirable trials and flawed development procedures [15]. In domains such as medicine, justice, and education, model interpretability helps decision makers criticize, refine, and trust in a model, based on their expert knowledge [16].
A growing body of research has focused on interpretability [1– 4, 14]. However, there is no consensus on the either the definition of “interpretability” in the machine learning context or how interpretability can be evaluated for benchmarking [10]. This study proposes a new probabilistic modeling approach to define “interpretability” in the context of machine learning. Evidence is first acquired from data using statistical analysis according to the maximum likelihood evidential reasoning (MAKER) framework. The MAKER framework is used for data-driven inferential modeling and decision making under different types of uncertainty [17]. It consists of state space model and evidence space model, which are driven by the data reflecting the input-output relationship. The reliability of evidence and interdependence between a pair of evidence can be explicitly measured in the MAKER framework. In such a framework, various types of uncertainty can be considered for inferential modeling, probabilistic prediction and decision making [17]. Based on the acquired evidence, the belief rule base (BRB), which features a unique strong interpretability, is established between the inputs and outputs of a numerical system. A machine learning algorithm is used to train the parameters of the interpretable model. The mechanism of the model’s inference and classification is briefly analyzed and graphically presented. The model is then validated through three representative data sets, and its performance is compared with that of other models.
In the remaining part of this paper, Section 2 reviews related studies, and highlights the major innovations of this modeling approach. Section 3 discusses the underlying methodologies of the new approach using the Iris data set. In Section 4, we analyze the new proposed approach, and discuss the interpretability achieved by the approach. In Section 5, the models constructed by the new approach and other alternatives are compared in terms of their prediction performance on the Banana, Haberman’s survival, and Iris data set. Finally, Section 6 offers concluding remarks and suggestions for further research.
Related research
One of the simplest methods of achieving interpretability is creating interpretable models. Rule-based models, which are represented as a set of IF-THEN rules, are among the most interpretable models. The BRB proposed by Yang et al. [18] was developed by adding a belief structure to the conventional IF-THEN rule base [19]. In a BRB system, information is integrated using the evidential reasoning (ER) rule to implement inference [20, 21]. ER can handle both qualitative and quantitative information under uncertainty and incompleteness [18], based on Dempster-Shafer (D-S) theory [22–24]. In the D-S theory, a frame of discernment (FoD) is used to contain pre-assigned classes to which basic probabilities are assigned to generate a belief distribution (BD). Basic probabilities are used to measure the extent to which observations of input variables point to different classes or their subsets. The BD for each observation of an input variable is referred to as a piece of evidence. While Dempster’s rule can be used to combine multiple pieces of evidence, it can be applied only if certain conditions are satisfied, including that any evidence is assumed to be fully reliable. However, this assumption is impractical and can lead to counter-intuitive results when Dempster’s rule is used to combine highly or completely conflicting evidence [25].
Built on the basic concepts of the D-S theory, the ER rule [26] eliminates this assumption by taking into account the reliability and relative importance of evidence, while still preserving the desirable features of Dempster’s rule. One of the most important features of the ER rule is that it constitutes a unique probabilistic inference process for conjunctive combination of independent evidence. The ER rule is used to deal with discrete probabilistic inference problems where both input and output variables are assumed to take discrete numerical or categorical values. In reality, however, inference and classification problems can have both discrete and continuous variables. As such, it is necessary to develop new methods to help solve such problems.
The new modeling approach proposed in this paper is developed to address these issues in the classification context. Several researchers have applied the BRB inference methodology and the ER approach to classification problems. Chang et al. [27] proposed a new rule activation and weight calculation procedure to construct a BRB classifier. Jiao et al. [28] developed a BRB classification system to deal with incomplete or imprecise information. Xu et al. [29] proposed a new classification method based on the ER approach. Yang et al. [21] proposed an ensemble BRB modeling method to handle classification problems. In the field of healthcare classification problems, it has been used for risk stratification of patients with cardiac chest pain [19], trauma outcome prediction [20], and diagnosis of lymph node metastasis in gastric cancer [30, 31].
In addition, researchers have made attempts to use other methods to interpret the results of computational intelligence methods. Du et al. [32] propose a guided feature inversion framework which not only determines each input variable’s contribution but also provides insights into the decision-making process of deep neural network models. In the study of Brian and McKeown [33], evidence is defined as the intersection of a feature’s actual and expected contribution. Based on this definition, the study categorize features which are important to prediction. Long Short-Term Memory (LSTM) caption generation model [34] which has a loss function encouraging class discriminative information is used to generate justifications for image classification of a convolutional neural network. Non-iterative supervised learning models [35, 36] provide a fast solution to classification and regression models with increased accuracy. They are based on the use of Ito decomposition (Kolmogorov– Gabor polynomial) and the neural-like structure of the successive geometric transformations model (SGTM). Simple interpretation of the results of the regression or classification tasks can be established on the basis of the transition from neural-like structure to the solution of the task, which is in the form of a linear polynomial.
Compared with these studies, this paper’s major innovations are as follows. (1) Continuous data from an input space are discretized based on referential values at which evidence is generated through statistical analysis. (2) Acquired evidence is combined in the MAKER framework [17], which captures the interdependence between any two pieces of evidence. (3) Based on the MAKER framework, combined evidence is used to generate belief rules, so as to formulate a BRB [18] for interpretable inference that deduces an explicit probabilistic relationship between input and output variables. (4) The parameters of the MAKER-based models such as referential values and weights are trained using machine learning algorithms. In sum, the model built by the new approach is characterized by a unique strong interpretability. It provides a specific definition of “interpretability” in the machine learning context.
The proposed probabilistic modeling approach and its application to the Iris data set
This section provides a step-by-step description of how the proposed probabilistic modeling approach is used to establish a model for a classification problem. Suppose that a training input-output data set that has N instances is indicated by x ={ x n |n = 1, …, N }. Each instance featuring M continuous input variables (x(l) ={ x(l)|l = 1, …, M }) is denoted by x n ={ xn,l|n = 1, …, N ; l = 1, …, M }. These instances are classified in a nominal output variable: y ={ k|k = 1, …, K } where an integer represents a class of output variable. y n indicates the specific class of output variable for each instance. For example, in the Iris data set, the input variables include sepal length, sepal width, petal length, and petal width, which are denoted by x(1), x(2), x(3), and x(4), respectively. The output variable contains three classes: Iris Setosa, Iris Versicolor, and Iris Virginica, which are signified by “1”, “2”, and “3”, respectively. The output variable is hence represented as y ={ 1, 2, 3 }.
In this section, a training set of a fold of the Iris data set (hereinafter referred to as “training set”) is used as an example to demonstrate how a MAKER-based classifier is constructed using the training set based on the proposed approach for the classification of the Iris data set. The associated trained referential values (displayed in Table 1) and weights are used to develop a MAKER-based classifier for the “training set” (the complete training set is provided in Table S1 of the supplementary materials). The rest of this section is organized into five subsections: statistical evidence acquisition, evidence independence analysis, belief rule-base inference, rules combination for classification, and training of model parameters. Figure 1 shows the flow diagram describing the probabilistic modeling based on the training data set, adapted genetic algorithm, classification based on BRB for test data set, and their interrelationship.
Referential values of input variables of the example training set used for demonstration
Referential values of input variables of the example training set used for demonstration

Probabilistic modeling, training of model parameters, and classification.
Since input variables are continuous, referential values of each variable can be used for probabilistic modeling, while any other observations in between a pair of adjacent reference values can be simulated by the probabilistic interpolation [37]. The matching degrees are generated to construct a probability distribution that is equivalent to the observation to be interpolated using certain principles. Referential values can be initially assigned through statistical analysis and subsequently fine-tuned by optimal training [29, 38].
Using the referential values, we can transform an observation xn,l into a belief distribution of a referential value Ai,l, which is shown in Equation (1).
In Equation (1),
Matching degrees for each referential value are aggregated for each class to generate an associated total matching degree for each class, as shown in Equation (2). The total matching degree of a referential value for a class is then treated as the frequency for the referential value matching the class. Using the referential values shown in Table 1, we can generate all the frequencies of the referential values for an input variable. Table 2 displays the frequencies of the referential values: 4.3000, 4.9991, and 7.7000 matching classes of the input variable: sepal length.
The frequencies of referential values for input variable of sepal length
For each row of a frequency table such as Table 1, we can generate a sum of frequency values (
The likelihoods of referential values of input variable: sepal length matching classes of output variable
Based on
In Equation (5), Θ ={ h1, …, h N } is referred to as a frame of discernment, which denotes a set of mutually exclusive and collectively exhaustive hypotheses [39]. (θ, pθ,j) is an element of evidence e j that points to proposition θ, which can be any subset of Θ other than the empty set, with a probability: pθ,j [39]. For instance, the classes of output variable: Iris setosa, Iris versicolor, and Iris virginica can be signified by “1”, “2”, and “3”, respectively. The frame of discernment is hence denoted as Θ ={ 1, 2, 3 }. As shown in Table 4, the probabilities: 0.9816, 0.0000, and 0.0184, of a referential value: 4.3000, indicate that if sepal length is 4.3000 cm, the probability of this flower being Iris setosa (class: “1”) is 0.9816, and that of this flower being Iris versicolor (class: “2”) is 0.0000, and that of this flower being Iris virginica (class: “3”) is 0.0184. Thus, we can define a piece of evidence at sepal length of 4.3000 cm, where it points to class: “1”, “2”, and “3” with a probability of 0.9816, 0.0000, and 0.0184, respectively.
The probabilities of referential values of input variable: sepal length matching classes of output variable
To achieve greater predictive power, it is necessary to combine multiple pieces of evidence to generate predicted probabilities for certain classes. In the ER rule, independence is assumed between a pair of evidence. Under the MAKER framework, the degree of interdependence between a pair of evidence is measured by an interdependence index “α” generated by marginal and joint likelihood function [17]. To generate the interdependence index between a pair of evidence, we need first to estimate the joint probabilities for a pair of evidence. Let xn,l and xn,m be the n
th
observation of the l
th
and m
th
input variable, respectively. Given that the simultaneous observations of xn,l and xn,m are characterized by a probability distribution, we can use Equation (6) to generate
Similarly, we can generate joint matching degrees for each instance in a data set matching associated combinations of referential values which point to different classes of output variable. These joint matching degrees can be further aggregated to generate frequency values for associated referential values combinations pointing to classes of output variable (
Based on these frequency values, the joint likelihoods for referential values combinations pointing to classes of output variable (
With joint likelihoods, Equation (10) is used to obtain the joint probabilities for referential values combinations pointing to classes of output variable (
The probabilities of referential values of input variable: sepal width matching classes of output variable
The joint probabilities of referential values combinations of input variables: sepal length and sepal width pointing to different classes of output variable (a referential value of sepal length and that of sepal width are represented by Ai,1 and Aj,2, respectively)
Using Equations (4) and (10), we can generate αA,B,i,j representing interdependence indices between a pair of evidential elements indicated by ei,l (A) and ej,m (B), which is shown in Equation (11).
In Equation (11), pA,i,l and pB,j,m are the basic probabilities for single input variables: x
l
at xi,l and x
m
at xj,m, which point to propositions A and B, respectively.pA,B,il,jm represents the joint basic probability for both xi,l and xj,m being observed for the proposition θ (θ = A ∩ B, θ ⊆ Θ), which is generated using Equation (10) based on the joint table for pA,i,l and pB,j,m. When ei,l (A) and ej,m (B) are disjoint, αA,B,i,j is given as 0. If ei,l (A) is independent from ej,m (B), 1 is assigned to αA,B,i,j [17]. An example of this is that the interdependence index between a piece of evidence from sepal length at 4.9991 and another piece from sepal width at 2.8969, which points to class: Iris versicolor (indicated by “2”), is obtained by
The interdependence indices between a piece of evidence from sepal length (ei,1) and another piece from sepal width (ej,2)
From Table 7, it is clear that the majority of interdependence indices are moderate, as they are between 1.0000 and 7.0000. For example, the interdependence between sepal length: 4.9991 and sepal width: 4.4000 is considered to be moderate, as the associated indices matching Iris setosa (class “1”), Iris versicolor (class “2”), and Iris virginica (class “3”) are 2.8891, 2.1496, and 1.9491, respectively. An exceptional example is that sepal length: 4.3000 and sepal width: 2.0000 are highly independent from each other, as the associated interdependence index is 181.8097.
Based on the acquired evidence and interdependence analysis, we are now in a position to construct a BRB to infer the likelihood of a class for an instance in a data set. A BRB can capture incomplete, fuzzy, and ignorant information, along with nonlinear causal relationship between antecedent attributes and consequents [18]. It consists of a finite number of belief rules, which are defined as follows [18].
In (12), R
k
denotes the k
th
(k = 1, …, L) belief rule. M
k
represents the number of antecedent attributes in the k
th
rule, and x
m
(m = 1, …, M
k
) indicates the m
th
antecedent attribute.
As per the belief rule described in (12), the antecedent of a belief rule, which is represented in the form of “
To obtain the probabilities for the consequences, multiple pieces of evidence from input variables are combined using the conjunctive MAKER rule. To combine evidence, it is necessary to consider the reliability of evidence to measure the degree of its support for proposition θ, as evidence is seldom fully reliable. Let rθ,i,l be the reliability of evidence ei,l pointing to proposition θ. As displayed in Equation (13), rθ,i,l, which essentially measures the quality of ei,l, is defined as a conditional probability for θ being true given that ei,l points to θ [17]. rθ,i,l is related to how data are generated and how ei,l is acquired from data [17].
Using Equation (13), we can further obtain the reliability of a piece of evidence ei,l, which is displayed in Equation (14).
To consider reliability in the process of combining multiple pieces of evidence, it is necessary to use the probability mass to combine evidence and reliability. There are two possible scenarios about generating the probability mass for the proposition θ being supported by ei,l, in terms of whether ei,l and other pieces of evidence are acquired from the same data source. These scenarios are shown as follows.
In Equation (16), wθ,i,l = ωi,lp l (θ|ei,l (θ)) denotes the weight of an evidential element ei,l (θ). wθ,i,l is in proportion to the conditional probability for θ being true provided that ei,l points to θ [17]. The conditional probability is measured by a probability function p l built from data for x l only [17]. Of note is that if p = p l , wθ,i,l = rθ,i,l, which indicates that ωi,l = 1.
In each classification experiment of this paper, the associated data set is obtained from a single data source. Hence, in each classification experiment, the reliability (rθ,i,l) and weight (wθ,i,l) for any evidential element are essentially the same. To make it simple, rθ,i,l = wθ,i,l, which is used in the process of evidence combination.
Based upon the above definitions and discussions, the conjunctive MAKER rule is employed to combine multiple pieces of evidence to generate the combined probabilities for an evidence combination or antecedent of a belief rule. Equations (17) and (18) display the conjunctive MAKER rule to obtain pθ,(2), which represents the combined probability for the proposition θ being jointly supported by a pair of evidence ei,l and ej,m.
In Equations (17) and (18),
Using the conjunctive MAKER rule, the combined probabilities of classes of output variable can be generated for all the possible combinations of multiple pieces of evidence in a data set. Each possible evidence combination and its associated combined probabilities are used to generate a belief rule. All the possible belief rules constitutes a BRB. In the Iris data set, there are four input variables. We can define three pieces of evidence at the associated referential values of each input variable, which are displayed in Table 1. Thus, there are 34 = 81 possible combinations of four pieces of evidence. Each evidence combination and its associated combined probabilities are used to form a belief rule which is exhibited in Table S2 of the supplementary materials. For example, four pieces of evidence defined at 7.7000 of sepal length, 4.4000 of sepal width, 6.7000 of petal length, and 2.5000 of petal width are combined to generate probabilities: 0.0066, 0.0015, and 0.9919 matching classes: “1”, “2”, and “3”, respectively. Hence, the associated belief rule is that if sepal length is 7.7000, and sepal width is 4.4000, and petal length is 6.7000, and petal width is 2.5000, then the probability for Iris setosa is 0.0066, and that for Iris versicolor is 0.0015, and that for Iris virginica is 0.9919.
Based on a BRB, the conjunctive MAKER rule is further used to generate the predicted probabilities of classes of output variable for an instance of a data set. Each observation of an input variable can activate two adjacent referential values it is in between. An instance featuring two input variables can activate 22 = 4 combinations of referential values from two input variables. In other words, it can activate 4 belief rules featuring two input variables. Similarly, in the example of “training set”, an instance: {5.0000, 2.3000, 3.3000, 1.0000} can activate 24 = 16 belief rules out of the BRB, which are displayed in Table 8.
The belief rules activated by an instance: {5.0000, 2.3000, 3.3000, 1.0000} out of a belief rule base
The belief rules activated by an instance: {5.0000, 2.3000, 3.3000, 1.0000} out of a belief rule base
Equation (19) is used to generate αn,il,jm, which denotes a joint matching degree for an instance characterized by two input variables ({xn,l, xn,m}) matching referential values combinations of a belief rule ({Ai,l, Aj,m}). It can be further extended to the instances featuring more input variables. A joint matching degree indicates the degree to which we should use the activated belief rules to predict the probability for each class of output variable for an instance.
In the instance: {5.0000, 2.3000, 3.3000, 1.0000}, 5.0000, 2.3000, 3.3000, and 1.0000 activates the sets of two adjacent referential values: {4.9991, 7.7000} , {2.0000, 2.8969} , {1.0000, 4.4044}, and {0.1000, 1.3389}, respectively. The matching degree of 5.0000 to 4.9991, that of 2.3000 to 2.8969, that of 3.3000 to 1.0000, and that of 1.0000 to 1.3389 are generated by
Having generated all the associated joint matching degrees to which an instance matches referential values combinations of belief rules, we can combine the activated belief rules to predict the probabilities of classes of output variable for an instance. To combine these belief rules, their reliabilities and weights need to be considered. Let e(L) represent L pieces of evidence. The combination of referential values that e(L) are defined at constitutes the antecedent of a belief rule. Let re(L) and we(L) be the reliability and weight, respectively, of a belief rule. re(L) and we(L) (re(L) = we(L) in this paper) can be initially determined based on expert knowledge or trained using data sets.
Based on the joint matching degree for an instance matching an activated belief rule and the activated one’s reliability (weight), we are able to generate the updated reliability (weight) of the activated belief rule for an instance (represented by ‘
In Equation (20), αn,e(L) indicates the matching degree for an instance matching an activated belief rule, which is generated using the method extended from Equation (20). The updated reliability (weight) helps us consider how reliable and important an activated belief rule is in the combination of activated belief rules. With the associated updated reliability (weight) of a belief rule, we can use the conjunctive MAKER rule to combine the activated belief rules to predict the probabilities for classes of output variable assigned to an instance. For example, based on the MAKER rule, the belief rules activated by the instance: {5.0000, 2.3000, 3.3000, 1.0000} can be combined to generate probabilities: 0.1079, 0.8288, and 0.0632 for class: “1”, “2”, and “3”, respectively. In other words, if a flower has a sepal length: 5.0000, sepal width: 2.3000, petal length: 3.3000, and petal width: 1.0000, the predicted probabilities of a flower being Iris setosa, Iris versicolor, and Iris virginica are 0.1079, 0.8288, and 0.0632, respectively.
In the above process, Ai,l, rθ,i,l, wθ,i,l, etc., are the adjustable parameters of models assigned for inference and prediction. These parameters can be trained based on the data sets for classification. An optimal learning model is proposed for parameters training based on the principle of maximizing the likelihood of true class of output variable, which is displayed in Equation (21).
In Equation (22),
As detailed in Section 3.1, the decomposition of the input space is implemented based on the referential-value-based discretization [29, 38]. Continuous data from an input space are discretized using referential values, whereby evidence is generated for inference. For example, as presented in Fig. 2, a three-dimensional input space: x(1) × x(2) × x(3) can be decomposed into 2 × 2 ×2 = 8 cubic local regions using three referential values (including those at the minima and maxima of input variables, e.g., A1,3 and A3,3) for each input variable (signified by an axis of the plot, e.g., the axis of x(2)). Each observation of an instance (x n ) of a data set, which is denoted by a data point in Fig. 2, lies in between two adjacent referential values of an input variable. A piece of evidence is directly acquired from a referential value using statistical analysis: sample casting and likelihoods normalization as shown in Section 3.1, which requires no assumptions about specific statistical input distributions and input-output relationships.

Decomposition of three-dimensional input space.
Data point (x n ) in Fig. 2, which represents an instance of a data set, can be located within a local cubic region determined by the magenta points (at the intersections of dotted lines) that denote referential values combinations. Thus, in order to produce the predicted probabilities for an instance, it is necessary to generate the probabilities of classes of output variable for these referential values combinations. Namely, we need to generate the probabilities for the consequences of belief rules located at the magenta points. Using the conjunctive MAKER rule (Section 3.3), we can combine multiple pieces of evidence relating to a magenta point to generate associated belief rules, while considering interrelationship between a pair of evidence to be combined. This allows us to determine the probabilities of belief rules at the magenta points. All such belief rules located at magenta points in an input space constitute a BRB or a MAKER-based model for inference and prediction. The BRB is used to further generate predicted probabilities of classes of output variable for an instance by combining the activated belief rules at the magenta points using the MAKER rule. In the process to generate predicted probabilities, the matching degree of an instance matching belief rules (shown in Equation 19) is used to measure the proximity of a data point in Fig. 2 to the magenta points (combinations of referential values). Thus, based on referential values and matching degrees, a complete description is provided for the relative location of a data point in an input space, which represents an instance of a data set. Following this, MAKER-based model parameters can be trained via a machine learning algorithm to minimize the MSE, thus reducing the difference between predicted and observed probability of a proposition being true.
Under the above-described structure, a MAKER-based model is essentially an approximator combining decomposed submodels denoted by local regions to describe the general pattern of a numerical system [41]. For each submodel (i.e., a local region), we can formulate an explicit input-output relationship [41]. As such, the MAKER-based model established by the proposed probabilistic modeling approach features a unique strong interpretability, which is specified in the following aspects. Evidence acquisition is interpretable. Evidence is directly acquired from referential values of continuous data by statistical analysis including sample casting and likelihoods normalization. Under the MAKER framework, we can combine multiple pieces of evidence, and capture interdependence between a pair of evidence. The capture of interdependence is achieved by using interdependence index based on marginal and joint likelihood functions, rather than assuming interdependence between a pair of evidence. Inference mechanism is interpretable. An instance of a continuous classification data set can activate multiple pieces of evidence from different input variables. It is highly necessary to combine activated evidence to generate belief rules reflecting actual information that an instance contains. From a BRB based on belief rules, each given instance is able to activate belief rules, which can be further combined to generate a predicted output distribution. As we can record how changes in input variables influence output variable, the BRB inference process is essentially transparent. The BRB inference guarantees that the MAKER-based models are totally transparent and interpretable. Parameters determination is interpretable. The parameters of MAKER-based models consist of referential values of input variables and reliabilities (weights) of evidential elements. Both of them can be trained based on a machine learning algorithm to make difference between predicted and observed probability for a proposition being true as small as possible.
The above specific definitions provide a clear path for understanding and evaluating the interpretability under the context of machine learning. They may be further developed to guide the way to improve model interpretability.
Classification experiments for the MAKER-based model and other models are carried out on the classification data sets including Banana, Haberman’s survival, and Iris data set (these data sets are downloaded from the website of KEEL: https://sci2 s.ugr.es/keel/category.php?cat=clas). These benchmark data sets are useful to validate the MAKER-based model constructed by the proposed modeling approach, while a larger number of data sets will be included in the classification experiments of further research. The associated performance comparative analysis is conducted to compare the performance of these models in these experiments. Each of the classification data sets has already been divided into five subsets using distribution optimally balanced stratified cross-validation [42]. This ensures all the subsets have a similar class distribution, resembling that of the entire data set. One of the five subsets can be retained as a test set, and the remaining subsets are used as a training set. Such a process can be repeated for five times (folds), with each of the five subsets used exactly once as the test set. Thus, each classification data set is partitioned into five folds of training and test sets for cross-validation.
In the comparative analysis, on one side of the comparison is the MAKER-based model constructed using the proposed probabilistic approach; on the other side are the conventional alternative models (displayed in Table 9), which consist of several groups of candidate submodels: decision tree, discriminant analysis, logistic regression, support vector machine (SVM), k-nearest neighbor (KNN), ensembles, and naïve Bayes. The candidate submodels of MAKER-based and conventional alternative models are trained based on each training set. Within each group of candidate submodels, the submodel with the highest average training accuracy is chosen as the group representative model. The training of candidate submodels of conventional alternative models is implemented in the application of “Classification Learner” in Matlab. The parameters used for the training of these models are the default parameters of the application. For example, regarding the submodel of “Complex Tree”, the maximum number of splits is 100, and the split criterion is Gini’s diversity index. In terms of “Quadratic Discriminant”, the associated regularization is diagonal covariance. With regards to “Fine Gaussian SVM”, the box constraint level is 1, and the manual kernel scale is 0.5. In the respect of “Weighted KNN”, the associated number of neighbors is 10, and the distance metric is Euclidean, and the distance weight is squared inverse. The testing of candidate submodels of conventional alternative models and the visualization of the testing results are implemented by the built-in functions of Matlab. The training and testing of MAKER-based submodels are implemented by self-developed codes in Matlab. Of note is that following the stopping criteria proposed by Yao [25], the training for the MAKER-based submodels continues until each input variable of the training sets contains five trained referential values. This allows for a balance between the complexity and accuracy of the model. The representative MAKER-based submodel for the Banana data set has five trained referential values for each input variable, while those of the Haberman’s survival and Iris data set have one trained referential value for each input variable.
The candidate submodels of conventional alternative models for the classification of the data sets
The candidate submodels of conventional alternative models for the classification of the data sets
All the group representative models are then tested based on each test set, and the predicted outcomes are generated in the form of probabilities or scores. The predicted outcomes are then used to generate the area under the receiver operating curve (AUROC), which is subsequently employed for the comparison of the classification models. AUROC is one of the most commonly used global index for classifiers evaluations [43]. Although accuracy is employed widely to compare the predictive capability of different classifiers, it completely ignores probability estimations of classification that most classifiers generate [44]. AUROC is argued to be an improved measure, whereby higher values indicate greater classification capabilities (1.0 is optimum) [44]. A general rule of thumb for using AUROC to judge the classification capability of a classifier [45, 46] is that an AUROC between 0.7 and 0.8 is considered acceptable, between 0.8 and 0.9 indicates excellent discrimination, and larger than 0.9 implies outstanding discrimination.
Tables 10–12 report the AUROCs associated with each representative model for the Banana, Haberman’s survival, and Iris data set, respectively, as well as the associated average AUROCs of the models across the five test sets. A comparison of the receiver operating characteristics (ROC) curves of the MAKER-based models with those of the optimum conventional models (in terms of AUROC) are presented in the supplementary material.
The AUROCs of alternative models for the Banana data set
The AUROCs of alternative models for the Haberman’s survival data set
The AUROCs of alternative models for the Iris data set
The average AUROC of the MAKER-based model across the five test sets is 0.95968, which is the second largest one among all the AUROCs of the alternative models for the Banana data set (Table 5). The weighted KNN model achieves the optimum AUROC for this data set. In addition, both the logistic regression and naïve Bayes model are capable of being interpreted, their average AUROCs are much lower than that of the MAKER-based models. Furthermore, the complex tree model has a slightly lower average AUROC than the MAKER-based model. This indicates that for the complex Banana data set, simple interpretable models (e.g., logistic regression and naïve Bayes model) are unable to perform as well as their complex counterparts (e.g., complex tree and MAKER-based model).
The average AUROC of the MAKER-based model for the Haberman’s survival data set is 0.69057, again reaching the second place amongst all models in terms of AUROCs (Table 6). Based on these AUROCs, the classification performance of the MAKER-based model is considered acceptable. Note that the Haberman’s survival data set is imbalanced, where the ratio of the number of positive to negative samples is approximately 1:3. This can have an impact on the classification results. Moreover, the AUROC of MAKER-based model surpasses that of the complex tree and logistic regression model. Results demonstrate the acceptable classification performance of the MAKER-based model for the Haberman’s survival data set.
In order to determine the ROC curves and AUROCs of each model for the classification of the Iris data set, Iris Versicolor was taken as the positive class, while Iris Setosa and Iris Virginica are combined as the negative class. Table 7 indicates an average AUROC of 0.9955 for the MAKER-based model, which is the third largest one among all the AUROCs for the Iris data set. This indicates the outstanding classification performance of the MAKER-based model for the Iris data set.
The AUROCs in Tables 5-7 indicate that the MAKER-based model is an outstanding classifier for the Banana and Haberman’s survival data set, and a generally acceptable one for the Iris data set. In addition, it generally performs better than other interpretable models such as complex tree, logistic regression, and naïve Bayes. However, higher computational complexity is involved in the interpretable MAKER-based models constructed by the proposed approach, as there is a high multiplicative complexity on the number of referential values of input variables in a BRB [47]. It is necessary to conduct further research to improve the training efficiency of the MAKER-based models.
This paper presents a new probabilistic modeling approach to conduct a MAKER-based classifier for interpretable inference and classification. A comparative analysis is conducted between the MAKER-based model built by the proposed modeling approach and conventional alternative ones to evaluate their classification performance on the Banana, Haberman’s survival, and Iris data set. Experimental results demonstrate the general robustness of the MAKER-based model in classifying the data sets. For example, AUROCs of 0.95968, 0.69507, and 0.99550 were determined for the Banana, Haberman’s survival, and Iris data set. The lower value associated with the Haberman’s survival data set may be attributed to the lack of balance between negative and positive samples of the data set.
Furthermore, the MAKER-based model is characterized by a unique strong interpretability, which is specified in three aspects: (1) interpretable evidence acquisition, (2) interpretable inference mechanism, and (3) interpretable parameters determination. This provides a clear definition of “interpretability” under the context of machine learning. The proposed probabilistic modeling approach has a great potential in solving different types of modeling and prediction problems in complex systems. However, further research is necessary for handling high multiplicative complexity of referential values numbers of input variables in a BRB [47], and dealing with the relatively poor sensitivity for classification of imbalanced data sets (e.g., Haberman’s survival data set), and establishing MAKER-based models based on the data sets with “unknown” class.
Footnotes
Acknowledgments
The authors express their sincere thanks for the support from NSFC-Zhejiang Joint Fund for the Integration of Industrialization and Informatization under Grant No. U1709215, the European Union’s Horizon 2020 Research and Innovation Programme RISE under Grant No. 823759 (REMESH), and the National Natural Science Foundation of China under Grant No. 72071056.
