Abstract
Wang and Ghosh (2011) proposed a Kullback-Leibler divergence (KLD) which is asymptotically equivalent to the KLD by Goutis and Robert (1998) when the reference model (in comparison with a competing fitted model) is correctly specified and when certain regularity conditions hold true. While properties of the KLD by Wang and Ghosh (2011) have been investigated in the Bayesian framework, this paper further explores the property of this KLD in the frequentist framework using four application examples, each fitted by two competing non-nested models.
1 Introduction
The Kullback-Leibler divergence (KLD) is perhaps the most commonly used information criterion for assessing model discrepancy (Akaike, 1974; Bernardo, 1979; Kullback and Leibler, 1951; Lindley, 1956; Schwarz, 1978; Shannon, 1948). A KLD is the expectation of the logarithm of the ratio of the probability density functions (p.d.f.s) of two models: one being a fitted model and the other being the reference model, where the expectation is taken with respect to the reference model. Thus a KLD can be viewed as a measure of the information loss in the fitted model relative to that in the reference model.
KLDs that are suitable for model comparison in the Bayesian framework typically involve the integrated likelihoods of the competing models, where the integrated likelihood under each model is obtained by integrating the likelihood with respect to the prior distribution of model parameters (e.g., Lindley, 1956; Schwarz, 1978). KLDs based on the ratio of integrated likelihoods, however, have been challenged by identifying priors that are compatible under the competing models and that produce proper integrated likelihoods. Goutis and Robert (1998) proposed a Kullback-Leibler projection (henceforth G-R KLD) that overcame the challenges associated with prior elicitation in calculating KLD under the Bayesian framework. More specifically, the G-R KLD is the infimum KLD between the likelihood under the reference model and all possible likelihoods arising from the competing fitted model. When the reference model is correctly specified, the G-R KLD is asymptotically equivalent to the KLD between the reference model and the competing fitted model evaluated at its maximum likelihood estimate (MLE) (ref. Akaike, 1974). The Bayesian estimate of G-R KLD is obtained by integrating the G-R KLD with respect to the posterior distribution of model parameters under the reference model. Thus the Bayesian estimate of G-R KLD is clearly not subject to the drawback due to impropriety of the prior as long as the posterior under the reference model is proper. In addition, G-R KLD is suitable for comparing the predictivity of the competing models since it is calculated with respect to the posterior density of model parameters under the reference model. However, the G-R KLD was originally developed for comparing nested generalized linear models while assuming a known true model, and its extension to general model comparison (e.g., in the frequentist framework) remains limited.
Stemming from the G-R KLD, Wang and Ghosh (2011) proposed a more tractable model discrepancy measure, the Goutis-Robert-Akaike KLD (or G-R-A KLD)—the KLD between the reference model and the competing fitted model evaluated at its MLE. The G-R-A KLD is equivalent to G-R KLD when the reference model is correctly specified. Unlike G-R KLD, the G-R-A KLD does not require the reference model r to be the true model, nor is the true model to be specified. Also, while G-R KLD has been limited to comparing nested generalized linear models, the G-R-A KLD is more flexible for comparing nested or non-nested models that are broader than generalized linear models. There are two appealing asymptotic properties associated with G-R-A KLD (Wang and Ghosh, 2011). First, under certain regularity conditions, the asymptotic expression of the posterior estimator of G-R-A KLD associated with a statistic Tn (pertinent to the inference or model diagnostics) fitted by two competing models is comprised of a leading term for model discrepancy in the mean of Tn, a term for model discrepancy in the variance of Tn and a constant to penalize model complexity, plus a smaller order term. The leading terms in the G-R-A KLD estimates are functions of MLEs, which make them also appropriate for model comparison in the frequentist framework. Second, the G-R-A KLD estimator is intimately related to Bayesian model discrepancy measures based on predictive statistics (Gelman et al., 1996; Guttman, 1967; Rubin, 1984), such as a one-sided weighted posterior predictive p-value (WPPP). The WPPP of Tn evaluates the predictive distribution of Tn under an assumed model f at
While properties of G-R-A KLD have been investigated in the Bayesian framework (Wang and Ghosh, 2011), the goal of this paper is to further explore the utility of G-R-A KLD in the frequentist framework through applying G-R-A KLD to comparing non-nested models in four real application examples. The paper is organized as follows. Section 2 describes the G-R-A KLD and its associated asymptotic properties. Section 3 demonstrates model comparison using G-R-A KLD in four examples.
2 A proposed KLD
2.1 Notations
Throughout this paper, we assume that Xi s are i.i.d. with some common c.d.f. g and parameter(s) θ for θ ∈ Θ
g
, where Θ
g
is a closed set. Denote r for the reference model and f for the fitted model, both governed by θ, where Θ
r
and Θ
f
are the corresponding parameter spaces. Also, a capital letter is used to denote for random variables (or estimators), and the corresponding lower case denotes its realization (or an estimate). Let Un = U(X1, …, Xn) be the base for deriving the posterior density of θ under model r. We shall denote the p.d.f.s of * given θ under model r and f by
2.2 Define Goutis-Robert-Akaike KLD
Suppose that model adequacy is evaluated based on its fit for certain statistics Tn = T(X1, …, Xn) that is pertinent to the inference or model diagnostics. Following Wang and Ghosh (2011), we assess the relative fit between models using the KLD of the distribution of Tn under r and f evaluated at
Wang and Ghosh (2011) refer to (2.1) as the Goutis-Robert-Akaike KLD or G-R-A KLD since (2.1) is asymptotically equivalent to the KLD proposed by Goutis and Robert (1998) when the reference model r is the true model (ref. Akaike, 1974). In general, for each θ ∈ Θr, G-R-A KLD given in (2.1) can be regarded as a measure of the minimal information gain regarding Tn in model r from model f since the minimum information loss under f is achieved at
2.3 Estimation of G-R-A KLD
Estimating (2.1) involves approximating the integral with respect to r(tn|θ) and estimating unknown model parameters θ. To account for the uncertainty of model parameters, Wang and Ghosh (2011) considered the following estimator:
where Un, as a function of (X1, …, Xn), is used for deriving the posterior of θ, and
In what follows, let the O statements be interpreted as ‘almost sure’ statements. Also, define Q(y) ≡ y − log(y) − 1 for y > 0 which has a unique minimum at y = 1. Denote θ for model parameters, and
Throughout the paper, assume the following regularity conditions (A1)−(A5).
(A1) For each x, both log(r(x|θ)) and log(f(x|θ)) are three times continuously differentiable in θ. Further, there exist neighbourhoods Nr (δ) = (θ − δr, θ + δr) and Nf (δ) = (θ − δf, θ + δf) of θ and integrable functions
and
for k = 1, 2, 3.
(A2) For all sufficiently large λ > 0,
and
(A3)
and
as δ → 0.
(A4) The prior density π(θ) is continuously differentiable in a neighbourhood of θ and π (θ) > 0.
(A5) Tn = T(X1, …, Xn) is asymptotically normally distributed under both models r and f with
and
which is further reduced to
when µf(θ) ≠ µr(θ), and
when
(A6) Tn is a p-dimensional statistic (p > 1 and p being independent of n) with
and
Then
when µf (θ) ≠ µr (θ), and
when µr (θ) = µf(θ) and ∑r(θ) ≠ ∑f(θ).
Proofs of Properties 1.1 and 1.2 can be found in Wang and Ghosh (2011).
From the asymptotic results given in (2.6)−(2.9), it is clear that when n is sufficiently large, G-R-A KLD is appropriate for model comparison in the frequentist framework. This is because the leading terms of G-R-A KLD involve functions of MLEs. Specifically,
The above results also suggest ways of choosing Tn in KLDt(r, f |Un) that would yield effective model diagnostics. For example, suppose that the estimate of E(Xi) is unbiased under both models r and f (i.e., µf (θ) = µr (θ)), but Var(Xi) differs between the two models. Then the empirical variance of Xi would be a better choice for Tn compared to the sample mean of Xi since according to (2.6) and (2.7), KLDt(r, f |Un) = Op(n) when Tn is the empirical variance of Xi, while KLDt(r, f|Un) = Op(1) when Tn is the sample mean of Xi.
Regarding the magnitude of G-R-A KLD, since asymptotically, G-R-A KLD differs from the Bayes factor by a term associated with the logarithm of priors under r and f (Lewis and Raftery, 1994), one could use the Jeffreys’ rule (1961) as guidance to assess model discrepancy: ‘worth not more than a bare mention’ if
where r and f are the density functions of Tn under r and f, respectively. Then the asymptotic relationship between KLDt(r, f |Un) and WPPPr(Un) follows:
when µf (θ) ≠ µr(θ); and
and
when
A Proof of Property 1.3 is given in Wang and Ghosh (2011).
Note that WPPP, developed based on the posterior predictive p-value technique previously proposed for assessing absolute model fit (Gelman et al., 1996; Guttman, 1967; Rubin, 1984), is a model discrepancy measure that evaluates the predictivity of model f under model r. The expression given in (2.10) shows that WPPP calculates the predictive p-value of Tn under an assumed model f should Tn originate from the reference model r. The asymptotic relationship between KLDt(r, f |Un) and WPPPr(Un) shown above suggests the common nature between WPPP and G-R-A KLD: a measure for comparing model predictivity. More specifically, when µf (θ) ≠ µr (θ) (i.e., the estimate of the mean of Tn differs between r and f), both KLDt(r, f|Un) and Φ−1(WPPPr(Un)) are of order Op(n). In contrast, suppose that
In terms of practical applications of G-R-A KLD, it is important to note that Properties 1.1 and 1.3 hold under assumptions (A1)−(A5), while Property 1.2 holds under assumptions (A1)−(A4) and (A6). Assumptions (A1)−(A3) are fairly standard, and they can be met by most distributions used in the literature (e.g., the exponential family). Assumption (A4) holds for priors that are commonly used in the literature, such as a (multivariate) normal prior for location parameter(s), a gamma prior for the shape parameter, a Dirichlet prior for the mixture component in a mixture distribution and an inverse Wishart prior for the covariance, where the extent to which the prior is informative is governed by the hyper-parameters associated with precision of the prior distribution. Under (A1)−(A4), the asymptotic normality assumption given in (A5) or (A6) is key for the approximation given in (2.5)−(2.12) to hold. When this normality assumption is violated, results in (2.5)−(2.12) may not be credible. In this case, based on (A1)−(A3) and a reviewer’s comment, we approximate G-R-A KLD by
3 Application of G-R-A KLD in frequentist analyses
In this section, we apply G-R-A KLD to compare non-nested models in four real-world studies. In each of these applications, we assess the model fit to the entire dataset; that is,
Regarding the calculation of G-R-A KLD in these examples, we first verify the assumptions required for (2.6)−(2.9). According to Cox and Hinkley (1974) and Cramer (1946), assumptions (A1)−(A3) are met in Examples 1 and 2 described below; and they can also be met in Examples 3 and 4 under additional restrictions on the model parameters (McLachlan and Basford, 1988). In Bayesian analyses, assumption (A4) can be met by choosing proper noninformative priors. However, (A4) is not intimately relevant when applying G-R-A KLD to frequentist analyses (similar to that when applying the Bayesian information criterion [Schwartz, 1978] to model comparison in frequentist analyses). The asymptotic normality assumption given in (A5) or (A6) does not hold in any of the examples below. However, since n in these examples is sufficiently large, we approximate G-R-A KLD by an empirical estimate:
To assess the performance of G-R-A KLD in these examples, we verify whether the result of G-R-A KLD is consistent with that of alternative model diagnostic tools, such as the quasi-likelihood information criterion (Pan, 2001), the Akaike Information Criterion (Akaike, 1974) and model fit plots.
3.1 Example I: Trend analysis of physician vs resident ratio pre- and post-TORT reform in Texas
In the USA, it is commonly perceived that medical malpractice litigation is one major source for the country’s skyrocketing health care costs and steadily diminishing access to care. In 2003, the state of Texas imposed a cap on the amount of non-economic damages patients could recover from doctors and shielded emergency room doctors from liability except in cases of ‘willful and wanton’ acts of negligence (see
The purpose of our analyses is to assess whether the temporal trend of the number of doctors practicing per capita in Texas between 1995 and 2011 changed after 2003. To this end, we considered statistical models to characterize the temporal trend of the proportion of physicians among Texas residents (per year) between 1995 and 2011. Two non-nested model candidates with the same number of parameters were compared: a generalized linear mixed effect model (GLMM) assuming a binomial distribution and the logit link vs GLMM assuming a binomial distribution with the probit link. That is, r(xi | θ) : log {Pr(xi = 1|t)/Pr(xi = 0|t)} = θ0 + θ1 (t − 1995) + θ2 1[t ≥ 2003] + θ3 (t − 2003)1[t≥ 2003] + δ
it
vs f(xi | θ) : Φ− (Pr(xi = 1|t) θ0 + θ1 (t − 1995) + θ21[t ≥ 2003] + θ3 (t − 2003)1[t≥ 2003] + δ
it
, where xi is the indicator for being a physician, t for year and δ
it
for the random effect following a normal distribution. The MLEs are:

The NHANES 2003–2004 hospitalization dataset
3.2 Example 2: Hospitalization in The National Health and Nutrition Examination Survey (NHANES)
The National Health and Nutrition Examination Survey (NHANES) of 2003−2004 has obtained hospitalization information from 10,117 individuals. Table 1 shows that around 10% of these individuals were hospitalized in the 12-month survey period. The objective of our analyses was to model the distribution of the number of hospitalizations per person in a 12-month period. Based on the skewed distribution of the hospitalization count, and that the mean (0.147) is smaller than the variance (0.302), a zero-inflated Poisson (or ZIP) model could be appropriate: a portion of the population were more likely to be hospitalized, and these count data can be fitted by a Poisson distribution, while the rest of the population had a very low hospitalization incidence. Alternatively, the negative binomial (NB) distribution is also plausible for fitting the skewed distribution of the hospitalization count. Both model candidates are generalization of the Poisson distribution, but they are not nested. To some extent, the NB model is more flexible than the ZIP model since it allows the Poisson rate to follow a continuous mixture distribution while ZIP restricts it to be a two-component mixture. Thus in our G-R-A KLD assessment, we let the ZIP model be model f, which arises as a discrete mixture of a point mass at zero with probability θ2 and a Poisson distribution with probability (1 − θ2), where the point mass portion is posited to address excess 0’s in the distribution. Denote the NB model by r, which arises as a continuous mixture of Poisson distributions (i.e., a compound probability distribution), where the mixing distribution of the Poisson rate is distributed according to
3.3 Example 3: The Johns Hopkins PIRC Study
This example used data from a school intervention trial conducted by the Johns Hopkins University Preventive Intervention Research Center (hereafter PIRC) in 1993−1994 (Ialongo et al., 1999). The PIRC study was designed to improve academic achievement and to reduce early behavioural problems. Teachers and first-grade children were randomly assigned to the control and intervention conditions (either the Classroom-Centered intervention or the Family-School Partnership intervention). The TOCA-R or Teacher Observation of Classroom Adaptation-Revised (Werthamer-Larsson et al., 1991) measures, each item with scale ranging from 1 to 6, were assessed at baseline, the 6-month follow-up and the 18-month follow-up.

In this paper, we focused on using growth mixture modelling (GMM; Jo et al., 2009) to identify distinct trajectories of TOCA-R attention deficit (AD) measures in the control group (n = 184)—this analysis is critical to deriving credible causal inference associated with the intervention effect under GMM framework. The TOCA-R AD score consists of TOCA-R items that measure hyperactivity, concentration problem and impulsiveness. Denote Xi1, Xi2 and Xi3 for the AD score measured at baseline, the 6-month follow-up and the 18-month follow-up, respectively. Missing data on AD assessments were analyzed under the missing at random (MAR) assumption (Little and Rubin, 2002) using the likelihood of the observed data. Covariates included in the analyses were child’s sex (zi1 = 1 for male, zi1 = 0 for female), ethnicity (zi2 = 1 for African-American, zi2 = 0 other), parent’s education (zi3 = 1 some education beyond high school, zi3 = 0 high school or less), marital status (zi4 = 1 for married, zi4 = 0 for not married) and employment status (zi5 = 1 for employed, zi5 = 0 for not employed). GMM with two classes of quadratic AD trajectories were employed (Jo et al., 2009). Model r included covariates as predictors for AD trajectory class membership:
for s = 1, 2, where esi ∼ N(0, Ω) and
for s = 1, 2, where esi ∼ N(0, Ω). Note that rigorously speaking, model f is not nested within model r. This is because S is not observed, and that S derived from the two models can be different. MLEs under f are:

3.4 Example 4: A longitudinal study of substance use in adolescents
This example demonstrates analyses of longitudinal tobacco use in a diverse sample of adolescents from age 11 to 21 years derived from a longitudinal study of substance use in adolescents (Dishion and Kavanagh, 2000). The study began in 1996, recruiting all students from 6th grade classrooms in three schools. Half of the classrooms within these schools received a class-wide intervention program. Participants in the study completed questionnaires once a year in their classroom from 6th grade through 9th grade, and then in 11th grade, 12th grade and at ages 19, 22, and 23 years via e-mail responses (a total of eight repeated measures).
In this paper, we present analyses of repeated measures tobacco use in 498 subjects in the control classrooms. We consider two model candidates, both are suitable for characterizing count data with excessive zeros: a mixture of six distinct ZIP distributions (henceforth referred to as ZIP mixture) and a mixture of six distinct 2-part distributions by Olsen and Schafer (2001), which is referred to as 2-part mixture below. Compared to the 2-part mixture model, the ZIP mixture model is scientifically more plausible since (i) ZIP is more natural for characterizing the discrete count data and (ii) according to the literature on smoking (Dishion and Kavanagh, 2000), ‘0’ tobacco users at a time point can consist of two types of subjects: abstainers and occasional (or infrequent) users, which is consistent with the ZIP model assumption, while the 2-part model limits ‘0’ tobacco users solely to abstainers. Thus in our G-R-A KLD calculation, we let model r be the ZIP mixture and model f be the 2-part mixture. That is, model r assumes:
for s = 1, … , 6; and model f assumes:
for s = 1, …, 6 , where es,ij ∼ N(0, Ω
jj
), with Ω
jj
being the jth diagonal element of Ω, the covariance matrix of (es,i1, …, es,i8). Following the argument in Example 3, we employed the pseudoclass technique to calculate KLDt(r, f |Un). We first calculated estimates of S for each individual under both r and f by drawing a random sample from a multinomial distribution with probabilities being the individual’s posterior class probabilities {Pr(Si = s|Xi,


4 Summary
Researchers often face the challenge of choosing between competing models that are scientifically plausible but non-nested. This paper shows that the G-R-A KLD is appropriate for comparing non-nested models in the frequentist framework. Our work draws on asymptotic properties of G-R-A KLD shown in Section 2: the G-R-A KLD quantifies information discrepancy between competing models in terms of their MLEs. Four novel examples given in Section 3 have confirmed the utility of G-R-A KLD as a measure of model discrepancy in the frequentist framework: the result due to G-R-A KLD was consistent with that due to AIC and model diagnostic plots. We found emerging conclusions between G-R-A KLD and AIC on model comparison in these four examples with moderate to large sample sizes, which could be due to that the similar asymptotics revealed in these examples. As a preliminary investigation of the finite sample property of the proposed G-R-A KLD for scenarios considered in Section 3, we have conducted simulations to assess the extent to which model discrepancy is revealed by G-R-A KLD under various situations. Based on the recommendation by Jeffreys (1961), one may regard that there is a ‘decisive evidence’ of model discrepancy if the expected G-R-A KLD exceeds 5. In our simulations, the result of each scenario was drawn from 100 independently simulated datasets of size ranging from 50 to 2000. For practical reasons, we only consider situations where the reference model is the true model. Findings of our simulations are summarized below.
To differentiate between models for a binary outcome with a logistic link vs a probit link using G-R-A KLD, for a given k0 > 0 the minimum sample size required to achieve KLDt(r, f|Un) > k0 depends on the true prevalence of the binary outcome (say θ): a larger sample size is required if θ is closer to 0.5 (see Figure 6a). For example, n = 100, 200, 300, 700 is required to achieve KLDt(r, f |Un) > 5 for
For a continuous outcome with a bell-shaped distribution, to differentiate the fit of a t-distribution from the fit of a 2-normal mixture (with the same mean but different variances between the two mixture components) using G-R-A KLD, it was shown that a larger sample size is required to achieve KLDt (r, f |Un) > k0 (where k0 > 0) when there is less discrepancy in the variance between two components of the normal mixture (see Figure 6b). For example, n = 50 is adequate to achieve KLDt(r, f|Un) > 5 if the more variable class in the normal mixture distribution has a variance (say
For count data with a right-skewed distribution, to compare the fit between a ZIP model and a negative binomial model using G-R-A KLD, the minimum sample size required to achieve KLDt(r, f|Un) > k0 (where k0 > 0) depends on the proportion of “excess zeros” (i.e.,

To sum up, our preliminary simulations suggest that the finite-sample performance of G-R-A KLD depends on the discrepancy and the underlying distributions of the models under comparison. Further investigation of the consistency between G-R-A KLD and other model fit indices (such as AIC) is needed for the finite sample situation. For example, while G-R-A KLD was originally developed from the Bayesian framework, it is not yet clear the extent to which (2.6)−(2.9), (2.11) and (2.12) would hold under different choices of priors when the sample size is limited.
In general, the G-R-A KLD provides the relative fit between competing models without requiring the true model to be specified (see (2.5)−(2.7) which show that G-R-A KLD estimate only involves MLEs under models r and f). This implies that G-R-A KLD is a practical measure for model discrepancy since it does not need to assume the true model to yield a meaningful interpretation regarding model discrepancy. For example, when the superior model is designated as model r and the inferior model is designated as model f in the G-R-A KLD calculation (like G-R KLD and G-R-A KLD considered herein), then the resulting G-R-A KLD can be interpreted as a measure of the minimum information gain in model r from model f (since the inferior model f is evaluated at the MLEs). In contrast, if the superior model is designated as model f (hence the inferior model is designated as model r) in the G-R-A KLD calculation, then the resulting G-R-A KLD would have been a measure of the maximum discrepancy between model r and model f (since the superior model f is evaluated at the MLEs). Further research is needed to help us better understand this type of KLD.
Finally, for the purpose of assessing model adequacy (rather than relative model fit), G-R-A KLD should be used in conjunction with absolute model departure indices such as residuals. Nevertheless, the G-R-A KLD can be regarded as a measure of the absolute fit (or adequacy) of model f when the superior model r is the true model.
Footnotes
Acknowledgements
Wang’s work on this paper was partially supported by K25-DK075092 and R21-CA161180. Jo’s work on this paper is partially supported by R01-DA031698 and R01-MH086043. We thank Dr Ron Stewart, Dr Nick Ialongo and Dr Tom Dishion for generously providing the datasets for Examples 1, 3 and 4, respectively. We thank the reviewers for their helpful comments.
