Bayesian semiparametric latent variable model with DP prior for joint analysis: Implementation with nimble

Abstract

Multiple responses of mixed types are naturally encountered in a variety of data analysis problems, which should be jointly analysed to achieve higher efficiency gains. As an efficient approach for joint modelling, the latent variable model induces dependence among the mixed outcomes through a shared latent variable. Generally, the latent variable is assumed to be normal, which is not that flexible and realistic in practice. This tutorial article demonstrates how to jointly analyse mixed continuous and ordinal responses using a semiparametric latent variable model by allowing the latent variable to follow a Dirichlet process (DP) prior, and illustrates how to implement Bayesian inference through a powerful R package nimble. Two model comparison criteria, deviance information criterion (DIC) and logarithm of the pseudo-marginal likelihood (LPML), are employed for model selection. Simulated data and data from a social survey study are used for illustrating the proposed method with nimble. An extension of DP prior to DP mixtures prior is introduced as well.

Keywords

mixed continuous and ordinal responses joint modelling Dirichlet process prior tutorial

1 Introduction

Multivariate responses of mixed data types are common in a variety of data analysis contexts, like multiple aspects of a topic that is of interest in survey sampling. In order to take account of the relationships among multiple responses, joint modelling is more preferred than separate analysis due to significant efficiency gains (McCulloch, 2008). In joint modelling analysis, the key is to build a joint distribution for multivariate responses. However, it is not straightforward to build a joint distribution for responses of mixed data types. One approach is employing copulas to model the joint distribution, which is relatively difficult to implement (Song and Song, 2007). Factorizing the joint distribution of the multivariate responses is another approach, for example, specifying the joint distribution of mixed continuous and discrete responses as a product of a marginal distribution of one set of responses and a conditional distribution of another set of responses. This approach is easy to understand and implement, but the main drawback is that different directions of factorization may lead to different results (Wu, 2013). The latent variable model (LVM), also referred as generalized latent trait model in Sammel et al. (1997) and Moustaki and Knott (2000), is another useful tool for joint modelling mixed responses, which is relatively straightforward and easy to implement. In LVM, a shared latent variable is set up to induce dependence among mixed continuous and discrete outcomes. Based on an assumption that the outcomes are conditionally independent given this latent variable, a joint model of these mixed outcomes can be built. For parameter estimation of LVM, both expectation–maximization (EM) algorithm and the Bayesian approach are used in the literature. In this article, we employ the Bayesian approach for inference due to its flexibility.

Generally, the latent variable in LVM is assumed to follow a normal distribution. However, this assumption may not be always true in practice. In order to overcome the limitation of the normality assumption, robust methods have been proposed in the literature, including multivariate t-distribution, skew-normal distribution and nonparametric priors. Multivariate t-distribution is usually used for non-normal data with symmetrical heavy tails (Shapiro and Browne, 1987; Kano et al., 1993; Lee and Xia, 2006). The skew-normal distribution is able to model skewed and non-symmetric continuous responses through a parametric distribution (Azzalini, 1985; Azzalini and Capitanio, 1999; Lin et al., 2009; Baghfalaki and Ganjali, 2011; Lu and Huang, 2014; Teimourian et al., 2015). These methods are both parametric and not flexible enough compared to nonparametric methods. In the Bayesian framework, nonparametric priors provide a flexible approach to address the uncertainty of distributions; see Müller et al. (2015) for a review. Practically, the Dirichlet process (DP) prior is a widely used nonparametric prior in Bayesian analysis; see for example, Ferguson (1973), Antoniak (1974), Escobar (1994), MacEachern and Müller (1998) and among others. In order to allow for direct inferences on the posterior distributions for more general functions, Ishwaran and Zarepour (2000) proposed a truncated version of DP, while Ishwaran and Zarepour (2001) introduced a stick-breaking representation and blocked Gibbs sampler for Bayesian inference to achieve practical and computational convenience. Some applications of DP prior are given by Lee et al. (2008), Gill and Casella (2009), Hwang and Pennell (2014) and Xia and Gou (2016).

In this tutorial article, we consider LVM for mixed continuous and ordinal responses to fit data from Chinese General Society Survey (CGSS) 2013. In order to violate the normality assumption of the latent variable, a DP prior is assigned. Bayesian inference of this semiparametric LVM with an implementation of a finite-dimensional approximation of the DP prior is carried out in nimble. In practice, Bayesian inference is always implemented in softwares and packages including WinBUGS (Spiegelhalter et al., 2003), JAGS (Plummer, 2003) and Stan (Team, 2017). WinBUGS is quite powerful and can handle various types of problems, but convergence of Markov chains would be slow with large and hierarchical structured datasets. JAGS, similar to WinBUGS, is an open-source implementation of BUGS model specification and can be interfaced with R. Stan is another open-source software with functionality similar to WinBUGS, but uses a more complicated Markov chain Monte Carlo (MCMC) algorithm, which allows to converge quickly under complex model settings (Ma and Chen, 2018). However, it is not straightforward to deal with discrete variables in Stan. Considering the complexity of our problem, we use a relatively new and powerful R package nimble (de Valpine et al., 2017) for Bayesian inference. Nimble is an R package for programming with BUGS models using syntax similar to WinBUGS and JAGS, but with more flexibility in defining the models and algorithms. Users can operate from within R, and nimble will generate the C++ code for faster computation.

The aim of this tutorial article is to introduce how to jointly analyse mixed continuous and ordinal responses using a Bayesian semiparametric latent variable model by allowing the shared latent variable to follow a DP prior, as well as its implementation in a relatively new and powerful R package nimble. Data from CGSS 2013 is used for illustration. The proposed semiparmetric LVM is introduced in Section 2. The procedures of Bayesian inference and model selection are provided in Section 3. In Section 4 and Section 5, a simulation study and an empirical analysis are carried out, respectively, to illustrate the proposed methods in nimble. An extension of DP prior to DP mixtures prior is introduced additionally in Section 6. Finally, conclusions and discussions are given in Section 7.

2 Model specification

Consider a dataset with N subjects and K outcome variables, such that the first K₁ outcome variables are continuous, while the remaining $K - K_{1}$ outcome variables are ordinal. For $i = 1, \dots, N$ , the continuous measurement of the ith subject for the variable K ( $k = 1, \dots, K_{1}$ ) is denoted by $Y_{ik}$ , while the ordinal measurement of the ith subject for the variable K ( $k = K_{1} + 1, \dots, K$ ) with $C_{k}$ levels is denoted by $Z_{ik}$ . The vector of mixed outcome variables can be denoted by { $Y = (Y_{1}, \dots, Y_{K_{1}})^{'}, Z = (Z_{K_{1} + 1}, \dots, Z_{K})^{'}$ }, where $Y_{k} = (Y_{11}, \dots, Y_{Nk})^{'}$ and $Z_{k} = (Z_{11}, \dots, Z_{Nk})^{'}$ . Since Y and Z may be correlated, an LVM is built for joint modelling these mixed outcomes.

Let L denote the shared latent variable in the LVM, and assume that all the mixed outcomes are conditionally independent given this latent variable. The joint distribution of these mixed continuous and ordinal outcomes is given by

\begin{matrix} f (Y, Z) & = \int f (Y, Z, L) dL = \int f (Y, Z | L) f (L) dL = \int f (Y | L) f (Z | L) f (L) dL, \\ = \int \prod_{k = 1}^{K_{1}} f (Y_{k} | L) \prod_{k = K_{1} + 1}^{K} f (Z_{k} | L) f (L) dL, \end{matrix}

where $f (Y_{k} | L)$ and $f (Z_{k} | L)$ denote the conditional distributions of $Y_{k}$ and $Z_{k}$ given L, respectively, and $f (L)$ denotes the density of the latent variable L.

For the continuous outcome variable $Y_{k}$ ( $k = 1, \dots, K_{1}$ ), a normal distribution is assigned as

Y_{k} | L \sim N (μ_{yk}, τ_{yk}^{- 1}),

μ_{yk} = β_{0 k} + β_{k} X + L,

(2.1)

where X represents the covariate vector and $β_{k}$ denotes the corresponding coefficients vector.

For the ordinal outcome variable ZK, an ordered probit distribution is assigned. The ordinal outcome variable is on a scale of $1, \dots, C_{k}$ , for which $C_{k} - 1$ thresholds are set up as $θ_{1}, \dots, θ_{C_{k} - 1}$ . This set of thresholds divides the real number line into $C_{k}$ disjoint segments, corresponding to the CK categorical levels. Assuming that $μ_{zk}$ denotes the underlying continuous variable of ZK, then for $i = 1, \dots, N$ and $k = K_{1} + 1, \dots, K$ , the ordinal measurement can be denoted as

Z_{k} = \{\begin{matrix} 1, & μ_{zk} < θ_{1} \\ 2, & θ_{1} \leq μ_{zk} < θ_{2} \\ ⋮ & ⋮ \\ C_{k}, & θ_{C_{k - 1}} \leq μ_{zk} \end{matrix}

The $C_{k} - 1$ thresholds are strictly increasing such that $θ_{1} < \dots < θ_{C_{k} - 1}$ . In an abbreviated form, $Z_{k} = c$ , if $θ_{c - 1} \leq μ_{zk} < θ_{c}$ , for $c = 1, \dots, C_{k}$ . The ordered probit distribution for ZK is given by

P (Z_{k} = c | μ_{zk}) = P (θ_{c - 1} \leq μ_{zk} < θ_{c}) = P (Φ (θ_{c} - μ_{zk}) - Φ (θ_{c - 1} - μ_{zk})),

μ_{zk} = α X + L,

(2.2)

where $c \in {1, \dots, C_{k}}$ , Φ is the cumulative distribution function of the standard normal distribution, the intercept is set to be 0 for identification and α is the corresponding coefficients of the covariate vector X .

In classic LVMs, the latent variable L is generally assumed to follow a normal distribution. However, this normality assumption may not be always true in practice. In order to overcome the limitation of the normality assumption of L, a semiparametric approach assigning a DP (Ferguson, 1973) prior for L is considered. Suppose that the latent variable L conforms to an unknown distribution such that $L \sim G$ . For the random distribution function G, a DP model is assigned with a positive real number κ and a continuous distribution function G₀, where G₀ is a base distribution around which G is centred. The distribution function G is denoted by

\begin{matrix} G \sim DP (κ, G_{0}), \end{matrix}

where the concentration parameter κ represents the weight of the base distribution. For realization, DP can be written as an infinite mixture of point masses, and Sethuraman (1994) made this precisely by providing a constructive definition of DP called the stick-breaking construction. It is simply given as

\begin{matrix} θ_{j} \sim G_{0}, G = \sum_{j = 1}^{\infty} π_{j} δ_{θ_{j}} (\cdot), \end{matrix}

where $θ_{j}$ is the jth matrix consisting of the possible values of L, $δ_{θ_{j}} (\cdot)$ denotes a discrete probability measure concentrated at $θ_{j}$ , and $π_{j}$ denotes the random probability weight between 0 and 1. For empirical purpose, Ishwaran and James (2001) proposed a truncated version of DP as

G \approx p_{J} (\cdot) = \sum_{j = 1}^{J} π_{j} δ_{θ_{j}} (\cdot), 1 < J < \infty,

where $π = {π_{j} : j = 1, \dots, J}$ is a random vector that can be obtained as

π_{1} = V_{1}, π_{j} = (1 - V_{1}) \dots (1 - V_{j - 1}) V_{j}, π_{J} = (1 - V_{1}) \dots (1 - V_{J - 1}) .

(2.3)

The weight parameter $π_{j}$ conforms that $0 \leq π_{j} \leq 1$ and $\sum_{j = 1}^{J} π_{j} = 1$ . According to Sethuraman (1994), $V_{j}$ can be set to follow $Beta (1, κ)$ independently, and the base model G₀ can be assumed to be normal, Cauchy, logistic or other distributions.

3 Bayesian analysis

3.1 Bayesian inference

Suppose Ω denotes the unknown parameter vector and $D = {Y, Z, X}$ denotes the observed data; then the joint posterior distribution of the unknown parameters can be written as

\begin{matrix} f (Ω | D) \propto f (D | Ω) f (Ω), \end{matrix}

where $f (D | Ω)$ represents the likelihood function of the model and $f (Ω)$ denotes the joint prior distribution of the parameters. In practice, $f (Ω)$ is usually specified as a product of the prior distributions of the parameters by assuming independence among these parameters a priori.

The response model described in Section 2 can be written as follows:

\begin{matrix} Y_{k} | β_{0 k}, β_{k}, τ_{yk}, L \sim N (μ_{yk}, τ_{yk}^{- 1}), \\ Z_{k} | α, θ_{c}, L \sim OP (μ_{zk}), \\ L | κ, G_{0} \sim DP (κ, G_{0}), \end{matrix}

where $OP$ denotes the ordered probit distribution, $μ_{yk}$ and $μ_{zk}$ are given in Equations (2.1) and (2.2), respectively, and DP is constructed using the stick-breaking representation in Equation (2.3). In the Bayesian framework, it is necessary to specify prior distributions for the unknown parameters. For the latent variable, a DP prior is assigned for L. For the other parameters, the following prior distributions are given, for $k = 1, \dots, K_{1}$ :

\begin{matrix} β_{0 k} \sim N (0, 0 . 001^{- 1}), β_{mk} \sim N (0, 0 . 001^{- 1}), β_{mk} \in β_{k}, \\ τ_{yk} \sim Gamma (0.001, 0.001) . \end{matrix}

For $k = K_{1} + 1, \dots, K$ and $c = 1, \dots, C$ ,

\begin{matrix} α_{mk} \sim N (0, 0 . 001^{- 1}), α_{mk} \in α_{k}, \\ θ_{c} \sim N (0, 0 . 001^{- 1}) I (θ_{c - 1}, θ_{c + 1}) . \end{matrix}

For $j = 1, \dots, J$ ,

\begin{matrix} θ_{j} \sim N (0, 1), V_{j} \sim Beta (1, κ), κ \sim Gamma (1, 1) . \end{matrix}

In order to obtain the posterior distributions, full conditional distributions of these parameters should be derived first. Generally, Bayesian posterior samples can be obtained from the respective full conditional distributions through MCMC algorithms, which can be implemented in specific software packages such as WinBUGS and JAGS. In this article, we utilize an R package ‘nimble’ for estimation and simply compare it with another common software JAGS to discover the differences in efficiency.

In order to obtain reliable posterior estimations from MCMC samples, it is necessary to monitor and assess the convergence behaviour of the Markov chains. Only samples obtained after the Markov chains converge to the respective stationary distributions can be kept for posterior estimation. Either graphical visualization or certain summary statistics can be applied to monitor the convergence. For convergence diagnostics, Gelman and Rubin's convergence diagnostic (Gelman and Rubin, 1992) is a general measure that can be computed from multiple Markov chains. However, in complex models and large datasets, it is time-consuming to run multiple chains, so graphical visualization is an alternative to assess the convergence. Trace plots and plots of autocorrelation functions (ACFs) are two important tools. When the sampling path in the trace plot of a Markov chain does not show any indication of a trend and autocorrelations are low in ACF plot, the Markov chain can be indicated as well converged.

3.2 Model selection

Model selection is an important topic in statistical modelling since the true model is actually unknown. Here we employ two commonly used criteria, DIC and LPML, for model comparison. DIC is defined as

\begin{matrix} DIC & = Dev (\hat{Ω}) + 2 p_{D} = 2 E [Dev (Ω)] - Dev (\hat{Ω}), \end{matrix}

where $Dev (Ω)$ is the deviance function, $p_{D}$ is the effective number of model parameters and $\hat{Ω}$ is the posterior mean of the parameters. Since it is computationally intensive to integrate out the latent variable L and obtain a closed form of the observed likelihood, here we employ the conditional DIC proposed by Celeux et al. (2006) and calculate the DIC based on the conditional likelihood. For simplicity, we assume that $K_{1} = 1$ and $K = 2$ ; then the deviance function of our proposed model is given as

\begin{matrix} Dev (Ω) & = - 2 \log f (Y, Z | Ω, X, L) = - 2 \log [f (Y | Ω, X, L) f (Z | Ω, X, L)] \\ = - 2 [\log f (Y | Ω, X, L) + \log f (Z | Ω, X, L)] \\ = - 2 [(- \frac{N}{2} \log (2 π) + N \log (τ_{y}) - \frac{1}{2} (Y - β_{0} - β X - L)^{'} τ_{y}^{2} (Y - β_{0} - β X - L)) \\ + \sum_{i = 1}^{N} \sum_{c = 1}^{C} I (Z_{i} = c) \log (Φ (θ_{c} - (α X + L)) - Φ (θ_{c - 1} - (α X + L)))] . \end{matrix}

Another criterion is LPML, which is obtained based on the conditional predictive ordinate (CPO). Similarly, we calculate the CPO and LPML based on the conditional likelihood. Let $D_{(- i)} = {(Y_{j}, Z_{j}) : j = 1, \dots, i - 1, i + 1, \dots, N}$ denote the observed data with the responses of the ith subject deleted. The CPO of the ith subject is defined as

\begin{matrix} {CPO}_{i} = \int f (Y_{i}, Z_{i} | Ω, X, L) π (Ω | D_{(- i)}) d Ω, \end{matrix}

where $π (Ω | D_{(- i)}) = \frac{\prod_{j \neq i} f (Y_{i}, Z_{i} | Ω, X, L)}{m (D_{(- i)})}$ and $m (D_{(- i)})$ denotes the normalizing constant. In practice, a Monte Carlo estimate of CPO can be obtained from MCMC samples of the posterior distributions. To be specific, let $Ω_{t} (t = 1, \dots, T)$ denote the MCMC samples of the unknown parameters from $π (Ω)$ in the tth iteration; then the Monte Carlo estimate of ${CPO}_{i}^{- 1}$ is given by

{\hat{{CPO}_{i}}}^{- 1} = \sum_{t = 1}^{T} \frac{1}{f (Y_{i}, Z_{i} | Ω_{t}, X, L)},

and LPML can be obtained as

\begin{matrix} \hat{LPML} = \sum_{i = 1}^{N} \log ({\hat{CPO}}_{i}) . \end{matrix}

Models with a smaller value of DIC is preferred, while models with a larger value of LPML is better.

4 Simulation study and implementation with nimble

4.1 Simulation study

In this section, a simulation study is carried out to illustrate the performance of the model selection criteria and the posterior estimates of the proposed model. For the performance of DIC and LPML, we compare models with different priors on the latent variable and calculate the percentage that the criteria choose the correct model. For the empirical performance of the posterior estimates, bias, the root mean square error (RMSE) and the coverage probability (CP) are employed. Suppose $θ$ is one of the unknown parameters that is of interest, ${\hat{θ}}_{t}$ denotes the tth replication of $θ$ , $θ_{0}$ and $\hat{θ}$ are the respective true value and the posterior estimate; then the bias and RMSE of the parameter can be calculated as

\begin{matrix} Bias (θ) = \frac{1}{T} \sum_{t = 1}^{T} ({\hat{θ}}_{t} - θ_{0}), \\ RMSE (θ) = [\frac{1}{T} \sum_{t = 1}^{T} ({\hat{θ}}_{t} - θ_{0})^{2}]^{1 / 2}, \end{matrix}

where t is the number of replications in the simulation. In our simulation, t was set to be 70. CP represents the probability that the 95% credible interval contains the true value of the parameters.

4.1.1 Simulation 1: L is bimodal

Here we assume that the non-normal latent variable L was generated from a mixture of two normal distributions $0.5 N (- 2, 1) + 0.5 N (2, 1)$ ; then L was a bimodal variable under this setting. In each replication, simulated data with a sample size of 1 000 were generated from the proposed model in Section 2 with a continuous outcome variable Y and an ordinal outcome variable $Z$ . Two covariates $X_{1}$ and $X_{2}$ were randomly generated from $N (0, 1)$ . $Y$ followed a normal distribution with a precision parameter $τ_{y} = 0.8$ and a location parameter $μ_{y} = β_{0} + β_{1} X_{1} + β_{2} X_{2} + L$ , where $β_{0} = 2, β_{1} = 2, β_{2} = 3$ . $Z$ was a three-level scaled ordinal variable which followed an ordinal probit distribution with $μ_{z} = α_{1} X_{1} + α_{2} X_{2} + L$ , where $α_{1} = 2$ and $α_{2} = 3$ .

In order to reveal the performance of DIC and LPML in model selection, we fitted the simulated datasets with the following different priors on the latent variable L:

\begin{matrix} M_{0} : L \sim DP (κ, H), θ_{j} \sim N (0, 1); \\ M_{1} : L \sim DP (κ, H), θ_{j} \sim Logistic (0, 1); \\ M_{2} : L \sim DP (κ, H), θ_{j} \sim Cauchy (0, 1); \\ M_{3} : L \sim N (0, 1); \\ M_{4} : L \sim t_{1} (0, 1) . \end{matrix}

R package nimble were used for programming and estimation. A total of 100 000 iterations were carried out and the thinning interval was set to be 10. The first 500 iterations were discarded as the burn-in phase, while the remaining 9 500 samples were saved for posterior estimation. Due to the computational burden, we ran a single Markov chain and used trace plots and ACF plots for convergence assessment.

Average DIC and LPML values for the five models, M₀–M₄, are shown in Table 1.

Table 1:

Average DIC and LPML for the competing models when L is bimodal

Model	M ₀	M ₁	M ₂	M ₃	M ₄
DIC	1950.36	1897.38	1898.23	3069.78	2688.95
LPML	−2302.90	−2270.65	−2271.21	−2734.42	−2619.87

From Table 1, we can see that for model M₁ with a DP prior on L and a logistic distribution as the base distribution, the DIC value is the smallest and the LPML value is the largest among these alternative models. These two model selection criteria choose M₁ as the best model. By comparing the results of the first three models that used a DP prior on L, we can see that the DIC and LPML values are similar, especially for models M₁ and M₂. For models M₃ and M₄, their DIC values are much larger than those of the other models, especially for models with the latent variable that follows a normal distribution.

We also calculated the percentages that DIC and LPML chose the best model in the simulation. The percentage of DIC for choosing the best model M₁ is 71.43, and the percentage of DIC for not choosing the models M₃ and M₄ is 100. For LPML, the percentages for choosing M₁ and not choosing M₃ and M₄ are 67.43 and 100, respectively. These values indicate that DIC and LPML perform well in selecting the correct model. The results in Table 1 indicate that when the latent variable is actually non-normal and bimodal, the DP prior performs well and is a good choice in practice.

The Bayesian estimates, bias, RMSE and CP of the parameters in the chosen model M₁ are shown in Table 2. We can see that the bias and RMSE in M₁ are relatively small and both the CPs are larger than 0.92. In Table 2, the results of model M₃ with a normal prior on L are shown as well. By comparing the results of M₁ and M₃, we can see that bias and RMSE in M₃ are larger than those in M₁, and the CPs are smaller than those in M₁, indicating that model with a normal latent variable performs worse than model with a latent variable that follows a DP prior when the latent variable is actually bimodal. These results also conform to the model selection results.

Table 2:

Simulation results of models M₁ and M₃ when L is actually bimodal

Parameter	True value	M ₁				M ₃
		Estimate	Bias	RMSE	CP	Estimate	Bias	RMSE	CP
$β_{0}$	2	2.0240	0.0240	0.3241	0.9892	2.1103	0.1103	0.3772	0.9014
$β_{1}$	2	2.0037	0.0037	0.0528	0.9859	2.1093	0.1093	0.1650	0.9177
$β_{2}$	3	3.0015	0.0015	0.0670	0.9296	3.1008	0.1008	0.1838	0.8538
$α_{1}$	2	1.9827	−0.0173	0.1061	0.9437	1.2322	−0.7678	0.7714	0
$α_{2}$	3	2.9628	−0.0372	0.1257	0.9577	1.8413	−1.1588	1.1638	0
$τ_{y}$	0.8	0.8651	0.0651	0.1228	0.9296	0.2477	−0.5523	0.5521	0

In our model settings, we used an informative hyperprior Gamma(1, 1) for the precision parameter κ in DP prior. Here we perform a sensitivity analysis to study the effect of hyperpriors on κ. We built model M₁ with the other two different hyperpriors on κ: Gamma(0.001,0.001) and Gamma(100,100). The simulation results of these three models are shown in Table 3. From Table 3, we can see that the performance of posterior estimates do not change a lot with different hyperpriors on the precision parameter κ, indicating that the model is robust with different hyperpriors on the concentration parameter in DP prior.

Table 3:

Simulation results for model M₁ with different priors for κ

Parameter	True value	M₁ with $κ \sim Gamma (1, 1)$				M₁ with $κ \sim Gamma (0.001, 0.001)$
		Estimate	Bias	RMSE	CP	Estimate	Bias	RMSE	CP
$β_{0}$	2	2.0240	0.0240	0.3241	0.9892	1.9518	−0.0482	0.2915	1.00
$β_{1}$	2	2.0037	0.0037	0.0528	0.9859	2.0017	0.0017	0.0510	1.00
$β_{2}$	3	3.0015	0.0015	0.0670	0.9296	3.0024	0.0024	0.0669	0.9286
$α_{1}$	2	1.9827	−0.0173	0.1061	0.9437	1.9841	−0.0159	0.1077	0.9429
$α_{2}$	3	2.9628	−0.0372	0.1257	0.9577	2.9682	−0.0318	0.1346	0.9571
$τ_{y}$	0.8	0.8651	0.0651	0.1228	0.9296	0.8788	0.0788	0.1302	0.9286
Parameter	True value	M₁ with $κ \sim Gamma (100, 100)$
		Estimate	Bias	RMSE	CP
$β_{0}$	2	1.9884	−0.0116	0.3812	1.00
$β_{1}$	2	2.0015	0.0015	0.0511	1.00
$β_{2}$	3	3.0023	0.0023	0.0674	0.9429
$α_{1}$	2	1.9760	−0.0240	0.1104	0.9286
$α_{2}$	3	2.9564	−0.0436	0.1387	0.9429
$τ_{y}$	0.8	0.8474	0.0474	0.1120	0.9429

We also ran model M₁ and conducted Bayesian analysis using JAGS to compare the efficiency of these two tools. Carrying out 100 000 iterations with a thinning interval of 10 for a sample of $N = 1 000$ , the total computation time for the fitting model M₁ using nimble was 1.13 h (Intel Core Processor 2GHz, 4GB RAM PC), while it took 4.94 h for JAGS to run the same amount of MCMC iterations. This result shows that nimble is computationally efficient than JAGS on our proposed model.

4.1.2 Simulation 2: L is skewed

In this section, the simulation settings were identical with Simulation 1, except for the true distribution of the latent variable. Here we assume that the non-normal latent variable L was drawn from $\sum_{k = 0}^{7} N (3 (2 / 3)^{k} - 1 + 1.919, (2 / 3)^{2 k}) / 8$ (Ibrahim et al., 2001), meaning that L followed a skewed distribution. Similarly, model comparison was carried out among models M₀–M₄. The average DIC and LPML values of these models are given in Table 4.

Table 4:

Average DIC and LPML for the competing models when L is skewed

Model	M ₀	M ₁	M ₂	M ₃	M ₄
DIC	1762.90	1755.55	1754.08	1788.69	1780.92
LPML	−1948.09	−1943.65	−1943.50	−1960.36	−1954.79

From Table 4, we can see that for model M₂, the DIC value is the smallest and the LPML value is the largest among these alternative models. These two model selection criteria chose M₂ as the best model. Similarly, the DIC and LPML values of the models with L following a DP prior are similar, especially for M₁ and M₂. For models M₃ and M₄, their DIC values are much larger than those of the other models, while their LPML values are smaller compared to the other models.

The percentage of DIC for choosing the best model M₂ is 73.19, and the percentage for not choosing the model M₃ and M₄ is 100. For LPML, the percentages for choosing M₂ and not choosing M₃ and M₄ are 70.27 and 100, respectively. These values indicate that DIC and LPML perform well in selecting the correct model under our simulation settings. The results in Table 4 indicate that when the latent variable is actually non-normal and skewed distributed, DP prior performs well and is a good choice in practice.

The posterior estimates, bias, RMSE and CP of the parameters in the chosen model M₂ are given in Table 5. We can see that the bias and RMSE in M₂ are relatively small and both the CPs are larger than 0.92. In Table 5, the simulation results of model M₃ are shown as well. Compared to M₂, the bias and RMSE in M₃ are larger, while the CPs are smaller. These results also conform to the model selection results.

Table 5:

Simulation results of models M₂ and M₃ when L is actually skewed

Parameter	True value	M ₂				M ₃
		Estimate	Bias	RMSE	CP	Estimate	Bias	RMSE	CP
$β_{0}$	2	2.0592	0.0592	0.1772	0.9752	1.8925	−0.1075	0.1163	0.4143
$β_{1}$	2	1.9951	−0.0049	0.0380	0.9851	1.9905	−0.0095	0.0423	0.9657
$β_{2}$	3	1.9950	−0.0050	0.0445	0.9255	2.9919	−0.0081	0.0518	0.8906
$α_{1}$	2	2.0143	0.0143	0.1533	0.9232	2.2969	0.2969	0.3380	0.4143
$α_{2}$	3	3.0106	0.0106	0.2004	0.9403	3.4427	0.4427	0.4921	0.3143
$τ_{y}$	0.8	0.7909	−0.0091	0.1201	0.9204	1.0690	0.2690	0.3200	0.7864

4.1.3 Simulation 3: L is normal

In this simulation, the true distribution of the latent variable L was assumed to be the standard normal distribution $N (0, 1)$ . Similarly, model comparison was carried out among models M₀–M₄. The average DIC and LPML values of these models are given in Table 6.

Table 6:

Average DIC and LPML for the competing models when L is normal

Model	M ₀	M ₁	M ₂	M ₃	M ₄
DIC	1859.12	1839.94	1845.37	1806.79	1824.09
LPML	−1985.42	−1980.12	−1981.64	−1971.97	−1978.88

From Table 6 we can see that when L is actually normal, model M₃ has the smallest DIC and the largest LPML values among these five models. The results indicate that DIC and LPML choose M₃ as the best model, which conforms to the fact that L actually follows a normal distribution. For models with DP priors on the latent variable, model M₁ with the logistic base function performs better than the other two models M₀ and M₂ on both the DIC and LPML values. The percentage of DIC for choosing the best model M₃ is 75.71, while the percentage of LPML for choosing M₃ is 74.29. These results show that DIC and LPML perform well in selecting the correct model when L actually follows a normal distribution.

The posterior estimates, bias, RMSE and CP of the parameters in the chosen model M₃ are given in Table 7. The results of model M₁ with a DP prior are shown as well for comparison.

Table 7:

Simulation results of model M₃ and M₁ when L is actually normal

Parameter	True value	M ₃				M ₁
		Estimate	Bias	RMSE	CP	Estimate	Bias	RMSE	CP
$β_{0}$	2	2.0028	0.0028	0.0352	1.00	2.0273	0.0273	0.2413	1.00
$β_{1}$	2	1.9868	−0.0132	0.0409	0.9571	1.9871	−0.0129	0.0421	0.9571
$β_{2}$	3	2.9924	−0.0076	0.0445	0.9571	2.9933	−0.0064	0.0450	0.9571
$α_{1}$	2	2.0400	0.0400	0.1349	0.9571	2.0040	0.0040	0.1267	0.9857
$α_{2}$	3	3.0797	0.0797	0.1922	0.9429	3.0268	0.0268	0.1886	0.9857
$τ_{y}$	0.8	0.8165	0.0165	0.0813	0.9571	0.7625	−0.0375	0.1237	0.9286

From Table 7 we can observe that when L is actually normally distributed, the posterior estimates in M₃ are of small bias and RMSE. However, model M₁ with a DP prior for L also behaves well on posterior estimation and the results are close to those in M₃. These results indicate that even when L actually follows a normal distribution, a model with a DP prior on the latent variable also performs well and can obtain relatively precise estimation for the parameters.

4.2 Implementation with nimble

The proposed semiparametric LVM and Bayesian inference were implemented via nimble. nimble is an R package for programming with BUGS models, which allows for fitting models specified using syntax similar to WinBUGS and JAGS, but with more flexibility in defining the models and algorithms. Users can operate from within R, and nimble will generate the C++ code for faster computation. In this section, we introduce how to carry out the proposed analysis using nimble 0.6–10 in R 3.4.1.

The first step is to create the proposed semiparametric LVM, which includes four parts: model code, constants, data and initial values for MCMC. The syntax of the model code is similar to the BUGS language, which is attractive for users who have coding experiences of WinBUGS and JAGS. The beginning lines for the model code are

The function nimbleCode() states the code for the model. We loop through 1 to the sample size N for the mixed responses and the latent variable. For the continuous response $Y$ , a normal distribution is specified,

where $b [1]$ denotes the intercept $β_{0}$ and $b [2 : 3]$ represents the coefficient vector $β$ . In nimble, the second parameter of function dnorm() has several choices including default precision $tau$ , standard deviation $sd$ and variance $var$ . Users can choose one of them for convenience.

For the ordinal response $Z$ , an ordinal probit model is specified,

where dcat() is the categorical distribution and phi() is the Gaussian cumulative density function. Similarly, $a [1 : 2]$ corresponds to the coefficient vector α.

For the latent variable L, we first assume underlying groups with corresponding probabilities for each component of L, so we have

Aforesaid is the likelihoods in the for-loop. Outside the for-loop we should declare the priors for the unknown parameters. For the latent variable L, a finite-dimensional stick-breaking representation of the DP prior is assigned:

For priors of the thresholds of the ordinal response $Z$ , we have

Here, we should ensure that the thresholds are in an ascending order. However, nimble does not support the JAGS sort() syntax, so we have to write a nimble function to implement this. Fortunately, nimble provides ways to write self-defined functions just as in R by using a function nimbleFunction(). In addition, it can also call R functions with similar functionality using nimbleRcall(). Here we define a nimble function Rsort() by calling the R function sort() using nimbleRcall() as

where ‘double(1)’ denotes the data type of the input and output. This self-defined function should be built before nimbleCode(). Finally the priors for $b [1 : 3]$ , $a [1 : 2]$ and the precision of $Y$ are given as

After defining the model code, we should define the constants, initial values and data list. Compared to WinBUGS and JAGS, data and initial values can be defined in the same way, while ‘constants’ is a new list that contains the values that would not change, including the variables that define for-loop indices. In our settings, the lists of data, constants and initial values are given as follows:

The second step is to build and compile the nimble model. The function nimbleModel() is used for the building the model and compileNimble() is used for compiling the model:

The function compileNimble() helps generating the C++ code, compiling that code and then loading it back into R, leading to faster computation. When running any nimble algorithms via C++, the model needs to be compiled before any compilation of algorithms, including MCMC algorithms.

Before we create the MCMC algorithms, we can see the default MCMC samplers that nimble assigned for the unknown parameters:

print=TRUE allows printing the MCMC samplers that nimble assigned for the parameters and thin=10 denotes the thinning interval. In our simulation study, the default samplers that nimble assigned for the precision parameter and the coefficient parameters of the continuous outcome $Y$ are conjugate samplers and for the coefficient parameters of the ordinal response $Z$ , random walk samplers are assigned.

The next step is to create the MCMC algorithm by using the function buildMCMC(), and compile it again and then run it:

The function addMonitors() allows for declaring the parameters that we want to estimate and run() represents running the MCMC iterations. After that, we can obtain the MCMC samples, and then obtain the posterior estimates of the parameters.

nimble also provides a one-line function of MCMC for the users to invoke the MCMC engine directly, which generally takes the code, data, constants and initial values as input, and provides a variety of options for executing and controlling multiple chains, iterations, thinning intervals, etc.

For the outputs, we can use another R package coda (Plummer et al., 2006) to summarize and generate plots just as in JAGS.

5 Application: Chinese General Society Survey 2013

In this section we applied our proposed model and method to analyse data from the Chinese General Society Survey (CGSS) 2013. CGSS, launched by the Department of Sociology of the Renmin University of China and the Survey Research Center of Hong Kong University of Science and Technology, is a survey project aimed at systematically monitoring the changing relationship between social structure and quality life in both urban and rural China. CGSS collects quantitative data about social structure, quality of life and the underlying linking mechanisms of these two aspects.

For this dataset, we considered two important variables, annual income ( $Y$ ) and degree of life happiness ( $Z$ ), as mixed outcome variables. Annual income is a continuous variable, which refers to the amount of annual income of the respondents in the survey year and is transformed into its logarithmic form when analysing. Degree of life happiness is an ordinal variable with five levels. The related question of this ordinal variable in CGSS 2013 is ‘How unhappy or happy are you with your life overall?’. There are five ordered categories for this question, including ‘unhappy at all’, ‘unhappy’, ‘neither unhappy nor happy’, ‘happy’ and ‘very happy’, which are recorded from scale 1 to 5.

In order to find out the factors that may have impacts on these two responses, we also considered the following explanatory variables: gender ( $X_{1}$ ), age ( $X_{2}$ ), education level ( $X_{3}$ ), health status ( $X_{4}$ ) and marital status ( $X_{5}$ ). Age is a continuous variable, while the remaining four covariates are categorical, which are represented as dummy variables in analysis. After removing the records with missing values, the total sample size is 10 166. The summary of these variables is displayed in Table 8.

Table 8:

Summary of response and explanatory variables in CGSS 2013

Variable	Data type	Levels	Mean/Proportion
log annual outcome	Continuous	–	8.52
life happiness	Ordinal	unhappy at all	1.51%
		unhappy	7.22%
		neither unhappy or happy	18.31%
		happy	59.03%
		very happy	13.94%
gender	Categorical	female	48.90%
gender	Categorical	male	51.10%
age (standardized)	Continuous	-	0.40
education level	Ordinal	primary school and below	35.42%
		junior middle school	29.69%
		senior high school	18.90%
		college and above	15.99%
health status	Ordinal	very unhealthy	2.85%
		unhealthy	13.42%
		general	19.09%
		healthy	38.54%
		very healthy	26.10%
marital status	Categorical	single	8.98%
		married or living as couple	80.62%
		separated or divorced	2.28%
		widowed	8.12%

Figure 1:

Box plot of log annual income versus life happiness

The box plot of these two responses are shown in Figure 1. From Figure 1 we can observe that there is a tendency that with a higher level of life happiness, the annual income is higher. Therefore, we cannot assume these two variables are independent and analyse them separately. A joint analysis is more suitable for this circumstance. With the data described earlier, we set up the models for the mixed continuous and ordinal outcomes according to Section 2 as follows:

\begin{matrix} Y_{i} | L \sim N (μ_{y i}, τ_{y}^{- 1}), \\ μ_{y i} = β_{0} + β_{1} X_{1 i} + β_{2} X_{2 i} + {β^{'}}_{3} X_{3 i} + {β^{'}}_{4} X_{4 i} + {β^{'}}_{5} X_{5 i} + L_{i}, \\ Z_{i} | L \sim OP (μ_{z i}), \\ μ_{z i} = α_{1} X_{1 i} + α_{2} X_{2 i} + {α^{'}}_{3} X_{3 i} + {α^{'}}_{4} X_{4 i} + {α^{'}}_{5} X_{5 i} + L_{i}, \\ L_{i} \sim DP (κ, H) . \end{matrix}

Since covariates education level ( $X_{3}$ ), health status ( $X_{4}$ ) and marital status ( $X_{5}$ ) are both categorical variables with more than two levels, they are written as vectors of dummy variables. The corresponding regression coefficients are denoted as vectors ${β^{'}}_{3} = (β_{3}^{1}, β_{3}^{2}, β_{3}^{3}), {β^{'}}_{4} = (β_{4}^{1}, β_{4}^{2}, β_{4}^{3}, β_{4}^{4}), {β^{'}}_{5} = (β_{5}^{1}, β_{5}^{2}, β_{5}^{3})$ with their first levels as the reference categories. Coefficients ${α^{'}}_{3}, {α^{'}}_{4}$ and ${α^{'}}_{5}$ are also written in a similar way.

For Bayesian implementation, priors on the unknown parameters are given as described in Section 3. R and nimble were used for programming, and a single Markov chain was used with the first 500 iterations being the burn-in phase and the remaining 5 000 samples used for posterior estimation. The thinning interval was set to be 10. Convergence was monitored by trace plots and ACF plots of the estimates.

We also assume different priors on L as M₀–M₄ in Section 4 and applied DIC and LPML for model selection. DIC and LPML values for these five models are shown in Table 9.

Table 9:

DIC and LPML values of the competing models for dataset CGSS 2013

Model	M ₀	M ₁	M ₂	M ₃	M ₄
DIC	6621.16	6789.04	6632.99	16140.01	23835.08
LPML	−3530.92	−3530.83	−3530.25	−3557.55	−3584.05

From the model comparison results, we can see that model M₀ has the smallest DIC value, while LPML chooses model M₂ as the best model. However, since there is a small difference between the LPML values of M₀ and M₂, the best model we choose for this dataset is M₀. The Bayesian estimates of the parameters in model M₀ are shown in Table 10. The traceplots and ACF plots of these parameters are given in the supplementary materials online.

Table 10:

Bayesian estimation of parameters in model M₀

Parameter	Estimate	SD	95% CI	Parameter	Estimate	SD	95% CI
$β_{0}$	3.9388	0.3194	(3.2934, 4.5803)	$α_{1}$	−0.1333	0.0228	(−0.1772, −0.0882)
$β_{1}$	1.4367	0.0612	(1.3173, 1.5567)	$α_{2}$	0.6792	0.0729	(0.5355, 0.8233)
$β_{2}$	0.6406	0.1991	(0.2518, 1.0368)	$α_{3}^{1}$	0.0302	0.0298	(−0.0278, 0.0880)
$β_{3}^{1}$	0.8055	0.0800	(0.6461, 0.9589)	$α_{3}^{2}$	0.0542	0.0343	(−0.0129, 0.1207)
$β_{3}^{2}$	1.2095	0.0906	(1.0330, 1.3881)	$α_{3}^{3}$	0.2012	0.0381	(0.1265, 0.2773)
$β_{3}^{3}$	2.1189	0.1015	(1.9212, 2.3210)	$α_{4}^{1}$	0.2402	0.0697	(0.1011, 0.3764)
$β_{4}^{1}$	0.3941	0.1990	(0.0027, 0.7788)	$α_{4}^{2}$	0.4609	0.0679	(0.3289, 0.5925)
$β_{4}^{2}$	1.1276	0.1963	(0.7450, 1.5110)	$α_{4}^{3}$	0.6784	0.0664	(0.5492, 0.8072)
$β_{4}^{3}$	1.3489	0.1921	(0.9746, 1.7231)	$α_{4}^{4}$	1.0497	0.0698	(0.9097, 1.1854)
$β_{4}^{4}$	1.3532	0.1968	(0.9722, 1.7453)	$α_{5}^{1}$	0.1443	0.0420	(0.0635, 0.2275)
$β_{5}^{1}$	1.4663	0.1174	(1.2331, 1.6925)	$α_{5}^{2}$	−0.5028	0.0817	(−0.6680, −0.3432)
$β_{5}^{2}$	2.1397	0.2268	(1.7002, 2.5860)	$α_{5}^{3}$	−0.0117	0.0630	(−0.1308, 0.1146)
$β_{5}^{3}$	1.3378	0.1748	(0.9906, 1.6807)	$τ_{y}$	0.1093	0.0016	(0.1064, 0.1124)

From Table 10, we can observe that at the 95% credible level, covariate gender has an impact on both the annual income and life happiness. Males tend to have higher annual income than females, but females have higher level of life happiness than males. Age has a positive impact on both annual income and life happiness, meaning that older people tend to have higher annual income and higher level of life happiness than younger persons. People with higher education level generally have higher annual income than people with lower education level. For life happiness, people who have been to college or above have significant improvement of life happiness than people of other education levels. People with a better health status tend to have higher income and higher level of life happiness. For different marital status, people who are single have a lower annual income than people of other marital status. The level of life happiness for people who are married or living as couples is the highest, and inversely, people who are seperated or divorced have the lowest level of life happiness.

6 Extension to the Dirichlet process mixtures prior

In this article we focus on the use of DP as a prior for the latent variable in our joint model. However, the discreteness of DP makes the model not that appealing for all applications. When the unknown distribution is known to be continuous, the DP measures would be awkward for its discrete nature. For some hierarchical models, the DP prior may lead to inconsistent estimates if the true distribution is continuous (Rodriguez and Müller, 2013). In order to mitigate this limitation of DP, a convolution with a continuous kernel can be added to the discrete distribution, that is, assigning a DP mixtures of continuous distributions as a prior for the latent variable.

Suppose for the random samples of latent variables $L_{i} (i = 1, \dots, N)$ , the unknown distribution is denoted by $F$ . A DP mixtures prior on $F$ is denoted by

\begin{matrix} L_{i} \sim F (L_{i}) = \int f (L_{i} | θ) G (d θ), \\ G \sim DP (κ, G_{0}), \end{matrix}

where $f (L_{i} | θ)$ is the kernel of the DP mixtures prior, which is indexed by a finite-dimensional parameter $θ$ . The DP mixtures prior can also be represented in a way similar to the stick-breaking construction of DP. Therefore, we have

\begin{matrix} L_{i} | π_{j}, θ_{j} \sim \sum_{j = 1}^{\infty} π_{j} f (L_{i} | θ_{j}), \end{matrix}

where $θ_{j} \sim G_{0}$ , $π_{j} = V_{j} \prod_{k < j} (1 - V_{k})$ and $V_{j} \sim Beta (1, κ)$ .

The DP mixtures are countable mixtures with an infinite number of components and specific priors on the weights $π_{j}$ and the component-specific parameters $θ_{j}$ . For appropriate choices of the kernel $f (L_{i} | θ)$ , the DP mixtures prior can support on a large class of distributions (Rodriguez and Müller, 2013). For a DP mixtures of normals prior, we have

\begin{matrix} L_{i} | G \sim \int N (L_{i} | μ_{j}, τ_{j}^{- 1}) G (d μ_{j}, d τ_{j}), \end{matrix}

which is equivalent to

\begin{matrix} L_{i} \sim N (μ_{j}, τ_{j}^{- 1}), (μ_{j}, τ_{j}) \sim G, \\ G = \sum_{j = 1}^{\infty} π_{j} N (μ_{j}, τ_{j}^{- 1}), (μ_{j}, τ_{j}) \sim G_{0}, \sum_{j = 1}^{\infty} π_{j} = 1 . \end{matrix}

The DP mixtures of normals prior allows a flexible continuous distribution for the latent variable. A normal distribution $N (μ_{0}, τ_{0}^{- 1})$ is assigned for $μ_{j}$ , with a normal hyperprior for $μ_{0}$ and a Gamma distribution for $τ_{0}$ .

In order to show the performance of DP mixtures prior, we also built the model using DP mixtures of normals prior with a logistic base function for the latent variable L to fit the simulated data in Simulation 1. Hyperpriors for $μ_{j}$ and $τ_{j}$ are N(0,1) and Gamma(1,1), respectively. The average DIC and LPML values of this new model, denoted by M₁^*, are 1823.29 and −2226.74, indicating that the DP mixtures of normals prior performs better than DP prior under the settings in Simulation 1. The simulation results of the parameters are given in Table 11.

Table 11:

Simulation results of model M₁^* when L is actually bimodal

Parameter	True value	M ₁ ^*
		Estimate	Bias	RMSE	CP
$β_{0}$	2	2.0738	0.0738	0.2239	1.00
$β_{1}$	2	2.0000	0.0000	0.0513	1.00
$β_{2}$	3	2.9999	0.0001	0.0643	0.9242
$α_{1}$	2	2.0096	0.0096	0.1108	0.9697
$α_{2}$	3	3.0073	0.0073	0.1353	0.9545
$τ_{y}$	0.8	0.8068	0.0068	0.1403	0.9697

From Table 11, we can see that the bias and RMSE of the posterior estimates are relatively small and the CP values are all greater than 0.92, showing that the posterior estimates in model M₁^* are precise. By comparing the results of model M₁ in Table 2, we can see that the bias of the estimates in model $M_{1}^{*}$ are smaller than those in model M₁ except for parameter $β_{0}$ . And generally the RMSE of these estimates in $M_{1}^{*}$ are slightly greater than those in M₁. In general, the performance of the DP mixtures of normal prior is better than the DP prior, but it should be noted that the estimation of parameter $β_{1}$ is sensitive to the hyperpriors on $μ_{j}$ and $τ_{j}$ . With non-informative hyperpriors $μ_{j} \sim N (0, 0 . 001^{- 1})$ and $τ_{j} \sim Gamma (0.001, 0.001)$ , the posterior estimate of β₁ is highly biased and of large standard deviation. Thus, DP mixtures prior is a good choice to mitigate the limitation of the DP prior, but the choice of hyperpriors for the hyperparameters should be careful. It is also interesting to explore the performance of the DP mixtures priors of other continuous distributions, which can be a direction for further research.

7 Concluding remarks

In this tutorial article, a Bayesian semiparametric latent variable model with the latent variable following a DP prior is applied for joint modelling mixed continuous and ordinal outcome variables and the R package nimble is used for implementation. Through simulation study and an empirical analysis of data from CGSS 2013, it is indicated that the semiparametric LVM with DP prior and the model selection criteria perform well. Besides, an extension of DP prior to DP mixtures prior is introduced for greater flexibility.

There are several possible extensions that can be made to improve the current model. First, the current analysis is cross-sectional, so we can generalize the model to more complex data structures, including clustered data, longitudinal data, and other mixed types of outcomes. Second, the DP prior can be extended to DP mixtures prior introduced in Section 6, and it is also valuable to explore the performance of DP mixtures priors of different continuous distributions. Third, missing data are another important issue to be considered. Missing data are common in scientific research and simply ignoring or inappropriately handling these missing values would lead to biased estimates. With missing values, the analysis procedure will be more complicated, especially for non-ignorable missing values.

Supplementary Materials

Supplementary materials for this article, including the model code, the traceplots and ACF plots mentioned in Section 5, are available from http://www.statmod.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

This work was funded by Chinese National Program for Support of Top-notch Young Professionals (Grant number 2015338).

References

Antoniak

(1974) Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics , 2, 1152–74.

Azzalini

(1985) A class of distributions which includes the normal ones. Scandinavian Journal of Statistics , 12, 171–78.

Azzalini

Capitanio

(1999) Statistical applications of the multivariate skew normal distribution. Journal of the Royal Statistical Society: Statistical Methodology, Series B , 61, 579–602.

Baghfalaki

Ganjali

(2011) An em estimation approach for analyzing bivariate skew normal data with non monotone missing values. Communications in Statistics: Theory and Methods , 40, 1671–86.

Celeux

Forbes

Robert

Titterington

(2006) Deviance information criteria for missing data models. Bayesian Analysis , 1, 651–73.

Valpine

Turek

Paciorek

Anderson-Bergman

Lang

Bodik

(2017) Programming with models: Writing statistical algorithms for general model structures with nimble. Journal of Computational and Graphical Statistics , 26, 403–13.

Escobar

(1994) Estimating normal means with a Dirichlet process prior. Journal of the American Statistical Association , 89, 268–77.

Ferguson

(1973) A Bayesian analysis of some nonparametric problems. The Annals of Statistics , 1, 209–30.

Gelman

Rubin

(1992) Inference from iterative simulation using multiple sequences. Statistical Science , 7, 457–72.

10.

Gill

Casella

(2009) Nonparametric priors for ordinal Bayesian social science models: Specification and estimation. Journal of the American Statistical Association , 104, 453–54.

11.

Hwang

Pennell

(2014) Semiparametric Bayesian joint modeling of a binary and continuous outcome with applications in toxicological risk assessment. Statistics in Medicine , 33, 1162–75.

12.

Ibrahim

Chen

M-H

Sinha

(2001) Criterion-based methods for Bayesian model assessment. Statistica Sinica , 11, 419–43.

13.

Ishwaran

James

(2001) Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association , 96, 161–73.

14.

Ishwaran

Zarepour

(2000) Markov Chain Monte Carlo in approximate Dirichlet and beta two-parameter process hierarchical models. Biometrika , 87, 371–90.

15.

Kano

Berkane

Bentler

(1993) Statistical inference based on pseudo- maximum likelihood estimators in elliptical populations. Journal of the American Statistical Association , 88, 135–43.

16.

Lee

S-Y

Song

X-Y

(2008) Semiparamtric Bayesian analysis for structural equation model with fixed covariates. Statistics in Medicine , 27, 2341–60.

17.

Lee

S-Y

Xia

Y-M

(2006) Maximum likelihood methods in treating outliers and symmetrically heavy-tailed distributions for nonlinear structural equation models with missing data. Psychometrika , 71, 565–85.

18.

Lin

Chen

(2009) Analysis of multivariate skew normal models with incomplete data. Journal of Multivariate Analysis , 100, 2337–51.

19.

Huang

(2014) Bayesian analysis of nonlinear mixed-effects mixture models for longitudinal data with heterogeneity and skewness. Statistics in Medicine , 33, 2830–49.

20.

Chen

(2018) Bayesian methods for dealing with missing data problems. Journal of the Korean Statistical Society , 47, 297–313.

21.

MacEachern

Müller

(1998) Estimating mixture of Dirichlet process models. Journal of Computational and Graphical Statistics , 7, 223–38.

22.

McCulloch

(2008) Joint modelling of mixed outcome types using latent variables. Statistical Methods in Medical Research , 17, 53–73.

23.

Moustaki

Knott

(2000) Generalized latent trait models. Psychometrika , 65, 391–411.

24.

Müller

Quintana

Jara

Hanson

(2015) Bayesian Nonparametric Data Analysis . Berlin: Springer.

25.

Plummer

(2003) Jags: A program for analysis of Bayesian graphical models using Gibbs sampling. In Proceedings of the 3rd international workshop on ‘Distributed Statistical Computing‘, edited by K Hornik, F Leisch and A Zeileis, 20–22 March 2003, Vienna, Austria. URL https://www.r-project.org/conferences/DSC-2003/Proceedings/Plummer.pdf (last accessed 11 December 2018).

26.

Plummer

Best

Cowles

Vines

Plummer

(2006) The coda package . International Agency for Research on Cancer, France (cited 30 December 2004) URL: http://www-fis.iarc.fr/coda (last accessed 11 December 2018).

27.

Rodriguez

Müller

(2013) Nonparametric Bayesian inference. NSF-CBMS Reg- ional Conference Series in Probability and Statistics , 9, i 110. URL http://[www-jstor-org.web.bisu.edu.cn/stable/nsfcbmsreg][conf.9.01] (last accessed 7 December 2018).

28.

Sammel

Ryan LM and Legler

(1997) Latent variable models for mixed discrete and continuous outcomes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 59, 667–78.

29.

Sethuraman

(1994) A constructive definition of Dirichlet priors. Statistica Sinica , 4, 639–50.

30.

Shapiro

Browne

(1987) Analysis of covariance structures under elliptical distributions. Journal of the American Statistical Association , 82, 1092–97.

31.

Song

X-K

Song

PX-K

(2007) Correlated Data Analysis: Modeling, Analytics, and Applications. Berlin: Springer Science + Business Media.

32.

Spiegelhalter

Thomas

Best

Lun

(2003) WinBUGS Version 1.4.1 User Manual. MRC Biostatistics Unit, University of Cambridge. URL https://www.mrcbsu.cam.ac.uk/software/bugs/the-bugs-project-winbugs/ (last accessed 7 December 2018).

33.

Team

(2017) Stan Modeling Language Users Guide and Reference Manual Version 2.17.0. URL http://mc-stan.org (last accessed 7 December 2018).

34.

Teimourian

Baghfalaki

Ganjali

Berridge

(2015) Joint modeling of mixed skewed continuous and ordinal longitudinal responses: A Bayesian approach. Journal of Applied Statistics , 42, 2233–56.

35.

(2013) Contributions to Copula Modeling of Mixed Discrete-Continuous Outcomes. PhD thesis, University of Calgary, Calgary, Canada.

36.

Xia

Gou

(2016) Bayesian semiparametric analysis for latent variable models with mixed continuous and ordinal outcomes. Journal of the Korean Statistical Society , 45, 451–65.

Bayesian semiparametric latent variable model with DP prior for joint analysis: Implementation with nimble

Abstract

Keywords

1 Introduction

2 Model specification

3.1 Bayesian inference

3.2 Model selection

4 Simulation study and implementation with nimble

4.1 Simulation study

4.1.1 Simulation 1: L is bimodal

Table 1:

Average DIC and LPML for the competing models when L is bimodal

Simulation results of models M1 and M3 when L is actually bimodal

Simulation results for model M1 with different priors for κ

Table 4:

Average DIC and LPML for the competing models when L is skewed

Simulation results of models M2 and M3 when L is actually skewed

Table 6:

Average DIC and LPML for the competing models when L is normal

Simulation results of model M3 and M1 when L is actually normal

5 Application: Chinese General Society Survey 2013

Table 8:

Summary of response and explanatory variables in CGSS 2013

Box plot of log annual income versus life happiness

DIC and LPML values of the competing models for dataset CGSS 2013

Bayesian estimation of parameters in model M0

Table 11:

Simulation results of model M1* when L is actually bimodal

Supplementary Materials

Declaration of conflicting interests

Funding

References

Simulation results of models M₁ and M₃ when L is actually bimodal

Simulation results for model M₁ with different priors for κ

Simulation results of models M₂ and M₃ when L is actually skewed

Simulation results of model M₃ and M₁ when L is actually normal

Bayesian estimation of parameters in model M₀

Simulation results of model M₁^* when L is actually bimodal