Abstract
There is a rich literature on the analysis of longitudinal data with missing values. However, the analysis becomes complex for semi-continuous (zero-inflated) longitudinal response with missingness. In this article, we propose a partially varying coefficients regression model for analysing such data. We use a two-part model, where in the first part we propose a latent dynamic model for accounting a ‘zero’ or a ‘non-zero’ response, and in the second part we use another dynamic model for estimating the mean trajectories of non-zero responses. Two dynamic models are linked through subject-specific random effects. The missing covariates are imputed repeatedly based on their respective posterior predictive distributions and the missing responses are imputed using the working model under different identifying restrictions. We analyse data from the Health and Retirement Study (HRS) for aged individuals and develop a dynamic model for predicting out-of-pocket medical expenditures (OOPME) containing excess zeros. The operating characteristics of the proposed model are investigated through extensive simulation studies.
Keywords
Introduction
Analysis of longitudinal data is typically challenging due to inherent dependence among measurements from the same subject over different time points. Despite the difficulty, various methods have been proposed for modelling mean trajectory and covariance structure for univariate and multivariate longitudinal data. Random effects models proposed by Laird and Ware (1982) have been successfully used for analysing longitudinal response. Pourahmadi (1999, 2000), Pan and Mackenzie (2003) proposed a joint mean-covariance model for univariate longitudinal data. Das et al. (2013) generalized Pourahmadi's approach to bivariate longitudinal outcomes. Flexible semi-parametric Bayesian models have also been proposed for grouped longitudinal measurements by Das and Daniels (2014) and Das et al. (2015).
In many real applications, we have longitudinal measurements with excess zeros. Statistical methods based on mixture models have been used for analysing longitudinal measurements with ‘excess zeros’. Lambert (1992) proposed a zero-inflated Poisson (ZIP) model which can handle such excess zeros in count data. Hall (2000) proposed a zero-inflated binomial (ZIB) model for upper-bounded count data containing a significant proportion of zeros. Min and Agresti (2002) provided a nice tutorial on different methods for handling data with excess zeros. Most popularly used methods for analysing zero-inflated data include two-part models (Duan et al., 1983), Tobit models (Amemiya, 1984) and finite mixture models (Aitkin and Rubin 1985). Ghosh and Tu (2008), Boone et al. (2012), Buu et al. (2012), Alfo and Maruotti (2010) and Yang et al. (2016) developed models for zero-inflated longitudinal count data. Similar models have also been proposed for continuous longitudinal measurements with excess zeros (Olsen and Schafer, 2001; Mukherji et al., 2016, Farewell et al., 2017). In this article, we propose a Bayesian approach of analysing zero-inflated continuous longitudinal response with incomplete covariates and response. Specifically, we consider monotone incompleteness which is quite common in clinical trials and/or biomedical experiments. In longitudinal studies, incompleteness typically occurs due to dropouts. In the presence of dropouts, marginal distribution of the response variable is not identified in general (Little, 1993). Selection models (Diggle and Kenward, 1994) and pattern mixture models (PMM; Little, 1993, 1995) are commonly used for modelling incomplete longitudinal data. In a selection model all subjects contribute to the model at all time points, but in a PMM subjects only contribute to the model prior to the respective dropout times. Thus PMMs are usually under-identified (Little, 1993). Little (1995) proposed several identifying restrictions for overcoming the under-identification problem which are discussed with examples in Daniels and Hogan (2008). A PMM for handling zero-inflation and incompleteness in a longitudinal setting is proposed by Maruotti (2001). In our work, however, we do not model the missingness but impute missing covariates and missing response using a working model under different identifying restrictions (Molenberghs et al., 1998). We analyse data from the Health and Retirement Study (HRS) conducted by University of Michigan for an ageing population. Subjects were either retired or about to retire (with age 50 years or above), and they were followed for 16 years with a follow-up frequency of 2 years which we refer as waves. The goal of this study was to understand the challenges of ageing in terms of health condition and economy. According to a joint report by the US Department of State and the National Institute on Aging, nearly 500 million people in the world were 65 years or above in 2006 (Dobriansky et al., 2007). By 2030, this number is expected to increase to 1 billion. In the United States life expectancy has increased from 49 years to 78 years in the last century (Arias, 2006), and in China it has increased from 40 years to 76 years in the last 50 years. Thus, it is expected that the growth of the older population will have a global impact on the health care system, economy and insurance.
In the HRS data, variables related to physical health, financial health and demographies are measured from the subjects longitudinally. This dataset has been analysed by several authors from various perspective. Brown et al. (2015) proposed a Bayesian two-part model for household debt based on the HRS data. Bhuyan et al. (2018) proposed a two-stage regression model for handling endogeneity and incompleteness due to censoring for the HRS data. They used a Bayesian Tobit model for the response and the endogenous covariate. Das et al. (2017) proposed a dynamic hurdle model for predicting the rate of hospital admission of aged individuals. Our current work is different from the previous works (based on the HRS data) since we consider the problem of monotone missingness and zero-inflation in a longitudinal setting. Our variable of interest is out-of-pocket medical expenditure (OOPME). Note that at each wave, OOPME is the total cost of all medical services excluding the cost that is reimbursed or paid through health insurance. The OOPME is an accurate measure of the actual financial burden of medical expenditure incurred by individuals (Mukherji et al., 2016), and contains excess zeros. The objective of our study is to model and predict OOPME in terms of the covariates measuring physical and financial health. We use a two-part model for our analysis. In the first part, we consider a latent variable for the binary indicator of zero/non-zero response, and use a dynamic probit model similar to the model proposed in Albert and Chib (1993). In the second part, we propose a partially varying coefficients dynamic model for non-zero response. Missing covariates are imputed from respective posterior predictive distributions, and missing responses are imputed based on the working model under different identifying restrictions. Model parameters are estimated by Markov Chain Monte Carlo (MCMC) in a multiple imputation framework.
The rest of our article is organized as the following. In Section 2, we propose two dynamic models for predicting zero-inflated longitudinal response under monotone missingness. Results from the analysis of the HRS data are shown in Section 3. In Section 4, we show the results from the simulation study. Some concluding remarks are given in Section 5.
Proposed model and methods
We consider a non-negative zero-inflated longitudinal response measured from N subjects at T different time points. The set of covariates contains
Dynamic model for the probability of a zero response
Define binary random variables
We model
and model
where
where Φ denotes the cumulative distribution function of a standard normal random variable. One can model the time-varying coefficients
A Legendre polynomial of order K as a function of time T is expressed as the following:
where
Next we model the non-zero responses for each subject at different time points. We note that in the original data different subjects are non-zero at different time points. Let the ith subject give non-zero responses at
where
In our application, a positive response at one time point can affect the probability of a zero response at other time points. Such dependence can be modelled by assuming a joint distribution for (
Let
where
Assuming that the prior distributions for
We compute full conditional distributions of the model parameters from the aforementioned joint posterior density (details are given in the Appendix). Model parameters are estimated by their respective sample posterior means. Convergence of the Markov chains are monitored graphically and also by computing the multivariate potential scale reduction factors as suggested in Brooks and Gelman (1998).
Xie and Paik (1997), Schafer and Yucel (2002), Searman et al. (2012), Chen and Fu (2011) and Maruotti (2011) proposed methods for modelling or imputing the missing covariates. For a set of correlated variables, one can use a joint model and update the missing values based on the respective full conditional distributions. However, in a regression setting with a set of uncorrelated predictors, joint modelling is not meaningful. Also joint modelling becomes cumbersome for larger number of variables under consideration. Here we propose an alternative multiple imputation method where univariate models are used for imputing the predictors, and the missing responses are imputed based on the working model under a set of identifying restrictions. In real applications (Section 3), we first verify the absence of multicollinearity in our data by computing variance inflation factor (VIF) for each covariate at each time point. Additionally, we assume that the covariates and the response are ‘missing at random’ (Rubin, 1976).
In our proposed dynamic models, we have
In other words, we first consider a suitable prior distribution for θ, and then compute the posterior distribution of θ for the given observed data (
Once the missing covariates are imputed, we use the dynamic models given in Equations (2.2) and (2.4) for imputing the missing response values. We start with some initial parameter values for these two dynamic models. Typically we fit two models on the subjects which are fully observed (no missing values) and compute the maximum likelihood estimates (mle) of the model parameters. These estimates are taken as the initial choice of the model parameters. For one set of imputed covariates, we impute the missing
Considering the model in Equation (2.4) as our working model, let
Under complete case missing value (CCMV) restriction, we have
for
For available case missing value (ACMV) restriction, we have
for
The ACMV restriction in (2.9) can further be expressed as the following:
where
Alternatively, under neighbouring case missing value (NCMV) restriction we have
for
Once the missing non-zero responses are imputed, we obtain one set of complete data
We consider data from the HRS for our analysis. This study started in 1992 with 1 929 subjects of age 50 years or above, and ended in 2008 with a follow-up frequency of two years which we refer as waves. The goal of the study was to understand the challenges of ageing related to physical health condition and additional cost due to hospital admission. We note that for 1 227 subjects, we have the complete data (no missing observation) while for the others we have missingness at one or more waves but the missingness is monotone. At any particular wave, the medical expenditure of a subject is covered by exactly one health insurance.
For our analysis, we focus on modelling and predicting OOPME. Note that OOPME includes costs related to medical services (e.g., doctor's visit, hospital visits, drugs, etc.), but excludes the costs reimbursed through health insurance. In health economics, OOPME is treated as an important measure of financial risk (Mukherji et al., 2016). In our data, we have three different health insurances: (a) employment insurance, (b) government insurance and (c) other (private) insurance. Note that government insurance includes medicaid insurance for people with low income, and medicare insurance for people with age 65 years or above. Clearly, for an individual if there is no medical expenditure or if all the medical costs are paid by health insurance at a particular wave, then OOPME will be zero for that person at that wave. Thus, our response variable contains excess zeros, and hence we use the proposed two-part model for data analysis. In our data, we have 9–16% of zeros on average in OOPME across the waves. The set of covariates contains body mass index (BMI) as a continuous variable and the presence (yes/no) of eight chronic diseases: hypertension, diabetes, cancer, lung problem, heart problem, stroke, arthritis and psychological problems. We consider BMI, calendar age and three indicator variables for three different health insurances as the covariates with potential time-varying effects on the response. The effects of all the other covariates are considered as fixed.
Dynamic models and priors
Let
for
For non-zero response, however, we consider log transformed data following Mukherji et al. (2016). Typically, medical expenditure data are highly positively skewed and hence a log transformation is useful from the modelling perspective. Thus, our model for the non-zero OOPME is the following:
and we assume that
Sensitivity analysis for the HRS data analysis
We consider the following prior distributions for model parameters. For each coefficient vector
(
First, we check the presence of multicollinearity in our data. Note that we have five covariates with time-varying effects and eight covariates with fixed effects. At each wave, we compute the VIF for each predictor variable, and it turns out the computed VIFs are all below 7. This indicates the absence of multicollinearity in our dataset, and hence we impute the missing covariates separately from a set of univariate models. We impute missing calendar age directly since we have eight waves, and the length of each wave is two years. For the other missing covariates, we use a model-based approach for imputation. We note that the missing covariates should not be imputed for each wave, simply because subjects are of different ages at a particular wave. Hence, we divide the subjects into five ‘age groups’ across all the eight waves: 50–54 (group 1), 55–59 (group 2), 60–64 (group 3), 65–69 (group 4) and ≥70 (group 5). The subjects belonging to the same age group are expected to be similar in terms of physical health and insurance status. Covariates are imputed for each age group. A wider age interval (for example, 50–59) might be inefficient for grouping simply because health conditions change for the older people at a much faster rate. Our choice for five groups in this dataset is subjective, but we consider this reasonable enough for effective imputation of the covariates. Thus, for imputing the missing covariates, we ignore the ‘waves’ but consider the aforementioned ‘age groups’. However, we note that our grouping (based on calendar age) is only for imputing the covariates, and not for any other inference purpose. For each age group, we fit a suitable probability distribution for each covariate based on the available data. For BMI, we fit
Results
In Table 2, we show the parameter estimates and the corresponding 95% credible intervals computed from MCMC iterations for the eight predictors with fixed effects on the response. Results are shown for the dynamic models given in the Equations (3.1) and (3.2). We note that for the dynamic model in Equation (3.1), the coefficients corresponding to cancer, heart problem and stroke are significantly different from zero, and the corresponding Bayesian credible intervals do not contain zero. Similar results are obtained for the dynamic model in Equation (3.2). The results are reassuring the fact that serious condition of the patients suffering from cancer, stroke or cardiovascular diseases are associated with frequent hospitalization. The additional cost due to cancer, stroke and heart problem are not, in general, covered by health insurance. We also note that the effects of blood pressure, diabetes and psychological problems are not significant for both the models. The coefficients for lung problem and arthritis are different from zero, but the corresponding 95% credible intervals contain zero. This again accords to our intuition that these diseases are chronic and hence not associated with frequent hospitalization. Also the medical costs for treating these diseases are covered by the health insurance.
Fixed covariate effects for two dynamic models in the HRS data
Fixed covariate effects for two dynamic models in the HRS data
Time-varying effects of different health insurances in probit model
In Figure 1, we plot the time-varying effects of three insurances for probit model and also plot the corresponding 95% point-wise credible intervals. We note that the curve for employment insurance shows a slow decreasing trend. This indicates that employment insurances cover less for the relatively older people, and hence the effect of this insurance on the probability of zero/non-zero OOPME gradually decreases. The increasing trend of the curve for government insurance after wave 4 reflects that the medicare insurance is more useful than the medicaid insurance. For the other (private) insurance, we observe a sharp decreasing trend after wave 4, indicating that the private insurances are less effective for the relatively older people (age 62 years or above).
Time-varying effects of calendar age and BMI in probit model
In Figure 2, we plot the time-varying effects of BMI and calendar age. We also plot the corresponding 95% point-wise confidence intervals. Clearly, the effect of the calendar age has a strict increasing trend after wave 3 on the zero-response probability. This is quite intuitive since the older people visit hospital more frequently and hence will result in higher OOPME compared to the relatively younger people. For BMI, the curve shows an increasing trend, indicating that obesity has higher effect on the probability of zero-response for the older people. Since there is a well-known relationship between obesity and many other diseases, this trend accords to our expectation.
Time-varying effects of different health insurances on non-zero out-of-pocket medical expenditure
Figure 3 shows the effects of the insurances on the non-zero OOPMEs. We observe similar trends for the insurance groups as in Figure 1. The explanations will go along the same line. As mentioned in Min and Agresti (2002), for a two-part model, the inference on the covariates are typically same for both the models. Our approach considers the same set of covariates for both the models and hence get similar inference from those.
Time-varying effects of calendar age and BMI on non-zero out-of-pocket medical expenditure
In Figure 4, we show the time-varying effects of BMI and the calendar age on the non-zero OOPME, and the corresponding 95% point-wise credible intervals. The trends are similar to the trends observed in Figure 2. Thus, we have the similar statistical inferences for two dynamic model per our expectation. From the estimated covariance matrix
We perform a simulation study to investigate the operating characteristics of the proposed approach of modelling incomplete zero-inflated longitudinal response. We consider 200 subjects for which a continuous response is measured at 10 different evenly spaced time points. The response for the ith subject (
where
where we consider
Once the complete dataset is simulated, we create ‘monotone missingness’ in the data. We choose 30 subjects randomly, and for each of these 30 subjects we randomly select a time point T between 3 and 9 after which the subject is missing. Note that for a missing subject we consider missing response as well as missing covariates. We analyse the generated incomplete data using the method discussed in Section 2. For the imputation of the missing covariates, we use the method proposed in Section 2.4, and we impute the missing responses under three different model restrictions: CCMV, ACMV and NCMV. First, we analyse the complete data with no missing value and estimate model parameters for two dynamic models. We replicate 100 datasets and estimate the model parameters for each dataset separately. Based on replicated data, we compute the average bias and average width of 95% credible intervals, as well as the corresponding coverage probability. Next, we consider the incomplete data generated by the aforementioned method and impute the missing response values under three model restrictions. We consider 100 replications of such incomplete dataset and compute average bias, width of 95% CI and corresponding coverage probability. The results are summarized in Table 3.
Estimated bias, width of 95% CI and coverage probabilities of model parameters for complete data and incomplete data under different imputation methods for the simulation study
Estimated bias, width of 95% CI and coverage probabilities of model parameters for complete data and incomplete data under different imputation methods for the simulation study
We note that for complete data, our proposed approach estimates the model parameters with very low bias and moderate width of the CIs with desirable coverage probabilities. For the incomplete data almost for all the model parameters, ACMV provides lower bias and shorter CIs. CCMV shows the worst performance. Both CCMV and NCMV result in larger bias and wider CIs compared to ACMV for most of the model parameters. For some parameters, NCMV and ACMV provide similar inference, but CCMV performs worse than the others. We also assess the accuracy of the first dynamic model in predicting a missing zero/non-zero response. For each replicated dataset, we compute the percentage of ‘mismatch’ of the predicted and the actual zero or non-zero response. Based on 100 replicates, we find that our model provides 86% ‘match’ with a standard error of 0.43. Next, we compare the performance of the proposed method to the commonly used Tobit model and the two-part model for our simulated (incomplete) data without imputing the missing data. We consider 100 replicates of the dataset, and for each dataset we fit five competing models: (a) Tobit model for incomplete data, (b) two-part model for incomplete data, (c) proposed model with CCMV imputation, (d) proposed model with ACMV imputation and (e) proposed model with NCMV imputation. In the first two cases missingness is modeled, and in the last three cases missing observations are imputed. We compute three important model selection criteria, for example, (a) mean squared error (MSE), (b) log pseudo-marginal likelihood and (c) deviance information criterion for all the models under consideration. For each replicated dataset, once we fit the models, we compute MSE. Based on 100 replicates, we obtain 100 MSE values for all the five models and compute average MSE (AMSE). A smaller AMSE value indicates a better fit. Second, we use the conditional predictive ordinate (CPO; Gelfand et al., 1992) defined as
Finally, we compute conditional deviance information criteria (DIC) proposed in Celeux et al. (2006). This DIC is based on the conditional likelihood
In Table 4, we show three different measures for goodness of fit. We note that Tobit model and two-part model based on the incomplete data provide poor fit, that is, higher AMSE, lower LPML and higher DIC values. The proposed model with ACMV imputation provides the smallest AMSE and largest LPML. For DIC, the proposed model with NCMV imputation and ACMV imputation are quite comparable. Thus, the effectiveness of the proposed model and ACMV restriction is reassured.
Goodnessof fit measures for different competing models in the simulation study
Missing observations in the longitudinal study are likely to be informative and hence the analysis of incomplete longitudinal data is challenging. In this article we have addressed the issue of analysing zero-inflated incomplete longitudinal response in a Bayesian framework using a two-part model. Bayesian approach makes the data-augmentation simple and computationally fast because it is based on Gibbs sampler. Our approach is based on the assumption that the covariates are uncorrelated, and hence we impute each covariate separately from their respective posterior predictive distributions. However, in some practical application the covariates can be correlated. In that case, a joint model can be developed for simultaneously imputing all the variables through MCMC iterations. We note that in real applications, longitudinal data can be unbalanced in many different ways. In our current work (both in simulation study and real application), we have considered only the unbalanced case due to monotone missingness. Our estimation method and hence our inference heavily depends on this assumption. Thus our findings should not be generalized for any kind of unbalanced longitudinal dataset and should be interpreted with caution.
Supplementary materials
Supplementary materials for this article (sample data and R code) are available from http://www.statmod.org/smij/archive.html.
Appendix
The model is:
Rewrite the earlier model:
The priors are the following:
The model is:
Rewrite the earlier model:
The priors are the following:
Acknowledgments
The authors thank Professor Pulak Ghosh, Indian Institute of Management, Bangalore, India, for sharing the HRS data.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The authors received no financial support for the research, authorship and/or publication of this article.
