Abstract
Given the questionnaire design and the nature of the problem, partially ordered data that are neither completely ordered nor completely unordered are frequently encountered in social, behavioral, and medical studies. However, early developments in partially ordered data analysis are very limited and restricted only to cross-sectional data. In this study, we propose a Bayesian two-level regression model for analyzing repeated partially ordered responses in longitudinal data. The first-level model is defined for partially ordered observations of interest that are taken at each time point nested within individuals, while the second-level model is defined for individuals to assess the effects of their characteristics on the first-level model. A full Bayesian approach with the Markov chain Monte Carlo algorithm is developed for statistical inference. Simulation studies demonstrate the satisfactory performance of the developed methodology. The methodology is then applied to a longitudinal study on adolescent smoking behavior.
Keywords
Introduction
Partially ordered data are categorical data that are frequently encountered in behavioral, social, psychological, and medical sciences. Unlike the well-studied nominal and ordinal data, partially ordered data that are neither completely ordered nor completely unordered can have highly complex data structures. A typical example of partially ordered data is the disability states of older adults in the Health, Aging, and Body Composition dataset described by Ip et al. (2013). As shown in Figure 1, different categories represent different disability states. Category 1 represents the nondisabled state; categories 2–5 indicate having one form of difficulty in performing daily activities, such as vertical mobility difficulty, horizontal mobility difficulty, and four other forms of difficulties; categories 6–8 indicate having two forms of difficulties; category 9 indicates having all six forms of difficulties captured by the study; and category D represents death. An arrow from one category to another indicates the dominance relationship. For example, the arrow between categories 1 and 4 shows that category 1 (no disability) dominates category 4 (horizontal mobility difficulty), whereas categories 4 and 5 (vertical mobility difficulty) are incomparable and are not linked by an arrow. In this typical example of partially ordered data, some categories follow rank order while others do not.

The partially ordered structure of the 10 disability states of the older adults in the Health ABC data in Ip et al. (2013).
To cope with this special data structure, one may combine the incomparable categories at nearly the same level into one equivalent class and then treat the combined classes as completely ordered categories (Wilson 1992) to use the well-known ordinal models. In the aforementioned disability states example, applying this method leads to five ordinal categories, {1}, {2,3,4,5},{6,7,8},{9}, and {D}, which can be interpreted as categories of no disability, mild disability, severe disability, worst disability, and death, respectively. However, one consequence of this combination procedure is that the interpretation ability of the model is jeopardized because the combined categories are not differentiable after combination. For example, one cannot find the unique risk factors corresponding to category 4 after categories 2–5 have been combined. Another simple approach for analyzing partially ordered data is to regard all categories as nominal. However, such method ignores the valuable ordinal information and leads to information loss. Recently, researchers have proposed improved methods to model the dominance relationship for every category in the partially ordered structure without combining the incomparable categories. Meulders, Ip, and Boeck (2005) used a dominance matrix to describe the dominance relationship among the partially ordered categories and proposed latent variable models to analyze partially ordered responses to anger-related feelings. Ip et al. (2013) developed a partially ordered mixed hidden Markov model to analyze the hidden disability states of older adults with the structure of states shown in Figure 1. However, this approach does not model the relationships among incomparable categories, such as categories 2–5 in Figure 1, and is unable to manage highly complicated partially ordered data structures, such as the disjoint ones.
The ultimate goal of partially ordered data analysis is to model the relationships among the incomparable categories while retaining all ordinal information of the data structure. To this end, several studies have proposed alternative ways of modeling partially ordered data precisely. Zhang and Ip (2012) regarded the partially ordered structure as a hierarchy of nominal and ordinal substructures and developed partitioned conditional models through several classes of generalized linear models. Peyhardi, Trottier, and Guédon (2016) defined a class of partitioned conditional generalized linear models through a partition tree of categories to handle complex data structures with any level of hierarchy and applied this method on partially ordered responses. Ip, Chen, and Quandt (2016) developed an item response model for analyzing multiple disjoint partially ordered responses by using a two-stage estimation algorithm. However, the aforementioned developments have focused only on cross-sectional data, and none of them consider repeated partially ordered responses in the longitudinal setting. Moreover, a Bayesian method for analyzing partially ordered data is yet to be proposed.
Substantive research often encounters longitudinal data where the observed variables are measured repeatedly over time. Given that the classical assumption of independence between observations does not hold, those statistical methods that ignore the dependency structure can underestimate the variance of the estimated parameters and are therefore inapplicable. Statistical models, such as the random effects model, that consider the interdependence among repeated measurements of longitudinal data are popular in the statistical literature and have attracted wide attention in applications (see, e.g., Diggle 2002; Laird and Ware 1982; Verbeke and Molenberghs 2009). Longitudinal data can also be viewed as two-level data with repeated measurements (first level) nested within individuals (second level; Hedeker and Gibbons 2006). The two-level type model not only introduces random coefficients in the first-level regression to address the interdependence among the repeated measurements but also includes time-invariant covariates as explanatory variables in the second-level model to examine the influence of individual characteristics on the first-level model (Hox, Moerbeek, and van de Schoot 2010; Longford 1995). One of the primary advantages of two-level models is that they provide insights into the effects of time-invariant factors, such as gender and race, on the random coefficients in order to better reflect the relationships between the longitudinal covariates and the outcomes for each individual (Muñoz and Chang 2007; Raudenbush and Bryk 2002; Willms and Raudenbush 1989).
This research is motivated by a data set on adolescent smoking behavior taken from the National Longitudinal Surveys of Youth (NLSY). Adolescent smoking is one of the driving forces of many public health problems and has been linked to negative physical, psychological, and interpersonal consequences (Petraitis, Flay, and Miller 1995; Repetti, Taylor, and Seeman 2002a; Weschler et al. 1994). Adolescent smokers have different smoking behaviors, and each smoking behavior has unique risk factors (Schane, Ling, and Glantz 2010). Obtaining a comprehensive understanding of behavior-specific risk factors can inform treatment and prevention strategies for teenagers at various conditions and stages of smoking. Adolescent smokers can be classified into several ordered or not strictly ordered categories, namely, nonsmoker, former smoker, light-nonfrequent smoker, light-frequent smoker, heavy-nonfrequent smoker, and heavy-frequent smoker, and a comprehensive approach for dealing with such partially ordered data structure needs to be developed (Zhang and Ip 2012). Moreover, the NLSY is a longitudinal study that contains repeated partially ordered responses. This longitudinal feature introduces additional difficulties in modeling the partially ordered data structure. To our knowledge, this is the first study that precisely models partially ordered data in the longitudinal setting.
In this study, a Bayesian two-level model for analyzing repeated partially ordered adolescent smoking behavior data is proposed. In the first-level model, both the response and covariates are time varying, and the regression coefficients are regarded as random so that both the intercepts and slopes differ across individuals. We propose partitioned conditional models to formulate the accurate structure of the smoking behaviors without combining incomparable categories. A conditional coding algorithm is simultaneously developed to facilitate the formulation of the complex data structure. The model enables the identification of different risk factors for smoking behaviors from incomparable categories. In the second-level model, we model the variation of random coefficients through time-invariant covariates and random components at the individual level. This model also explains how the individual-level covariates affect the relationship between the time-varying covariates and the response in the first-level model. In the proposed model setting, the observed data likelihood involves intractable high-dimensional integrals, thereby making the existing maximum likelihood methods indirectly applicable for partially ordered data. This study develops a Bayesian approach, along with efficient Markov chain Monte Carlo (MCMC) techniques, for statistical inference. The Bayesian approach is proposed not only for its basic advantages, such as utilizing useful prior information and producing reliable results in small samples (Ansari and Jedidi 2000; Lee and Song 2004), but also for its power and efficiency in managing complicated models and data structures for which other methods are either unfeasible or work poorly (Congdon 2007; Skrondal and Rabe-Hesketh 2004). In this study, due to the complex dependence of the likelihood of the proposed model, many numerical optimization procedures are likely to fail in parameter estimation. By contrast, the sampling-based Bayesian approach with MCMC methods has been proven to be a powerful tool to overcome the various challenges confronting the conventional optimization procedures (e.g., Congdon 2007; Dunson 2000; Feng, Wu, and Song 2017; Song et al. 2013). Another appealing feature of MCMC is that various characteristics of the posterior distribution, such as the posterior mean, mode, and percentiles, can be investigated based on stationary simulated observations.
The rest of this article is organized as follows. Section 2 describes the partially ordered set theory and the model formulation. Section 3 briefly introduces the Bayesian analysis procedure. Section 4 evaluates the empirical performance of the proposed model through a simulation study. Section 5 presents a real data example concerning the risk factors of adolescent smoking. Section 6 concludes the article with a discussion. Technical details are provided in the Online Appendix (which can be found at http://smr.sagepub.com/supplemental/).
Model Description
Partially Ordered Set (Poset) Theory
The mathematical model for partially ordered data is a poset, which has been intensively studied as a mathematical object (Dushnik and Miller 1941; Birkhoff 1940). A poset is a set (P) equipped with a binary relation (≤). Any two distinct elements in a poset can be comparable or incomparable with respect to the binary relation. As a typical poset example, assume four elements e1, e2, e3, and e4, where

A simple partially ordered set structure with four categories.
The poset structure can be regarded as a hierarchy of nominal and ordinal substructures. The following ordered partition theorem, which is proven in Zhang and Ip (2012), provides a basis for specifying the hierarchical substructures of a poset starting from a set of given pairwise comparable and incomparable relationships among elements.
We take the poset mentioned in Zhang and Ip (2012) as an illustrative example. As shown in Figure 3, the poset

The Hasse diagram of the partially ordered set of nine categories in the illustrative example. Note that there are three connected components.
Let Y be a random variable that takes the values in the poset categories from 1 to 9. The conditional modeling technique uses three random variables, namely,
First, we partition the poset into the disjoint connected components {1}, {2,3,4,5,6,7}, and {8,9} by using the breadth-first search algorithm in graph theory (Cormen 2009). The random variable
Second, each connected component is partitioned into totally weekly ordered antichains. The first connected component {1} has only one category. We do not consider a model for it because the random variable is degenerate. The second connected component {2,3,4,5,6,7} corresponds to
Similarly, for the third connected component {8,9}, the ordinal random variable
Third, the elements in each antichain are pairwise incomparable and are thus modeled through nominal models. Taking the antichain {4,5,6} corresponding to
The conditional models for other antichains are similarly defined.
Through these steps, we can clearly demonstrate the partially ordered structure of the categories in a Hasse diagram as shown in Figure 3. Based on the definition of
The One-to-one Mapping Between Y and
Note. NA = not applicable.
Two-level Model for Repeated Partially Ordered Responses
Consider a set of partially ordered observations
The first-level models for the repeated partially ordered responses are defined through the following conditional models: 1. For where 2. For where For where 3. We regard the elements within antichains as nominal categories and propose conditional nominal models as follows: For where For where The second-level model that examines the interdependence among the repeated measurements and the variation of random coefficients
where
The two-level model defined above is particularly appropriate for research designs where data for participants are collected at a nested structure (e.g., repeated measurements nested within individuals). In the longitudinal study of adolescent smoking behavior discussed in the Introduction section, while potential predictors, such as age and educational enrollment status, can influence adolescent smoking behavior over time, these effects may exhibit heterogeneity due to their dependence on certain time-invariant individual characteristics, such as gender and race. A simple way to incorporate this into the regression model is to add additional independent binary variables to account for gender (male = 0) and race (nonwhite = 0). This would have the effect of shifting the log odds in equations (1 )–(5) up or down, but it still assumes that the effects of age and educational enrollment status on adolescents’ smoking behavior are the same regardless of their gender and race. This may not be the case in reality. Gender or racial gap is likely to cause all of the predictors to have different sorts of effects in males and females or whites and nonwhites. Moreover, compared with the fixed effect model that treats cluster-specific characteristics as nuisances and does not estimate them, the proposed two-level random effects model views the clustering of individuals’ repeated measurements as a feature of interest in its own right, and not just a nuisance to be adjusted for, thereby providing additional insights into the rationale behind the clustering.
Bayesian Inference
Let
where
Let
Simulation Study
In this section, we conduct two simulations to evaluate the performance of the proposed method. Simulation 1 (Simulation 1 subsection) examines the performance of parameter estimation. Simulation 2 (Simulation 2 subsection) checks the robustness of the proposed model and MCMC method to prior inputs, the burn-in phase and length of MCMC, the initial values of the unknown parameters, and model misspecification.
Simulation 1
We consider the model defined by equations (1
)–(6) with
To generate
Afterward, we generate the partially ordered
In the simulation study, the prior inputs for the hyperparameters in the prior distributions are assigned as follows: (prior 1)
Estimation Results in Simulation 1.
Note.
The computer code is written in R. The R code for this simulation is provided in a freely accessable website (http://www.sta.cuhk.edu.hk/xysong/BTLM).
Simulation 2
To check the sensitivity of Bayesian results to prior inputs, we disturb prior 1 by modifying the hyperparameters as follows: (prior 2)
Estimation Results Under Different Priors in Simulation 2.
Note.
To examine the robustness of MCMC to its burn-in phase and length, we change the burn-in phase and length from {8,000, 10,000} to {15,000, 15,000} iterations. Specifically, for each of the above 100 data sets with
Estimation Results Under Different Burn-in Phase and Initial Values of the Unknown Parameters in Simulation 2.
Note.
To evaluate the robustness of MCMC to the initial values of the unknown parameters, we consider two different initial values as follows: Initial 1 (used in simulation 1): the elements of
Finally, we investigate the robustness of the proposed model to the misspecification of distributional assumptions. We consider a model setup that is the same as simulation 1, except that the distribution of
Estimation Results Under
Note.
Estimation Results Under
Note.
A Study on the Partially Ordered Adolescent Smoking Behavior Dataset
In this section, we illustrate the proposed methodology by using a longitudinal data set extracted from NLSY (Center for Human Resource Research 2003). The NLSY is one of the most comprehensive longitudinal studies of youth with a representative sample of adolescent respondents aged from 12 to 18 years at the time of the first interview in 1997 and respondents aged from 28 to 34 years at the 16th round (the last round) of interviews in 2013–2014. We focused on years 1998, 1999, 2000, 2002, and 2004 because the questionnaire design in these years is relatively consistent.
This analysis focused on adolescent smokers, who can be classified according to their different smoking behaviors. Following the medical and sociological literature (Ham and Hope 2003; Zhang and Ip 2012), we applied three related questions, namely, “ever smoked,” “number of cigarettes smoked per day in the last 30 days,” and “number of smoking days in the last 30 days,” to form the categories of our outcome smoking behavior variable, including nonsmoker, former smoker, light-nonfrequent smoker (smoked on

The partially ordered set structure of adolescent smoking categories in the National Longitudinal Surveys of Youth dataset.
The following
We applied the following two-level model to analyze the described longitudinally and partially ordered adolescent smoking behavior data set. For the first-level model, the partially ordered smoking behavior variable only has one disjoint component, which means that
For
where
For
The second-level model to examine the influence of time-invariant individual characteristics, such as gender and race, on the varying coefficients
where
The proposed Bayesian method was applied to conduct the statistical inference. The hyperparameters in the prior distributions were set the same as prior 1 in simulation 1. The algorithm had converged within 10,000 iterations in the analysis of this data set. Thus, we collected 10,000 posterior observations after discarding the first 10,000 burn-in iterations. The estimation results of the model parameters, together with their SE estimates, are presented in Table 7. We also conducted the analysis using prior 2 specified in simulation 2. The estimation results are similar to those reported in Table 7 and are not reported. All the estimated variances
Regression Coefficients in the Analysis of the Adolescent Smoking Behavior Dataset.
We start from the conditional ordinal model (8) of
For the conditional nominal model (9) of
In summary, the above analysis revealed how teenagers’ age, education enrollment status, household size, parental help, and residence area influence their smoking behavior. The results of the second-level model provided additional insights into how these effects vary according to certain individuals’ characteristics, such as gender and race. These insights cannot be revealed by the conventional single-level regression.
Discussion
In this study, we propose a Bayesian two-level model to analyze longitudinal data with a partially ordered structure. The model comprises a first-level random coefficients model to investigate the relationship between time-varying covariates and partially ordered outcomes and a second-level regression model to examine the effects of individual characteristics on the random coefficients. The proposed model accommodates the exact structure of the poset so that all the ordinal information among the categories is kept and the incomparable categories among the same ordinal level can be well separated in the analysis of the partially ordered data. The model also accounts for the interdependency among the repeated partially ordered responses and examines how individual characteristics affect the first-level random coefficients. The simulation results show that the proposed method performs well. Applying the proposed method to the longitudinal adolescent smoking behavior data set provides new insights into the effects of risk factors for adolescent smoking behaviors that are characterized through partially ordered categories. Such data structure cannot be accommodated by the existing methods. The analysis results boost our comprehensive understanding of adolescent smoking behaviors and help us come up with an effective strategy to prevent adolescents from smoking.
Though other methods, such as GEE (generalized estimating equation) and robust SE, can do the same or similar analyses without causing much numerical problems, they may encounter difficulty when the model and data structures become increasingly complicated. For instance, in the presence of missing data, GEE can accommodate only the missing data that are missing completely at random. By contrast, MCMC can easily manage various types of missing data by adding additional posterior sampling step(s). Thus, the proposed MCMC method has a high potential to be extended to address various challenging problems for which other methods are either unfeasible or work poorly.
In this study, the partially ordered structure of the response is constructed based on the relations between categories in the poset. Such a poset structure must be determined beforehand according to the nature and aim of the substantive study and/or experts’ knowledge. This structure may vary according to the objective of the study. For instance, in the study of adolescent smoking behavior, {former smoker} can be regarded as a lighter category than {light-nonfrequent smoker} if our focus is on adolescents’ misconduct behavior. However, if the primary goal of a study is to investigate the effect of smoking on the risk of lung cancer, then no clear evidence can tell how to rank the above two smoking categories because which category is associated with a higher risk of lung cancer is unknown.
Notably, the parameters in the conditional models for partially ordered data only allow for conditional interpretations. Moreover, we assume that the regression coefficients in the second-level model (6) are invariant to r and k. Although extensions to allow r- and/or k-specific regression coefficients are straightforward, such extensions dramatically increase the number of parameters and thus cause difficulties in interpretation.
Supplemental Material
Supplemental Material, Appendix_(1) - Bayesian Two-level Model for Repeated Partially Ordered Responses: Application to Adolescent Smoking Behavior Analysis
Supplemental Material, Appendix_(1) for Bayesian Two-level Model for Repeated Partially Ordered Responses: Application to Adolescent Smoking Behavior Analysis by Xiaoqing Wang, Haotian Wu, Xiangnan Feng, and Xinyuan Song in Sociological Methods & Research
Footnotes
Acknowledgment
The authors thank the editor, the associate editor, and the two reviewers for their valuable comments, which substantially improved this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by GRF grants 14301918 and 14303017 from the Research Grant Council of HKSAR, the National Natural Science Foundation of China (11671054, 11871263, 71802166), and the Direct Grant of the Chinese University of Hong Kong.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
