Abstract
Abstract:
In categorical data analysis, several regression models have been proposed for hierarchically structured responses, such as the nested logit model, the two-step model or the partitioned conditional model for partially ordered set. The specifications of these models are heterogeneous and they have been formally defined for only two or three levels in the hierarchy. Here, we introduce the class of partitioned conditional generalized linear models (PCGLMs) that encompasses all these models and is defined for any number of levels in the hierarchy. The hierarchical structure of these models is fully specified by a partition tree of categories. Using the genericity of the recently introduced
Keywords
Introduction
Response categories are often hierarchically organized, that is, partitioned into subsets of related categories. Several partitioned conditional regression models have been proposed in different applied fields, including econometrics, medicine and psychology. This is the case, for instance, when the assessment of a disease after a treatment is measured using a coarse scale with ordered categories—‘worse’, ‘same’ and ‘improvement’—and a fine scale to decompose the category ‘improvement’ into several degrees. (Tutz (1989) introduced the two-step model to take account of the hierarchy among ordinal choices in medicine (see also Morawitz and Tutz, 1990). Although the hierarchical structure among categories may seem natural for ordinal responses, it also makes sense for nominal responses. This is the case in econometrics, when an individual compares different alternatives, making its decision in several steps. This is illustrated by the well-known example of travel demand with the four alternatives: ‘air’, ‘bus’, ‘car’ and ‘train’. The nested logit model, introduced by McFadden (1978), enables to decompose the decision mechanism into two steps: (a) ‘air’ versus other alternatives and (b) choice among the three alternatives ‘bus’, ‘car’ and ‘train’. Zhang and Ip (2012) introduced the partitioned conditional model for partially ordered set (POS-PCM). We will show that introducing a hierarchical structure among partially ordered categories is very relevant for analyzing in plants the influence of growth on branching pattern.
Until now, partitioned conditional models have been formally defined for only two or three levels in the hierarchy. Furthermore, they all assume that the hierarchy among categories is a priori known. Our first contribution is to use partition trees to specify the hierarchy among categories. This enables us to define partitioned conditional models for an arbitrary number of levels in the hierarchy. Moreover, using the genericity of the
In Section 2, the
Partitioned conditional GLMs
(r, F, Z) specification of GLMs for categorical responses
Consider the regression context, where the response
Each linear predictor has the form
The
We first need to introduce definitions and notations of partition trees.
child vertices constitute a partition of their parent vertex and every singleton
Let
a partition tree a collection of models
Therefore, we have for each category
The class of PCGLMs for categorical response variables is the set of
(a)
-partition tree and (b)
-partition tree.
There are exactly
See Appendix B for the proof. Henceforth, we will specify any PCGLM by its graphical representation, with each non-terminal vertex being labelled by an
Graphical representation of a PCGLM.
Using the partitioned conditional structure of the model, the log-likelihood can be decomposed as follows
Concavity of
Binary partition trees: In the Bernoulli regression case, the log-likelihood is strictly concave when
Canonical GLMs: In the case of a canonical GLM for categorical responses—that is, (reference, logistic,
PCGLM specification of the nested logit model
The nested logit model for nominal responses was introduced by McFadden (1978) in the framework of individual choice behaviour. This model was designed in order to avoid the inconsistency of the independence of irrelevant alternatives (IIA) property in some situations. This is well illustrated by the paradox of the classical blue and red buses example (Debreu 1960). In the following model, the nested logit model is presented with only two levels. Let
Multinomial and conditional logit models, that share the canonical link function, are specified by the (reference, logistic,
The dataset (Greene 2003) contains
Estimates of the nested logit model and three PCGLMs.
Estimates of the nested logit model and three PCGLMs.
(a) PCGLM specification of a nested logit model and (b) PCGLM 1 for the travel demand between Melbourne and Sydney.
In this context, the hierarchy among response alternatives is a priori fixed. We used a classical hierarchy for this dataset: the first level separates air from other alternatives and the second level separates bus, train and car. The nested logit model (Hensher and Greene 2002) can be specified in the PCGLM context by a partition tree and a collection of
(a) PCGLM 2 and (b) PCGLM 3 for the travel demand between Melbourne and Sydney.
Looking at the log-likelihoods, we see that replacing the inclusive values by the characteristics which composed them improved the fit. Moreover, the corresponding parameters are easy to interpret (effect of cost and transfer time on choice between air and ground). Let us remark that PCGLMs 1 and 2 locally respect the RUM principle since each conditional model of the collection
PCGLM specification and extension of the two-step model
The two-step or compound model was defined by Tutz (1989) in order to decompose the latent mechanism of an ordinal response into two levels. Ordinal responses are commonly used in medicine and psychology, for instance, to assess a patient's condition. This ordinal scale is often built from a coarse and a fine scale.
For the back pain prognosis dataset (Doran and Newell 1975), the response variable
Two-scale back pain assessment.
The two-step model can be extended in different ways. A partition tree with a depth greater than two, that conserves the ordering among categories, can be used. Furthermore, different link functions appropriate for ordinal responses can be used for each non-terminal vertex. For instance, the (adjacent,
(a) PCGLM specification of a two-step model and (b) PCGLM for the back pain prognosis example.
As we saw in our first two examples, the hierarchy is often a priori fixed according to the relations between response categories. In the case of a symmetric scale among item response, Thissen-Roe and Thissen (2013) proposed to automatically determine the hierarchy, decomposing the decision into two steps. At the second level, the two child vertices have a common distribution and thus the likelihood cannot be separately maximized. Note that the hierarchical structure of this two-decision model does not respect the partition tree definition when there is an odd number of response categories.
In this section, we extend the indistinguishability procedure of Anderson (1984) to select the hierarchy among ordered response categories. He introduced the stereotype model derived from the classical multinomial logit model
Anderson (1984) proposed a testing procedure to identify consecutive categories that can be clearly distinguished by the explanatory variables
PCGLM specification of indistinguishability hypothesis
.
PCGLM specification of indistinguishability hypothesis
.
The indistinguishability procedure can be viewed as a partitioning procedure using the PCGLM specification; see Figure 7 and Appendix C for details. Since the partition tree
First level: Note that all the PCGLMs with only a root proportional model (and minimal response models for other non-terminal vertices) have exactly the same number of parameters: PCGLM selected by the extended indistinguishability procedure with the back pain prognosis dataset.
PCGLM specification of a one-dimensional stereotype model of Anderson (1984) with the back pain prognosis dataset.
It can be seen that the original categories need to be aggregated in order to efficiently describe the back pain. At the first level, our results are similar to those of Anderson, that is, the partition (worse), (same, little improvement, moderate improvement) and (slight improvement, complete relief) for the three explanatory variables. Anderson (1984) obtained a log-likelihood of
In categorical data analysis, the case of nominal and ordinal responses has already been investigated in depth, while the case of partially ordered responses has been comparatively neglected. Zhang and Ip (2012) introduced the partitioned conditional model for partially ordered set. In the following text, the method of Zhang and Ip (2012) and our method will be described and compared using the pear tree dataset.
The pear tree dataset
The branching process in temperate trees is a sequential process where successive developmental events are modulated by the growth of the parent shoot (made of a succession of internodes separated by nodes where the axillary shoots are located). We here focus on the immediate axillary shoots, that is, developed without delay with respect to the parent node establishment date. The first event
Successive nodes and axillary productions of pear tree.
Let us assume that every partially ordered variable
Correspondence between the latent process
and the response
.
Correspondence between the latent process
Hasse diagrams obtained assuming that (a) the elongation
precedes the transformation into spin
and (b) the elongation
and the transformation into spin
are non-ordered.
The main idea of Zhang and Ip (2012) was to propose an algorithm that builds a partitioned conditional model from any Hasse diagram.
An antichain is a set of pairwise incomparable elements. In our example, the partially-ordered set summarized by the Hasse diagram in Figure 11(a) is partitioned into the three totally ordered antichains (l), (u, s) and (U, S). Zhang and Ip (2012) proposed to use the odds proportional logit model, considering the antichains as ordered categories, and then used a multinomial logit model within each antichain. The corresponding partitioned conditional model is specified in Figure 12 (BIC = 4 333.939). It should be noted that (tutz, 2012) used a weak ordering relation between sets of categories and, thus, some information is lost during the partition tree construction. In our example, the same POS-PCM would be obtained using the second Hasse diagram in Figure 11(b).
POS-PCM obtained from the Hasse diagram in Figure 11(a) and equivalently from the Hasse diagram in Figure 11(b).
POS-PCM obtained from the Hasse diagram in Figure 11(a) and equivalently from the Hasse diagram in Figure 11(b).
We propose to directly use the latent variables to build a PCGLM bypassing the Hasse diagram construction. We assume that the elongation
PCGLM for the latent process
,
,
, assuming that the elongation
precedes the transformations into spin
.
PCGLM for the latent process
,
,
, assuming that the elongation
precedes the transformations into spin
.
PCGLM for the latent process
,
,
, assuming that the elongation
and the transformation into spin
are non-ordered.
Parameter estimates for the PCGLM specified in Figure 13.
PCGLMs constitute a flexible and interpretable framework for analyzing categorical responses. Explanatory variables can be selected for each non-terminal vertex. An explanatory variable may thus have an effect on one partition of categories, not on another. It should be kept in mind that the non-effect of a variable is as interesting as the effect. As for other regression models, various variable selection procedures can be applied to PCGLMs. Because of the small number of explanatory variables (
An important issue with PCGLMs is the selection of the partition tree. The proposed indistinguishability procedure for jointly selecting the partition tree and the explanatory variables could be used in the supervised classification context with ordered classes. This procedure selects the best split between categories, starting from the entire set
Appendices
Examples of (
) specification
specification of four generalized linear models for categorical responses
The cardinal of vertex
Indistinguishability procedure
Indistinguishability procedure with
specification
Here, we express the indistinguishability procedure in terms of canonical models by simply changing the design matrix. In fact, the hypothesis
Indistinguishability procedure with PCGLM specification
Here we express the indistinguishability procedure in terms of PCGLM by simply changing the partition tree. In fact, any canonical (reference, logistic,
PCGLM specification of indistinguishability hypothesis
Using this proposition, the canonical (reference, logistic,
Footnotes
Acknowledgments
The authors thank Yves Caraglio for the pear tree dataset and for the representation of pear tree axillary production.
