Abstract
In multilevel data, cross-classified data structures are common. For example, this occurs when individuals move to different regions in longitudinal data or students go to different secondary schools than their primary school peers. In both cases, the data structure is no longer fully nested. Estimating cross-classified multilevel models is computationally intensive, so researchers have used several shortcuts to decrease run time. We consider how these shortcuts affect parameter estimates. In particular, we compare parameter estimates from fully nested and cross-classified models using a series of Monte Carlo simulations. When the outcome is continuous, we identify systematic differences in estimated standard errors and some differences in the estimated variance components. When the outcome is binary, we also find differences in the estimated coefficients. Accordingly, we caution researchers to avoid fully nested model specifications when cross-classification exists but suggest some limited conditions under which parameter estimates are unlikely to be different.
1. Introduction
As nested data structures in the social sciences have become increasingly complex, researchers have sought statistical models that appropriately account for the inherent dependence that comes with these structures. Multilevel models (also known as mixed-effects models or hierarchical linear or generalized linear models) represented a major innovation in modeling nested structures by accounting for the correlation within nested units and the variability across higher-level units such that they are now an indispensable component of the statistical repertoire of the social sciences. Yet, social scientists continue to encounter even more complex data structures where lower-level units are not strictly hierarchically organized. Such models are known as cross-classified hierarchical models, which are much more computationally intensive. In this article, we consider whether there are scenarios in which cross-classified models can be fit using standard modeling procedures for fully nested hierarchical models. In other words, if we momentarily ignore the cross-classification structure, under what circumstances are the results biased? We use a series of Monte Carlo simulations to answer this question.
Limited research examines the difference between fully nested and cross-classified model specifications. Existing studies have found that when cross-classification is ignored in linear models, fixed-effects estimates are unaffected but standard errors and the variance components are biased (Luo and Kwok 2009, 2012; Meyers and Beretvas 2006). In the context of linear models, Meyers and Beretvas (2006) show, with both empirical and synthetic data, that variance components and standard errors are biased when estimating fully nested models compared to the more appropriate cross-classified model. Similarly, Luo and Kwok (2012) show that the rate of mobility, or the share of cross-classified cases, results in increasingly biased estimates of standard errors when using a fully nested specification. More generally, Luo and Kwok (2009), using empirical and synthetic data, again with a continuous outcome, show that ignoring cross-classification at a given level results in underestimated standard errors at that level and overestimated standard errors one level lower in the hierarchical structure. These past studies have examined bias under a variety of circumstances, and we build on this research in several key ways. First, we consider many more data-generating conditions. Second, we conduct a thorough examination of which factors explain any differences between the two modeling approaches. Third, we expand the examination to the common social science issue of longitudinal cross-classification as well as the case of a binary outcome. We believe this examination will bring these distinctions to a broader social science audience as this discussion has been mostly confined to educational researchers.
We use two examples to guide our discussion. The first is a classic example of children within schools, which has been the predominant focus in this literature. In this example, we know students’ primary and secondary schools. 1 In a strictly hierarchical model, students would be located in a primary school, and several primary schools would feed into a single secondary school. Panel A of Figure 1 shows this three-level hierarchical model structure and provides an example of the data structure. Here, two students each attend one of four primary schools. Then, two primary schools each feed into one of two secondary schools. In reality, however, students from the same primary school may not attend the same secondary school. In such a case, the fully nested structure breaks down and becomes the cross-classified structure shown in Panel B, where a student is cross-classified by primary and secondary school. Turning to the data structure, the two students in both primary school 1 and primary school 3 each attend secondary school 1 and secondary school 2, respectively. For these data points, the fully nested structure remains intact. The two students who attended primary school 2, however, each attend different secondary schools, necessitating a consideration of cross-classification for this data.

Hierarchical example of students within primary and secondary schools. (Panel A) Fully nested structure. (Panel B) Cross-classified structure.
In our second example, we consider a longitudinal data structure. Here, individuals are sampled and followed through time, and each observation is nested within the individual. The data are geocoded because we are also interested in the effect of some geographic unit, here the city. In Figure 2, Panel A, observations are fully nested within individuals, who are fully nested within cities in a three-level hierarchical structure. In the data structure, two time points are observed for each individual, and two people each live in cities 1 and 2. However, people often move between repeated observations, such that the data become cross-classified by person and city, as shown in Panel B. Here, person 1 and person 3 never move: Their two repeated observations are always located in the same city. Like the previous example, the fully nested structure remains intact for these two individuals. By contrast, person 2 moves from city 1 to city 2 between repeated observations, such that we must consider the cross-classification.

Hierarchical example of longitudinal observations within individuals and cities. (Panel A) Fully nested structure. (Panel B) Cross-classified structure.
We use these two examples for two reasons. First, examples of students dominate the statistical literature on cross-classification models. Yet, as analysis of geocoded, longitudinal data grows, the use of cross-classification models for structures such as those described previously will become more important (see e.g., Hall and Crowder 2014). Second, longitudinal data deserve their own consideration due to the statistical property of exchangeability. A randomly selected student, for example, can be exchanged for another randomly selected student. In longitudinal data, sampling is not done at the lowest level. Unlike students, that lowest level (the repeated observation) cannot simply be swapped out for another randomly sampled observation. These differences imply very different sample sizes at each level. In the example of students, the sample size becomes smaller as you move up the levels, and all three levels can vary in sample size. In the longitudinal example, the sample size at the lowest level is fixed by the number of repeated measures, and the second level for the individual will be the largest. We recognize that the underlying statistical structure is the same, but the distinction between these data structures is consequential as it puts different restrictions on data generation such that we consider both in the simulations that follow.
As these two examples show, a certain portion of the data remains fully nested. Even in cases of movement, students in primary schools will typically only feed into certain secondary schools, and individuals only move across cities a limited number of times over repeated observations. When movement across either secondary schools or cities is low, the cross-classification will represent a sparse matrix. Goldstein (1999) notes there is considerable computational advantage to simplifying the variance structure in the case of cross-classified models. In eliminating infrequent cross-classifications with a fully nested specification, the loss in precision of the estimation of the variance components is slight relative to the great computational advantage (Goldstein 1999:117). Here, we build on prior research (Luo and Kwok 2009, 2012; Meyers and Beretvas 2006) and consider the implications for bias that result in ignoring cross-classification in favor of the computationally less intensive fully nested model while also providing a thorough examination of the sources of bias in parameter estimates.
2. Background for the Models
2.1. Defining the Models
We will briefly review the fully nested and cross-classified hierarchical statistical models to highlight the differences. For further details, we refer the reader to Goldstein ([1995] 2011), Hong and Raudenbush (2008), Raudenbush and Bryk (2001), and Raudenbush and colleagues (2011). In a fully nested model, we typically refer to the levels as level 1, 2, and 3. In the cross-classified structure, the two higher levels technically both reside at a single second level, requiring different terminology. The lowest level effects (students or repeated measures in our two examples) are still referred to as level-1 effects. The two higher level effects are then referred to as row effects and column effects. Here, we use notation found in Raudenbush and Bryk (2001) and Raudenbush and colleagues (2011). 2
For both the fully nested and cross-classified hierarchical models, the level-1 model is the same. For our student example, we use subscripts
where
As is typical in hierarchical models, each of the
where
where
For the cross-classified model, the difference is that the equation at level 2 accounts for both the row and column cross-classified effects:
where
2.2. Computation Considerations for Each Model
The statistical theory behind cross-classification models has been established (Goldstein [1995] 2011; Hong and Raudenbush 2008; Raudenbush and Bryk 2001), but the computational techniques have lagged behind. The difference in computational intensiveness between the fully nested model and the cross-classified model is nontrivial. In addition to higher sample sizes at each level and increasing the number of fixed effects, as the complexity of the variance components specification increases (e.g., random slopes, covariance between random effects, or here a cross-classified random-effects structure), computational times tend to increase dramatically. In the simplest form of the three-level fully nested model, the dependence structure is simplified by the fact that all level-2 units are contained within a single level-3 unit. In the cross-classified model, however, the dependence structure across the random effects is complicated by the fact that level-2 units can be contained within multiple level-3 units. However, not all possible combinations will be present. For example, primary schools only feed into a limited number of secondary schools (with an upper bound of the most populous primary school), and individuals only move across a limited number of cities (with an upper bound of the number of repeated observations). Thus, the amount of movement is consequential to the difference between the two modeling structures.
With consistent exponential growth in computing capabilities, one might conclude that the time to estimate the cross-classified model will eventually decrease to the point that any examination of the differences in the two models is premature. However, fitting variance components over multiple cores has been difficult to implement. The Stata documentation on their multiple core products demonstrates this difficulty (StataCorp 2017). The procedure “mixed” (StataCorp 2017:27) that fits the nested and cross-classified models for linear outcomes currently experiences no speed advantage from use of any number of multiple cores relative to a single core. Likewise, in R, the package lme4 for mixed-effects models does not allow the parallel option on a single model to take advantage of multiple cores, except for estimating bootstrapped standard errors (Bates et al. 2015). Currently, any gains in computing time for such models would be due solely to improvements in the processing speed of a single core, which grossly constrains computational abilities. This fact makes Goldstein’s (1999) assertion concerning computational advantages of simplifying the variance components structure all the more relevant. At present, one reasonable way to shorten computation time for the cross-classified model is to take a shortcut (e.g., assume a fully nested or simpler structure).
Past research has used several strategies regarding issues of movement across higher levels. First, some work has successfully utilized hierarchical cross-classification models. Most of these studies fall under our first example, with cross-classifications of schools (e.g., Goldstein, Burgess, and McConnell 2007), neighborhoods (e.g., Sharp, Denney, and Kimbro 2015), or both (e.g., Lee 2012). Examples of longitudinal cross-classification are increasingly common (e.g., Hall and Crowder 2014). With this cross-classification in particular, researchers are using a host of alternative modeling strategies, likely due to the severe computational limitations on the ability to run these models, a point we return to in the Discussion section. Given these difficulties, a common approach for longitudinal data is to pick one wave or location across observations, such as the first or most recent, effectively reducing the data to a fully nested structure (e.g., Kimbro and Denney 2013; Owens 2010; Warner et al. 2011). Of course, this would eliminate the possibility of examining the effect of changes in geographic-level variables. If we are interested in these changes, or in the case of nonlongitudinal cross-classification like our student example, we cannot simplify the third level in this manner. Another approach might be to ignore the top level of the hierarchy and fit these effects at level 2 (e.g., Clark, Ailshire, and Latnz 2009; Vogel, Porter, and McCuddy 2017), especially if there are few clustered observations at the highest level. 4
Another possible shortcut is to violate the assumption of independence within the cross-classification structure and fit the cross-classified model as a fully nested three-level model. In Figures 1 and 2, this choice would mean fitting the data structure in Panel B with the model structure in Panel A. Substantively, this disconnects observations at the lowest level from related observations for each combination of level-2 units within level-3 units. For the case of students nested within primary schools within secondary schools, this implies there is a different primary school intercept for each secondary school that students subsequently attend. Thus, the independence assumption is violated because the model will not recognize that some students attended the same primary school if they later attend different secondary schools. For the example of repeated observation of individuals within cities, individuals would have a new individual-level intercept for each city in which they resided, again violating independence because the model would view the time they spent in different cities as unique individuals. To be clear, this approach would still account for level-3 (or column level) dependency and dependency based on the combinations in the cross-classification, but it would not fully account for the dependency at level 2 (or the row level). The question then becomes how sensitive model estimates are to this violation if one were to take such a shortcut via the fully nested model.
Consider the following formula defining the linear solution to fixed effects in the context of multilevel models (Littell et al. 2006:751):
In Equation 5, which is in matrix notation,
In Table 1, we explore these possible differences with an empirical example using the National Longitudinal Survey of Youth 1997 (NLSY97). We examine the longitudinal data structure shown in Figure 2 for both a continuous (a 15-point depression scale) and binary (marijuana use) outcome. The NLSY97 is a large, nationally representative, geocoded sample (N = 8,984) designed to track the transition of youth into adulthood. Adolescents age 12 to 16 were randomly sampled in 1997 and surveyed annually with an excellent retention rate. Using the restricted-access, geocoded NLSY97, we analyzed a subset of respondents whose city of residence could be identified by combining core-based statistical area (CBSA) and county information with a variable assessing whether the respondent lived in a principal city within the metro area. We restrict analyses to begin at age 23 as CBSA data are available only beginning in 2004, and we extend our analysis through age 26 for four time points. We also randomly dropped a subset of respondents that did not move across cities to achieve a movement rate of approximately 40 percent (about 20 percent moved prior to dropping). Finally, the depression scale was only collected every other survey year, decreasing the number of observations for that outcome. We obtained city demographic variables from the U.S. census for predictors, namely, the percentage of minors, 5 owner-occupied housing, non-Hispanic whites, and female-headed households. For the individual, we use gender, age, and marital status as predictors. We estimated the continuous model using maximum likelihood estimation and the binary outcome with a logit link using maximum likelihood with a LaPlacian approximation to the log-likelihood.
Empirical Comparison of Fully Nested and Cross-Classified Models with the National Longitudinal Survey of Youth 1997 (NLSY97)
p < .05. **p < .01. ***p < .001.
For the continuous example of depression in Table 1, the estimates between the fully nested and cross-classified structures are somewhat similar for the predictors in terms of the fixed effects and their standard errors. We observe some differences (e.g., percentage minor, marital status), but they do not affect inference in this example except for the precise p level for marital status. For percentage non-Hispanic whites, however, the shortcut fully nested Model 1 would lead to different inferential conclusions as it is significant in this specification but not in the cross-classified Model 2. Thus, one would conclude that the higher the percentage of nonwhites, the higher individual-level depression is in that area. Note, too, that the intercept and variance components are different between the models.
For the binary outcome of marijuana use, we observe many differences in the fixed effects and their standard errors that would affect inferential conclusions. The nonsignificant city predictors and the significant effect of marital status is the same between the models, but the shortcut fully nested Model 3 would lead one to conclude that age is negatively related to marijuana use and that gender is not significantly related to marijuana use. Using the cross-classified model, however, we see no significant age effect and a significant positive effect of gender whereby males are more likely to use marijuana. The intercept and the variance components are also very different between specifications. These empirical examples demonstrate differences between the two modeling approaches, but we emphasize that general expectations regarding differences should not be drawn from this example. Any empirical example represents only a single sample drawn from a population. Instead, we are interested in systematic differences between fully nested and cross-classified models, which simulations provide.
We can judge the estimates from the fully nested model by considering two desired statistical properties of an estimator, bias and consistency. First, bias is the difference between an estimator’s expected value and the observed value. Thus, computing the difference between the cross-classified, which represents the expected value, and the observed fully nested model provides a measure of bias. In multilevel linear models, bias is confined to the standard errors and variance components by improperly accounting for dependencies in the data (Hox 2010), which could result in incorrect inference. In generalized multilevel models, however, bias can occur in both the fixed and random effects due to the estimation procedures necessary for such specifications (Goldstein and Rasbash 1996; Jang and Lim 2009; Rodriguez and Goldman 1995, 2001). Thus, we consider bias in both estimates.
Second, statistical consistency is defined as whether an estimate converges to the true value as the sample size increases. Both common estimation procedures for multilevel linear models, maximum likelihood and generalized estimating equations, produce consistent estimates even when the random-effects specification is incorrect (Goldstein [1995] 2011; Hox 2010; Raudenbush and Bryk 2001). Given the different estimation procedures, whether this result holds for generalized linear models is less clear. Thus, although we expect consistent estimators, we can still formally assess consistency through manipulation of the sample size in our simulations.
3. Study 1
To discern the differences between fully nested and cross-classified multilevel models, we used a Monte Carlo simulation experiment. 6 Study 1 mimics the structure of individuals who are cross-classified by primary and secondary schools, with movement occurring such that not all individuals from the same primary school end up in the same secondary school. We simulated data with a known nested structure while systematically varying the sample size at level 2 (Meyers and Beretvas 2006), the sample size at level 3 (Meyers and Beretvas 2006), the size of the variance component on the random intercept associated with level 2 (Luo and Kwok 2009), and the percentage of cases that moved. 7 Each simulated data set had a continuous outcome and six predictors. The six predictors correspond to a dummy variable and a continuous predictor at levels 1, 2, and 3. We include both dummy and continuous predictors given that their source distributions (uniform and normal, respectively) could lead to differing results. We constrained all main effects to be (asymptotically) .3; the actual quantity is arbitrary, and we focus only on the difference in estimates. For the sake of simplicity, the nested structure of the data was balanced, but the total sample size depended on the experimental condition (see the following). The simulation entailed generating the data; estimating a cross-classified and a fully nested multilevel model 8 ; saving the coefficients, standard errors, and variance components; and then repeating this process 1,000 times per condition of the Monte Carlo experiment. 9
3.1. Manipulations
The level-2 sample size was 200, 600, or 2,000, and the level-3 sample size was 20, 35, or 50. The total sample size for a simulated data set was the product of the level-2 and level-3 sample sizes. For example, when the level-2 sample size was 600 and the level-3 sample size was 50, the total number of cases at level 1 was 30,000. The random intercept variance associated with level-2 was 0, .2, or .4. This was accomplished by generating a distribution of values for each level-2 unit with the desired variance and adding it to the outcome. To disassociate the level-2 variance from level-3 variance, the distribution of betas was randomly sampled. This ensures that asymptotically the variance at level-2 is set by design and the level-3 variance is asymptotically zero (because there were 200, 600, or 2,000 values added to the outcome nested within 20, 35, or 50 level-2 units).
Movement or shuffling across secondary schools was modeled with 5, 10, 15, 20, 40, and 75 percent of the cases not being fully nested (i.e., the percentage of secondary school students that do not go to the same secondary school as their primary school peers). Moving entailed randomly selecting cases and then randomly reallocating them to a new level of the secondary school nesting ID variable. For example, if the level-2 sample size is 200 and the level-3 sample size is 20, there is a total sample of 4,000 cases. With 5 percent movement, 200 cases were selected to move. If one of the cases selected to move was coded 1 in the secondary school nesting ID variable, it would have randomly been reassigned to another level of that variable. The initial sample had balanced level-2 sample sizes within level-3 units, but movement was implemented, as is realistic, such that balance was not maintained in the movement process. Thus, we accomplished contextual effects by simulating data within level-2 units, and those who moved then took the slope that was used to generate their responses with them to new level-3 units.
The Monte Carlo experiment varied the level-2 sample size (three categories), the level-3 sample size (three categories), the level-2 variance component (three categories), and the percentage of movers (six categories). Thus, the entire experiment included 162 conditions (3 × 3 × 3 × 6 = 162). With 1,000 replications per condition, the entire sample space is the estimates from 162,000 cross-classified and fully nested multilevel models. We next present summary statistics of the differences in parameter estimates, and then we model those differences as a function of the manipulated factors in the Monte Carlo experiment.
3.2. Results
We report descriptive statistics for the difference between the parameter estimates from each model in Table 2. These values correspond to the parameters from the cross-classified model minus the parameters from the fully nested model; positive values indicate that the fully nested model is underestimating the coefficient, and negative values indicate that the fully nested model is overestimating the coefficient. In terms of the estimated coefficients or slopes, on average, there is no difference in the parameter estimates. We see some small discrepancies in the estimates (as indicated in the minimum and maximum observed values), but there is nothing systematic about the differences in parameter estimates. In addition to descriptive statistics, Table 2 shows the adjusted R2 from an ANOVA model predicting the differences as a function of all the manipulated factors (e.g., level-2 and level-3 sample size, percentage of movers, and size of the variance component) in the experiment and all possible interactions. The manipulated factors explain no variance in the differences in coefficients. This is consistent with results reported elsewhere (Luo and Kwok 2009; Myers and Beretvas 2006).
Summaries of Differences in Parameter Estimates from Cross-Classified and Fully Nested Models from Study 1
Note: Each estimate is based on the differences in parameters from 162,000 models.
From an ANOVA model predicting the differences with each manipulated factor and all their interactions.
Turning to differences in the estimated standard errors, on average, there are no differences in the standard errors for the level-1 predictors. What small differences exist are explained well by the manipulated factors (R2 = .991). For the differences in the level-2 and level-3 standard errors, the fully nested model underestimates the standard errors. Specifically, on average, the standard error for the level-2 continuous variable is underestimated by 33.3 percent, for the level-2 dummy variable it is underestimated by 32.5 percent, for the level-3 continuous variable it is underestimated by 29.6 percent, and for the level-3 dummy variable it is underestimated by 30 percent (percentages come from dividing the differences by the average standard error from the cross-classified models). This means researchers using a fully nested specification are more likely to commit a Type I statistical error in wrongly rejecting a correct null hypothesis. Furthermore, the manipulated factors account for much of the variation in the differences in coefficients (R2
To illustrate the patterns in differences in parameter estimates, we model the difference between the parameter estimates using ANOVA. For model building purposes, we used an effect size of .06
The differences in standard errors are all systematic with respect to the manipulated factors in the experiment. For the standard errors associated with the level-1 predictors, the four-way interaction between level-1 sample size, level-2 sample size, the percentage of movers, and the size of the variance component has a large effect size. To illustrate this interaction, Figure 3 10 presents plots of the marginal means from the model for the difference in the standard error of the level-1 continuous predictor. When the variance component is zero (Figure 3, Panel A), there are no differences in the standard error. As the variance component increases, so do the differences in the standard errors. When the variance component is non-zero (Figure 3, Panels B and C), the percentage of movers increases differences in the standard errors, and as the sample sizes at levels 1 and 2 increase, the differences decrease. Here, as the variance component increases and as the sample sizes decrease, the fully nested specification overestimates the standard error. The pattern for the standard error for the level-1 dummy variable is the same, except the differences are about twice as large as they are for the continuous variable.

Marginal means from an ordinary least squares regression predicting the differences in the standard error for the level 1 continuous variable for Study 1. Variance components are equal to (A) 0, (B) .2, and (C) .4. Confidence intervals are too small to be discerned in the graph.
Similarly, the differences in standard errors for the level-2 predictors require the four-way interaction effect. Figure 4 shows the marginal means plots for the differences in the standard error for the continuous level-2 predictor 4. Again, when the variance component associated with level 2 is zero, there are no differences in the parameter estimates (Figure 4, Panel A). When the variance component is non-zero (Figure 4, Panels B and C), as the percentage of movers and the level-2 sample size increases, so do the differences in estimated standard errors. As the level-1 sample size increases, the differences in estimated standard errors decrease. The pattern for the standard error for the level-2 dummy variable is the same, except the differences are about half as large as they are for the continuous variable. As depicted in Figure 5, the differences in the estimated standard errors for the continuous level-3 predictor follow basically the same pattern as those for the level-2 predictors, except the effect is about twice as large. The differences in the estimated standard errors for the level-3 dummy variance are similar to the level-3 continuous variable, but the effect of the manipulated factors are half as strong. We tried to model the differences in estimates of the variance components, but we found no systematic patterns.

Marginal means from an ordinary least squares regression predicting the differences in the standard error for the level 2 continuous variable for Study 1. Variance components are equal to (A) 0, (B) .2, and (C) .4. Confidence intervals are too small to be discerned in the graph.

Marginal means from an ordinary least squares regression predicting the differences in the standard error for the level 3 continuous variable for Study 1. Variance components are equal to (A) 0, (B) .2, and (C) .4. Confidence intervals are too small to be discerned in the graph.
Summarizing the main results from Study 1, we find that when the variance component is small, there are no differences in parameter estimates between fully nested and cross-classified model specifications. When nesting accounts for variance in the outcome (i.e., when the variance component is nontrivial), the percentage of movers and the sample size are important determinants of differences in the standard errors. Study 1 sheds light on the factors affecting parameter estimates under conditions typically associated with people nested in two different units (e.g., primary and secondary schools), but the data structure is atypical of other types of nested data structures, such as repeated observations nested in individuals. Study 2 addresses the conditions more likely to be encountered in longitudinal analyses.
4. Study 2
Study 2 mimics the structure of repeated observations nested in individuals and individuals nested in cities. Again, we constrained all main effects to be (asymptotically) .3, and we focus on the difference in estimates. We manipulated the level-1 sample size (number of time points), the level-2 sample size (number of individuals), the percentage of cases that moved, and the size of the variance component. 11 As above, each simulated dataset had a continuous outcome and six predictors. The six predictors correspond to a dummy variable and a continuous predictor at levels 1, 2, and 3. The first nesting ID variable corresponds to time points nested in individuals, and the second nesting ID variable corresponds to individuals in cities. We included 50 cities, with individuals evenly distributed between them before moving took place. To account for exchangeability, the same individuals (level 2) are used across occasions. This approach differs from Study 1, in which resampling took place across all three levels. The Study 2 simulation entailed generating the data; estimating a cross-classified and a fully nested multilevel model; saving the coefficients, standard errors, and variance components; and then repeating this process 1,000 times per condition of the Monte Carlo experiment.
4.1. Manipulations
We varied the number of time points to be 2, 5, or 10. We varied the number of individuals to be 1,000 or 5,000. The smallest simulated data files thus had 2,000 observations (i.e., 2 × 1,000), and the largest had 50,000 (i.e., 10 × 5,000), with all combinations observed in the factorial experiment. For movement, we used the same percentages as Study 1: 5, 10, 15, 20, 40, and 75 percent. However, we implemented moving differently in Study 2. We randomly selected cases, and then we randomly selected a time point after 1 for the case to move. We then randomly reassigned them to another city. Each case could only move once, and the move could happen at any point in time except the first time point. Finally, we manipulated the variance component on the random intercept to be either 1 or 1.5, more closely corresponding to repeated observations through time where there is more stability.
The Monte Carlo experiment thus varied the number of time points (three categories), the number of individuals (two categories), the percentage of movers (six categories), and the variance component (two categories). The entire experiment included 72 conditions (3 × 2 × 6 × 2 = 72). With 1,000 replications per condition, the entire sample space is the estimates from 72,000 cross-classified and fully nested multilevel models.
4.2. Results
Table 3 presents the descriptive statistics. As in Study 1, the cross-classified and fully nested models estimate basically the same coefficients. What little differences exist between the estimated coefficients is noise (the adjusted R2s when predicting the differences in the coefficients with fully interacted models are .000). Likewise, the differences in the variance components are quite small and unsystematic (the adjusted R2s from fully interacted models are .025, .144, and .000, respectively). We do, however, find some systematic differences in the standard errors.
Summaries of Differences in Parameter Estimates from Cross-Classified and Fully Nested Models from Study 2
Note: Each estimate is based on the differences in parameters from 72,000 models.
From an ANOVA model predicting the differences with each manipulated factor and all their interactions.
In terms of modeling the differences in the standard errors, we again used an effect size of .06 as a threshold for retaining a factor in model building. For the difference in the estimates of standard errors for the level-1 predictors, we find that the fully nested model overestimates the standard errors by 4.62 percent (the difference divided by the average for the cross-classified model). This difference is systematic with respect to the manipulated factors, as indicated by the R2 values in Table 3. In terms of modeling the differences in level-1 standard errors, the results indicated that we should retain a four-way interaction between the sample size at levels 1 and 2, the variance component, and the percentage of movers. Figure 6 presents plots of the marginal means from the model for the difference in the standard error of the level-1 continuous predictor. When the percentage of movers is small, so is the difference in estimates of the standard errors. As either the percentage of movers increases or sample size decreases, the fully nested specification overestimates the standard error. The pattern of results for the level-1 dummy-variable standard error is the same, except the effects are twice as large.

Marginal means from an ordinary least squares regression predicting the differences in the standard error for the level 1 continuous variable for Study 2. Variance components are equal to (A) 1.0 and (B) 1.5. Confidence intervals are too small to be discerned in the graph.
Turning to the difference in the standard errors associated with the level-2 predictors, the fully nested model specification underestimates the standard errors for the continuous and dummy variables by 8.6 and 8.3 percent, respectively. In terms of modeling the differences in the level-2 standard error for the continuous variable, the effect-size approach suggested the best model included a two-way interaction between the percentage of movers and the number of time points and a three-way interaction between the percentage of movers, the number of people, and the size of the variance component. Figure 7 presents plots of the marginal means from the model for the difference in the standard error of the level-2 continuous predictor. The percentage of movers again has a strong effect, but it is moderated such that when there are 5,000 people and the variance component is 1, the fully nested model largely underestimates the standard error. For the level-2 dummy variable, we observe the same pattern, but the effects are twice as large.

Marginal means from an ordinary least squares regression predicting the differences in the standard error for the level 2 continuous variable for Study 2. Variance components are equal to (A) 1.0 and (B) 1.5. Confidence intervals are too small to be discerned in the graph.
Finally, the fully nested model specification underestimates the standard errors for the level-3 continuous and dummy variables by 8.4 and 8.2 percent, respectively. Here, the effect-size approach suggested the level-1 sample size—the number of time points—was not relevant to differences in the estimated standard errors, but there is a three-way interaction between the number of people, the percentage of movers, and the size of the variance component. Figure 8 presents marginal means from this model. For the smaller sample size, the variance component does not have a large effect, but the percentage of movers increases the extent to which the fully nested specification underestimates the standard errors. For the larger sample size, the smaller variance component results in a larger effect of the percentage of movers. Again, we find that the differences in the standard error associated with the level-3 dummy variable have the same pattern, but the effects are twice as large.

Marginal means from an ordinary least squares regression predicting the differences in the standard error for the level 3 continuous variable for Study 2. Confidence intervals are too small to be discerned in the graph.
In summary, Study 2 finds that the estimated coefficients and the variance components are not affected by whether the model is estimated using a fully nested or cross-classified specification. However, the standard errors vary systematically with the sample size, the percentage of movers, and the magnitude of the variance components. As such, we caution against estimating fully nested models when the data are technically cross-classified. In Studies 1 and 2, we find that fully nested model specifications result in a significant underestimation of the standard errors associated with level-2 and level-3 predictors, which increases the likelihood of a Type 1 statistical error. In particular, as the percentage of movers increases and the sample sizes decrease, the underestimation gets worse.
Studies 1 and 2 shed light on key differences in parameter estimates for fully nested and cross-classified model specifications, but there are many unexplored possibilities. For example, Studies 1 and 2 only examine models with a continuous outcome. For specification differences in parameter estimates in the context of limited dependent variables, we turn to Study 3.
5. Study 3
Study 3 implements a binary outcome and a data structure similar to Study 1 (e.g., people nested in two units with movement going from unit 1 to unit 2). We again constrained the effects to be (linearly) .3 for each of the four predictor variables. The outcome was a binomial with the probability of a 1 equal to 50 percent. Based on results from Studies 1 and 2, we manipulated the level-2 sample size (200 or 2,000), the level-3 sample size (20 or 50), the percentage of movers (2.5, 5, 10, 15, 20, or 40 percent), and the variance component (0 or .4). This experiment therefore had 48 conditions (2× 2× 2 × 6) and a sample space of 48,000 pairs of parameter estimates. 12
Table 4 shows the descriptive statistics. Unlike in Studies 1 and 2, here we see differences in the estimated coefficients. Specifically, the fully nested model underestimates the slope for both of the level-1 variables by .6 percent. Likewise, the fully nested model underestimates the slopes for the level-2 and level-3 predictors by about 1 percent. We find some patterns to the differences at level 1 but not at levels 2 and 3. In particular, there is a two-way interaction between the percentage of movers and the size of the variance component in predicting the differences of the level-1 standard errors. Figure 9 plots the marginal means from this model. When the variance component is non-zero, as the percentage of movers increases, the fully nested model increasingly underestimates the coefficient. The pattern is the same for the level-1 dummy variable except the effects are about half.
Summaries of Differences in Parameter Estimates from Cross-Classified and Fully Nested Models from Study 3
Note: Each estimate is based on the differences in parameters from 48,000 models.
From an ANOVA model predicting the differences with each manipulated factor and all their interactions.

Marginal means from an ordinary least squares regression predicting the differences in the level 1 continuous variable for Study 3. Confidence intervals are too small to be discerned in the graph.
Turning to the differences in standard errors, the fully nested model underestimates the level-1 standard errors by about 2 percent, but these differences are not systematic—none of the manipulated factors predicted them with a medium effect size. There are sizable and systematic differences in the level-2 and level-3 standard errors. The fully nested model underestimates the level-2 standard errors by 26 percent and the level-3 standard errors by 28 percent. In terms of the level-2 standard errors, we found a two-way interaction between the percentage of movers and the size of the variance component. Figure 10 illustrates this interaction for the continuous predictor. For the larger variance component, movement results in the fully nested model underestimating the standard error. We observe the same pattern and effect size for the level-2 dummy variable. In terms of the level-3 standard errors, we found a three-way interaction between the percentage of movers, the size of the variance component, and the level-1 sample size. Figure 11 illustrates this interaction for the continuous predictor. When the variance component is larger, the fully nested model increasingly underestimates the standard error. This effect is worse for the smaller level-1 sample size. As with level 2, we observe the same pattern and effect size for the level-3 dummy variable.

Marginal means from an ordinary least squares regression predicting the differences in the standard error for the level 2 continuous variable for Study 3. Confidence intervals are too small to be discerned in the graph.

Marginal means from an ordinary least squares regression predicting the differences in the standard error for the level 3 continuous variable for Study 3. Confidence intervals are too small to be discerned in the graph.
In Study 3, unlike in Studies 1 and 2, we find differences in the estimated variance components. The simulation experiment embedded a (linear) variance component of either 0 or .4 between levels 1 and 2. Level-2 units were then randomly sampled within level 3 to approximate no variance between levels 1 and 3. The fully nested model underestimates the variance component for level 2 by 14.6 percent, and it underestimates the variance component for level 3 by 20.8 percent. We found systematic variation in the level-2 variance component but not in the level-3 variance component. 13 For the level-2 variance component, we found a three-way interaction between the level-2 sample size, the percentage of movers, and the variance component in predicting the differences in the estimates of the variance component. Figure 12 illustrates this interaction. When the variance component is non-zero, the percentage of movers has a large impact on the extent to which the fully nested model underestimates the variance component.

Marginal means from an ordinary least squares regression predicting the differences in the variance component for the second level in Study 3. Confidence intervals are too small to be discerned in the graph.
Although limited in the observed conditions, Study 3 illustrates that researchers should use an abundance of caution in using a fully nested specification for cross-classified data with a binary outcome. Not only are the standard errors systematically underestimated, but the coefficients and the variance components are as well. Accordingly, we advocate for the use of a cross-classified specification, particularly in the context of limited dependent variables.
6. Summary of Results
We find that estimating a fully nested model specification when in the presence of a cross-classification results in underestimated standard errors. This is true for level-2 and level-3 predictors. The extent of underestimation is largely driven by the magnitude of the variance component, the percentage of movers, and the sample size. In the case of a nonlinear outcome, we found that estimated slopes and variance components were also underestimated. In this context, the size of the variance component and the percentage of movers seem to be the strongest sources of bias in parameter estimates. Underestimated standard errors imply an increased likelihood of Type 1 statistical error—wrongly rejecting a true null hypothesis—and as such, fully nested specifications should be avoided in the presence of cross-classification. Fully nested specifications have limited bias only under conditions of very small variances, large sample sizes, and limited cross-classification.
7. Discussion
This article adds to the growing literature that examines the differences between hierarchical modeling when a cross-classification is present. Past studies found that when cross-classification is ignored, fixed-effects estimates are unaffected, but standard errors and the variance components are biased (Luo and Kwok 2009, 2012; Meyers and Beretvas 2006). Our results confirm these findings, except the variance component is only systematically related to our manipulated factors for the generalized linear model. In addition to adding many more data-generating conditions as recommended by past research (Meyers and Beretvas 2006), we expand to a longitudinal case and the case of logistic regression. In contrast to past examinations of cross-sectional linear models, the former show that even the variance components may be unbiased for longitudinal data, and the latter reveal that even fixed effects may be biased for limited outcomes. We also conducted a more thorough examination of the sources of bias by considering factorial combinations of those sources. This exercise revealed some very limited cases where the two modeling approaches were close to one another.
Perhaps most revealing from our analysis is how the combination of our manipulated factors contributes to bias. Past studies have found that similar factors are related to bias, but they did not thoroughly consider factorial combinations. In almost all cases, we found incredible systematic sources of bias, typically in the form of a four-way interaction between the row or level-2 sample size, the column or level-3 sample size, the variance component, and the percentage of movers. The models with this interaction explained an overwhelming amount of the variation in the bias of the standard errors and variance components.
The effect on the bias was summarized through a series of figures that demonstrate the importance of the combination of the conditions. Aside from the coefficients and variance components from the models with continuous outcomes, we found evidence that the difference between parameter estimates from fully nested and cross-classified models varies systematically with the factors we manipulated in our simulations. Even the coefficients differed when the outcome was binomial (empirical example and Study 3). Trends in these differences between parameter estimates suggest that larger variance components, smaller sample sizes, and fewer movers each reduce differences in estimates, but the effect of each of these likely depends on the state of the others.
An additional advantage of examining these combinations was the ability to determine if there were any conditions where the less computationally intensive fully nested model might produce similar estimates to the cross-classified model. In general, we emphasize that this was not the case. However, when the percentage of movers was small, the sample sizes were large, and the variance component was small, the models performed similarly. First, the percentage of movers appears to be particularly consequential. When movement is low (here, 5 percent), the models are nearly identical. This result makes sense as the hierarchical structure is fully nested for the vast majority of observations. Our simulations used several values of movement near the lower end of the spectrum (i.e., 5, 10, 15, and 20 percent). Although not universally true in all our models, we generally found that increased movement at the lower end of the spectrum increases bias faster than at the upper end (e.g., the figures do not show a “linear” pattern). But we caution researchers from extrapolating that any point on the movement continuum is “safe.” Our data are synthetic, meaning assumptions are ensured (e.g., the error variance was exactly 1) and are not likely to match the exact structure of real-world applications.
Second, as the sample sizes increase, the two models produce similar results, confirming past studies regarding the consistency of estimates in multilevel models (Goldstein [1995] 2011; Hox 2010; Raudenbush and Bryk 2001). We note that increasing the sample size implies two different considerations based on our two examples. In the case of students, the researcher has control over level-1 sampling of students and typically level-2 sampling to the degree that primary schools are used in the sampling strategy. In the case of repeated observations, the researcher has control over level-2 sampling of individuals and typically level-3 sampling to the degree that geographic units (e.g., census tracts, metros, counties) are used in the sampling strategy. Here, the level-1 sample size is a function of the number of time points and level-2 individual sample size, and thus it increases with each new data collection. The sample size at level 3 also has the potential to increase if individuals move to cities not represented in the original data collection.
Finally, the variance component at the row or level-2 level also matters: When close to zero, there is very little difference between the fully nested and cross-classified model, even in the logistic case. Such low variance is possible, especially in models with many predictors that explain the variance, but it is not possible to know this value ahead of modeling, unlike the sample size and percentage of movers. Thus, we would not advocate using the variance component to make any decisions regarding a shortcut.
To be clear, where computationally manageable, cross-classification models should be used. Currently, multilevel models can only be fit on a single core such that computational times can only decrease with enhanced processing power of one core. The structure of the variance components is not the only factor related to computation time. Larger sample sizes across all levels (including longer time series), higher numbers of fixed effects, and generalized models all increase computational time, which only grows with a complex variance components specification. Each of these factors are becoming more common in the social sciences. Fortunately, higher sample sizes are associated with less biased models when fit with a fully nested variance structure. Together with a low percentage of movers, researchers can make informed decisions when cross-classification models cannot be estimated.
Despite the results gleaned from this study, we left many issues unexplored. Several factors are important to consider in subsequent work. In particular, other limited outcomes, serial correlation, and random slopes are all common issues in the social sciences, but to date, we know very little about the consequences for bias when estimating a fully nested specification in the presence of cross-classification in these contexts. Random slopes, as part of the variance specification, will substantially increase run time when estimating a cross-classified specification. We speculate that the variance components associated with random slopes will be biased under the same conditions observed here (large variance, small sample size, many movers), but we cannot know without further examination. Similarly, we would expect similar patterns of bias as those reported in Study 3 in cases of other limited dependent variables (e.g., count or ordinal), but at this point, we cannot be certain. With respect to serial correlation, we found no systematic differences in the error variance in Study 2 (mimicking time nested in people), but there were differences in the other variance components. This leads us to expect differences in parameter estimates associated with the specification of serial correlation (e.g., an AR[1] specification) because serial correlation is estimated simultaneously with the random effects. We hope subsequent work will address these important gaps in the literature.
Our work sheds light on the implications of estimating fully nested models under conditions of cross-classification, but there remain several factors worthy of consideration. For example, we did not consider serial correlation, random effects on predictors, or the covariance between random effects. Each of these entail estimable parameters within the context of mixed-effects models that may be biased by ignoring cross-classification. Our work contributes to a growing understanding of the consequence of specifying cross-classification as fully nested, but much remains to be understood about these processes.
