Abstract
Multilevel modeling techniques are becoming more popular in handling data with multilevel structure in educational and behavioral research. Recently, researchers have paid more attention to cross-classified data structure that naturally arises in educational settings. However, unlike traditional single-level research, methodological studies about multilevel effect size have been rare and those that have recently appeared had an emphasis on strictly hierarchical data structure. This article extends the work on multilevel standardized mean differences from strictly hierarchical structure to both fully and partially cross-classified structures. Analytically derived formulae for calculating effect sizes and the corresponding sampling variances (or standard errors) are presented, verified by simulation results, and illustrated with real data examples. Implications for primary research studies and meta-analyses are discussed.
Keywords
The field of educational and psychological science has witnessed a movement from the obsession of statistical significance testing to the evaluation of effect size (Ferguson, 2009). However, for studies with multilevel data, effect size is still underreported, even though some effect size statistics have been recently developed for multilevel data with a nested structure (e.g., standardized mean difference; Hedges, 2007, 2011) and a cross-classified structure (e.g., proportion of variance accounted for; Luo & Kwok, 2010). As cross-classified random effects models (CCREMs) have gained increasing attention in educational research (e.g., Friedel, Cortina, Turner, & Midgley, 2010; Johnson, 2011), there is a need to develop relevant effect size measures for CCREMs. The intent of the present article is to (a) analytically develop the standardized mean difference measure for two-level CCREM, (b) verify the performance of the mathematically derived formulae using simulated data, and (c) illustrate the computation of the effect size statistic with real data example.
The American Educational Research Association (AERA, 2006, p. 37), in its Standards for Reporting on Empirical Social Science Research in AERA Publications, recommended the use of effect size statistics with the rationale that “[i]nterpretation of statistical analyses is enhanced by reporting magnitude of relations.” The American Psychological Association (2010, p. 34), in its Publication Manual, took a stronger stance to deem the reporting of effect size statistics as “almost always necessary.” The attention given to effect size can be attributed to three important advantages of reporting such statistics. First, effect sizes, rather than statistical significance tests, directly answer research questions such as how strong two variables are associated or how effective an intervention is (see Thompson, 2007). Second, to date, effect size is the element to be synthesized in almost all meta-analytic studies (Lipsey & Wilson, 2001). Third, effect size estimation also plays a critical role in research planning, such as power analysis (Cohen, 1988). Two of the commonly reported effect size families include the standardized mean difference (i.e., group difference divided by sample standard deviation [SD] or the d family) and the proportion of variance accounted for (or the r family; Grissom & Kim, 2012). Whereas the effect size statistics in single-level studies are already well developed, effect size in multilevel modeling has appeared only recently and is generally limited to strictly nested data. Therefore, more discussion on this topic is necessary and valuable.
Effect Size in Multilevel Analyses
Although techniques for handling data with cluster structures have been developed for several decades (e.g., Goldstein, 1986; Mason, Wong, & Entwisle, 1983), in the past 10 years, they have gained much more attention in educational and behavioral research. This is not surprising, given that in these fields much data collected have intrinsically nested structure. For example, in the field of education, students are naturally nested within classrooms, and classrooms are naturally nested within schools. Because traditional data analytic techniques ignore the multilevel structure and give incorrect standard errors (SEs; Hox, 2010), new methods are proposed that provide correct SEs and hence accurate statistical inference. One of the most popular approach is multilevel modeling (Goldstein 2011), which is synonyously called hierarchical linear modeling (Raudenbush & Bryk, 2002), linear mixed modeling (Littell, Milliken, Stroup, & Wolfinger, 1996), and other similar names.
Despite the rapid growth in the number of multilevel studies, rarely did researchers utilize effect size statistics in reporting multilevel results. Most of these studies used proportion of variance accounted for, or R 2 (see Luo & Kwok, 2010; Snijders & Bosker, 2011). However, for studies with a binary covariate, such as treatment–control or male–female, the standardized mean difference is a more natural choice and is more easily understood by researchers.
In addition to the point estimates of an effect size, its sampling variance (or SE) is important. As commented by Cohen (1994), it is “far more informative to provide a confidence interval” (p. 1310), and the computation of (asymptotic) confidence interval (CI) requires the sampling variance of the effect size. This is particularly important for meta-analysts (Hedges, 1981; Lipsey & Wilson, 2001), because both point and variance estimates of effect size are required to get an overall average effect size and to understand the influence of study-level covariates including publication bias in the literature. Given the importance of the point estimate and the sampling variance of the standardized mean difference effect size, as well as the lack of discussion about them in complex multilevel models, research efforts to supplement methods for their calculation are warranted.
Recently, Hedges (2007, 2009, 2011) made a seminal effort in defining standardized mean difference statistics for data with two-level and three-level nested structures. Particularly, he suggested that, depending on the context, there could be different choices of SDs in computing the effect size. Hedges (2007) illustrated the calculation of effect size in two-level studies with an example about the effect of connected mathematics in classrooms. In that example, students were nested within classrooms, and the treatment (i.e., connected mathematics) was defined at the classroom level (i.e., Level 2). He showed that the overall effect size was 0.15 (95% CI [0.29, 0.59]) and the within-classroom effect size was 0.17 (95% CI [0.34, 0.69]). In a three-level cluster-randomized design, five possible effect size statistics can be computed depending on which variance component is invoked. The formulae given by Hedges (2007, 2009, 2011) do not require researchers to have the raw data to obtain an effect size estimate; instead, only the estimated treatment effect (i.e., grand mean difference between the treatment and the control arm), sample sizes for all levels of clustering, and the corresponding intraclass correlations (ICCs) are needed. In the context of meta-analysis, Ahn, Myers, and Jin (2012) have suggested methods to estimate ICCs when the original research report does not include the relevant information.
CCREMs
The number of published articles adopting the CCREM method, a more complicated structure than nested multilevel models, has increased dramatically in recent years. A simple search in the Educational Research Information Center database with the keyword “cross-classified” found only three articles during 2000–2005 but 32 articles during 2006–2012. One reason for the increasing adoption of CCREM is that multilevel data may not always have a strictly hierarchical structure. A typical example is given by Beretvas (2010), where students are nested within both primary schools (PSs) and high schools (HSs), but PS is not nested within HS nor vice versa. That is, not all students in one HS come from the same PS, nor do all students from one PS go to the same HS. In this case, PS and HS are labeled as crossed factors. If both PS and HS are assumed to be random effects, then CCREM can be used to analyze such kind of data. Luo and Kwok (2010) have discussed the R 2 effect size for CCREMs. However, to the best of our knowledge, no discussion has taken place about standardized mean difference for CCREMs. Standardized mean difference would be suitable, for instance, in describing the effects of a school-based intervention on students’ learning, where students are nested in both schools and neighborhoods. Based on the framework of previous studies (Hedges, 2007, 2011), in this article, we develop effect sizes for CCREMs through mathematical derivation and evaluate their performances using both simulated and real data sets.
The purpose of this article is to analytically develop the standardized mean difference measure for two-level CCREMs for both balanced and unbalanced designs and to verify the performance of the mathematically derived formulae. Because of the complexity of the formulae, we also provide real data examples for pedagogical purposes, so that applied researchers can better understand how those formulae can apply to their research. In the following sections, we would (a) briefly introduce the notations for a two-level CCREM with two crossed factors; (b) discuss two estimation approaches to obtain the standardized mean difference, D, and the corresponding sampling variance, V(D) (where
Model and Notation
In a balanced design with the cross-classification of two random effects A and B at Level 2, let J and K be the number of clusters in effect A and in effect B, respectively. In the context of education, A can be classrooms in a school and B can be neighborhoods. To make things more concrete, in the following sections, we would use an hypothetical example where effect A is the classroom effect and effect B is the neighborhood effect, although the notation is equally applicable to other contexts such as therapy grouping effect by classroom effect or in longitudinal settings with person effect by time effect. As a result, there are J classrooms and K neighborhoods, and J × K combinations of classroom and neighborhood or J × K cells. Further, let
Let
where
In a balanced design, the variance of Y can be partitioned into three independent components, which are denoted as
ICC
The ICC quantifies the degree to which two randomly drawn observations within a cluster are correlated. In CCREMs, there are different possible ICCs depending on how a cluster is defined. For instance, for observations in the same classroom (random effect A) but in different neighborhoods (random effect B), the ICC can be defined as follows:
where
Standardized Mean Differences for Fully Cross-Classified Data
In educational research, the standardized mean difference is defined as the ratio of (a) the difference between the population means of the treatment arm and of the control arm to (b) an SD. Hedges (2009) defined different effect sizes associated with different levels. For example, a researcher may be interested in how an intervention is effective in group level and can use only the between-level SD, while ignoring the within-group variations. Similarly, in CCREM, one can consider using σ
W
, σ
A
, σ
B
,
On the population level, the effect size is defined as
where
In a balanced design, the average of the cell means,
is in general a biased estimator of the total population variance
Estimation of D 1
With a balanced data structure assumed, and when the sets of clusters of random effect B in the treatment arm and in the control arm overlap completely (e.g., students receiving treatment come from the same set of neighborhoods as those in the control arm), the sample estimator of δ T , D 1, and the corresponding sampling variance V(D 1) are
and
where
where
Estimation of D 2
The derivation of D
1 is based on assumptions that (a) the cluster size is constant and (b) the ICCs are known or estimated with a reasonable accuracy. In real research, these assumptions may not hold. If either (a) or (b) or both (a) and (b) are violated, then D
1 and V(D
1) calculated from Equations 6 and 7 can be biased and inefficient. For unbalanced designs, the close forms of D
1 and V(D
1) are very complex and are functions of the cell sizes in addition to the components in Equations 6 and 7. Because information about the cell sizes is rarely available from published research reports, it is difficult to obtain efficient estimates of δ
T
and V(δ
T
) for unbalanced data starting from expected mean squares. However, if consistent estimates of the variance components (from maximum likelihood, REML, or Bayesian estimation, etc.) are available, researchers can use both the point estimates and the SEs of the random effects to calculate the effect size. Specifically, if estimates of the treatment effect and the variance components,
Derivations of Equations 9 and 10 can be found in Appendix A (see the online Appendix, available at http://jeb.sagepub.com/supplemental).
If meta-analysts can obtain neither information about the degree of imbalance of the data nor unbiased estimates of the variance components and their variances, the best they can do is to compute the effect size assuming a balanced design and replace n in Equations 6 and 7 with the average cell size, N/(JK), to obtain D 1. If information about ICC is not obtainable, they can put in a reasonable guess of ρ A and ρ B by referring to research with similar designs and variables. There are also several articles summarizing what typical ICCs are for various designs and areas (Hedges & Hedberg, 2007; Murray & Blitstein, 2003).
Monte Carlo Study for Evaluating the Two Effect Size Estimation Approaches Under Unbalanced Designs
We used a 3 × 2 × 2 × 2 full factorial simulation study to empirically check the performance of D
1 and D
2 with unbalanced designs. The design factors included (a) the population effect size (δ
T
= 0.2, 0.5, or 0.8), (b) the number of clusters in random effect A (e.g., classrooms) per treatment status (JT
= JC
=20 or 50), (c) the average cell size (n = 0.25 or 1), and (d) the ICC of random effect
The estimation of D
1 and V(D
1) was performed in R with ρ
A
and ρ
B
being fixed to the population value. On the other hand, D
2 and V(D
2) were obtained using
As shown in Table 1, both effect size estimators had relative bias less than 5%. The relative SE bias was stronger, but the impact was small as the percentage coverage of the 95% symmetric CI 1 was close to nominal value and ranged from 93.6% to 95.6%. This is within the expected interval [92.7%, 96.6%] when the true coverage is 95% with 500 replications, and thus, we conclude that the performance of both estimators was satisfactory. As expected, D 1 was less efficient under unbalanced structure, but the loss of efficiency was little. Specifically, the relative efficiency, RE = V(D 2)/V(D 1), of D 1 was lowest when both JT and n were small and when ρ A was large, but it was still acceptable as RE = 86%. In summary, both approaches to obtain δ T and V(δ T ) performed well in unbalanced designs.
Simulation Results for Unbalanced CCREMs (With Clusters in Random Effect B Completely Overlapped)
Note. Based on 500 replications for each condition. Pearson’s contingency coefficients for all conditions ranged from .84 to .92, indicating strong associations between the clusterings of random effects A and B. δ
T
= population effect size. JT
= JC
= number of clusters of random effect A in the treatment (control) arm; Number of clusters of random effect B = JT
. NT
= NC
= total sample size of the treatment (control) arm. ρ
A
= intraclass correlation of effect A; ρ
B
= 0.1 for all conditions. Coverage refers to the percentage of replications in which the 95% confidence interval includes δ
T
. For 500 replications, the Monte Carlo coverage percentage have a confidence interval of [92.7%, 96.6%] if the true coverage percentage is 95%. RE(D
1, D
2) = relative efficiency of estimator D
1 to the Mplus (version 7.0) estimation of D
2 using
Next, we generated data with K
overlap > 0. We kept K = JT
= JC
, but set KT
= KC
= 3JT
/5 and so K
overlap = K/5. In other words, 20% of the clusters in random effect B were overlapped between the treatment and the control arms, with a correlation approximately equal to
Simulation Results for Unbalanced CCREMs (With Clusters in Random Effect B Partially Overlapped)
Note. Based on 500 replications for each condition. Pearson’s contingency coefficients for all conditions ranged from .84 to .92, indicating strong associations between the clusterings of random effects A and B. δ
T
= population effect size. JT
= JC
= number of clusters of random effect A in the treatment (control) arm. Total number of clusters of random effect B = JT
, with 20% of the sets of clusters in B overlapping across treatment arms. NT
= NC
= total sample size of the treatment (control) arm. ρ
A
= intraclass correlation of effect A; ρ
B
= 0.1 for all conditions. Coverage refers to the percentage of replications in which the 95% confidence interval includes δ
T
. For 500 replications, the Monte Carlo coverage percentage have a confidence interval of [92.7%, 96.6%] if the true coverage percentage is 95%. RE(D
1, D
2) = relative efficiency of the estimator D
1 to the Mplus (version 7.0) estimation of D
2 using
Real Data Example
We would use part of the National Educational Longitudinal Study data set (NELS:88; Ingles et al., 1990) to illustrate the calculation of D
1 and D
2. This longitudinal study followed a nationally representative sample of students starting from their eighth grade and recorded students’ experiences in a variety of areas such as home and working. A hypothetical research question is whether availability of a mathematics club in middle school (MS) predicted students’ mathematics achievement at 10th grade. Using only cases with complete data on all the variables related to the analysis, the data consisted of 15,611 students cross-classified by 986 MSs (383 with math club) and 1,418 HSs (KT
= 625, KC
= 940, so K
overlap = 147). The average cell size was thus 15,611/(986 × 1,418 = 0.0112). Only 147 HS had both students from MSs with math club and those from MSs without math club, so correlation rK
of the HS effect was
Standardized Mean Differences for Partially Cross-Classified Data
Thus far, we have considered cross-classified data, where observations in both the treatment and the control arms are cross-classified by effects A and B. However, there are designs where only observations in the treatment arm are cross-classified, but the observations in the control arm are nested only in effect A but not in effect B. In this article, we denote such a data structure as partially cross-classified. This is similar to the partially nested design discussed in Bauer, Sterba, and Hallfors (2008), where the observations in the treatment arm are nested within random effect A, whereas those in the control arm are not. The difference is that in partially cross-classified data, there is one more level of nested structure, random effect B, present in both the treatment and the control arms.
Consider a hypothetical example, where students from JT classrooms are randomly assigned to K emotion management groups (i.e., the treatment), and those from other JC classrooms did not receive any treatment. Further, assume that each emotion management group includes students from different classrooms to avoid situations where group members are very familiar with each other. Suppose that a researcher is interested in the effectiveness of the emotion management group on students’ life satisfaction (Y). Such a design can be represented by the model equation
where
Estimation of D 1
As shown in Appendix A (see the online Appendix, available at http://jeb.sagepub.com/supplemental), the sample estimators D 1 and V(D 1) of the effect size δ T are given as
where
where
Estimation of D 2
When maximum likelihood or other unbiased estimates of the fixed effect, the variance components, and their sampling variances are available, one can calculate the standardized mean difference and its sampling variance as
where
If the weighted average of the two variance components is difficult to obtain, researchers can replace
Monte Carlo Study for Evaluating the Two Effect Size Estimation Approaches Under Unbalanced Designs
Similar to what we did for fully cross-classified designs, we used simulations to empirically check the performance of D
1 and D
2 for PCCREMs. The simulation conditions were the same as those used for the previous CCREM simulation study except that the random effect B was not present in the control arm, and the associated variances were added to
As shown in Table 3, both D 1 and D 2 have relative bias less than 5% and coverage percentage of the 95% CI ranging from 92.8% to 95.2%, so their performances are satisfactory. In general, D 1 was less efficient under unbalanced structure when ρ A was large and the average cell size was small, but the loss of efficiency was negligible (minimum relative efficiency being 96.4%). In summary, both approaches to estimate the effect size δ T and V(δ T ) perform similarly well in the chosen unbalanced designs.
Simulation Results for Unbalanced PCCREMs
Note. Based on 500 replications for each condition. δ
T
= population effect size. Pearson’s contingency coefficients for all conditions ranged from .84 to .92, indicating strong associations between the clusterings of random effects A and B for the treatment arm. JT
= number of clusters of random effect A in the treatment arm; number of clusters of random effect B for the treatment arm = JT
. NT
= NC
= total sample size of the treatment (control) arm. ρ
A
= intraclass correlation of effect A; ρ
B
= 0.1 for the treatment group for all conditions. Coverage refers to the percentage of replications in which the 95% confidence interval includes δ
T
. For 500 replications, the Monte Carlo coverage percentage have a confidence interval of [92.7%,96.6%] if the true coverage percentage is 95%. RE(D
1, D
2) = relative efficiency of the estimator D
2 to the estimator D
2 computed from SAS 9.3 with
Real Data Example
For illustrative purpose, the data used in Coyne et al. (2013) were analyzed. The data consisted of 103 kids receiving Early Reading Intervention. The treatment arm consisted of 70 kids who have changed group membership based on their performance over the course of the study, which created a cross-classified data structure, given that the initial group membership of each kid was different from the final group membership. In other words, kids were cross-classified by the initial and final group memberships in the treatment arm. On the other hand, the control arm consisted of 33 kids who were randomly assigned to groups at the beginning of the study and were kept in the same group over the course of study, which created a strictly hierarchical structure for this group of kids. The dependent variable is the score on a word identification test at the final stage. There were 19 groups among those receiving intervention (i.e., JT = 19) and 10 groups among those receiving regular reading instruction (i.e., JT = 10) at the the initial stage. At the final stage, those in regular reading stayed in their original groups, but some of those receiving intervention had moved to different groups. In summary, students receiving regular reading instruction were nested within initial groupings, whereas those receiving intervention were cross-classified by the initial and final groupings (K = 19). The design was not balanced, and the average cell size was 0.187.
Using SPSS
Conclusion
Unlike single-level research studies in which effect sizes are regularly reported, effect size statistics for multilevel studies, in particular standardized mean difference, are still not fully investiaged. Effect size is extremely important because it directly quantifies the effect of interest (e.g., the effect of the treatment and gender difference), regardless of whether the study consists of single-level or multilevel data.
Our article has included analytically derived formulae of the standardized mean difference for fully and partially cross-classified treatment–control arm designs, as well as methods for obtaining the effect size when reliable and consistent estimates of variance components are available. Although the analytical formulae for D 1 are tedious to use and can lose efficiency when the design is unbalanced or when the sample size is small, they are nevertheless important. In secondary analyses and meta-analyses where the clustering is not taken into account in the original analyses or when information about the variance components is not available, D 1 can still be computed when the following information is available: number of clusters, cluster size, and intraclass correlations are available. For occasions where intraclass correlations are not available, Hedges (2007) provided an example of substituting values reported from other studies or with an educated guess, and Ahn et al. (2012) suggested quantitative procedures to estimate the ICCs. In addition, one can perform sensitivity analyses to examine whether different choices of ICCs result in substantial differences in the estimated effect size and the corresponding SEs (Hedges, 2007, 2011).
We have also suggested a method to estimate effect size D 2 using maximum likelihood or Bayesian estimates of variance components. It is easier to implement, and we thus recommend its use when raw data are available. To facilitate future replication and research synthesis, we also recommend researchers analyzing primary data to report the effect size and its sampling variance, or at least the estimated values and the sampling variances (or the SEs) of the variance components.
Given the complexity associated with the effect size estimation equation for CCREMs, a logical question would be when a researcher can ignore one level of clustering (i.e., random effect B) but still get a good estimate of the effect size and the sampling variance. We have reanalyzed the two real data examples by ignoring one level of clustering, and it appears that when the two crossed random effects are highly correlated, omitting one random effect does not lead to substantial differences in point and interval estimates of effect size. This makes sense because when the two effects share a lot of common information, most of the information is still preserved when one effect was omitted (see Luo & Kwok, 2009). On the other hand, if the crossed random effects were only weakly correlated or uncorrelated (such as when the design is balanced), in general, the bias on the estimated sampling variance increases when number of clusters K and the ICC ρ B of the omitted random factor are large, based on Equations 7 and 8. For example, assuming a balanced design, when δ T = 0.5, J = K = 30, n =1, ρ A = 0.25, ρ B = 0.1, ignoring the clustering of B results in an underestimation of V by 19.1% (from 0.046 to 0.037); when n = 10 and other things being unchanged but ρ B = 0.2, then V(D 1) is underestimated by 32.2% (from 0.055 to 0.037); and when n = 1 but K is doubled to 60, V(D 1) is underestimated by 32.8% (from 0.028 to 0.019); This would result in CIs of the effect size that are too narrow and not valid (see Hedges, 2011). Nevertheless, because CCREMs are complex models, further studies are needed to fully understand the impact of ignoring one or more levels of clustering on effect size estimation in real research.
This article is limited to only two-level CCREM with two crossed random effects, which is an extension of two-level multilevel models. However, the framework can be extended to CCREMs with three or more levels and with three or more random effects, or to CCREMs where treatment is defined as a Level 1 variable (i.e., the treatment is individually randomized). Future research can investigate perhaps effect size estimations in more complicated designs, as well as in other models in the multilevel family such as the multiple membership models. Simulation studies comparing different variance component estimation methods (e.g., Bayesian vs. REML) in the process of computing effect size are also highly encouraged. Also, in this article, we assumed that the effect size of interest has σ T as the denominator for standardization. There are occasions where researchers may be interested in effect size with σ B or σ W or other alternatives as the denominator, but they are left for discussions in future studies. Finally, procedures to convert standardized mean difference effect sizes with multilevel structure into proportion of variance accounted for effect sizes (see Luo & Kwok, 2010) will be highly valuable for research synthesis methodology.
Footnotes
Acknowledgments
We thank Dr. Deborah Simmons’s data from the Early Reading Intervention Project (supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R324E060067 to Texas A&M University; the opinions expressed are those of the authors and do not represent views of the U.S. Department of Education). We are also grateful to Dr. Victor Willson for his invaluable suggestions on this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Note
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
