Abstract
Multilevel modeling (MLM) is frequently used to detect cluster-level group differences in cluster randomized trial and observational studies. Group differences on the outcomes (posttest scores) are detected by controlling for the covariate (pretest scores) as a proxy variable for unobserved factors that predict future attributes. The pretest and posttest scores that are most often used in MLM are total scores. In prior research, there have been concerns regarding measurement error in the use of total scores in using MLM. In this article, using ordinary least squares and an attenuation formula, we derive the measurement error correction formula for cluster-level group difference estimates from MLM in the presence of measurement error in the outcome, the covariate, or both. Examples are provided to illustrate the correction formula in cluster randomized and observational studies using between-cluster reliability coefficients recently developed.
Introduction
Multilevel designs have been widely adopted in education because it is natural that individuals (e.g., students) are nested within clusters (e.g., classrooms or schools) in educational settings. In cluster randomized studies, clusters of individuals are assigned at random to treatments. Random assignment may occur at the classroom level rather than at the student level because researchers cannot control students’ class assignment (Raudenbush, 1997). In observational studies, researchers do not have control over the assignment of clusters into groups. For example, the effect of school type (e.g., traditional schools vs. nontraditional schools) on student-level outcomes can be of interest and school type assignment cannot be controlled by researchers. In cluster randomized and observational studies, one objective for statistical analysis is to explore cluster-level group differences between a control group and a treatment group in cluster randomized studies and between cluster-level groups (e.g., cluster-level demographic information) in observational studies having multilevel designs.
In practice, multilevel modeling (MLM) is the general approach used to detect cluster-level group differences on posttest outcomes; often, related covariates at different levels of multilevel data are controlled in the model (Aitkin & Longford, 1986; Goldstein, 2003, chap. 2). Pretest scores are important covariates to be controlled because they serve as proxy variables for unobserved factors that predict future attributes (e.g., Bloom, Hayes, & Black, 2005). Also, many educational and psychological outcomes, such as ability, are unobservable. Thus, multiple indicators (or items) are often collected to infer the unobserved attributes. When using MLM, the multiple indicators on pre- and posttest measures are frequently summed (i.e., total score). The question to be addressed in this article is whether the total scores are appropriate to use either as a covariate (i.e., pretest scores) or as the outcome (i.e., posttest scores) in MLM analyses that are used to detect cluster-level group differences. Referring to previous research findings, there are two major concerns in using total scores in MLM: measurement error in covariates (e.g., Lüdtke et al., 2008; Lüdtke, Marsh, Robitzsch, & Trautwein, 2011; Shin & Raudenbush, 2010) and in outcomes (e.g., Fox, 2004; Raudenbush & Sadoff, 2008). However, these previous studies have not presented the effects of these concerns on detecting cluster-level group differences when using MLM.
There are two common practices used to ameliorate concerns about measurement error (Cohen, Cohen, West, & Aiken, 2003; Cole & Preacher, 2014): using measurement error correction methods and using latent variable models for explicit modeling of construct(s) with multiple indicators. An attenuation formula to correct for measurement error (Lord & Novick, 1968; Spearman, 1904) has been used for correlation coefficients as evidence of validity or criterion reliability. Cohen et al. (2003) used the attenuation formula to correct for measurement error in outcomes and covariates in linear regression models. Other kinds of measurement error correction methods include errors-in-variables regressions (e.g., Camilli, 2006) and simulation extrapolation (see Carroll, Ruppert, Stefanski, & Crainiceanu, 2006; Fuller, 1987, for reviews on the correction methods). Recently, Lockwood and McCaffrey (2014) presented such correction methods to correct for measurement error in analysis of covariance (ANCOVA) and multilevel ANCOVA, for estimating treatment effects in observational studies. On the other hand, previous research has shown that covariate measurement error is not a problem in treatment effects for experimental or randomized designs with groups that do not differ in average covariate values in ANCOVA (Culpepper & Aguinis, 2011; Porter & Raudenbush, 1987). That is, measurement error in covariates (e.g., pretest scores) may not be of concern for detecting group differences in a randomized design.
Within a structural equation modeling (SEM) framework, there are several studies demonstrating the use of latent variable models to test analysis of variance (ANOVA)–like mean differences across groups at the latent construct level, including structured means models (SMMs; Sörbom, 1974) and multiple-indicator multiple-cause (MIMIC; Jöreskog & Goldberger, 1975) models. A relatively novel analytic framework, multilevel SEM (MSEM), has been used to account for multilevel data in the use of SEM (McDonald, 1993; L. K. Muthén & Muthén, 1998-2014; Rabe-Hesketh, Skrondal, & Pickles, 2004). MSEM for categorical variables is also referred to as explanatory item response modeling (De Boeck & Wilson, 2004) or nonlinear multilevel latent variable modeling (e.g., Yang & Cai, 2014). Multilevel item response models have been used to account for measurement error in covariate(s) (Battauz, Bellio, & Gori, 2011; Fox & Glas, 2003) or in outcomes (e.g., Fox, 2004) or in both covariates and outcomes (e.g., Raudenbush & Sampson, 1999).
However, situations may arise where researchers need to choose measurement error correction methods instead of latent variable modeling approaches. First, item-level data, making it possible to specify a measurement model, are not always available. Many studies consider MLM because only a single outcome (e.g., a standardized test score) is available for analyses. Second, latent variable models often require larger sample sizes than MLM because there are more measurement model parameters to be estimated in latent variable models. When cluster sizes and the number of clusters are not large enough, the use of latent variable models may not be feasible. Third, when MLM is the dominant analytic method in a substantive area, researchers may use MLM to communicate their study results more easily with others in the area.
The purpose of this study is to provide a measurement error correction formula for the cluster-level group difference estimate in the presence of measurement error in outcomes (e.g., posttest) for cluster randomized studies and in outcomes (e.g., posttest) or a covariate (e.g., pretest) or both for observational studies. In the current study, the derivation of the measurement error correction formula is based on ordinary least squares (OLS; e.g., Cohen et al., 2003) and an attenuation formula (Lord & Novick, 1968; Spearman, 1904). It has been shown that the OLS principle can be applied to fixed effect estimation in a two-level random intercept model when covariates are not correlated across levels (Lüdtke et al., 2008; Lüdtke et al., 2011). With an assumption of normally distributed errors, the OLS estimator is a maximum likelihood estimator (MLE; e.g., Neter, Kutner, Nachtsheim, & Wasserman, 1996).
This article is organized as follows. We first specify a two-level random intercept MLM to estimate a cluster-level group difference parameter and describe a multilevel extension of classical test theory (CTT) to characterize measurement error in multilevel modeling. We then provide a measurement error correction formula for the cluster-level group difference estimate from MLM. Subsequently, to illustrate the formula, the measurement error correction formula is applied to two empirical studies for detecting cluster-level group differences in cluster randomized and observational studies, respectively.
Assessing Group Differences and Measurement Error in Multilevel Modeling
A two-level random intercept MLM is chosen to detect the cluster-level group difference on posttest scores, controlling for pretest scores (equation 4.6 in Moerbeek, Van Breukelen, & Berger, 2008). Here, the individual level is called Level 1 (e.g., student-level) and the cluster level is called Level 2 (e.g., classroom level). Denote
A model at Level 1 (e.g., the student level) can be specified as follows:
where
and
Characterizing Measurement Error in Multilevel Modeling
In this subsection, a multilevel extension of CTT (Geldhof, Preacher, & Zyphur, 2014; Lüdtke et al., 2011; B. O. Muthén, 1991) is presented to characterize measurement error in MLM. The observed posttest total score for the outcome,
where
Geldhof et al. (2014) presented separate reliability estimates at each level based on Equation (4). Specifically, within-cluster reliability can be defined as the ratio of the within-level true score variance to total within-level variance (
Measurement Error Correction for Cluster-Level Group Differences
In this section, we first show the correction formula for measurement error only in the outcome, which can be used for cluster randomized studies. Subsequently, we provide the correction formula for measurement error in the covariate, which can be applied to observational studies. We further show that the latter correction formula is not necessary for cluster randomized studies. Finally, the correction formula for measurement error in the outcome and the covariate is presented for observational studies.
Let
where
Cluster-Level Group Difference With Measurement Error in the Outcome
When there is measurement error only in the outcome,
The following two attenuation formulas can be used to compute
and
where
Substituting Equations (7) and (8) into Equation (6), the measurement error corrected group difference estimate,
As shown in Equation 9, the correction formula is a function of the between-cluster reliability for the outcome.
Cluster-Level Group Difference With Measurement Error in the Covariate
Referring to equation 4.3.6 (p. 122) in Cohen et al. (2003), the unstandardized regression coefficient corrected for measurement error in
where
Measurement error in a covariate is expected in observational studies so that the correction formula provided in Equation (10) can be used to correct for such measurement error. However, the expected bias in
Because every term in Equations (5) and (10) is a constant, Equation (11) is calculated simply as follows:
The expected bias is 0 in the case of
Cluster-Level Group Difference With Measurement Error in the Outcome and Covariate
When there is measurement error in the outcome (
Equation (7) to calculate
and
Substituting Equations (7), (14), and (15) into Equation (13), the measurement error corrected group difference estimate,
As presented in Equation (16), the correction formula is a function of the between-cluster reliability for the outcome and the covariate.
Examples
In this section, we illustrate the use of the correction formula for measurement error only in the outcome (Equation 9) in a cluster randomized study and the use of the correction formula for the outcome and the covariate (Equation 16) in an observational study. In both examples, the main analytic goal is to detect cluster-level group differences using the two-level random MLM based on total pretest and posttest scores (Equation 3). In the examples, for comparison with the OLS estimate for
Example 1: Measurement Error Correction for the Outcome in a Cluster Randomized Study
The data used in the first example were collected in an efficacy trial of the instructional intervention called Enhanced Anchored Instruction (EAI). EAI aims to improve math achievement in middle and high school students (see Bottge et al., 2015, for details on EAI). The design of the efficacy trial was a pretest–posttest cluster randomized design, where schools, rather than classes or students, were randomly assigned to EAI and business as usual (BAU). A main research analysis focus was whether group (EAI vs. BAU) differences emerged for computation math skill after EAI.
Measure and Samples
The researcher-developed test Fraction Computation Test was administered at pretest and posttest. The test has 20 items assessing students’ ability to manually add and subtract fractions. There were a total of 42 points on the test. For 18 of the 20 items, students could earn 0, 1, or 2 points. On two items, students could earn 3 points if they simplified the answer (i.e., revised the fraction to simple terms). Interrater agreement was 99% on the pretest and 97% on the posttest.
Twenty-four middle schools in the Southeastern United States participated in the study. The schools were randomly assigned to EAI and BAU with equal probability. Of the initial sample, 25 students did not respond to all items in the pretest or the posttest. These students were not considered in the analysis. Accordingly, 232 BAU and 214 EAI students were chosen in the final sample. The cluster size (i.e., the number of students for each teacher) ranged from 7 to 28 students and the average cluster size was 17.84. Based on chi-square tests of equal proportions, students were comparable across instructional conditions in terms of gender, ethnicity, subsidized lunch, and disability area (see Bottge et al., 2015). Each school had one participating inclusive math classroom, except one school having two participating classrooms. Therefore, clustering due to schools was ignored and a two-level structure (i.e., 446 students nested in 25 teachers) was considered.
Analysis
A group (i.e., treatment condition) covariate was coded with a value of −0.5 for members of the BAU group and a value of 0.5 for the EAI group. The intraclass correlation (based on results of the unconditional two-level random intercept MLM using ML) on the outcome was 0.232, indicating that 23.2% of the total variance is explained by teachers. The between-cluster reliability estimates for posttest scores were 0.8501 for
Group Difference Estimate Without and With Measurement Error Correction for the Outcome
Example 2: Measurement Error Correction for the Outcome and the Covariate in an Observational Study
In the second example, we use data from an instructional intervention to improve word knowledge of adolescents (see Goodwin, 2016, for details). The design of the study was a pretest–posttest randomized design at the student level, where students nested within teachers were randomly assigned within class to the intervention or comparison instruction. For the illustrative purpose of detecting a cluster (i.e., teacher)-level group difference in the current study, an analysis goal was to detect teacher group differences from traditional and nontraditional schools. For this analysis goal, we consider this example to be an observational study at the cluster level even though the data are from a (individual level) randomized study.
Measure and Samples
Word knowledge was measured by three researcher-created measures for multiple-choice, self-report, and depth shown by producing related words at pretest and posttest. The multiple-choice measure was chosen in this study. In the multiple-choice measure, students were presented with an underlined word within a short phrase without context clues and they then circled the word among five choices of target word. There were 16 words (i.e., items) and items were scored as correct (score of 1) or incorrect (score of 0).
The samples consisted of 202 students (118 fifth-grade; 84 sixth-grade) that were diverse (113 Black, 47 Hispanic, 37 Caucasian, 5 Asian), mostly in poverty (173 receiving free and reduced lunch services), and spoke a range of languages at home (128 native English speakers, 28 English language learners, 46 language minority youth). These students were learning from 21 teachers who ranged in experience levels. Cluster size (i.e., the number of students for each teacher) ranged from 1 to 35 and the average cluster size was 9.619. The study took place within four schools (school A=13; B=35, C=98, D=56) in the southeastern U.S. Schools A and D were traditional middle schools and Schools B and C were nontraditional schools (i.e., STEM [science, technology, engineering, and math] magnet and charter school, respectively). One student missed the last five items at pretest and was omitted. There were 201 students nested within 21 teachers in the final samples for analysis.
Analysis
Twenty-one teachers were grouped into two groups, teachers in traditional schools (
Based on results of the unconditional two-level random intercept MLM using ML, the intraclass correlation was 0.234 for the outcome (i.e., posttest scores) and 0.196 for the covariate (i.e., pretest scores). These results indicate that 23.4% of the total variance in the outcome and 19.6% of the total variance in the covariate are explained by teachers. The between-cluster reliability estimates for pretest scores were 0.7551 for
Group Difference Estimate Without and With Measurement Error Correction for the Outcome and the Covariate
Summary and Discussion
The number of studies with multilevel designs has increased in educational research. Many researchers collect multiple indicators to measure educational and psychological attributes, which are often subject to measurement error. It has been increasingly common to use latent variable models to explicitly model measurement error in outcomes and/or covariates using multiple indicators (e.g., Fox, 2004; B. O. Muthén & Asparouhov, 2013; Rabe-Hesketh et al., 2004; Raudenbush & Sampson, 1999; Yang & Cai, 2014). However, as noted earlier, researchers may encounter situations where latent variable models cannot be used for measurement error adjustment.
In this article, measurement error correction formulas for a cluster-level group difference estimate from MLM were provided when there is measurement error in the outcome (e.g., posttest scores) in cluster randomized studies and there is measurement error in the outcome (e.g., posttest scores) and the covariate (e.g., pretest scores) in observational studies. We showed that the measurement error correction formula is a function of the between-cluster reliability recently developed by Geldhof et al. (2014). In the examples, we illustrated how to obtain disattenuated cluster-level group difference estimates using the formula in cluster randomized and observational studies.
There are methodological limitations to the current study. First, we limited our focus to the two-level random intercept MLM because it is one of the more popular analytic methods for estimating cluster-level group differences (e.g., Moerbeek et al., 2008; Raudenbush, 1997). In addition, only pretest covariates (at Levels 1 and 2) and a binary group covariate (at Level 2) were considered in the model because we focus attention on cluster-level group effects and pretest scores. Additional work is required for other specifications of MLM having more hierarchical levels and additional covariates.
Second, the measurement error correction formula we provided was based on an unbiased estimate of the between-cluster reliability and its availability to researchers. The empirical illustration was based on the between-cluster reliability coefficients described by Geldhof et al. (2014). According to a simulation study, they found between-cluster reliability coefficient estimates cannot be trusted when cluster size is small (i.e., 15 or fewer individuals per cluster) and intraclass correlation is low (i.e.,
Third, this study used an attenuation formula for measurement error correction. As shown in the formula, the lower the between-cluster reliability, the greater will be the correction. Unlike the measurement error correction for correlations, there is no range restriction for the disattenuated estimate in MLM using the disattenuation formula. However, the correction formula we provided for MLM estimates shares limitations of the correction formula for correlation coefficients (see Muchinsky, 1996, for a review of the limitations). For example, the interpretation of a dramatically elevated disattenuated estimate is challenging, especially when the between-cluster reliability coefficient estimate is small (e.g.,
Fourth, scale scores calculated from measurement models (e.g., factor analytic models, item response models) can be used as the outcome in MLM when they are available to researchers in addition to the total scores. In using the scale scores in MLM, the procedure can be called a two-stage procedure where the scale scores are calculated using measurement models in the first stage and then used as outcomes and covariates in MLM in the second stage. An additional study of the two-stage procedure and the measurement error correction method presented in this study is necessary to present relative performance for detecting the cluster-level group differences between the two approaches.
Despite these limitations, this article highlighted that the cluster-level group difference estimate from MLM can be attenuated in the presence of measurement error in the outcome in cluster randomized studies and in the presence of measurement error in the outcome and the covariate in observational studies. Attenuation due to measurement error is a well-known problem for correlations and for linear regression models. However, no study to date has shown that the attenuation formula is also applicable to MLM for detecting cluster-level group differences. Furthermore, substantive researchers continue to use cluster-level group difference estimates from MLM based on total scores from unreliable measures. This article showed one possible measurement error correction method when researchers need to report group difference estimates from MLM in the presence of measurement error and between-cluster reliability is available to them.
Footnotes
Appendix
Acknowledgements
We are grateful to Dr. Brian Bottge (University of Kentucky) and Dr. Amanda Goodwin (Vanderbilt University) for making the data available for applications.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
