Multilevel Factor Analysis by Model Segregation

Abstract

Measures of classroom environments have become central to policy efforts that assess school and teacher quality. This has sparked a wide interest in using multilevel factor analysis to test measurement hypotheses about classroom-level variables. One approach partitions the total covariance matrix and tests models separately on the between-classroom and within-classroom levels. This article shows that when using this approach, robust test statistics, including rescaled and residual-based test statistics provide better inferences about the classroom-level measurement structure than the widely used likelihood ratio test statistic even when the number of classrooms is large, and there is no excess kurtosis in the observed variables. This article then presents an empirical example and a simulation study to demonstrate how item intraclass correlations and within-group sample sizes influence test statistic performance. The results have implications for the study of classroom environments.

Keywords

learning environments factor analysis multilevel modeling

Survey-based measures of classroom quality have become a staple of many teacher performance portfolios. Seventeen states and many local education agencies including Chicago and Memphis, Tennessee, include student surveys as measures of teacher quality or professional practice (National Council for Teacher Quality, 2013). Measures of teacher quality and professional practice are constructed based on aggregated student survey responses. There is an increased attention in applied literature toward using measurement models that account for the hierarchical structure of these surveys and the fact that individual students are associated with specific classrooms. There is a long tradition of literature (e.g., Cronbach, 1976; Harnqvist, 1978; Julian, 2001; Longford & Muthén, 1992; Zyphur, Kaplan, & Christian, 2008) suggesting that single-level analytic methods that do not account for hierarchical data structures are problematic and can be “substantively misleading” (Reise, Ventura, Nuechterlein, & Kim, 2005, p. 130).

Multilevel factor analysis (e.g., Goldstein, 2003; Lee, 1990; Longford & Muthén, 1992; McDonald & Goldstein, 1989; Muthén, 1991, 1994; Rabe-Hesketh, Skrondal, & Zheng, 2007) provides a method to analyze multivariate data that are hierarchically structured. One widely used framework (Muthén, 1994) partitions the total covariance matrix into independent between-group (or group level) and within-group (or individual level) covariance matrices. As in conventional single-level factor analysis, it is often of interest to researchers to test measurement hypotheses in multilevel factor analysis by using test statistics. There are several different approaches that can be used to assess the adequacy of measurement models (e.g., Hox, 2010; Ryu & West, 2009) in multilevel factor analysis. These include simultaneously modeling both within-group and between-group covariance structures (e.g., Muthén, 1994), saturating (i.e., estimating all item covariances) the model at one level and fitting a factor model at the other level (e.g., Hox, 2010), and segregating the between and within covariance matrices and conducing factor analysis one level at a time (e.g., Yuan & Bentler, 2007).

In conventional factor analysis, the commonly used likelihood ratio test statistic is derived under the assumption that the observed data are continuous and multivariate normal (e.g., Bollen, 1989). Asymptotically, when this assumption holds, this test statistic will be appropriately distributed and inferences drawn from the model will be valid. In fact, it has been shown that normal theory estimators generally remain consistent and test statistics are correctly distributed unless kurtosis in the observed variables is excessive (Browne, 1984; Muthén & Kaplan, 1985, 1992).

Because the segregating method proceeds by conducting two conventional factor analyses, it is often assumed that if sample sizes are sufficiently large, there is no excess kurtosis and the measurement model is correctly specified, inferences about the between-group covariance structure based on the likelihood ratio test statistic will be valid (e.g., Goldstein, 2003; Hox & Maas, 2004; Ryu & West, 2009). However, there are situations where this is not the case. While the statistical basis for this phenomenon has been developed elsewhere (Yuan & Bentler, 2002, 2006, 2007), the poor performance of this test statistic is not widely known and is rarely mentioned in the multilevel factor analysis literature. In fact, the poor performance of the likelihood ratio test statistic is frequently characterized as evidence of model misspecification in applied literature (Mathisen, Torsheim, & Einarsen, 2006).

This article is organized as follows. First, the multilevel factor analysis framework is briefly described, along with the rationale for model testing at the between level. Second, the three major approaches to testing multilevel factor models are summarized. Third, four test statistics are presented, including the conventional likelihood ratio test statistic, Satorra and Bentler’s (1988) rescaled test statistic, Browne’s (1982, 1984) residual-based test statistic, and Yuan and Bentler’s (1998) adjusted residual-based test statistic. Fourth, an empirical example from a classroom environment survey illustrates how these statistics may influence inferences about the measurement model in multilevel contexts. Finally, a simulation study is presented to demonstrate the specific conditions under which test statistics may yield valid inferences. The final section discusses the implications of the results for the use of the segregating method to investigate the between-classroom factorial structure of classroom environment surveys and other surveys that have a group or a cluster as the primary unit of analysis.

Theoretical Background

Multilevel Factor Analysis

The multilevel factor analysis framework used in this study (e.g., Goldstein, 2003; Lee, 1990; Longford & Muthén, 1992; McDonald & Goldstein, 1989; Muthén, 1991, 1994) is based on a two-level score decomposition (Liang & Bentler, 2004, Longford & Muthén, 1992; Yuan & Bentler, 2007):

y_{i j} = μ + u_{j} + e_{i j},

where the vector of p observed scores for individual i in group j (y _ij ) can be decomposed into a vector of means (μ) and independent between-groups (u_j ), and within-group (e_ij ) random components. The associated covariance matrix of the observed scores can be expressed:

Σ_{T} = Σ_{B} + Σ_{W},

where Σ _T , Σ _B , and Σ _W are symmetric p × p covariance matrices. The covariance matrices can be expressed in two separate factor models (e.g., Bollen, 1989), one for the between-group level:

Σ_{B} = Λ_{B} Φ_{B} Λ_{B}^{T} + Ψ_{B},

and another for the within-group level

Σ_{W} = Λ_{W} Φ_{W} Λ_{W}^{T} + Ψ_{W} .

Here Λ _B is a p × k matrix of factor loadings for p items on k factors and Λ _W is a p × r matrix of factor loadings for p items on r factors. Note that while it is possible for k = r and for Λ _B = Λ _B , this is not necessary. Φ _B and Φ _W are k × k and r × r matrices of factor covariances, respectively, and Ψ _W and Ψ _B are p × p diagonal matrices containing unique (residual) variances. It follows that Φ _B need not equal Φ _W , and Ψ _W need not equal Ψ _B .

The Rationale for Between-Level Model Testing With Student Surveys

Surveys of classroom environments often assume a specific measurement model where students are treated as objective raters of the classrooms in which they study (e.g., Ferguson, 2012; Follman, 1992; Worrell & Kuterbach, 2001). Variance between students within the same classroom is attributable to sampling error and represents “noise.” Averaging over individual students, variance between classrooms represents true variance in classroom quality. In this way, these surveys are often designed to measure climate variables (Marsh et al., 2012), and the primary unit of analysis is the classroom. Accordingly, understanding the between-classroom factor structure is critical for developing and testing theories about how the classroom climate relates to other variables of substantive interest, such as student achievement and persistence in school. There is a long tradition of research suggesting that multilevel factor analysis is the appropriate tool for testing the between-level measurement models in hierarchically structured data (Cronbach, 1976; Harnqvist, 1978; Julian, 2001; Longford & Muthén, 1992; Marsh et al., 2012; Reise et al., 2005; Zyphur et al., 2008).

Three Approaches to Multilevel Fit Testing

Though multilevel factor analysis provides a framework to test between-classroom measurement models, there is little consensus on the best approach to evaluate models within that framework. There are three primary approaches described in the methodological literature on multilevel factor analysis: (1) simultaneously modeling the within-level and between-level structures (Muthén, 1994); (2) fitting an unrestricted (saturated) model at the within level and testing a measurement model at the between level (Hox, 2010; Muthén, 1994; Ryu & West, 2009) referred to as the “partially saturated model method” (Ryu & West, 2009, p. 589); and (3) segregating the between and within covariance matrices and conducting separate factor analyses referred to as the “segregating” method (Ryu & West, 2009, p. 592; Yuan & Bentler, 2007).

It has been shown in several studies (e.g., Hox, 2010; Ryu & West, 2009; Yuan & Bentler, 2007) that simultaneously modeling the within- and between-level structures does not produce meaningful diagnostic information about the between-level factor structure. Thus, the simultaneous modeling of between- and within-factor structures makes model or theory revision difficult (Yuan & Bentler, 2007), and this approach is not recommended in the literature. The partially saturated model method, on the other hand, does provide level-specific diagnostic information but was not meant to provide parameter estimates or standard errors (Ryu & West, 2009, p. 599; Yuan & Bentler, 2007). A practical issue with this method is that estimates of fit indices such as the root mean square error of approximation (Steiger & Lind, 1980) and the comparative fit index (Bentler, 1990) provided by software programs will spuriously show good fit (Hox, 2010, p. 307) and so may be misinterpreted (e.g., Kunter et al., 2008; Rosenberg, 2009).

The segregating method (Yuan & Bentler, 2007), which is the focus of this article, is operationalized in two steps. First, the total covariance matrix is partitioned and maximum likelihood estimates (MLEs) of Σ _B and Σ _W in Equation 2 are obtained. For balanced data, the MLEs of these two matrices are unbiased estimates of the population matrices, even when the data are not normally distributed (Muthén, 1994). Once the matrices have been separated, conventional single-level factor analyses can be conducted. Similar approaches are described in Goldstein (2003, p. 189) and Hox (2010). This approach potentially allows for a wide variety of test statistics and fit indices (Yuan & Bentler, 2007), since the model testing proceeds as two separate conventional single-level analyses. It also allows for parameter estimates and standard errors (Ryu & West, 2009, p. 599) to be obtained.

Because the segregating method is a two-step procedure, parameter estimation may be less efficient than estimation under the partially saturated model method (Goldstein, 2003). However, Yuan and Bentler (2007) suggested that, in small- to medium-sized samples, particularly with larger models, estimation under the segregating method may actually be more efficient than the partially saturated model method, because parameter estimates based on a smaller model will have more numerical stability (the segregating method will, in general, have far fewer parameters than partially saturated model method; Yuan & Bentler, 2007, p. 56). The author is unaware of any systematic comparison of the relative efficiency of estimation under the segregating and partially saturated modeling methods in the literature.

Four Test Statistics

Test statistics used in conjunction with the segregating method can be considered from a conventional, single-level framework, since the segregating method is operationalized by performing a series of conventional factor analyses. Before defining the test statistics used in this analysis, some general notation will be presented. Given a symmetric matrix A, let vech(A) be the half-vectorization of A. If the dimension of A is p × p, it has $p^{*} = \frac{(p + 1) p}{2}$ unique elements, and vech(A) is a p* × 1 vector. The matrix D_p is a p² × p* duplication matrix (e.g., Magnus & Neudecker, 1988). Additionally, let a function with a dot on top denote a derivative. For example, let $\dot{σ} (θ)$ denote the derivative of σ(θ) with respect to θ. For a total sample size N, let n = N − 1.

Given a p × p population covariance matrix Σ, a q-vector of free parameters θ, a testable null hypothesis can be expressed as Σ(θ) = Σ. In other words, the population covariance matrix, Σ, can be expressed as a function of the model parameters, θ (Bollen, 1989). This null hypothesis can be tested using a test statistic obtained from minimizing a discrepancy function, F[S, Σ(θ)], which indicates the discrepancy between the sample covariance matrix, S, and the model-implied covariance matrix Σ(θ). Optimal estimates of model parameters, $\hat{θ}$ , are found at the minimum of F.

Bentler and Dudgeon (1996) note that all discrepancy functions are associated with a weight matrix, W and an asymptotic covariance matrix Γ, which is given by the distribution of $\sqrt{n} (s - σ (θ))$ :

\sqrt{n} (s - σ (θ)) \overset{d}{\to} N (0, Γ),

where s = vech(S) and σ(θ) = vech(Σ(θ)). Γ is a symmetric positive definite p* × p* matrix. In the case of the between-group covariance matrix, $\sqrt{J} ({\hat{σ}}_{B} - σ_{B} (θ)) \overset{d}{\to} N (0, Γ_{B})$ , where ${\hat{σ}}_{B} = v e c h ({\hat{Σ}}_{B})$ , σ _B (θ) = vech(Σ _B (θ)), Σ _B (θ) is the model-implied between-level covariance matrix, and J is the number of groups (Yuan & Bentler, 2007).

Following Browne (1984; see also Bentler & Dudgeon, 1996; Foldnes, Foss, & Olsson, 2012), a discrepancy function is correctly specified for W if

W \overset{p}{\to} Γ^{- 1} .

When the model is correct and the discrepancy function is correctly specified:

n \hat{F} \overset{d}{\to} χ_{d}^{2},

where $\hat{F} = F [S, Σ (\hat{θ})]$ , the minimized value of the discrepancy function. The degrees of freedom, d, is given by d = p* − q (e.g., Bollen, 1989).

The likelihood ratio test statistic T_ML

The ML discrepancy function (Jöreskog, 1967) is derived from the normal theory log likelihood (e.g., Bentler & Yuan, 1999). Optimal estimates of model parameters, $\hat{θ}$ , are found by minimizing

F_{M L} = log |Σ (θ)| + t r (S Σ {(θ)}^{- 1}) - log |S| - p,

where |·| denotes the determinant, and tr denotes the trace of a matrix. In conventional factor analysis, S is the typical sample covariance matrix. In using the segregating method to investigate the between-level covariance structure, S is given by ${\hat{Σ}}_{B}$ and Σ(θ) is given by Σ _B (θ). Corresponding to this discrepancy function, the test statistic T_ML can be defined as $T_{M L} = n {\hat{F}}_{M L} .$ In conventional factor analysis, n = N – 1, one less than the total sample size. In using the segregating method to investigate between-level covariance structure n = J − 1.

F _ML can be understood as asymptotically equivalent to a special member of a class of generalized least squares estimators (Browne, 1974) with a weight matrix given by:

W_{M L} = .5 D_{p}^{T} (Σ {(\hat{θ})}^{- 1} \otimes Σ {(\hat{θ})}^{- 1}) D_{p} .

When the model is correct, under the assumption of multivariate normality, W_ML satisfies Equation 6, F_ML is asymptotically optimal (Browne, 1974; Foldnes et al., 2012, p. 373), and T_ML will be asymptotically distributed as a central χ² variate. In fact, Browne (1984) suggests that under some conditions, the weight matrix given in Equation 9 may still be correctly specified, provided there is no excess multivariate kurtosis in the observed variables.

The residual-based test statistics T_RADF and T_CRADF

Browne (1982, 1984) described a class of residual-based test statistics based on arbitrary distributional assumptions. A thorough discussion of these statistics can be found in Foldnes et al. (2012). Yuan and Bentler (2007) adapt Browne’s (1984) residual-based asymptotically distribution-free statistic for use in conjunction with the segregating method. The residual-based test statistic, T _RADF, is given by

T_{R A D F} = n {\hat{e}}^{T} \{{\dot{σ}}_{c} (\hat{θ}) {[{\dot{σ}}_{c} {(\hat{θ})}^{T} \hat{Γ} {\dot{σ}}_{c} (\hat{θ})]}^{- 1} {\dot{σ}}_{c} {(\hat{θ})}^{T}\} \hat{e} .

In conventional factor analysis, $\hat{e} = s - σ (\hat{θ}), {\dot{σ}}_{c} (\hat{θ})$ is a p* × ( p* − q) full-rank orthogonal complement of $\dot{σ} (\hat{θ})$ , and $\hat{Γ}$ is a sample estimate of Γ. $\hat{Γ}$ is often obtained by calculating the fourth-order central sample moments (e.g., Bentler, 2006). In using the segregating method to investigate the between-level covariance structure, $\hat{e} = {\hat{σ}}_{B} - σ_{B} (\hat{θ}), {\dot{σ}}_{c} (\hat{θ})$ is the full-rank orthogonal complement of ${\dot{σ}}_{B} (\hat{θ})$ , $\hat{Γ}$ is replaced with ${\hat{Γ}}_{B}$ , and n = J − 1. Yuan and Bentler (2002, 2006, 2007) proposed using generalized estimating equations (Liang & Zeger, 1986; Yuan & Jennrich, 1998) to obtain ${\hat{Γ}}_{B}$ .

Yuan and Bentler (1998, 2007) suggested a small sample corrected version to T _RADF for use in conjunction with the segregating method:

T_{C R A D F} = \frac{T_{R A D F}}{1 + \frac{T_{R A D F}}{J}} .

Neither T_RADF nor T_CRADF will be defined unless ${\dot{σ}}_{c} {(\hat{θ})}^{T} \hat{Γ} {\dot{σ}}_{c} (\hat{θ})$ in Equation 10 is invertible.

The rescaled test statistic T_RML

T_RML was designed to rescale T_ML based on excess skew and kurtosis in the observed variables (Satorra & Bentler, 1988). Let

\hat{U} = {\hat{W}}_{M L} - {\hat{W}}_{M L} \dot{σ} (\hat{θ}) {(\dot{σ} {(\hat{θ})}^{T} {\hat{W}}_{M L} \dot{σ} (\hat{θ}))}^{- 1} \dot{σ} {(\hat{θ})}^{T} {\hat{W}}_{M L} .

Also let $k = t r (\hat{U} \hat{Γ}) / d$ . Then:

T_{R M L} = \frac{T_{M L}}{k} .

Yuan and Bentler (2007) proposed a version of T_RML for use in conjunction with the segregating method, where W_ML = W_B , the weight matrix in Equation 9 evaluated at Σ _B (θ), ${\dot{σ}}_{c} (\hat{θ})$ is the full-rank orthogonal complement of ${\dot{σ}}_{B} (\hat{θ})$ , and $\hat{Γ}$ is replaced with ${\hat{Γ}}_{B}$ . While T_RML is not generally χ² distributed, its first moment is asymptotically equal to the first moment of $χ_{d}^{2}$ (e.g., Bentler & Yuan, 1999).

Behavior of T_ML in the Segregating Methodology

In using the segregating method, T_ML is often expected to converge to a central χ² distribution with d degrees of freedom if the model is correct and there is no excess skew or kurtosis in the observed variables. Several sources (Goldstein, 2003; Hox, 2010; Hox & Maas, 2004; Ryu & West, 2009) suggested that T_ML will behave in this way and can be used to evaluate the between-level measurement models.

In practice, however, and contrary to the advice given in these sources, T_ML may be inflated and may not have the correct asymptotic distribution, even when the data are normally distributed and the model is correctly specified. The extent of the inflation will be related to (1) the proportion of total observed variance attributable to group membership (i.e., the intraclass correlations [ICCs] of the observed variables) and (2) within-group sample size.

For clarity of presentation, we will assume that the groups are balanced (i.e., m ₁ = m ₂ = … = m_j = m) and that the observed scores (as defined in Equation 1) are multivariate normal in distribution. The ICC represents the proportion of observed variance attributable to group membership and can be obtained from the diagonal elements of Σ _B and Σ _W . For any given item p, the ICC can be expressed:

{I C C}_{p} = \frac{Σ_{B p p}}{Σ_{B p p} + Σ_{W p p}},

where Σ _Bpp and Σ _Wpp are the diagonal elements of Σ _B and Σ _W , respectively. ICC values range between 0 and 1, and for a fixed value of $σ_{B}^{2}$ , the ICCs will increase as $σ_{W}^{2} \to 0$ .

Under normal theory, the asymptotic covariance matrix of $\sqrt{J} ({\hat{σ}}_{B}^{2} - σ_{B}^{2})$ is given by the inverse of the Fisher information (Yuan & Bentler, 2006):

\begin{aligned} Γ_{B} = {[2^{- 1} D_{p}^{T} ({(Σ_{B} + \frac{1}{m} Σ_{W})}^{- 1} \otimes {(Σ_{B} + \frac{1}{m} Σ_{W})}^{- 1}) D_{p}]}^{- 1} \\ + \frac{1}{m^{2}} {((m - 1) 2^{- 1} D_{p}^{T} (Σ_{W}^{- 1} \otimes Σ_{W}^{- 1}) D_{p})}^{- 1} \end{aligned} .

Γ _B depends not only on information from Σ _B but on information from Σ _W as well.

When a factor analysis is performed on ${\hat{Σ}}_{B}$ using ML estimation in conventional software, it is associated with the weight matrix W_B , described previously. However, Equation 6 implies that in order for F_ML to be correctly specified for W_B in using the segregating method, W_B must converge to $Γ_{B}^{- 1}$ . In order for W_B to converge to $Γ_{B}^{- 1}$ , the terms involving Σ _W in Equation 15 need to be ignorable.

The ignorability of these terms is directly related to the ICC of the observed variables and within-group sample size. Keeping Σ _B fixed, as the ICC increases, Σ _W approaches zero, and the terms involving Σ _W in Equation 15 become ignorable. For low ICCs, where Σ _W is relatively large, these terms will not be ignorable. Alternatively, keeping ICC fixed, as m, the within-group sample size, increases, the terms involving Σ _W in Equation 15 become ignorable. For small within-group sample sizes, these terms will not be ignorable.

This implies that F_ML is particularly likely to be misspecified for W_B when ICCs are low or within-group sample sizes are small. Under those conditions, T_ML will not converge in distribution to a centrally distributed χ² variate, even when the model is correct and the number of groups is sufficiently large. As a result, inferences about model structure based on T_ML may not be valid for the segregated analysis of ${\hat{Σ}}_{B}$ even when the data are normally distributed. It should be noted that, while the above argument assumed that the groups were balanced, this assumption was made only to simplify the presentation. Results in Yuan and Bentler (2002, 2006) suggest that similar results would hold for the case of unbalanced groups.

Behavior of the Residual-Based Test Statistics

Unlike T_ML , the residual-based and rescaled test statistics use information from both between and within covariance sources through ${\hat{Γ}}_{B}$ . Thus, these test statistics are expected to converge to the correct distribution regardless of ICC or within-group sample size.

T_RML is expected to converge to a distribution with the correct first moment regardless of ICC and within-group sample size. The scaling constant, k, will be greater than 1. In conventional, single-level factor analysis, Bentler (2006) explained that $t r (\hat{U} \hat{Γ})$ can be thought of as a way to determine the discrepancy between the hypothesized model and data distribution (carried by $\hat{U}$ ) and the true data distribution (carried by $\hat{Γ}$ ). In analyzing ${\hat{Σ}}_{B}$ , the discrepancy between $\hat{U}$ and ${\hat{Γ}}_{B}$ occurs because ${\hat{Γ}}_{B}$ is based on information from both ${\hat{Σ}}_{B}$ and ${\hat{Σ}}_{W}$ , and $\hat{U}$ is based on information from $Σ_{B} (\hat{θ})$ alone.

Issues With T_ML in the Study of Classroom Climate

The relationship between the discrepancy function, the weight matrix, item ICCs, and within-group sample size is rarely made explicit in the methodological literature on multilevel factor analysis. Even when the poor performance of T_ML is noted (Muthén, 1994, p. 389; Hox, 2010; Yuan & Bentler, 2007), the possible role of either item ICC or within-group sample size in the misspecification of F _ML for W_B is not described. In fact, several sources (Goldstein, 2003; Hox, 2010; Hox & Maas, 2004; Ryu & West, 2009) suggested that the segregating method is a “viable method” (Hox & Maas, 2004, p. 145) that can be “implemented within the preexisting ML SEM framework” (Ryu & West, 2009, p. 600).

As a result, there is confusion in the applied literature on the interpretation of T_ML . There are many cases in the applied literature where an inflated valued of T_ML is interpreted as suggesting model misspecification and often the theorized between-group model is then modified by removing items, adding additional factors, or modifying paths (e.g., Mathisen et al., 2006). The possibility that T_ML may also reflect the fact that F_ML is misspecified for W_B is unexplored and untested.

The advice to use T_ML for model fit assessment is particularly problematic when the segregating method is used to assess the factor structure of classroom climate surveys, because the two conditions most likely to cause issues with the performance of T_ML —low item ICCs and relatively small within-group sample sizes—are particularly common in this field. Generally speaking, item ICCs for climate variables are “often less than .1 and rarely greater than .3” (den Brok, Bergen, Stahl, & Brekelmans, 2004; Marsh et al., 2012, p. 115; Toland & De Ayala, 2005). Class sizes typically range between 12 and 25 students per class (e.g., Holfve-Sabel & Gustaffson, 2005; Kunter et al., 2008). Under these conditions, the inflation of T_ML is likely to be severe. Relatedly, Type I error rates are likely to be far higher. It is unlikely that inferences about the between-classroom measurement models based on T_ML would be valid.

Because T_ML is expected to perform poorly in the evaluation of between-level measurement models for classroom climate surveys, it may seem reasonable to recommend the use of alternative test statistics, such as the residual-based and rescaled test statistics, since the theory outlined previously suggests these statistics should perform well asymptotically. In fact, Yuan and Bentler (2007) recommended the use of T_RML and T_CRADF for model evaluation in conjunction with the segregating method. However, there is only limited simulation work with the residual-based and rescaled test statistics in a multilevel context, and there are many known issues with statistics like T_RADF , T_CRADF , and T_RML in conventional factor analysis, particularly with small sample sizes and large models, which may be expected to present problems in multilevel investigations (e.g., Bentler & Yuan, 1999; Curran, West, & Finch, 1996; Hu, Bentler, & Kano, 1992; Muthén & Kaplan, 1985, 1992; Powell & Schaefer, 2001; Yuan & Bentler, 1998). In conventional factor analysis, when models are large and sample sizes are small, T_RADF and T_RML tend to overreject correct models, and T_CRADF tends to underreject correct models (Yuan & Bentler, 1999).

In fact, as it turns out, these specific conditions (small sample sizes and large models) are also likely to occur with student surveys of classroom climate. In the literature on student surveys of classroom climate, the number of classrooms (i.e., the group-level sample size) is typically between 50 and 500 (e.g., Fauth, Decristan, Riser, Klieme, & Buttner, 2014; Holfve-Sabel & Gustaffson, 2005; Kunter et al., 2008; Toland & De Ayala, 2005). Measurement models range from 25 degrees of freedom to well over 150 degrees of freedom (e.g., den Brok et al., 2004; Fauth et al., 2014; Holfve-Sabel & Gustaffson, 2005; Kunter et al., 2008; Toland & De Ayala, 2005).

It is not clear whether, under these conditions, the residual-based test statistics or the rescaled test statistics would continue to perform well. It is also unclear whether Yuan and Bentler’s (2007) recommendations to use T_RML and T_CRADF , which were based on a simulation study using high item ICCs, large within-group sample sizes, relatively small measurement models, and a large number of groups, would be supported under a wider range of conditions, particularly those typically found in survey-based research on classroom climate.

This study uses an illustrative example and a simulation in order to (a) illustrate the extent to which the ML test statistic will be inflated, (b) demonstrate how item ICC and within-group sample size influence the distribution of T_ML , (c) investigate the performance of several alternative test statistics—specifically T_RML , T_RADF , and T_CRADF —under a broader range of conditions, particularly those that are frequently encountered in survey-based research on classroom climate. The empirical example comes from the Tripod Classroom Environment Survey (Ferguson, 2010), which is administered to measure aspects of classroom environment. Using the illustrative example and the simulation study, the following four research questions were addressed:

Research Question 1: To what extent can inferences about the measurement structure of the Tripod Classroom Environment Survey based on T_ML differ from those based on residual-based and rescaled test statistics?

Research Question 2: Is there a loss of estimator efficiency in using the segregating method, as compared to the partially saturated model method (e.g., Hox, 2010; Muthén, 1994, Ryu & West, 2009) described previously?

Research Question 3: How do item ICC and within-group sample size influence the distribution of T_ML ? How do item ICC and within-group sample size influence the differences between the two estimated matrices, $W_{B}^{- 1}$ and ${\hat{Γ}}_{B}$ ?

Research Question 4: How do T_RML , T_RADF , and T_CRADF perform under a broader range of conditions, particularly those that are frequently encountered in survey-based research on classroom climate?

Method

Data Sources

The Tripod Classroom Environment Survey

The Tripod Survey (Ferguson, 2010) is designed to assess seven dimensions of teaching practice, often referred to as the “Seven Cs”: caring, captivating, conferring, clarifying, challenging, controlling, and consolidating. This version of the Tripod Survey was administered in an urban school district in California in 2010. This example uses 5 items from the “challenging” dimension that are rated on 5-point scales (1 = totally untrue and 5 = totally true). The sample used in this analysis contained 5,508 students in 285 classrooms. The average classroom size was approximately 17 students. Students are treated as nested within classrooms, and it is assumed that each student has rated only one classroom. Descriptive information about the survey items is summarized in Table 1.

Table 1

Descriptive Statistics for Tripod Survey Variables

Item		Mean	SD	ICC
1	My teacher asks questions to be sure we are following along when he or she is teaching	4.11	0.40	.07
2	My teacher asks students to explain more about answers they give	3.93	0.39	.10
3	In this class, my teacher accepts nothing less than our full effort	3.93	0.46	.13
4	My teacher doesn’t let people give up when the work gets hard	4.07	0.38	.10
5	My teacher wants us to use our thinking skills, not just memorize things	3.95	0.42	.12

Note. SD = standard deviation; ICC = intraclass correlation.

Simulated data sets

Data were generated from multivariate normal distributions and a population model with two within-level factors and one between-level factor. This population model was selected because several sources suggest that the between-level factor structure is likely to be simpler than the structure at the within level (e.g., Holfve-Sabel & Gustaffson, 2005; Muthén & Asparouhov, 2011). Simulation conditions were selected in order to reflect the conditions commonly reported in survey-based research on classroom climate. The following four conditions were manipulated: (1) item ICCs (ICC = .50, ICC = .26, ICC = .10, and ICC = .05), (2) Level 2 sample size (J = 100, J = 200, and J = 500), (3) group size (n = 10, n = 30, and n = 50), and (4) the size of the measurement model (df = 9, df = 54, and df = 135).

For the ICC = .5 condition with six observed variables, the generating model used the following parameters:

Λ_{W} = (\begin{matrix} \begin{matrix} .7 & 0 \\ .7 & 0 \\ .7 & 0 \end{matrix} \\ \begin{matrix} 0 & .7 \\ 0 & .7 \\ 0 & .7 \end{matrix} \end{matrix}) Λ_{B} = (\begin{matrix} \begin{matrix} .7 \\ .7 \\ .7 \end{matrix} \\ \begin{matrix} .7 \\ .7 \\ .7 \end{matrix} \end{matrix}) Φ_{W} = (\begin{matrix} 1 & .3 \\ .3 & 1 \end{matrix}),

Ψ_{B} = Ψ_{W} = d i a g (.51)

$Ψ_{B} {a n d Ψ}_{W}$ were 6 × 6 diagonal matrices, so that all of the diagonal elements equal .51. For the other ICC conditions used in this simulation, Λ _B ,Φ _W , and Ψ _B were fixed, and Λ _W and Ψ _W were varied. For the ICC = .2 condition, the nonzero elements of Λ _W were set to 1.41, and the diagonal elements of Ψ _W were set to 2.00. For the ICC .10 condition, the nonzero elements of Λ _W were set to 2.10, and the diagonal elements of Ψ _W were set to 4.59. For the ICC = .05 condition, the nonzero elements of Λ _W were set to 3.08 and the diagonal elements of Ψ _W were set to 9.50. Model size was varied by adding additional items but keeping the general pattern of factor loadings the same as given in Equation 23. In total, this simulation contained in 4 × 3 × 3 × 3 conditions. While certain constellations of conditions may be unlikely to occur in practice (i.e., many, large classrooms, high ICCs, and a small model), the inclusion of conditions across this range allows for a more comprehensive study of the behavior of the test statistics. Five hundred replications were conducted in each condition for a total of 54,000 replications. Simulations were conducted using MPlus’s (Muthén & Muthén, 2010) Monte Carlo capabilities. For each of the replicated data sets, the MPlus-Automation package (Hallquist, 2012) in R (R Core Team, 2012) was used to obtain saturated estimates of Σ _B and Σ _W . T_ML , T_RADF , T_CRADF , and T_RML were estimated in EQS (Bentler, 2006) using the REQS (Mair & Wu, 2012) package.

Analytic Approach

To address the first research question, the Tripod Survey data were used. ML estimates of Σ _B and Σ _W were obtained. ${\hat{Σ}}_{B}$ was then used as the input covariance matrix for a confirmatory factor analysis, where the hypothesized model was unidimensional (i.e., all 5 items loaded onto one factor). Then, the four test statistics T_ML , T_RADF , T_RADF , and T_RML were estimated. To address the second research question, the simulated data were used. The parameters were estimated using both the segregating method and the partially saturated model method for each simulated data set. For each approach, estimated factor loadings for the between-group model were stored, and the ratio of the mean square errors of the parameter estimates was calculated:

e ({\hat{θ}}_{S E G}, {\hat{θ}}_{S A T}) = \frac{E [{({\hat{θ}}_{S E G} - θ)}^{2}]}{E [{({\hat{θ}}_{S A T} - θ)}^{2}]} .

If $e ({\hat{θ}}_{S E G}, {\hat{θ}}_{S A T})$ is less than 1, the segregating method would be preferable to the partially saturated model method, since this would indicate that the mean square error of parameter estimation under the segregating method would be lower than the mean square error of the partially saturated model method (e.g., Hoel, Port, & Stone, 1971).

Simulated data were also used to answer the third research question. For each simulation condition, the mean and standard deviation of T_ML were estimated, and an empirical Type I error rate was calculated. For the purpose of this study, the Type I error rate was calculated at the nominal α = .05 level. Because it is expected that the empirical error rates will differ somewhat from the nominal rate, an acceptable empirical error rate is taken as one that falls in the interval [.028, .079], the estimated two-sided 99% adjusted Wald confidence interval (e.g., Agresti & Coull, 1998). In addition to T_ML , $W_{B}^{- 1}$ and ${\hat{Γ}}_{B}$ were compared through their squared distances from each other: $| W_{B}^{- 1} - {\hat{Γ}}_{B}^{} |^{2}$ . Based on Equation 17, it is anticipated that the distance between these two matrices should increase as ICCs and within-group sample sizes decrease.

In order to address the fourth research question, investigating the performance of T_RML , T_RADF , and T_CRADF under a range of conditions similar to those encountered in survey-based research on classroom climate, means and standard deviations of these three test statistics were estimated for each simulation condition, and an empirical model rejection rate was calculated. As in the case of T_ML , the rejection rate was calculated at the nominal α = .05 level and acceptable rates were those in the interval [.028, .079]. It should be noted that for several conditions (when J = 100 and df = 135), the residual-based test statistics are not estimable because ${\dot{σ}}_{c} {(\hat{θ})}^{T} {\hat{Γ}}_{B} {\dot{σ}}_{c} (\hat{θ})$ is not invertible under these conditions, and so those statistics are not included in those specific analyses.

Results

To What Extent Can Inferences About the Measurement Structure of the Tripod Classroom Environment Survey Based on T_ML Differ From Those Based on Residual-Based and Rescaled Test Statistics?

The estimate of T_ML is 136.9. This can be referred to as $χ_{5}^{2}$ and suggests strong evidence for rejecting the null hypothesis that the proposed model holds in the population ( p < .0001). If T_ML were used as the basis for model evaluation, it would be concluded that these 5 items are not unidimensional.

However, based on the theoretical results presented previously, there is reason to suspect that the T_ML test statistic should not be trusted in this particular case. First, the item ICCs are fairly low, ranging from .07 to .13 (Table 1). Second, the average number of individuals in each classroom is fairly small. Even if all of the distributional assumptions were satisfied, with ICCs that are in this range, the correct specification of F_ML for W_B would require much larger classroom sizes in order for T_ML to have the correct distribution. Thus, it may be more appropriate to make model inferences based on rescaled or residual-based test statistics. Here, T_RADF (4.54, p = .454), T_CRADF (4.47, p = .484), and T_RML (5.50, p = .358) all suggest strong evidence for failing to reject the null hypothesis. In other words, these three test statistics all suggest that the items are indeed unidimensional, an inference that completely contradicts the inference based on T_ML .

It should be noted that while this example provides a clear illustration of how low ICCs and small within-group sample sizes can distort inferences about the between-classroom model based on T_ML , it was also limited in some important ways. First, the data-generating mechanism was unknown. While the inflation of T_ML relative to T_RADF , T_CRADF , and T_RML is related to ICC and within-group sample size, it is possible that other factors, including multivariate kurtosis, play a role in model appraisal. Second, the model is relatively small, containing only five variables and 5 degrees of freedom, and so while the rescaled and residual-based test statistics provide valid inferences in this case, these results may not generalize to larger models. These issues are addressed in the analyses that follow.

Is There a Loss of Estimator Efficiency in Using the Segregating Method, as Compared to the Partially Saturated Model Method?

Table 2 displays the relative efficiency of parameter estimation under both the segregating method and the partially saturated model method across all simulation conditions. Table 2 displays results for only one model size condition (df = 9), but the results are consistent across all model sizes. These results suggest that there is, in general, no loss of efficiency that comes from using the segregating method. Supporting the hypotheses of Yuan and Bentler (2007), there is even a slight gain in efficiency for the segregating method as the ICCs get smaller and the group sizes get smaller. At ICC = .05 with a small number of groups and only 10 individuals in each group, the segregating method is far more efficient than the partially saturated model method. Although not evident from Table 2, it should be noted that while the segregating method is relatively more efficient in the condition with ICC = .05, n = 10, and J = 100, there is considerably more variability in the parameter estimates overall, perhaps reflecting some numerical instability.

Table 2

Efficiency of the Segregating Method Relative to the Partially Saturated Model Method

		Group Size
		10	30	50
A. Square distance between asymptotic covariance matrices, df = 9
J = 500	ICC = .50	1.00	1.00	1.00
	ICC = .26	1.00	1.00	1.00
	ICC = .05	1.01	1.01	1.01
J = 200	ICC = .50	1.00	1.00	1.00
	ICC = .26	1.00	1.00	1.00
	ICC = .05	0.90	1.02	1.01
J = 100	ICC = .50	1.00	1.00	1.00
	ICC = .26	1.00	1.00	1.00
	ICC = .05	0.48	1.04	1.01

Note. ICC = intraclass correlation. Values <1 indicate that estimation under the segregating method is relatively more efficient than the partially saturated model method.

How Do Item ICC and Within-Group Sample Size Influence the Distribution of T_ML ? How do Item ICC and Within-Group Sample Size Influence the Differences Between ${W_{B}}^{- 1}$ and ${\hat{Γ}}_{B}$ ?

Tables 3 present the test statistic means, variances, and empirical Type I error rates across a selected subset of simulation conditions. For compactness of presentation, results for ICC =.10 are not displayed but are consistent with the results displayed here. As expected, as either ICC or within-group sample size decreases, T_ML increases and, relatedly, Type I error rates increase. T_ML is only well behaved with 500 groups, more than 30 individuals per group and an ICC of .50 (Table 3, Panel A). This condition is most similar to the simulation conditions of Ryu and West (2009) and Hox and Maas (2004) and offers some insight into why those studies concluded that ML methods and the likelihood ratio test statistic were appropriate for use in conjunction with the segregating method.

Table 3

Test Statistic Means, Standard Deviations, and Type I Error Rates for T_ML

	Group Size
	10		30		50
	Mean (SD)	Rejection Rate	Mean (SD)	Rejection Rate	Mean (SD)	Rejection Rate
A. Test statistic performance, df = 9
J = 500
ICC = .50	12.55 (6.15)	.216	10.05 (4.65)	.072	9.64 (4.29)	.054
ICC = .26	18.37 (9.55)	.490	11.68 (5.38)	.142	10.58 (4.75)	.102
ICC = .05	176.35 (143.71)	1.000	35.87 (19.34)	.896	22.6 (11.55)	.644
J = 200
ICC = .50	12.34 (6.1)	.198	10.11 (4.73)	.088	9.77 (4.52)	.074
ICC = .26	18.32 (9.55)	.492	11.71 (5.56)	.186	10.75 (4.99)	.104
ICC = .05	193.74 (120.24)	1.000	39.36 (26.93)	.872	24.25 (14.80)	.690
J = 100
ICC = .50	12.52 (6.22)	.190	10.27 (4.81)	.092	10.2 (5.09)	.102
ICC = .26	19.55 (11.10)	.514	12.01 (5.73)	.180	11.22 (5.62)	.158
ICC = .05	192.25 (96.32)	1.000	47.54 (41.23)	.906	26.96 (19.26)	.690
B. Test statistic performance, df = 54
J = 500
ICC = .50	73.08 (14.27)	.502	58.87 (11.55)	.124	58.33 (11.42)	.110
ICC = .26	107.62 (24.04)	.964	67.79 (13.55)	.356	63.7 (12.80)	.248
ICC = .05	1,054.79 (336.47)	1.000	211.24 (66.09)	1.000	133.36 (33.78)	.994
J = 200
ICC = .50	73.46 (14.97)	.508	61.18 (12.24)	.192	58.67 (11.50)	.110
ICC = .26	113.2 (32.08)	.962	70.76 (14.57)	.432	64.35 (12.70)	.248
ICC = .05	1,233.88 (299.86)	1.000	262.93 (109.39)	1.000	150.01 (50.77)	.994
J = 100
ICC = .50	78.61 (16.36)	.630	64.67 (12.76)	.272	61.09 (11.66)	.172
ICC = .26	136.81 (54.23)	.982	76.33 (16.40)	.564	67.57 (13.38)	.350
ICC = .05	1,144.92 (205.67)	1.000	341.59 (110.41)	1.000	190.39 (78.90)	1.000
C. Test statistic performance, df = 135
J = 500
ICC = .50	177.93 (22.55)	.728	151.27 (18.48)	.248	144.41 (17.36)	.122
ICC = .26	263.08 (48.18)	.998	174.22 (22.50)	.668	157.51 (19.23)	.384
ICC = .05	2,552.47 (512.77)	1.000	574.96 (170.26)	1.000	342 (64.83)	1.000
J = 200
ICC = .50	185.74 (23.9)	.814	154.93 (18.9)	.310	148.12 (17.44)	.180
ICC = .26	300.61 (79.66)	1.000	180.36 (23.26)	.770	162.35 (19.74)	.466
ICC = .05	3,009.19 (437.05)	1.000	660.18 (202.01)	1.000	408.67 (123.65)	1.000
J = 100
ICC = .50	204.02 (30.38)	.926	163.11 (19.66)	.470	156.3 (19.01)	.340
ICC = .26	378.52 (106.55)	1.000	195.5 (29.51)	.910	173.29 (22.64)	.668
ICC = .05	2,485.48 (248.38)	1.000	911.03 (175.26)	1.000	520.02 (123.77)	1.000

Note. SD = standard deviation. Empirical Type I error rates in the interval [.028, .079] are shown in boldface.

T_ML inflation can be quite severe. When ICCs are low and the within-group sample sizes are small, the correct model is rejected 100% of the time, and the test statistic mean is about 20 times larger than expected, for all model sizes. This pattern of inflation suggests that T_ML will not provide valid inferences about between-classroom measurement models in survey-based research on classroom climate. The results presented in Table 3 also suggest little evidence that T_ML would ever converge to the correct distribution, regardless of the number of groups that are included in the sample. For example, for ICC = .50, with within-group sample sizes of 10, there is little evidence of convergence as the number of groups increases from 100 to 500.

Table 4 presents the distances between $W_{B}^{- 1}$ and ${\hat{Γ}}_{B}$ . As anticipated by theory, the squared distance between these matrices increases as either ICC decreases or within-group sample size decreases. At ICC = .50, with group sizes of 50, the distance between the covariance matrices is relatively small, implying a small amount of misspecification of the discrepancy function for W_B . Relatedly, T_ML is relatively well behaved. For ICC = .05, however, the distances between the covariance matrices are quite large, the misspecification of F_ML for W _B is more severe and T_ML is more inflated.

Table 4

Squared Distances Between Asymptotic Covariance Matrices

		Group Size
		10	30	50
A. Squared distance between asymptotic covariance matrices, df = 9
J = 500	ICC = .50	13.37	6.93	5.89
	ICC = .26	79.20	13.59	8.45
	ICC = .05	7,371.88	388.32	121.33
J = 200	ICC = .50	24.97	16.36	14.40
	ICC = .26	98.54	25.16	17.67
	ICC = .05	7,864.97	440.61	141.81
J = 100	ICC = .50	43.98	27.98	28.42
	ICC = .26	130.40	13.59	33.41
	ICC = .05	8,502.18	486.50	178.86
B. Square distance between asymptotic covariance matrices, df = 54
J = 500	ICC = .50	132.70	77.79	72.13
	ICC = .26	437.04	112.50	86.69
	ICC = .05	58,907.71	3,317.32	1,097.86
J = 200	ICC =.50	261.67	190.56	176.47
	ICC = .26	959.62	282.88	212.88
	ICC = .05	63,017.13	3,871.44	1,301.13
J = 100	ICC = .50	458.82	364.11	324.45
	ICC = .26	1,293.77	507.18	373.91
	ICC = .05	69,335.72	4,908.45	1,669.69
C. Square distance between asymptotic covariance matrices, df = 135
J = 500	ICC = .50	572.75	335.90	343.99
	ICC = .26	3,099.04	617.33	468.70
	ICC = .05	234,646.43	13,045.81	4,553.31
J = 200	ICC = .50	1,158.64	759.65	756.52
	ICC = .26	4,092	1,101.68	922.85
	ICC = .05	248,861.52	15,157.23	5,553.22
J = 100	ICC = .50	2,036.26	1,361.54	1,508.60
	ICC = .26	5,746.05	1,829.12	1,772.68
	ICC = .05	282,624.76	18,259.65	7,540.48

Note. ICC = intraclass correlation.

How do T_RML , T_RADF , and T_CRADF Perform Under a Broader Range of Conditions, Particularly Those That Are Frequently Encountered in Survey-Based Research on Classroom Climate?

Performance of T_RADF

Consistent with theoretical expectation, there is evidence that T_RADF converges to the correct distribution as the number of groups increases regardless of ICC or within-group sample size. This pattern of convergence is most apparent in Table 5 (Panel A), as the number of groups increases from 100 to 500. With 100 groups, T_RADF overrejects the correct model for nearly all ICC and within-group sample size conditions. Contrary to this pattern, T_RADF has a mean that is too low at ICC = .05 and n = 10, which may reflect some of the instability of the estimates at low ICCs and small sample sizes. With 500 groups, T_RADF is much better behaved. However, when the model is sufficiently large, the number of groups would have to be enormous in order for T_RADF to provide correct inferences. In Table 5 (Panel C), when the model is large and the number of groups is small, T_RADF rejects the correct model 100% of the time. Even with 500 groups, the empirical Type I error rates approach 90%. This is consistent with results from both conventional and multilevel factor analysis, where it has been shown that T_RADF and other similar statistics converge slowly to the appropriate distribution (e.g., Curran et al., 1996; Hu et al., 1992; Muthén & Kaplan, 1985, 1992; Powell & Schaefer, 2001; Yuan & Bentler, 1998, 2003, 2007).

Table 5

Test Statistic Means, Standard Deviations, and Type I Error Rates for T_RML , T_RADF , and T_CRADF

		Group Size
		10		30		50
		Mean (SD)	Rejection Rate	Mean (SD)	Rejection Rate	Mean (SD)	Rejection Rate
A. Test statistic performance, df = 9
J = 500
T_RML	ICC = .50	9.59 (4.69)	.078	9.19 (4.23)	.062	9.16 (4.06)	.046
	ICC = .26	9.56 (4.97)	.082	9.26 (4.23)	.054	9.21 (4.12)	.048
	ICC = .05	10.31 (7.33)	.128	9.6 (5.02)	.098	9.43 (4.80)	.084
T_RADF	ICC = .50	9.79 (4.79)	.072	9.41 (4.41)	.060	9.42 (4.28)	.056
	ICC = .26	9.69 (4.84)	.070	9.49 (4.45)	.070	9.44 (4.33)	.058
	ICC = .05	9.02 (4.27)	.046	9.6 (4.63)	.086	9.53 (4.70)	.094
T_CRADF	ICC = .50	9.56 (4.57)	.064	9.2 (4.21)	.050	9.21 (4.10)	.040
	ICC = .26	9.46 (4.61)	.062	9.28 (4.24)	.060	9.23 (4.14)	.052
	ICC = .05	8.82 (4.10)	.034	9.38 (4.41)	.074	9.31 (4.48)	.078
J = 200
T_RML	ICC = .50	9.42 (4.61)	.074	9.25 (4.30)	.064	9.30 (4.32)	.062
	ICC = .26	9.43 (4.84)	.058	9.28 (4.38)	.052	9.35 (4.35)	.056
	ICC = .05	9.91 (6.40)	.130	10.02 (6.26)	.102	9.88 (5.75)	.088
T_RADF	ICC = .50	9.96 (5.04)	.096	9.81 (4.73)	.088	9.87 (4.78)	.078
	ICC = .26	9.76 (4.88)	.076	9.77 (4.66)	.072	9.90 (4.83)	.086
	ICC = .05	8.21 (3.69)	.026	9.70 (4.56)	.072	9.97 (4.86)	.076
T_CRADF	ICC = .50	9.38 (4.45)	.064	9.26 (4.22)	.05	9.30 (4.25)	.054
	ICC = .26	9.20 (4.32)	.052	9.22 (4.16)	.044	9.33 (4.29)	.054
	ICC = .05	8.83 (3.97)	.038	9.18 (4.12)	.05	9.36 (4.28)	.052
J = 100
T_RML	ICC = .50	9.59 (4.76)	.072	9.42 (4.39)	.050	9.72 (4.85)	.086
	ICC = .26	9.91 (5.49)	.098	9.49 (4.49)	.064	9.75 (4.90)	.074
	ICC = .05	8.99 (5.75)	.096	11.33 (9.48)	.124	10.63 (7.25)	.128
T_RADF	ICC = .50	10.86 (5.60)	.122	10.56 (5.29)	.118	10.89 (5.75)	.152
	ICC = .26	10.7 (5.51)	.134	10.55 (5.25)	.122	10.89 (5.85)	.150
	ICC = .05	7.45 (3.51)	.020	10.26 (4.81)	.086	10.88 (5.81)	.150
T_CRADF	ICC = .50	9.55 (4.32)	.064	9.33 (4.16)	.054	9.57 (4.45)	.070
	ICC = .26	9.43 (4.28)	.048	9.33 (4.13)	.048	9.56 (4.50)	.076
	ICC = .05	6.82 (2.94)	.006	9.12 (3.84)	.034	9.55 (4.46)	.078
B. Test statistic performance, df = 54
J = 500
T_RML	ICC =. 50	56.31 (11.02)	.084	53.96 (10.59)	.052	55.38 (10.84)	.074
	ICC = .26	57.11 (12.64)	.122	54.09 (10.82)	.062	55.52 (11.16)	.068
	ICC = .05	68.80 (21.62)	.380	58.39 (17.60)	.164	56.86 (14.24)	.124
T_RADF	ICC = .50	64.28 (13.30)	.282	61.62 (12.72)	.192	63.48 (12.97)	.242
	ICC = .26	64.68 (13.34)	.278	61.74 (12.88)	.200	63.47 (13.31)	.248
	ICC = .05	60.24 (11.88)	.168	61.66 (12.95)	.192	63.22 (13.91)	.260
T_CRADF	ICC = .50	56.68 (10.37)	.072	54.61 (9.98)	.060	56.07 (10.12)	.062
	ICC = .26	57.00 (10.38)	.074	54.69 (10.10)	.052	56.05 (10.38)	.054
	ICC = .05	53.54 (9.42)	.032	54.63 (10.16)	.054	55.83 (10.83)	.064
J = 200
T_RML	ICC = .50	56.47 (11.50)	.082	56.11 (11.11)	.088	55.79 (10.88)	.084
	ICC = .26	59.55 (16.50)	.172	56.48 (11.55)	.102	56.19 (11.05)	.092
	ICC = .05	76.74 (21.21)	.540	70.37 (28.15)	.354	62.96 (20.48)	.224
T_RADF	ICC = .50	81.20 (19.29)	.648	81.26 (19.61)	.662	81.2 (18.75)	.668
	ICC = .26	81.76 (20.10)	.660	81.02 (19.26)	.656	81.55 (18.98)	.660
	ICC = .05	65.97 (13.13)	.312	80.43 (19.01)	.658	81.86 (19.36)	.668
T_CRADF	ICC = .50	56.93 (9.56)	.054	56.94 (9.77)	.054	56.97 (9.18)	.052
	ICC = .26	57.16 (9.85)	.068	56.84 (9.61)	.058	57.14 (9.28)	.056
	ICC = .05	49.12 (7.32)	.004	56.56 (9.44)	.048	57.27 (9.47)	.064
J = 100
T_RML	ICC = .50	60.24 (12.45)	.170	59.41 (11.58)	.152	58.42 (11.22)	.108
	ICC = .26	70.61 (27.19)	.370	60.77 (12.93)	.194	59.17 (11.77)	.136
	ICC = .05	73.09 (20.97)	.464	89.20 (29.85)	.658	77.78 (31.76)	.450
T_RADF	ICC = .50	143.50 (45.63)	.988	144.83 (40.46)	.990	142.43 (42.71)	.986
	ICC = .26	144.59 (46.84)	.986	144.95 (40.85)	.986	142.84 (42.32)	.984
	ICC = .05	86.07 (23.38)	.682	137.14 (40.21)	.980	143.45 (41.21)	.982
T_CRADF	ICC = .50	56.91 (7.23)	.008	57.41 (6.45)	.006	56.84 (7.05)	.006
	ICC = .26	57.04 (7.33)	.012	57.40 (6.55)	.008	56.94 (6.96)	.006
	ICC = .05	45.03 (6.33)	.000	56.04 (6.75)	.004	57.12 (6.66)	.006
C. Test statistic performance, df = 135
J = 500
T_RML	ICC = .50	137.54 (17.49)	.090	138.66 (16.87)	.080	137.05 (16.46)	.062
	ICC = .26	141.22 (25.47)	.154	139.46 (17.94)	.100	137.56 (16.78)	.064
	ICC = .05	179.15 (35.42)	.670	160.21 (45.33)	.350	146.69 (26.88)	.220
T_RADF	ICC = .50	195.17 (29.99)	.866	198.46 (28.92)	.904	197.07 (29.26)	.886
	ICC = .26	194.25 (29.28)	.860	198.83 (29.28)	.898	197.13 (28.95)	.886
	ICC = .05	180.78 (25.25)	.758	197.38 (29.59)	.898	197.89 (29.44)	.892
T_CRADF	ICC = .50	139.55 (15.39)	.056	141.30 (14.68)	.078	140.57 (15.00)	.064
	ICC = .26	139.10 (15.14)	.064	141.48 (14.87)	.078	140.61 (14.87)	.060
	ICC = .05	132.13 (13.57)	.008	140.72 (14.97)	.078	140.99 (14.99)	.070
J = 200
T_RML	ICC = .50	143.36 (18.40)	.158	142.22 (17.30)	.124	140.87 (16.61)	.100
	ICC = .26	159.57 (41.28)	.344	144.19 (18.54)	.140	141.91 (17.30)	.112
	ICC = .05	220.39 (38.75)	.946	180.52 (56.55)	.590	172.92 (51.39)	.460
T_RADF	ICC = .50	483.47 (113.75)	1.000	491.95 (111.34)	1.000	485.81 (98.1)	1.000
	ICC = .26	487.13 (110.77)	1.000	491.28 (108.80)	1.000	487.08 (101.72)	1.000
	ICC = .05	353.32 (69.62)	1.000	436.21 (127.34)	.998	491.01 (119.39)	1.000
T_CRADF	ICC = .50	138.98 (9.26)	.004	139.80 (8.87)	.002	139.51 (8.27)	.000
	ICC = .26	139.35 (9.05)	.000	139.79 (8.74)	.002	139.53 (8.59)	.000
	ICC = .05	125.79 (8.89)	.000	133.48 (14.03)	.000	139.51 (9.53)	.004
J = 100
T_RML	ICC = .50	157.21 (23.39)	.372	150.08 (18.19)	.232	149.30 (18.01)	.208
	ICC = .26	200.01 (57.33)	.656	155.98 (23.37)	.330	151.80 (19.63)	.258
	ICC = .05	209.04 (39.39)	.886	250.46 (49.47)	.976	218.84 (53.8)	.844

Note. SD = standard deviation; ICC = intraclass correlation. Empirical Type I error rates in the interval [.028, .079] shown in boldface.

Performance of T_CRADF

T_CRADF shows well-behaved means, standard deviations, and empirical Type I error rates across a wide variety of simulation conditions. Table 5 (Panel A), which displays results for the small models (df = 9), shows that T_CRADF performs well when the number of groups is sufficiently large, relative to the size of the model. This pattern continues in Table 5 (Panel B), with the medium-sized models (df = 54), provided that the number of groups is sufficiently large (J = 200 or J = 500). However, when the number of groups is small relative to the size of the model (e.g., in the condition with J = 100 and df = 54), the multilevel version of T_CRADF performs similarly to the conventional version (Bentler & Yuan, 1999; Yuan & Bentler, 1998). That is, the statistic accepts more correct models than would be expected by chance.

Performance of T_RML

Consistent with theory, in all ICC and group size conditions, the scaling constant for T_RML , k, is larger than 1. The amount of rescaling changes as a function of within-group sample size and item ICC. At ICC .50, there is virtually no rescaling at all. At ICC .05, the T_ML value is scaled by almost 90%. There is also some evidence that the mean of T_RML converges appropriately as the group number increases. For small models, relatively large sample sizes and high ICCs, T_RML behaves well. These conditions are most similar to the conditions that lead Yuan and Bentler (2007) to recommend T_RML for model testing. However, when the full range of simulation conditions are considered, it becomes clear that T_RML cannot adequately control Type I errors when group sizes are small or when ICCs are low. In the condition where ${\dot{σ}}_{c} {(\hat{θ})}^{T} \hat{Γ} {\dot{σ}}_{c} (\hat{θ})$ is not invertible and neither T_RADF nor T_CRADF is estimable, T_RML is unable to control Type I errors under any of the simulation conditions. The current study suggests, contrary to the recommendation of Yuan and Bentler (2007), even though T_RML always performs better than T_ML , it cannot be used to make inferences about model fit in conjunction with the segregating method.

Summary and Conclusion

As surveys of the classroom environment have gained traction as components of teacher evaluation portfolios, there has been an increased amount of attention paid to using multilevel factor analyses to explore hypotheses about the measurement structure of between-classroom phenomena. The segregating method has many theoretical benefits. It allows for the separate testing and identification of measurement models at the between level and within level. This is a key advantage over approaches that simultaneously fit models to Σ _B and Σ _W , since many studies have found that the simultaneous testing of between and within models can make diagnosing sources of model misfit difficult (e.g., Hox, 2010; Ryu & West, 2009; Yuan & Bentler, 2007).

The current study, however, clarifies an important characteristic of the segregating method for applied research. Namely, at ICC and sample size configurations likely to be encountered when data about the classroom environment are collected by surveying students, the commonly used ML test statistic obtained from the segregating method is likely not asymptotically distributed as central χ² variates under the null hypothesis. This suggests that the reliance on the ML test statistic can result in unwarranted model modifications or revisions.

The current study used an illustrative example and a simulation study to investigate the performance of test statistics under conditions likely to be encountered in using the segregating method in applied research on classroom climate. The results reflect some general patterns that are worth noting here. As with any simulation study, caution should be used in generalizing these results to other conditions not included in the study. More work would be needed to investigate how other conditions, such as differences in factor loadings, alternative (nonnormal) distributions, and imbalanced group-sizes, would influence the performance of test statistic.

Estimation Under the Segregating Method Can Be as Efficient as the Partially Saturated Model Method

While it is often assumed that because the segregating method requires two separate steps for implementation, there will be a loss of efficiency as compared to single-step approaches such as the partially saturated model method, and the results of this study suggest that the segregating method can be as efficient as the partially saturated model method, and under certain conditions—with small sample sizes, low ICCs and large models—the segregating method can be considerably more efficient. Goldstein (2003) suggested that the segregating method may lose efficiency if the group sizes are highly unbalanced, a condition which is beyond the scope of the current investigation. Further research would be needed to investigate the joint influence of group size imbalance, ICC, and model size on efficiency.

Inferences About Model Fit Based on T _ML Can Lead to Invalid Conclusions About the Between-Classroom Factor Structure

At ICC and sample size configurations likely to be encountered when data about classroom climate are collected by surveying students, T_ML is not asymptotically distributed as central χ² variate under the null hypothesis. The Tripod Survey example demonstrated that inferences based on T_ML can lead to invalid conclusions about the between-classroom factor structure when ICCs are low and within-classroom sample sizes are small. The simulation study shows the extent to which T_ML can be inflated. The simulation results show that in some conditions, the mean of the test statistic was nearly 20 times too large, and every model was rejected. Thus, T_ML behaved very poor in general and should not be used to make inferences about the between-level measurement models. While beyond the scope of the current study, these results suggest that, beyond issues with assessing model fit, ML estimation would result in biased standard errors. This is because, typically speaking, estimated standard errors are computed using $\dot{σ} {(\hat{θ})}^{T} W_{M L} \dot{σ} (\hat{è})$ (e.g., Bentler, 2006). Based on the results presented elsewhere (Hox, 2010; Yuan & Bentler, 2007), it is anticipated that the parameter estimates themselves will be unbiased. Further research is needed to address these issues.

T_CRADF Can Provide Valid Inferences, Provided the Number of Groups Is Sufficiently Large

While both the rescaled and residual-based test statistics show evidence supporting the hypothesis that they would converge to the appropriate distribution regardless of item ICC and within-group sample size, only T_CRADF showed adequate performance over a wide range of conditions. T_RADF showed a tendency to overreject correct models for all but the largest samples, consistent with findings in conventional factor analysis. T_RML , too, overrejected correct models, particularly for small between-level sample sizes. Only T_CRADF is recommended for use in conjunction with the segregating methodology. However, caution should be used with small samples, where T_CRADF shows a tendency to underreject correct models.

Footnotes

Acknowledgments

The author is grateful to Joan Herman, Jia Wang, and Noelle Griffin for their support and to Peter Bentler, Li Cai, and Jose Felipe Martinez for their valuable advice and feedback.

Author’s Note

The findings and opinions expressed in this report are those of the author and do not necessarily reflect the positions or policies of the Bill and Melinda Gates Foundation or the U.S. Department of Education.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research for this article was supported in part by grant number 52306 from the Bill and Melinda Gates Foundation with funding to the National Center for Research on Evaluation, Standards, and Student Testing (CRESST). Part of this research was made possible by a predoctoral advanced quantitative methodology training grant (#R305B080016) awarded to UCLA by the Institute of Education Sciences of the U.S. Department of Education.

References

Agresti

Coull

B. A.

(1998). Approximate is better than “exact” for interval estimation of binomial proportions. The American Statistician, 52, 119–126.

Bentler

P. M.

(1990). Comparative fit indexes in structural models. Psychological Bulletin, 107, 238–246.

Bentler

P. M.

(2006). EQS 6 structural equations program manual. Encino, CA: Multivariate Software.

Bentler

P. M.

Dudgeon

(1996). Covariance structure analysis: Statistical practice, theory, and directions. Annual Review of Psychology, 47, 563–592.

Bentler

P. M.

Yuan

K. H.

(1999). Structural equation modeling with small samples: Test statistics. Multivariate Behavioral Research, 34, 181–197.

Bollen

K. A.

(1989). Structural equations with latent variables. New York, NY: John Wiley.

Browne

M. W.

(1974). The analysis of patterned correlation matrices by generalized least squares. British Journal of Mathematical and Statistical Psychology, 30, 113–124.

Browne

M. W.

(1982). Covariance structures. In Hawkins

D. M.

(Ed.), Topics in applied multivariate analysis (pp. 72–141). Cambridge, England: Cambridge University.

Browne

M. W.

(1984). Asymptotically distribution-free methods for the analysis of covariance structures. British Journal of Mathematical and Statistical Psychology, 37, 1–21.

10.

Cronbach

L. J.

(with assistance of Deken, J. E., & Webb, N.). (1976). Research on classrooms and schools: Formulation of questions, design and analysis. Occasional Paper of the Stanford Evaluation Consortium, Stanford University.

11.

Curran

P. S.

West

S. G.

Finch

J. F.

(1996). The robustness of test statistics to nonnormality and specification error in confirmatory factor analysis. Psychological Methods, 1, 16–29.

12.

den Brok

Bergen

Stahl

R. J.

Brekelmans

(2004). Students’ perceptions of teacher control behaviours. Learning and Instruction, 14, 425–443.

13.

Fauth

Decristan

Rieser

Klieme

Büttner

(2014). Student ratings of teaching quality in primary school: Dimensions and prediction of student outcomes. Learning and Instruction, 29, 1–9.

14.

Ferguson

(2010, October 14). Student perceptions of teaching effectiveness. Retrieved from http://www.gse.harvard.edu/ncte/news/Using_Student_Perceptions_Ferguson.pdf

15.

Ferguson

R. F.

(2012). Can student surveys measure teaching quality? Phi Delta Kappan, 94, 24–28.

16.

Foldnes

Foss

Olsson

U. H.

(2012). Residuals and the residual-based statistic for testing goodness of fit of structural equation models. Journal of Educational and Behavioral Statistics, 37, 367–386.

17.

Follman

(1992). Secondary school students' ratings of teacher effectiveness. The High School Journal, 75, 168–178.

18.

Goldstein

(2003). Multilevel statistical models. New York, NY: John Wiley.

19.

Hallquist

(2012). MplusAutomation: Automating Mplus model estimation and interpretation. R package version 0.5-3. Retrieved from http://CRAN.R-project.org/package=MplusAutomation

20.

Harnqvist

(1978). Primary mental abilities at collective and individual levels. Journal of Educational Psychology, 70, 706–716.

21.

Hoel

P. G.

Port

S. C.

Stone

C. J.

(1971). Introduction to statistical theory. Boston, MA: Houghton Mifflin.

22.

Holfve-Sabel

Gustaffsson

(2005). Attitudes towards school, teacher, and classmates at classroom and individual levels: An application of two-level confirmatory factor analysis. Scandinavian Journal of Educational Research, 49, 187–202.

23.

Hox

J. J.

(2010). Multilevel analysis. Techniques and applications. Mahwah, NJ: Lawrence Erlbaum.

24.

Hox

Maas

(2004). Multilevel structural equation models: The limited information approach and the multivariate multilevel approach. In Jöreskog

K. G.

Sörbom

(Eds.), Recent developments on structural equation models (pp. 135–149). AK Houten, Netherlands: Springer.

25.

L. T.

Bentler

P. M.

Kano

(1992). Can test statistics in covariance structure analysis be trusted? Psychological Bulletin, 112, 351–362.

26.

Jöreskog

K. G.

(1967). Some contributions to maximum likelihood factor analysis. Psychometrika, 32, 443–482.

27.

Julian

M. W.

(2001). The consequences of ignoring multilevel data structures in nonhierarchical covariance modeling. Structural Equation Modeling, 8, 325–352.

28.

Kunter

Tsai

Y. M.

Klusmann

Brunner

Krauss

Baumert

(2008). Students' and mathematics teachers' perceptions of teacher enthusiasm and instruction. Learning and Instruction, 18, 468–482.

29.

Lee

S. Y.

(1990). Multilevel analysis of structural equation models. Biometrika, 77, 763–772.

30.

Liang

Bentler

P. M.

(2004). An EM algorithm for fitting two-level structural equation models. Psychometrika, 69, 101–122.

31.

Liang

K. Y.

Zeger

S. L.

(1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13–22.

32.

Longford

N. T.

Muthén

B. O.

(1992). Factor analysis for clustered observations. Psychometrika, 57, 581–597.

33.

Magnus

J. R.

Neudecker

(1988). Matrix differential calculus with applications in statistics and econometrics. Chichester, England: John Wiley.

34.

Mair

(2012). REQS: R/EQS Interface. R package version 0.8-12. http://CRAN.R-project.org/package=REQS

35.

Marsh

H. W.

Lüdtke

Nagengast

Trautwein

Morin

A. J.

Abduljabbar

A. S.

Köller

(2012). Classroom climate and contextual effects: Conceptual and methodological issues in the evaluation of group-level effects. Educational Psychologist, 47, 106–124.

36.

Mathisen

G. E.

Torsheim

Einarsen

(2006). The team-level model of climate for innovation: A two-level confirmatory factor analysis. Journal of Occupational and Organizational Psychology, 79, 23–35.

37.

McDonald

R. P.

Goldstein

(1989). Balanced versus unbalanced designs for linear structural relations in two-level data. British Journal of Mathematical and Statistical Psychology, 42, 215–232.

38.

Muthén

(1991). Multilevel factor analysis of class and student achievement components. Journal of Educational Measurement, 28, 338–354.

39.

Muthén

(1994). Multilevel covariance structure analysis. In Hox

Kreft

(Eds.), Multilevel Modeling, a Special Issue of Sociological Methods & Research, 22, 376–398.

40.

Muthén

Asparouhov

(2011). Beyond multilevel regression modeling: Multilevel analysis in a general latent variable framework. In Hox

Roberts

J. K.

(Eds.), Handbook of advanced multilevel analysis (pp. 15–40). New York, NY: Taylor & Francis.

41.

Muthén

Kaplan

(1985). A comparison of some methodologies for the factor analysis of non-normal Likert variables. British Journal of Mathematical and Statistical Psychology, 38, 171–189.

42.

Muthén

Kaplan

(1992). A comparison of some methodologies for the factor analysis of non-normal Likert variables: A note on the size of the model. British Journal of Mathematical and Statistical Psychology, 45, 19–30.

43.

Muthén

(2010). Mplus (version 6.11) [computer software]. Los Angeles, CA: Author.

44.

National Council for Teacher Quality. (2013). Connect the dots: Using evaluations of teacher effectiveness to inform policy and practice. Washington, DC: Kathryn M. Doherty and Sandi Jacobs.

45.

Powell

D. A.

Schafer

W. D.

(2001). The robustness of the likelihood ratio chi-square test for structural equation models: A meta-analysis. Journal of Educational and Behavioral Statistics, 26, 105–132.

46.

Rabe-Hesketh

Skrondal

Zheng

(2007). Multilevel structural equation modeling. In Lee

S.-Y.

(Ed.), Handbook of latent variable and related models (pp. 209–227). Amsterdam, The Netherlands: Elsevier.

47.

R Core Team. (2012). R: A language and environment for statistical computing. [Computer software manual]. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. Retrieved from http://www.R-project.org/

48.

Reise

R. P.

Ventura

Nuechterlein

K. H.

Kim

K. H.

(2005). An illustration of multilevel factor analysis. Journal of Personality Assessment, 84, 126–136.

49.

Rosenberg

S. L.

(2009). Multilevel validity: Assessing the validity of school-level inferences from student achievement test data. ProQuest.

50.

Ryu

West

S. G.

(2009). Level-specific evaluation of model fit in multilevel structural equation modeling. Structural Equation Modeling, 16, 583–601.

51.

Satorra

Bentler

P. M.

(1988). Scaling corrections for chi-square statistics in covariance structure analysis. ASA Proceedings, Business and Economic Statistics Section (pp. 308–313). Washington, DC: American Statistical Association.

52.

Steiger

J. H.

Lind

J. C.

(1980). Statistically-based tests for the number of common factors. Paper presented at the annual spring meeting of the Psychometric Society, Iowa City, IA.

53.

Toland

M. D.

De Ayala

R. J.

(2005). A multilevel factor analysis of students’ evaluations of teaching. Educational and Psychological Measurement, 65, 272–296.

54.

Worrell

F. C.

Kuterbach

L. D.

(2001). The use of student ratings of teacher behaviors with academically talented high school students. Prufrock Journal, 12, 236–247.

55.

Yuan

K.-H.

Bentler

P. M.

(1998). Normal theory based test statistics in structural equation modeling. British Journal of Mathematical and Statistical Psychology, 51, 289–309.

56.

Yuan

K.-H.

Bentler

P. M.

(2002). On normal theory based inference for multilevel models with distributional violations. Psychometrika, 67, 539–561.

57.

Yuan

K.-H.

Bentler

P. M.

(2003). Eight test statistics for multilevel structural equation models. Computational statistics & data analysis, 44(1), 89–107.

58.

Yuan

K.-H.

Bentler

P. M.

(2006). Asymptotic robustness of standard errors in multilevel structural equation models. Journal of Multivariate Analysis, 97, 1121–1141.

59.

Yuan

K.-H.

Bentler

P. M.

(2007). Multilevel covariance structure analysis by fitting multiple single-level models. Sociological Methodology, 37, 53–82.

60.

Yuan

K.-H.

Jennrich

R. I.

(1998). Asymptotics of estimating equations under natural conditions. Journal of Multivariate Analysis, 65, 245–260.

61.

Zyphur

M. J.

Kaplan

S. A.

Christian

M. S.

(2008). Assumptions of cross-level measurement and structural invariance in the analysis of multilevel data: Problems and solutions. Group Dynamics: Theory, Research, and Practice, 12, 127–140.

Multilevel Factor Analysis by Model Segregation

Abstract

Keywords

Theoretical Background

Multilevel Factor Analysis

The Rationale for Between-Level Model Testing With Student Surveys

Three Approaches to Multilevel Fit Testing

Four Test Statistics

The likelihood ratio test statistic TML

The residual-based test statistics TRADF and TCRADF

The rescaled test statistic TRML

Behavior of TML in the Segregating Methodology

Behavior of the Residual-Based Test Statistics

Issues With TML in the Study of Classroom Climate

Method

Data Sources

The Tripod Classroom Environment Survey

Simulated data sets

Analytic Approach

Results

To What Extent Can Inferences About the Measurement Structure of the Tripod Classroom Environment Survey Based on TML Differ From Those Based on Residual-Based and Rescaled Test Statistics?

Is There a Loss of Estimator Efficiency in Using the Segregating Method, as Compared to the Partially Saturated Model Method?

How Do Item ICC and Within-Group Sample Size Influence the Distribution of TML ? How do Item ICC and Within-Group Sample Size Influence the Differences Between W B − 1 and Γ ˆ B ?

How do TRML , TRADF , and TCRADF Perform Under a Broader Range of Conditions, Particularly Those That Are Frequently Encountered in Survey-Based Research on Classroom Climate?

Performance of TRADF

Performance of TCRADF

Performance of TRML

Summary and Conclusion

Estimation Under the Segregating Method Can Be as Efficient as the Partially Saturated Model Method

Inferences About Model Fit Based on T ML Can Lead to Invalid Conclusions About the Between-Classroom Factor Structure

TCRADF Can Provide Valid Inferences, Provided the Number of Groups Is Sufficiently Large

Footnotes

Acknowledgments

Author’s Note

Declaration of Conflicting Interests

Funding

References

The likelihood ratio test statistic T_ML

The residual-based test statistics T_RADF and T_CRADF

The rescaled test statistic T_RML

Behavior of T_ML in the Segregating Methodology

Issues With T_ML in the Study of Classroom Climate

To What Extent Can Inferences About the Measurement Structure of the Tripod Classroom Environment Survey Based on T_ML Differ From Those Based on Residual-Based and Rescaled Test Statistics?

How Do Item ICC and Within-Group Sample Size Influence the Distribution of T_ML ? How do Item ICC and Within-Group Sample Size Influence the Differences Between ${W_{B}}^{- 1}$ and ${\hat{Γ}}_{B}$ ?

How do T_RML , T_RADF , and T_CRADF Perform Under a Broader Range of Conditions, Particularly Those That Are Frequently Encountered in Survey-Based Research on Classroom Climate?

Performance of T_RADF

Performance of T_CRADF

Performance of T_RML

Inferences About Model Fit Based on T _ML Can Lead to Invalid Conclusions About the Between-Classroom Factor Structure

T_CRADF Can Provide Valid Inferences, Provided the Number of Groups Is Sufficiently Large