Abstract
Studies comparing groups that are at different levels of multilevel data (namely, cross-level groups) using the same measure are not unusual such as student and teacher agreement in education or congruence between patient and physician perceptions in health research. Although establishing measurement invariance (MI) between these groups is important, testing MI is methodologically challenging because the groups compared for MI are at different levels with one group nested within the other group. We propose a multilevel confirmatory factor analysis (CFA) model that allows MI testing between cross-level groups at the between level and demonstrated testing MI between students and teachers using the promoting social interaction scale. Along with the demonstration, some methodological issues in implementing the proposed model (e.g., cluster invariance and reliability) and evaluating the model fit of multilevel CFA (e.g., ΔCFI and level-specific fit indices) and alternative approaches to the proposed model are discussed.
Keywords
Multilevel data are prevalent in social sciences. Multilevel data consist of multiple units of analysis, for example, students and teachers as Level-1 and Level-2 units of analysis when students are nested within teachers. It is also not unusual that researchers compare these units across levels in multilevel data (Level-1 units vs. Level-2 units such as students vs. teachers; patients vs. caregivers, team members vs. leaders) where one group is at Level 1 and the other is at Level 2 (e.g., 200 students and their 30 teachers). For example, in educational research, congruence or disagreement between teachers’ and students’ perceptions has been studied for decades in many different aspects such as teaching effectiveness (e.g., Beaudrie, 2015), classroom learning environment (Chambers, 2015), and the use of power in the classroom (McCroskey & Richmond, 1982) because understanding the gap between teachers’ and students’ perceptions is important to improve educational and instructional practices and outcomes in the classrooms and schools (Beaudrie, 2015). Similar examples of congruence studies can be found across disciplines (e.g., Krupat et al., 2000; Lukowski, 2004; Watson, 2009).
In these studies that investigate agreement between two groups that are at different levels of multilevel data such as students and teachers with quantitative measures, the majority employed types of mean comparison (e.g., t tests), correlation, or interrater reliability to test the congruence between them using either items or scale composite scores (e.g., Beaudrie, 2015; Chambers, 2015). In the student teacher agreement studies, for example, student ratings or scores are averaged within classroom and compared to teacher scores. The mean differences between students and teachers are, then, tested for statistical significance to assess whether students and teachers agree on the tested measures. One fundamental assumption of these practices is measurement invariance between students and teachers, that is, students and teachers interpret and respond to the items of a measure in the same way. Even though it is widely recognized among applied researchers and practicing scientists that measurement invariance is a prerequisite to a meaningful comparison of scores between groups (Raykov, Marcoulides, & Li, 2012), measurement invariance between students and teachers, or between two groups that are at different levels of multilevel data, is often overlooked when agreement (i.e., a mean difference in this example) is assessed.
Before we move on to the issues of measurement invariance in agreement studies, it should be noted that the current study is based on research settings where agreement between groups is of focal interest. In the organizational climate research, there has been a long history of research on agreement within group (Schneider, Ehrhart, & Macey, 2011) which will be discussed later because agreement within group (e.g., agreement among students about their classroom learning environment) is essential to agreement between groups at different levels of multilevel data (e.g., agreement between students and teachers about the classroom learning environment). Another note is that agreement between groups can be measured in different ways. For demonstration purposes, we consider a mean difference between two groups as a measure of agreement.
When researchers are interested in testing measurement invariance (MI) between groups across levels of multilevel data, they may encounter several methodological challenges in implementing the MI test in this circumstance. Because one group (students) is nested within the other group (teachers), multilevel modeling that takes into account the dependency of students in the same classroom is essential. Previous simulation studies (French & Finch, 2010; Kim, Kwok, & Yoon, 2012) showed that failing to model data dependency in testing MI with multilevel data resulted in high Type I error (falsely rejecting invariance when a measure is invariant). In addition, the groups that are compared for MI are at different levels of multilevel data (students at Level 1 and teachers at Level 2). We term this type of group formation as cross-level groups hereafter. Because two groups in comparison are at different levels and the higher level group members are clusters to which the lower level group members belong, multiple group confirmatory factor analysis (CFA), which is a typical approach to MI testing for two groups (e.g., males and females), is not feasible.
The purpose of this article is twofold. First, we advocate testing measurement invariance in studies that investigate agreement between groups such as student teacher agreement studies. Second, we address the aforementioned methodological challenges in modeling measurement invariance when two groups are at different levels of multilevel data (i.e., cross-level groups), and introduce a multilevel CFA model with MI tested at Level 2. To this end, we demonstrate the proposed model by testing measurement invariance between student and teacher ratings on promoting social interaction in the classroom. Although this demonstration is based on educational data in classroom contexts, the proposed model may be applied to a wide variety of settings in different disciplines where units in different levels of multilevel data are compared for MI.
Multilevel Confirmatory Factor Analysis With Measurement Invariance Tested at Level 2
Measurement Invariance
Measurement invariance is defined as the equal probability of endorsing an item given latent ability regardless of group membership (Millsap, 2011). In other words, an item or a test is noninvariant or biased if individuals with the same latent ability have different expected item or test scores depending on their group membership. In CFA, measurement invariance is commonly tested with a multiple group analysis by comparing measurement models across groups. A measurement model is formulated for group
where Y,
Building a Multilevel Model With Student Data
Multilevel Confirmatory Factor Analysis
When data are multilevel (i.e., two levels for exposition such as students nested within teachers), each score
respectively, where W and B denote within (Level 1) and between (Level 2), respectively. Because the within data are basically deviation scores from the mean of each teacher, the intercepts are constrained at zero and not shown in Equation (2). Similar to the decomposition of observed scores, the latent factor scores are decomposed into within and between components (Muthén, 1994):
Errors at the within and between levels are assumed to be normally distributed with mean zero and variance
Conceptualization of Latent Factors in Multilevel Confirmatory Factor Analysis
When researchers model latent factors at both within and between levels (Level 1 and Level 2, interchangeably), the meaning of those factors needs to be conceptualized at each level. There are different types of multilevel constructs depending on which level construct is interpretable and of focal interest: for example, within-cluster construct, shared cluster construct, and configural cluster construct (Bliese, 2000; Kim, Dedrick, Cao, & Ferron, 2016; Stapleton, Yang, & Hancock, 2016). The multilevel constructs based on student ratings of classroom contexts (e.g., promoting social interaction) are considered as shared or reflective cluster constructs. For shared cluster constructs, the reference of the items is at Level 2, for example, classrooms or teachers, although students are the respondents (e.g., My teacher often allows us to discuss our work with classmates). That is, Level-2 factors are generally more meaningful as collective perceptions of students on teacher’s practices than Level-1 factors that represent individual deviations from the common perception of teacher practices and classroom environment. For the same reason, the between measurement model of students is of interest and compared with the teachers’ measurement model. For readers interested in different types of multilevel construct conceptualization and model specification, refer to Stapleton et al. (2016) and Bliese (2000). There is also a history of debate about the conceptualization of organizational climate as individual experiences and/or organizational attributes. An overview of the conceptualization of climate and its level of analysis theoretically and statistically is found in Schneider et al. (2011) as well as Yammarino and Dansereau (2011).
Agreement Within Groups
One underlying assumption of shared cluster constructs is agreement among individuals at Level 1 about the construct measured at Level 2. That is, it is assumed that individuals perceive the Level 2 construct (e.g., classroom environment) similarly. Although there are several measures that assess the degree of agreement within groups, we use intraclass correlation (ICC) in this article: ICC1 and ICC2. The first type of ICC is estimated using the formula,
Cluster Invariance
When a multilevel CFA model is constructed with student ratings, it is assumed that the student rating scale is invariant across classrooms (so called, cluster invariance) in order that one representative measurement model of students at the classroom level can be compared to the teachers’ measurement model. Cluster invariance is referred to as the invariance of a measure across Level-2 units of analysis in multilevel data (e.g., invariance across classrooms). Cluster invariance can be tested using multilevel CFA, which is formulated as Equations (2) and (3) (Jak, Oort, & Dolan, 2013). In cluster invariance testing, configural invariance means the equal factor structure across levels (i.e., factor structure of within measurement model = factor structure of between measurement model). Then, the equality of factor loadings across levels (
Testing Measurement Invariance Between Students and Teachers
Multilevel Confirmatory Factor Analysis With Measurement Invariance Tested at Level 2
Once the multilevel CFA model based on student data is ready with cluster invariance, the teachers’ measurement model is added to the Level-2 part of students’ multilevel model. As illustrated in Figure 1, teachers’ perceptions of their own practices in the classroom is another Level-2 factor on which the variables of teacher ratings load. As pointed out earlier, students and teachers cannot be considered as two independent groups because students are nested within teachers. Instead, they are represented by two separate but correlated factors at Level 2. Then, the between measurement model of students is compared with that of teachers for equivalence. The configural invariance refers to the identical structures of student and teacher factors at Level 2. If the loadings of the student and teacher factors are equal for all items (

Model constructed for testing measurement invariance (MI) between students and teachers. FW = factor at Level 1 for student data; FB = factor at Level 2 for student data; FT = factor at Level 2 for teacher data. MI between students and teachers is tested at Level 2. Cluster invariance is imposed on the student measurement model.
In the model specification of student teacher measurement invariance testing, error covariance is worthy of note. For student and teacher measures, the same items are used with the only difference in the referent (“my teacher” for students; “I” for teachers). Also, a teacher rating and his or her students’ average rating are paired. In this case, it is common to allow the residual of a student item to covary with that of the corresponding teacher item. However, when cluster invariance holds across classrooms for the student measurement model, the residual variances at the between level are zero for student items, and thus, the covariance of errors between student and teacher items is not modeled (i.e., zero).
Alternative Models for Testing Measurement Invariance Between Students and Teachers
Although we promote multilevel CFA with MI tested at Level 2, researchers may consider single-level alternatives to multilevel CFA by either disaggregating or aggregating multilevel data. First, Level-2 data (e.g. teacher data) can be disaggregated to Level 1 (e.g., student level) and only the between-level part of the proposed model in Figure 1 is constructed as a student-level CFA model ignoring data dependency of students nested within teachers. The disadvantages of single-level approaches with multilevel data are widely discussed in the multilevel modeling literature, and thus we discourage using this approach by all means. Second, the disaggregated single-level CFA model with adjusted standard errors that take account of the dependency of multilevel data can be considered alternatively. There are two major issues with this approach that applied researchers should be cognizant of. Creating a single-level model with multilevel data requires cross-level measurement equivalence (i.e., identical measurement models across levels) because a single set of factor loadings are estimated given the identical factor structure across levels (Zyphur et al., 2008). When the within and between factor structures are not identical, it is reported that this approach with adjusted standard errors (also called a design-based approach to multilevel data) yielded biased parameter estimates (Wu & Kwok, 2012) and inflated false detection of noninvariance in MI testing (Kim, Yoon, Wen, Luo, & Kwok, 2015). More problematically, when the construct tested for MI is a shared cluster construct with the focus on the between-level measurement model, constructing a student-level measurement model and testing measurement invariance between students and teachers at the student level is not conceptually appropriate even when this model is statistically feasible with cross-level equivalence.
Aggregating the student data to the teacher level could reduce model complexity if classroom means of student scores are reliable because the focal interest is at the between level in this case. Similar to the disaggregated single-level approach, the cross-level measurement equivalence is assumed in this approach. As pointed out earlier, observed cluster means (e.g., classroom average of student ratings) are not always reliable especially when cluster size is small, and previous studies evidenced that this approach yielded biased estimates of regression coefficients in regression analysis (Croon & van Veldhoven, 2007; Lüdtke et al., 2008). Croon and van Veldhoven suggested to use adjusted group means to correct bias. On the other hand, the multilevel CFA approach, which is demonstrated in this study using Mplus, is known to take into account the reliability of group means at the between level and produce unbiased estimates (Lüdtke et al., 2008).
Finally, a multilevel model with a saturated within model is another alternative to the proposed multilevel model. The within model is completely relaxed by allowing covariance among all within-level variables without imposing any specific model, which yields perfect model fit at the within level. This model is reasonable because the focal interest of the research is at the between level and MI is tested across groups at the between level. This saturated within model approach can be advantageous because the fit of the within-level measurement model is not of concern and all fit indices used for model evaluation and comparison are relevant to the level where MI is tested (i.e., between level). However, it should be kept in mind that the within level model can be theoretically important and informative although not directly used for MI testing between students and teachers. Thus, researchers “miss the opportunity to discover similarities and differences across levels of analysis in the functioning of their observed variables, differences that could have interesting theoretical importance” (Zyphur et al., 2008, p. 127).
In the following section, the proposed model was demonstrated with student and teacher ratings of promoting social interaction in the classroom. In this demonstration, we first tested measurement invariance of the promoting social interaction scale between students and teachers with the proposed multilevel CFA model and subsequently evaluated the agreement between students and teachers by comparing their factor means. For demonstration purposes, we also tested MI between students and teachers using alternative models and compared the results with those of the proposed model: single-level model ignoring data dependency, single-level approach with adjusted standard errors, aggregated data approach, and multilevel CFA with a saturated within model.
Method
Participants
Three hundred and thirty-six middle school students were nested within 31 teachers. The average cluster size, that is, average number of participating students per teacher is 10.84 (SD = 3.72) with the minimum 2 and maximum 19. The student sample had 52% males (48% females) and several ethnicities were represented (54% White, 21% Latino, 12% Other/Multiracial, 6% Asian, and 5% African American). The teacher sample had 23% males (77% females) and was predominantly White (74% White, 13% Latino, 7% African American, 3% Asian American, and 3% did not report their ethnicity in the survey).
Procedure
Students and teachers were recruited from social studies classrooms in three middle schools in the southeastern United States using a convenience sampling method. The research team requested one subject area to be recruited from; principals at participating schools selected social studies class due to practicality issues with state testing. Data used in this study were collected in fall 2009 during the first year of middle school (sixth grade) as part of a larger longitudinal study that investigated student motivation. The three middle schools recruited served a large, ethnically diverse, urban community. Active parental consent and participant assent were obtained prior to data collection; the average consent return rate for students was 57%. Participants were representative of demographics at each school and overall district demographics. Surveys were administered to participants during school hours. The research team returned one additional day to administer make-ups for participants who were absent. Of note is that measurement invariance among the three schools was not explicitly tested because of a small number of teachers per school (e.g., five teachers in one school) and nonconvergence, but indirectly evaluated in testing measurement invariance between students and teachers. 1
Measures
Variables in the current study included demographics and the Teacher Promotion of Social Interaction scale from the Classroom Social Environment developed by Ryan and Patrick (2001). The Promoting Social Interaction scale was self-report, used a 5-point Likert scale (1 = not at all true; 5 = very true), and was positively worded (i.e., higher scores indicated higher degrees of a given attribute).
Promoting Social Interaction Scale: Student Version
The student version of the Teacher Promotion of Social Interaction scale assessed the extent to which students perceived teachers as encouraging students to interact with one another during academic activities (Ryan & Patrick, 2001). This subscale comprised of four items, including “My teacher often allows students to discuss their work with classmates”, “My teacher encourages us to share ideas with one another in class”, “My teacher lets us ask other students when we need help with our work”, and “My teacher encourages us to get to know all the other students in class.” This scale has been previously administered to early adolescents and has been found to be valid and reliable (Patrick, Ryan, & Kaplan, 2007; Ryan & Patrick, 2001).
Promoting Social Interaction Scale: Teacher Version
The teacher version of the Teacher Promotion of Social Interaction scale assessed the extent to which teachers perceived themselves as encouraging students to interact with one another during academic activities (Ryan & Patrick, 2001). This measure was based on the student-version of the Teacher Promotion of Social Interaction subscale (Ryan & Patrick, 2001). The measure used similar items and Likert-type scale as the student-version, but was minimally reworded to reflect teachers’ own perceptions using the phrase (“I” instead of “My teacher”). Similar to the student-version, this measure contained four items, including “I often allow students to discuss their work with classmates”, “I encourage students to share ideas with one another in class”, “I let students ask other students when they need help with their work”, and “I encourage students to get to know all the other students in the class.” This is the first time this scale has been administered to teachers; psychometric information is not available from prior research studies.
Data Analytic Plan
First, descriptive statistics including mean, standard deviation, skewness, kurtosis, and correlation of items were examined for student and teacher data, separately, using SPSS version 22. For the student data, we also reported two types of intraclass correlation (ICC1 and ICC2). Regarding reliability, we estimated Cronbach’s alpha (α) and composite reliability omega (ω) using confirmatory factor analysis. Because Cronbach’s alpha is based on the stringent assumption of essentially tau-equivalent model and estimated by weighting all items equally (Raykov, 1997), we also examined omega that takes into account heterogeneous relations of items to the factor (i.e., unequal factor loadings) in its estimation. For the teacher measure, the Cronbach’s alpha and composite reliability omega were estimated using single-level CFA. For the student data we constructed multilevel CFA. Because multilevel constructs (within and between) were modeled, we reported level-specific reliability estimates, that is, alpha and omega for within and between constructs separately from Level-1 and Level-2 measurement models, respectively (Geldhof, Preacher, & Zyphur, 2014).
For measurement invariance testing between students and teachers in terms of promoting social interaction, we first checked student and teacher measurement models separately including cluster invariance testing for the student model. Then, we applied the proposed model shown in Figure 1 to test configural invariance. We imposed cluster invariance for the student Level-1 and Level-2 measurement models with equal factor loadings across levels and zero Level-2 residual variances. For the identification of the covariance structure, the factor loading of one item (i.e., referent item) per factor was fixed at one. For the identification of the mean structure, the intercepts of the referent item were constrained equal between teachers and students, and the mean of the student Level-2 factor was fixed at zero (see the Mplus code in the appendix).
Metric invariance was tested by imposing the equality constraints on all factor loadings between students and teachers. After the establishment of metric invariance, the intercepts of all items were constrained equal between students and teachers for scalar invariance. Finally, factor means were compared between students and teachers to assess agreement between them. Because the mean of the student Level-2 factor was constrained at zero, the estimated mean of the teacher factor could be interpreted as the factor mean difference between students and teachers when scalar invariance was satisfied. We used Mplus version 7.1 (Muthén & Muthén, 2012) to conduct measurement invariance testing. The program default robust maximum likelihood (MLR) was adopted for model estimation which adjusts standard errors and chi-square statistics for nonnormal and nonindependent data.
In determining the level of invariance, we evaluated overall model fit and conducted likelihood ratio tests (LRTs) between two competing models (i.e., configural vs. metric invariance models; metric vs. scalar invariance models). Although Satorra-Bentler scaled likelihood ratio test (Satorra & Bentler, 2001) is considered optimal with the MLR estimation, we used the regular LRT because the Satorra-Bentler scaled LRT frequently produces negative values and the good performance of the regular LRT with MLR was also reported (Jak et al., 2014; Kim et al., 2017). The overall model fit was considered reasonable if chi-square p≥.05, comparative fit index (CFI) ≥ .95, and root mean square error of approximation (RMSEA) ≤ .06 (Hu & Bentler, 1999). We also considered the changes in CFI and RMSEA. When ΔCFI < .01 (Cheung & Rensvold, 2002) and ΔRMSEA < .015 (Chen, 2007), the model with additional constraints (invariance model) was supported. These model fit criteria were suggested in the literature based on single-level CFA models, but for the demonstration purpose, we used these long-established conventional criteria for multilevel CFA. Considering that some scholars (e.g., Hsu, Kwok, Lin, & Acosta, 2015; Ryu & West, 2009) criticized the overall fit evaluation of multilevel models, we also assessed level-specific fit indices and discussed their usability for the proposed model evaluation.
The four alternative models were constructed with two correlated factors of students and teachers. 2 The first two alternative models were disaggregated single-level models without and with standard error adjustment, respectively (n = 336). We used TYPE = COMPLEX for adjusted standard errors in Mplus. For the aggregated approach, student scores of each item were averaged by teacher and combined with the teacher data (n = 31). Finally, for the saturated within model, the between-level model was identical to that of the proposed model, but at the within level student perception of teacher promoting social interaction was not modeled and all items were simply allowed to be correlated. For all four models, MI was tested with the same aforementioned procedure with the same model evaluation and comparison methods. For the single-level approach ignoring data dependency and the aggregated approach, maximum likelihood was used for model estimation. The MLR was used for the other two models.
Results
Descriptive Statistics
Table 1 presents descriptive statistics and correlations between items by students and teachers data separately. Overall, student ratings were on average lower than teacher ratings across all items. Items 2 (“Share ideas”) and 4 (“Get to know others”) had higher responses than other items across students and teachers, suggesting that both students and teachers perceived these two aspects of social interaction were promoted more than the others. Greater variability in student responses was observed across all items. Items were approximately normally distributed except for the slightly negatively skewed distribution of teacher responses for Items 2 and 4. Item responses were more highly correlated for students than teachers. For the student data, the item ICC1s ranged between .05 and .10 indicating small variability across classrooms relative to within-classroom variability. The estimated ICC2s were higher with the values from .36 to .55, but lower except one item than the minimum cutoff .50 for a moderately reliable shared cluster construct showing considerable variability among students about their teacher’s promoting social interaction.
Descriptive Statistics and Correlations Between Items by Students and Teachers.
Item descriptions are paraphrased. Items were measured on a 5-point Likert-type scale (1 = not at all true, 5 = very true).
For the teacher responses, there was no missing case (n = 31). For the student responses, missing rates were very small across items. The missing cases were treated using full information maximum likelihood estimation (MLR) that included all 336 students in data analyses.
Reliability
Based on student data, Cronbach’s alphas for the scale were .72 and .94, at the within and between levels, respectively, and composite reliability omegas .73 and .97, respectively. Cronbach’s alpha for the scale using teacher data was .51 and composite reliability omega .53.
Measurement Invariance Testing
Student and teacher measurement models were constructed separately and the model fit was checked prior to measurement invariance testing. The multilevel CFA model with cluster invariance fitted student data well,
3
To test measurement invariance between students and teachers, the multilevel CFA model shown in Figure 1 was constructed with factor loadings and intercepts freely estimated for both student and teacher factors at the between level except the first item for model identification. This configural invariance model showed good fit,
Measurement Invariance Testing of the Promoting Social Interaction Scale across Students and Teachers.
Note. CFI = comparative fit index; RMSEA = root mean square error of approximation; SRMR-W and SRMR-B = standardized root mean square residual at the within and between levels, respectively; Δ
The mean of the teacher factor, which indicates the factor mean difference between students and teachers. bSRMR was reported because model only involved one level of analysis.
With the establishment of scalar invariance, factor mean comparisons were conducted. Results showed that the mean of the promoting social interaction factor for teachers (factor variance = 0.060) was 0.907 higher than that for students (factor variance = 0.046). Converting factor scores from the raw scale to the standardized scale (factor variances equaled 1.00), the standardized factor mean difference between teachers and students was 4.23. That is, the factor mean for teachers was 4.23 SD higher than that for students. In other words, teachers perceived themselves as actively promoting social interaction in class, while from students’ perspective, social interaction was promoted to a lesser degree. The correlation between student and teacher factors was .43, but this was not statistically significant (p = .15).
The bottom panels of Table 2 showed the results of MI testing with four alternative models. When the single-level model was used to test MI ignoring data dependency, only configural invariance was satisfied with excellent fit. The model fit from configural to metric and from metric to scalar invariance deteriorated notably. The single-level model with adjusted standard errors showed mixed results. The likelihood ratio test did not support metric invariance. However, both ΔCFI and ΔRMSEA supported scalar invariance, and the fit of the scalar invariance model was excellent. The aggregated data analysis and the ML CFA with saturated within model supported scalar invariance between students and teachers, which is consistent to the conclusion of the proposed model. In all three approaches in which scalar invariance was fully or partly supported, the teacher factor mean was statistically significantly higher than the student factor mean indicating there was a considerable difference between students and teachers in their perception of promoting social interaction in the classroom. The smallest factor mean difference (0.881) was observed with the aggregated approach; the largest (0.997) with the single-level analysis with adjusted standard errors. The proposed ML CFA yielded the smallest standard error.
Discussion
This study demonstrated model specification and procedures of measurement invariance testing when the comparing groups are at different levels of multilevel data, in other words, one group is nested within the other group. In the following sections, we briefly discuss congruence or disagreement of student teacher perceptions about promoting social interaction given measurement invariance of the scale between students and teachers. Finally, remaining methodological issues in implementing the proposed model and future research directions are discussed.
Overall, the findings indicated that students and teachers perceived the promotion of social interaction in the classroom differently. This adds to the literature on congruence or disagreement between student and teacher perceptions of the classroom (Beaudrie, 2015, Chambers, 2015; McCroskey & Richmond, 1982) and may have implications for improving educational outcomes and instructional practices (Beaudrie, 2015). The findings extend prior studies using multilevel models by proposing a multilevel CFA model that allows MI testing between cross-level groups and demonstrating testing MI between students and teachers. One possible explanation for differing student and teacher perceptions is that reports of promoting social interaction may be more subjective than other aspects of the classroom such as instruction (Desimone, Smith, & Frisvold, 2010; Mitchell, Bradshaw, & Leaf, 2010). Given that data for the current study were collected during the first year of middle school, students and teachers may differ in their perceptions of the classroom environment due to systematic changes in classroom and school environments associated with the transition into middle school (Eccles et al., 1993). Students’ perceptions of the classroom environment may be in flux following the transition into middle school as a result of experiencing changes in school structure, motivational and instructional techniques, grouping practices, and quality of teacher-student and student-student relationships (Eccles et al., 1993; Eccles & Roeser, 2011). Additionally, prior research suggests that student perceptions cannot be reliably aggregated at the classroom level (Lam, Ruzek, Schenke, Conley, & Karabenick, 2015; Miller & Murdock, 2007; Schenke, Ruzek, Lam, Karabenick, & Eccles, 2017; Schweig, 2014). This aligns with theory that student perceptions reflect individual interpretations of their experiences in the same environment (Ames, 1992; Maehr & Midgley, 1991) and research indicating differential treatment of teachers toward students within the same classroom (Brattesani, Weinstein, & Marshall, 1984; Eccles & Blumenfeld, 1985; Kuklinski & Weinstein, 2000). Future research is needed to further examine the heterogeneity of student and teacher perceptions of the classroom and individual differences in the influence classroom contextual factors (Schenke et al., 2017). This may inform future educational interventions and align with recommendations for researchers to examine individual- and contextual-level moderating variables that may limit intervention results (Rosenzweig & Wigfield, 2016) and to consider individual’s beliefs and understandings of themselves and their environments (Wilson & Buttrick, 2016).
Although this study showed scalar invariance of the promoting social interaction measure between students and teachers, several remaining issues are worthy of note. We observed low reliability estimates of teacher scores which raised a concern about the use of observed teacher scores in examining teacher student congruence. In this case, the advantage of multilevel CFA is more prominent over conventional approaches (e.g., an observed mean comparison) because CFA allows researchers to compare factor means after taking measurement error into account.
The item ICC2s of the student data were low with the values between .36 and .55. The shared cluster construct requires ICC2 greater than .50 for the meaningful interpretation of its between-level construct (Klein et al., 2000). The low ICC2 values of student variables indicate that students within classroom rated their teacher widely differently. Thus, although scalar invariance holds between students and teachers on average, it should be noted that students also disagree with each other about their teachers to some degree. The low ICC2s might not be surprising given small cluster size (10 students on average). That is, in the extreme case, a teacher was rated by two students, which obviously raises a concern of the reliability of the average student rating. Because the reliability of cluster means partly depends on the cluster size, when researchers conduct studies with the shared cluster construct, cluster size should be sufficiently large to have reliable cluster means for the between-level construct.
To establish measurement invariance between students and teachers using multilevel CFA, we mainly depended on the fit criteria developed for single-level CFA. Recently, Kim et al. (2017) investigated the conventional model comparison criteria, ΔCFI < .01 and ΔRMSEA < .015 in measurement invariance testing with multilevel CFA and showed reasonable performance of ΔCFI < .01. Although they examined these cutoffs for specifically many group comparisons, it appears that the conventional cutoff is generally robust for multilevel CFA. However, researchers should be mindful of some criticism on the general use of CFI in measurement invariance testing. Lai and Yoon (2015) noted that CFI values depend on how a baseline model is specified and the conventional baseline model is not appropriate to MI testing. The conventional baseline model freely estimates the means and variances of observed variables between groups and, thus, the scalar invariance model that constrains the intercepts of observed variables is not nested within the baseline model. They proposed a modified baseline model and showed that CFI with the modified baseline model was more sensitive to the violation of scalar invariance than the conventional CFI. Although ΔCFI performed reasonably with large noninvariance, its performance was not as good as the proposed CFI. Note that their investigation was conducted with single-level multiple group CFA. Given the popularity of ΔCFI in MI testing, further investigation is called for about the adequacy of ΔCFI in a variety of MI research settings.
Hsu et al. (2015) conducted a simulation study to assess the sensitivity of RMSEA, CFI, Tucker–Lewis index (TLI), SRMR-W, and SRMR-B to model misspecification in multilevel CFA and reported that SRMR-B was the only fit index sensitive to the model misspecification at the between level. Given their findings, evaluating between level-specific fit indices including SRMR-B is important for multilevel CFA with focal interest in the between-level model as demonstrated in this study. However, Hsu, Lin, Kwok, Acosta, and Willson (2016) showed that the performance of between level-specific indices depended on ICC. When ICC was low (about .09), they did not recommend SRMR-B and RMSEA-B for model evaluation. Instead, because Hsu et al.’s (2016) study evidenced reasonable performance of CFI-B and TLI-B (≥.95), we evaluated these fit indices for the between-level models and observed that both between level-specific fit indices deteriorated notably when scalar invariance was imposed (CFI-B .928 from 1.00; TLI-B .930 from 1.117), which raises a concern of scalar noninvariance between students and teachers. To date, the criteria to evaluate and compare measurement invariance models in multilevel CFA have not been well established. Methodological efforts to develop reliable criteria based on between level-specific fit indices are needed. Especially, these between-level fit criteria are essential to evaluate multilevel CFA for cluster constructs such as shared cluster constructs (e.g., leadership, classroom climate).
In this study we demonstrated measurement invariance testing between students and teachers given classroom invariance of the student measure. However, the systematic review of 72 multilevel CFA studies (Kim et al., 2016) revealed that only a small number of studies tested cross-level invariance which is required for metric invariance across clusters and only two of those studies confirmed cross-level invariance. Given their report, cluster noninvariance or cluster bias (difference in student response patterns across classrooms) may not be unusual, which questions the feasibility of testing student teacher invariance. As suggested in general MI testing, researchers may establish partial cluster invariance and proceed for testing measurement invariance between students and teachers. However, the meaning of Level-2 constructs under cluster noninvariance (including different factor structures across levels such as one factor at Level 1 and two factors at Level 2) and the treatment of such constructs in subsequent analyses need further discussions and investigation from both substantive and methodological perspectives.
Measurement invariance between students and teachers was also investigated with four alternative models in this study. The single-level approach ignoring the hierarchical data structure reached a different level of MI (configural invariance) from what the proposed model yielded (scalar invariance). Previous studies (e.g., French & Finch, 2010; Kim et al., 2012) demonstrated that single-level MI tests with multilevel data without taking into account data dependency resulted in high Type I error (i.e., high false detection of noninvariance). For both conceptual and statistical reasons, this approach is not recommended. The other three approaches reached the same conclusion of scalar invariance consistent to the conclusion of the proposed model, which is not surprising given cross-level measurement equivalence of promoting social interaction. The cross-level equivalence of measurement models is a fundamental underlying assumption of a single-level approach, either disaggregating to Level 1 or aggregating to Level 2 (Zyphur et al., 2008). When this assumption is satisfied, the MI tests at Level 1 with disaggregated data may perform reasonably in the detection of noninvariance if standard errors are properly adjusted as demonstrated in Kim et al. (2012) with within-level groups (e.g., gender). However, in this study the comparing groups are at different levels and the teacher construct of promoting social interaction is uninterpretable at the student level. Thus, even though this disaggregating approach may yield statistically reasonable solutions of MI under certain circumstances, this approach is not conceptually appropriate for the construct conceptualized at Level 2.
Because the construct of interest is at the between level, aggregating student scores and testing MI at the teacher level as a single level model or testing MI at the between level with a saturated within model seem appealing. The aggregated single-level approach has been adopted in the organizational climate and contextual studies (Hofmann, 2002; Lüdtke et al., 2008), but its disadvantages are also well documented in the literature. Croon and van Veldhoven (2007) pointed out that the aggregating single-level approach would yield unbiased parameter estimates if there is no within-group variability (or asymptotically unbiased as group size increases). Lüdtke et al. (2008) compared the aggregated approach with the multilevel approach (called multilevel latent covariate model) in estimating contextual effects, and showed the outperformance of the multilevel approach. In the current study, we observed the within-group agreement was not high with only one item meeting the minimal cutoff of ICC2 for reliable classroom means. In this situation, the proposed multilevel CFA approach is more compelling because this approach takes into account the unreliability of group means at the between level (Lüdtke et al., 2008) although we did not observe any difference between two approaches in MI testing results with both endorsing scalar invariance.
The multilevel CFA with a saturated within model also yielded identical results to those of the proposed model in MI testing between students and teachers. This approach could be very attractive to applied researchers because the specification of the within model is not needed and the model fit is evaluated only at the between level. However, the factor structure of student perception of promoting social interaction is not investigated at the within level, the tenability of the within model is not tested, and subsequently cluster invariance including cross-level measurement equivalence cannot be assessed and is not explicit as in the proposed model. Solely for the between-level MI testing purposes, this approach is a potential alternative to the proposed model with advantages.
For the factor mean comparison between students and teachers, all models that endorsed scalar invariance showed a statistically significant mean difference. However, the estimated parameters and the associated standard errors vary to some degree across models as shown in Table 2. Because the estimated effect size was very large, some discrepancies in the estimated factor mean difference and its standard error across different approaches did not lead to different conclusions about student teacher agreement. However, the aggregated approach yielded the smallest factor mean difference possibly because of the unreliability of group means when student ratings were averaged as Lüdtke et al. (2008) consistently observed the underestimation of contextual effects with the aggregated approach in their simulation study. We also observed apparently large heterogeneity of factor variance between students and teachers for the disaggregated single-level approaches with notably small teacher variability, which again raises a concern of the legitimacy of the disaggregated analysis with teacher data. Finally, a future Monte Carlo simulation study is called for to compare the proposed method with potential alternatives under various research conditions in terms of MI testing and the accuracy and efficiency of the parameter estimates to fully discover the benefits of one model over another.
Conclusion
In studies investigating congruence between two groups, we advocate MI testing to ensure that the two groups interpret and respond to test items in the same way. Because the groups compared for MI are often at different levels of multilevel data with one group nested within the other group, we proposed and demonstrated a multilevel CFA model that allows MI testing between cross-level groups in multilevel data. This study is expected to promote MI testing between groups in congruence studies in different fields (e.g., agreement of student teacher perceptions in education). The proposed model can also be applied to any measurement invariance testing research with cross-level groups that are at different levels of multilevel data. Along with the demonstration, we discussed some methodological issues in implementing the proposed model and evaluating the model fit of multilevel CFA, which call for future research.
Footnotes
Appendix
Data: file is data_sturev.txt;
Variable: names are TeacherID StudentID Y1-Y4 T1-T4;
missing are all(999);
usevariables are Y1-Y4 T1-T4;
between = T1-T4;
cluster is TeacherID;
Analysis: type = twolevel;
Model:
%within%
FW by Y1
Y2(1)
Y3(2)
Y4(3);
%between%
FB by Y1
Y2(1)
Y3(2)
Y4(3); ! cross-level factor loading invariance for metric invariance of student data;
Y1-Y4@0; ! residual variances constrained to be zero for scalar invariance of student data;
FT by T1-T4; ! factor loading was fixed at one for the referent item to identify the covariance structure;
FT with FB; ! the correlation between teachers’ and students’ perceptions of promoting social interaction was estimated;
!Y1-Y4 pwith T1-T4; ! residual covariances were not estimated, because residual variances of student items were constrained to be zero;
[Y1](i1);
[Y2-Y4];
[T1](i1); ! intercepts of the referent item were constrained equal between teachers and students;
[T2-T4];
[FB@0 FT]; ! the mean of the student level-2 factor was fixed at zero;
Output: sampstat stdyx residual modindices(all) cinterval;
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
