Abstract
Adapted from the Early Childhood Environment Rating Scale–Revised, the Chinese Early Childhood Program Rating Scale (CECPRS) is a culturally comparable measure for assessing the quality of early childhood education and care programs in the Chinese cultural/social contexts. In this study, 176 kindergarten classrooms were rated with CECPRS on eight dimensions of program quality. Multivariate generalizability theory (MG-theory) was applied to evaluate the reliability of CECPRS ratings. Empirical findings indicate that CECPRS ratings showed excellent reliability, both on the subscales and for the composite. Based on the findings of the D-study for different measurement protocol scenarios, for achieving a good balance between reliability and cost/efficiency, we recommend using two independent raters for each kindergarten classroom. Furthermore, the application of MG-theory in this context broke new ground in the assessment of child care program quality, as the MG-theory provides research and practical possibilities not readily available under the traditional reliability methods.
Keywords
Governments in many countries have high priority for the quality of Early Childhood Education and Care (ECEC) programs. For this purpose, early childhood researchers are actively seeking psychometrically sound (e.g., measurement reliability, validity) methods and approaches to measure the quality of ECEC programs (Fenech, 2011). In the United States, research on ECEC quality and child development outcomes has been providing guidance for policy formation and classroom practice (e.g., National Institute of Child Health and Human Development [NICHD]Early Child Care Research Network [ECCRN], 1999). In such research, the Early Childhood Environment Rating Scale–Revised (ECERS-R; Harms, Clifford, & Cryer, [1998/2005]) has been the main tool for measuring process quality of ECEC services.
Quality Evaluation for ECEC in China
In China, ECEC government agencies and researchers have also been working to develop a system (generally known as the “Kindergarten Quality Rating System,” or KQRS) that can guide them in defining, measuring, and improving the quality of current ECEC program services. In 2010, the State Council of China (the executive branch of Chinese government) called the ECEC community to establish an effective KQRS to address the pressing issue of preschool education quality and its measurement. Unfortunately, existing research provides little information in this regard (Hu & Li, 2012). In fact, policy makers have primarily relied on ECEC regulations to guide their decisions on quality assessment of ECEC programs, instead of relying on empirical research evidence and findings to guide such decisions (Hu & Li, 2012; Pan, Liu, & Lau, 2010).
One major problem is that, for the evaluation criteria that most KQRS currently utilize in practice, there has been little information or empirical evidence for their psychometric quality. Out of a limited number of studies, Pan et al. (2010) found that the KQRS used in Beijing (China’s capital with a population close to 20 million) could not differentiate those kindergartens categorized as “superior” from those categorized as “good” by another more commonly used holistic quality categorization system. Other scholars have complained about current rating system’s overemphasis on structural features, such as kindergarten administration, while paying inadequate attention to educational activities (Dai & Liu, 2003). Hu, Li, and Zhao (2013) used the ECERS-R to evaluate the validity of the KQRS used in Zhejiang province (a province on China’s east coast, with population more than 50 million, and a relatively more developed economy among the provinces in China), and they had similar findings about Zhejiang’s KQRS system’s lack of focus on measuring the process quality. Early childhood education scholars in China have recommended the ECERS-R as the most suitable tool for adoption and adaptation for Chinese contexts (e.g., Hu & Zhu, 2009).
Adaptation Process of the ECERS-R
The ECERS-R (Harms et al., 1998/2005) has been the most frequently utilized observational tool for measuring the quality of ECEC in empirical research (Fenech, 2011) and in the quality rating and improvement system (QRIS) in the United States (Tout et al., 2010). The ECERS-R has been translated into more than 14 languages, and a rich body of research literature has been accumulated regarding its psychometric quality (e.g., reliability and validity) in international contexts. While most countries have fully adopted the ECERS-R without any adaptations, scholars (e.g., Tobin, 2005) have pointed out that simple translation of the instrument into another language would be grossly inadequate; instead, researchers should undertake a systematic approach to study and understand the cultural congruence between the concepts of quality measured in the ECERS-R and the sociocultural and policy contexts where the adapted tool will be used. Only through such careful and systematic explorations can the researchers gain sufficient understanding about how to adapt the tool for a new cultural and social environment.
Considerations of Cultural Differences and Content Validity
To contextualize the ECERS-R for Chinese ECEC, Hu et al. (2013) explored the quality of 105 classrooms in Zhejiang based on the ECERS-R. More specifically, Hu et al., using both ECERS-R and the current Zhejiang’s KQRS system, examined the degrees of congruence between the two evaluation systems (i.e., ECERS-R and Zhejiang’s KQRS) on various quality concepts. They identified some underlying cultural and contextual reasons for differences found in the concepts of quality underlying the two evaluation systems. Hu et al. also identified some culturally unique quality concepts that reflect the strengths/values of Chinese ECEC, such as collectivism and related teaching methods (i.e., whole-group instruction), to be incorporated into the adapted ECERS-R. As a result, numerous revisions and additions have been made to the scoring criteria and to the examples to improve the cultural and social relevance of the tool.
Moreover, Hu et al. (2013), based on their review and analysis of the international literature regarding the psychometric properties of the ECERS-R, made the efforts to incorporate important findings in the adaptation of ECERS-R for Chinese cultural and social contexts. For example, two large international studies have suggested that the ECERS-R is not sensitive in differentiating programs in the upper range of quality level spectrum (e.g., from excellent to good), but more sensitive in differentiating programs in the lower range of quality level (e.g., those scored below the mean; M. C. Lambert et al., 2008; Mathers, Linskey, Seddon, & Sylva, 2007). Mathers et al. (2007) further suggested additional assessment items for in-depth adult interactions in curricula activities, similar to the conclusions by Sylva et al. (2006) whose research led to the ECERS-Extension to academic-related skills in different curricular domains in England. Therefore, in our adaptation research work, the research team worked on strengthening evaluation questions and scoring criterion most appropriate for ECEC program quality in Chinese cultural and social contexts. Furthermore, the research team solicited and carefully examined ECERS-R raters’ and expert reviewers’ perceptions for cultural appropriateness of both adopted items and adapted items based on the ECERS-R. Based on extensive research and empirical findings, the research team developed the Chinese Early Childhood Program Rating Scale (CECPRS), an adapted version of the ECERS-R. Hu et al. presented more details of this adaptation work.
Considerations of Psychometric Properties
In general, empirical studies have supported the psychometric quality of the use of ECERS-R (e.g., reliability and validity) both in the United States (e.g., Burchinal, Howes, & Kontos, 2002; R. Lambert, Abbott-Shim, & McCarty, 2002), and in the international contexts (e.g., Mathers et al., 2007; Rentzou, 2010). However, the findings concerning the number of dimensions of ECERS-R have been mixed, with the results comprising a one-dimensional global quality (e.g., Perlman, Zellman, & Le, 2004), a two-factor model (Cassidy, Hestenes, Hegde, Hestenes, & Mims, 2005), and a three-factor model (i.e., Gordon, Fujimoto, Kaestner, Korenman, & Abner, 2012).
Using Item Response Theory (IRT), Gordon et al. (2012) identified that ECERS-R’s “stop scoring” response process could potentially cause a “rating category disordering.” Another main cause of the disordering, based on Gordon et al., is related to “the mixing of different aspects of quality within the indicators of single items” (p. 11). The fact that the scoring process requires numerous subjective raters and its reliance on teachers’ report of information to finish scoring has also contributed to the disordering issues. Gordon et al. thus concluded that ECERS-R has measurement flaws that should be avoided by future researchers when developing tools for quality evaluation.
To adapt ECERS-R to a tool (i.e., CECPRS) for assessing Chinese ECEC program quality, the researchers have made several adaptations to avoid the “rating category disorder.” First, we required that raters score all indicators, treating the CECPRS as a checklist, rather than a scale (Gordon et al., 2012). Second, in order to differentiate specific concepts of quality within an item, we created subitems for specific aspects of quality in relation to the four indicators of a single item (Gordon et al., 2012). As an example, Table 1 shows the adapted version of the item “Indoor space” in CECPRS.
Item 1: Indoor Space.
The addition of assessment subitems (shown in Table 1) has significantly increased the logical organization of the assessment questions under each item; however, this has led to the reorganization of the content of each item. Finally, these two additional features (i.e., to score all indicators, and the inclusion of the subitems) have allowed us to calculate scores for each subitem under each item.
Purpose of This Study
For CECPRS, an instrument adapted from ECERS-R for the Chinese cultural and social contexts, it is imperative that the psychometric characteristics (e.g., reliability and validity) of using CECPRS be systematically evaluated. Research work on CECPRS has been ongoing. Up to now, the psychometric research work for CECPRS has included the following: correlational analyses among subscales of CECPRS, interrater reliability, internal consistency reliability (e.g., Cronbach’s coefficient α), exploratory factor analysis, and confirmatory factor analysis (Li, Pan, & Hu, 2013a).
The initial empirical findings about the psychometric quality of the use of CECPRS have been very promising. For example, Cronbach’s coefficient alphas are typically around .90 for subscales and about .95 for the total scale. The factor analysis work suggested four substantively meaningful factors (quality dimensions): (a) physical environment and provisions, (b) curricula structure and child-initiated activities, (c) teacher-lead group teaching, and (d) supervision and interactions. Interrater reliability estimates (in the form of kappa, κ) at the item rating level suggested substantial (κ = .60-.80) to outstanding (κ > .80) interrater agreement for almost all the items (Li et al., 2013a). In addition, initial research findings also show that the CECPRS score is statistically associated with child developmental outcomes (e.g., movement, cognition, language, etc.; Li, Pan, & Hu, 2013b), suggesting good criterion-related validity evidence (part of construct validity evidence) for CECPRS. As part of the programmatic research efforts related to CECPRS, this study was conducted to extend the research (Li et al., 2013a, 2013b) on the psychometric quality of CECPRS use. More specifically, this study examines the measurement reliability of CECPRS under the framework of the generalizability theory (G-theory). It should be noted that the psychometric researches described above did not involve the same data as used in this study.
Application of G-Theory for CECPRS
G-theory (Brennan, 2010; Cronbach, Gleser, Nanda, & Rajaratnam, 1972) has emerged as a comprehensive framework for measurement reliability. G-theory subsumes other forms of reliability approaches (e.g., internal consistency reliability, interrater reliability, intraclass correlation, etc.), and provides a comprehensive and unifying framework (Fan & Sun, 2013) for assessing measurement reliability, especially for complex measurement situations.
Relative versus absolute decisions
Classical test theory focuses on norm-referenced score interpretation, that is, reliability is concerned about the consistency of relative standings of the individuals, but not about the consistency of the actual scores. Within the G-theory framework, this type of interpretation is called “relative decision.” Criterion-referenced score interpretation is concerned about both the consistency of the relative standings of the individuals, and the consistency of actual scores (i.e., possible score change). This criterion-reference perspective is called “absolute decision” in G-theory. Two types of generalizability (reliability) coefficients correspond to these two types of decisions: “relative decision” generalizability coefficient
Researchers are usually familiar with the interrater reliability coefficient, which is a “relative decision” reliability estimate. In using CECPRS, however, the measurement interest is not only about the consistency of relative standings of the programs, but also about the consistency of the actual rating scores across raters, because these scores represent the quality differences across the programs. Because of this, the appropriate reliability coefficient should be “absolute decision” generalizability coefficient φ.
Planning for optimal measurement protocol
Conventional reliability approaches are typically post hoc, that is, measurement reliability is computed after the fact. G-theory, however, can be used proactively in planning for better measurement protocol. G-theory may have two stages: G-study and D-study. The G-study serves as a “pilot” study that provides information (e.g., variance components for different sources) for planning for future measurement study. In the D-study, the information from the G-study are used for planning for “optimal” measurement protocol so that the best possible reliability can be achieved while balancing other factors (e.g., cost and effort). This flexibility and forecasting capability are not generally available in conventional reliability approaches (Fan & Sun, 2013; Shavelson & Webb, 1991).
The application of G-theory has advanced from the univariate G-theory method to the multivariate G-theory (MG-theory) method. MG-theory application is appropriate for multidimensional and complicated measurement situations. MG-theory application offers methodological advantages: The analysis and the estimation process take into account not only the variances (variance components), but also the covariance structure, of the dimensions. Reliabilities of all dimensions are estimated simultaneously, rather than each dimension in isolation (univariate G-theory).
Evidently, evaluation of ECEC quality is a multidimensional measurement process. Conventional reliability methods do not have the capacity of handling multiple dimensions simultaneously. In addition, conventional approach of interrater reliability cannot handle the situation where different kindergarten classrooms may be rated by different (teams of) raters (i.e., raters nested under classroom), which can be a common practice in the evaluation of ECEC programs. Furthermore, as indicated in previous discussion, the G-study and D-study procedures provide the flexibility in exploring “optimal” measurement protocol. Based on these considerations, MG-theory is a more appropriate approach for reliability analysis when conducting a multidimensional evaluation involving multiple raters (Clauser, Harik, & Margolis, 2006; Yang, Chang, & Ma, 2004).
Method
Participants
This study took place in a large province on China’s east coast (i.e., Zhejiang province). In the study, 91 kindergartens (in China, a kindergarten is typically a school with multiple, or even many, classrooms for kids at the same or different ages) from six municipalities were included, with each two municipalities representing one of the three levels of economic development (advanced, average, below average), as categorized by the Zhejiang provincial government. The selection of the 91 kindergartens was stratified on the basis of current provincial government ratings of quality (high, moderate, and low quality), location (urban, county, township, and village), and funding source (public and private). Finally, from the 91 kindergartens, a sample of 176 kindergarten classrooms was randomly selected for inclusion in this study. The age make-up of the kindergarteners is as the following: 45 classrooms for 3- to 4-year-olds, 51 classrooms for 4- to 5-year-olds, 74 classrooms for 5- to 6-year-olds, and the remaining ones for mixed age children.
Measures
The CECPRS described previously was used to measure the program quality of the kindergarten classrooms. The CECPRS uses the same 7-point Likert-type scale as ECERS-R to rate ECEC program quality in Chinese social and cultural contexts, with 1 representing inadequate quality and 7 representing excellent quality. The CECPRS is made up of eight subscales: (a) Space and Furnishing, (b) Personal Care Routines, (c) Curriculum Planning and Implementation, (d) Whole-Group Instruction, (e) Activities, (f) Language-Reasoning, (g) Guidance and Interactions, and (h) Parents and Staff. The CECPRS has 51 items, 177 subitems, and a total of 700 indicators. More details about CECPRS can be found in Li et al. (2013a, 2013b).
Design
Using CECPRS, two raters (r) observed and rated each of the 176 classrooms (c). The raters were graduate students majoring in early childhood education, and they had received rigorous training in using CECPRS for rating kindergarten classroom quality. The training consisted of 2 days of lecture and 4 days of field practice. Specifically, in training practice, these raters achieved interrater reliability coefficient of .85 before they were allowed to rate real kindergarten classrooms. In this measurement situation, rater is the “facet,” that is, a potential measurement error source, as raters may be inconsistent from rating one classroom to another (rater-classroom interaction effect), and raters may also differ in terms of their general tendency of being lenient or stringent (rater effect). Because CECPRS has eight subscales (dimensions), all of which were rated by raters nested in classrooms, the measurement design of this study is a multidimensional nested design with the single facet of raters nested under classroom (i.e., r:c).
Having a nested design is typically not ideal, and should be avoided if possible, as such a design may “leave important questions unanswered” (Shavelson & Webb, 1991, p. 52). More specifically, nested design makes it impossible to estimate all variance components. Whether a nested design is usable in a G-theory study depends on the purpose of the study, and on which type of generalizability coefficient (relative decision, absolute decision) is needed or desired in a study. Fortunately, in this study, the nested design was usable, as it did not compromise the purpose of our study. Because we were interested in obtaining the G-theory absolute decision φ coefficient, the confounding introduced by the nested design would not affect the computation of this coefficient, as discussed later.
Results and Discussion
The G-Study Results
As discussed previously, G-theory applications may include G-study and D-study. As the first step, G-study includes design, data collection, and estimation of the relevant variance components under the design conditions (Shavelson & Webb, 1991). Once the estimated variance components of all the relevant sources are estimated, these can then be used in D-study to plan for future measurement protocols. In this study, mGENOVA (Brennan, 2001) was used for conducting all G-theory analyses.
Table 2 presents the G-study results for all variance components (the diagonal elements) and covariance components (i.e., covariation among the subscales) of the eight dimensions of CECPRS. Each variance component represents the estimated “true score” variance across the classrooms (
Estimated G-Study Variance and Covariance Components.
Note. The diagonal elements are variance components; the lower diagonal elements are covariances; the upper diagonal elements are correlations.
The correlation coefficients among the subscales were at least .65, suggesting that the eight dimensions are sufficiently related to each other to form the foundation for a composite measure. The variance component for the nested rater effect (
The D-Study Results
Once the variance components from the G-study results are available, these can then be used in the D-study to investigate how a better measurement protocol can be designed (Brennan, 2010; Fan & Sun, 2013; Shavelson & Webb, 1991). For example, the measurement protocol can be modified by adding or reducing the number of raters, increasing or decreasing the number of items, and so on. The impact of these measurement protocol modifications on measurement reliability is then evaluated so that the “optimal” measurement protocol can be planned. In the following, we discuss the results from (a) the original measurement protocol, and (b) the new measurement protocol by modifying the number of raters under the nested design.
D-Study Results Under the Original Measurement Protocol
Table 3 presents the results of D-study under the original measurement protocol used in our G-study, that is, each classroom was rated by two raters, and the raters were nested under classroom. In the application of CECPRS, the ratings on each dimension represent kindergarten classroom quality levels, and a higher rating score represents better classroom quality than a lower rating score. Because of this, it is insufficient to only consider the consistency of relative rankings across raters; furthermore, the consistency of actual rating score values across the raters is also relevant. With these considerations, the generalizability coefficient for absolute decision (φ) is the appropriate form of reliability coefficient:
D-Study Results Under Original Measurement Protocol.
Note. S/N = signal-to-noise ratio.
In the formula above, as discussed hereinbefore,
Each φ coefficient in Table 3 is the ratio of “true score” variance (i.e., variance component for the “object of measurement”:
D-Study Results Under New Measurement Protocols
Evaluation of ECEC program quality is a very arduous, time-consuming, and complicated process. To explore for better measurement protocol, we considered a series of “what if?” scenarios, that is, how the number of raters would affect the measurement reliability of CECPRS applications. This process allows us to balance measurement considerations (e.g., reliability) and practical considerations (e.g., cost/efficiency for using different number of raters).
In our “what if?” scenarios, we considered 1 to 5 (i.e., nr = 1, 2, 3, 4, or 5) raters for each classroom. These “what if” scenarios of using different numbers of raters in the D-study allow us to estimate the impact of changing this measurement condition on the reliability of CECPRS applications. Figure 1 graphically presents the results of these “what if” scenarios for examining the impact of using different number of raters on measurement reliabilities (φ) of CECPRS’s eight subscales and the overall CECPRS composite. It is shown in Figure 1 that the generalizability coefficients of all subscales were above .8 even for using one trained rater, suggesting that, in CECPRS applications, reliability for ECEC program quality ratings is generally good, but there is some variation across the subscales.

φ coefficient curves of eight subscales and the composite (dashed curve) of CECPRS for using different number of raters (nr).
More relevant and interesting is the pattern in Figure 1. As theoretically expected, reliability estimates increase as the number of rater increases. The more interesting information in Figure 1, however, is the pattern that the largest increase (i.e., the steepest section of the curves) occurred between one rater and two raters. After the two-rater scenario, the increase of φ coefficient value gradually flattens out, resulting in a situation of “diminishing returns” for further increasing the number of raters. More specifically, for the total improvement of reliability from using one rater to using five raters, the step of one rater to two rater accounts for about 60% of the total. After this step, the percentages of improvement are 20% (two to three raters), 10% (three to four raters), and 6% (four to five raters) of the total improvement, respectively. This pattern suggests that, in CECPRS applications, using two trained raters could be considered as the “optimal” measurement protocol for balancing the consideration for reliability (i.e., reaching a reasonable level of reliability) and the consideration for cost/efficiency. Although using more raters may further increase measurement reliability, the “diminishing return” pattern suggests that it may not be worth the extra cost of having more than two raters for each ECEC classroom.
Conclusion
For assessing the quality of early childhood program in the cultural and social contexts of China, the CECPRS was developed, primarily based on the adaptation from the widely known and used ECERS-R. Initial findings concerning the psychometric quality of CECPRS have been very promising. Multivariate generalizability theory (MG-theory) was used to examine the measurement reliability of CECPRS.
There are several noticeable findings from the MG-theory analyses. First, different subscales (e.g., Space and Furnishings, Language-Reasoning) showed some variation in terms of the variance component for the “object of measurement,” suggesting that the kindergarten classrooms differed more on some aspect of quality (e.g., Space and Furnishings) than on others (e.g., Language-Reasoning). Substantively, some dimensions may be easier for the raters to observe and evaluate than others, resulting in different degrees of measurement reliability. For example, Subscale 1 (Space and Furniture) measures the physical environment, such as indoor or outdoor spaces and equipment, which are relatively easy to be observed and evaluated. This may explain the higher reliability estimate of this subscale than, say, Subscale 6 (Language-Reasoning), which may be harder for the raters to observe and evaluate, thus resulting in lower level of reliability estimate.
Second, the D-study findings revealed a pattern of “diminishing returns,” suggesting that the optimal number of raters for each classroom could be two. Using more than two raters would increase the cost of measurement, while only resulting in minimal increase of measurement reliability.
In conclusion, the empirical findings in this study, together with those in previous studies (Li et al., 2013a, 2013b), indicate that, for measuring program quality of ECEC in the Chinese cultural/social contexts, the CECPRS has demonstrated good psychometric characteristics as related to measurement reliability. Moreover, based on the findings of MG-theory analyses presented above, we recommend using two trained raters per classroom so that we may achieve a reasonable balance between the consideration for good measurement reliability and the consideration for cost/efficiency. Furthermore, the application of MG-theory in evaluating measurement instrument for ECEC program quality is a breakthrough from the traditional methods, as the MG-theory allows a researcher to consider possibilities not readily available under the traditional reliability methods.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research and preparation of this manuscript has been supported by the following grants: (a) A Multidimensional and Integrated View Approach to Chinese Preschool Evaluation from the National Social Science Foundation for Young Scholars of China (Grant CHA110131); (b) A National Mechanism and Policy Study to Ensure Every Age-Qualifying Child a Basic and Quality Preschool Education (Grant AHA110004); (c) A Psychometric Study of Measures for Assessing the Quality of Preschool Education from the National Ministry of Education for Young Scholars of China (Grant13YJCZH011).
