Measurement Quality of the Chinese Early Childhood Program Rating Scale

Abstract

Adapted from the Early Childhood Environment Rating Scale–Revised, the Chinese Early Childhood Program Rating Scale (CECPRS) is a culturally comparable measure for assessing the quality of early childhood education and care programs in the Chinese cultural/social contexts. In this study, 176 kindergarten classrooms were rated with CECPRS on eight dimensions of program quality. Multivariate generalizability theory (MG-theory) was applied to evaluate the reliability of CECPRS ratings. Empirical findings indicate that CECPRS ratings showed excellent reliability, both on the subscales and for the composite. Based on the findings of the D-study for different measurement protocol scenarios, for achieving a good balance between reliability and cost/efficiency, we recommend using two independent raters for each kindergarten classroom. Furthermore, the application of MG-theory in this context broke new ground in the assessment of child care program quality, as the MG-theory provides research and practical possibilities not readily available under the traditional reliability methods.

Keywords

reliability multivariate generalizability theory (MG-theory)program quality early childhood education and care

Governments in many countries have high priority for the quality of Early Childhood Education and Care (ECEC) programs. For this purpose, early childhood researchers are actively seeking psychometrically sound (e.g., measurement reliability, validity) methods and approaches to measure the quality of ECEC programs (Fenech, 2011). In the United States, research on ECEC quality and child development outcomes has been providing guidance for policy formation and classroom practice (e.g., National Institute of Child Health and Human Development [NICHD]Early Child Care Research Network [ECCRN], 1999). In such research, the Early Childhood Environment Rating Scale–Revised (ECERS-R; Harms, Clifford, & Cryer, [1998/2005]) has been the main tool for measuring process quality of ECEC services.

Quality Evaluation for ECEC in China

In China, ECEC government agencies and researchers have also been working to develop a system (generally known as the “Kindergarten Quality Rating System,” or KQRS) that can guide them in defining, measuring, and improving the quality of current ECEC program services. In 2010, the State Council of China (the executive branch of Chinese government) called the ECEC community to establish an effective KQRS to address the pressing issue of preschool education quality and its measurement. Unfortunately, existing research provides little information in this regard (Hu & Li, 2012). In fact, policy makers have primarily relied on ECEC regulations to guide their decisions on quality assessment of ECEC programs, instead of relying on empirical research evidence and findings to guide such decisions (Hu & Li, 2012; Pan, Liu, & Lau, 2010).

One major problem is that, for the evaluation criteria that most KQRS currently utilize in practice, there has been little information or empirical evidence for their psychometric quality. Out of a limited number of studies, Pan et al. (2010) found that the KQRS used in Beijing (China’s capital with a population close to 20 million) could not differentiate those kindergartens categorized as “superior” from those categorized as “good” by another more commonly used holistic quality categorization system. Other scholars have complained about current rating system’s overemphasis on structural features, such as kindergarten administration, while paying inadequate attention to educational activities (Dai & Liu, 2003). Hu, Li, and Zhao (2013) used the ECERS-R to evaluate the validity of the KQRS used in Zhejiang province (a province on China’s east coast, with population more than 50 million, and a relatively more developed economy among the provinces in China), and they had similar findings about Zhejiang’s KQRS system’s lack of focus on measuring the process quality. Early childhood education scholars in China have recommended the ECERS-R as the most suitable tool for adoption and adaptation for Chinese contexts (e.g., Hu & Zhu, 2009).

Adaptation Process of the ECERS-R

The ECERS-R (Harms et al., 1998/2005) has been the most frequently utilized observational tool for measuring the quality of ECEC in empirical research (Fenech, 2011) and in the quality rating and improvement system (QRIS) in the United States (Tout et al., 2010). The ECERS-R has been translated into more than 14 languages, and a rich body of research literature has been accumulated regarding its psychometric quality (e.g., reliability and validity) in international contexts. While most countries have fully adopted the ECERS-R without any adaptations, scholars (e.g., Tobin, 2005) have pointed out that simple translation of the instrument into another language would be grossly inadequate; instead, researchers should undertake a systematic approach to study and understand the cultural congruence between the concepts of quality measured in the ECERS-R and the sociocultural and policy contexts where the adapted tool will be used. Only through such careful and systematic explorations can the researchers gain sufficient understanding about how to adapt the tool for a new cultural and social environment.

Considerations of Cultural Differences and Content Validity

To contextualize the ECERS-R for Chinese ECEC, Hu et al. (2013) explored the quality of 105 classrooms in Zhejiang based on the ECERS-R. More specifically, Hu et al., using both ECERS-R and the current Zhejiang’s KQRS system, examined the degrees of congruence between the two evaluation systems (i.e., ECERS-R and Zhejiang’s KQRS) on various quality concepts. They identified some underlying cultural and contextual reasons for differences found in the concepts of quality underlying the two evaluation systems. Hu et al. also identified some culturally unique quality concepts that reflect the strengths/values of Chinese ECEC, such as collectivism and related teaching methods (i.e., whole-group instruction), to be incorporated into the adapted ECERS-R. As a result, numerous revisions and additions have been made to the scoring criteria and to the examples to improve the cultural and social relevance of the tool.

Moreover, Hu et al. (2013), based on their review and analysis of the international literature regarding the psychometric properties of the ECERS-R, made the efforts to incorporate important findings in the adaptation of ECERS-R for Chinese cultural and social contexts. For example, two large international studies have suggested that the ECERS-R is not sensitive in differentiating programs in the upper range of quality level spectrum (e.g., from excellent to good), but more sensitive in differentiating programs in the lower range of quality level (e.g., those scored below the mean; M. C. Lambert et al., 2008; Mathers, Linskey, Seddon, & Sylva, 2007). Mathers et al. (2007) further suggested additional assessment items for in-depth adult interactions in curricula activities, similar to the conclusions by Sylva et al. (2006) whose research led to the ECERS-Extension to academic-related skills in different curricular domains in England. Therefore, in our adaptation research work, the research team worked on strengthening evaluation questions and scoring criterion most appropriate for ECEC program quality in Chinese cultural and social contexts. Furthermore, the research team solicited and carefully examined ECERS-R raters’ and expert reviewers’ perceptions for cultural appropriateness of both adopted items and adapted items based on the ECERS-R. Based on extensive research and empirical findings, the research team developed the Chinese Early Childhood Program Rating Scale (CECPRS), an adapted version of the ECERS-R. Hu et al. presented more details of this adaptation work.

Considerations of Psychometric Properties

In general, empirical studies have supported the psychometric quality of the use of ECERS-R (e.g., reliability and validity) both in the United States (e.g., Burchinal, Howes, & Kontos, 2002; R. Lambert, Abbott-Shim, & McCarty, 2002), and in the international contexts (e.g., Mathers et al., 2007; Rentzou, 2010). However, the findings concerning the number of dimensions of ECERS-R have been mixed, with the results comprising a one-dimensional global quality (e.g., Perlman, Zellman, & Le, 2004), a two-factor model (Cassidy, Hestenes, Hegde, Hestenes, & Mims, 2005), and a three-factor model (i.e., Gordon, Fujimoto, Kaestner, Korenman, & Abner, 2012).

Using Item Response Theory (IRT), Gordon et al. (2012) identified that ECERS-R’s “stop scoring” response process could potentially cause a “rating category disordering.” Another main cause of the disordering, based on Gordon et al., is related to “the mixing of different aspects of quality within the indicators of single items” (p. 11). The fact that the scoring process requires numerous subjective raters and its reliance on teachers’ report of information to finish scoring has also contributed to the disordering issues. Gordon et al. thus concluded that ECERS-R has measurement flaws that should be avoided by future researchers when developing tools for quality evaluation.

To adapt ECERS-R to a tool (i.e., CECPRS) for assessing Chinese ECEC program quality, the researchers have made several adaptations to avoid the “rating category disorder.” First, we required that raters score all indicators, treating the CECPRS as a checklist, rather than a scale (Gordon et al., 2012). Second, in order to differentiate specific concepts of quality within an item, we created subitems for specific aspects of quality in relation to the four indicators of a single item (Gordon et al., 2012). As an example, Table 1 shows the adapted version of the item “Indoor space” in CECPRS.

Table 1.

Item 1: Indoor Space.

	Score
	Inadequate		Minimal		Good		Excellent
Subitem	1	2	3	4	5	6	7
1.1. Space and classroom set-up	1.1.1. No playing areas in the classroom; space is crowed (e.g., chairs and desks occupied the whole classroom without free space); severely insufficient space per child.		1.1.3. Enough indoor space for children, adults, and furnishings; there is free space for children’s free play, or there is an area which can be changed into a free play place for children.		1.1.5. Ample indoor space with independent area for collective teaching, children’s play and napping; children and adults can move around freely (e.g., tables and desks do not block the ways)		1.1.7. There is an area especially designed for children to take naps. Extra specialized activity rooms are available for children in the observed class.
1.2. Infrastructure	1.2.1. Space lacks adequate lights, ventilation, temperature control or sound-absorbing materials.		1.2.3. Adequate lights, ventilation, temperature control with a heater or AC and sound-absorbing materials; no noises.		1.2.5. Ventilation is good and can be controlled; some natural lighting; temperature control with a heater or AC.		1.2.7. Natural lighting can be controlled; multiple sound-absorbing materials like soft wall wrapper, carpet, rug, and other safe stuff.
1.3. Safety and decoration	1.3.1. Space in extremely unsafe and poor conditions (e.g., cracks on the walls or the roof; unsafe floors; the classroom is evaluated as dangerous building by the government); space in poor condition.		1.3.3. No safety problems; space is reasonably maintained with basic conditions.		1.3.5. Space is properly decorated with safe and environment friendly materials; space’s condition can meet children’s needs of being healthy and engaging in activities.		1.3.7. Space is nicely decorated, kept in good condition with periodic inspection.
1.4. Cleaning and maintenance	1.4.1. Space poorly maintained (e.g., floors left sticky or dirty, trashcans overflowing; mosquitoes and bugs problems.		1.4.3. Adequate cleaning and maintenance with little mosquitoes and bugs problems.		1.4.5. Space is well cleaned and maintained. No mosquitoes and bugs problems.		1.4.7. Use professional cleaning services regularly.

The addition of assessment subitems (shown in Table 1) has significantly increased the logical organization of the assessment questions under each item; however, this has led to the reorganization of the content of each item. Finally, these two additional features (i.e., to score all indicators, and the inclusion of the subitems) have allowed us to calculate scores for each subitem under each item.

Purpose of This Study

For CECPRS, an instrument adapted from ECERS-R for the Chinese cultural and social contexts, it is imperative that the psychometric characteristics (e.g., reliability and validity) of using CECPRS be systematically evaluated. Research work on CECPRS has been ongoing. Up to now, the psychometric research work for CECPRS has included the following: correlational analyses among subscales of CECPRS, interrater reliability, internal consistency reliability (e.g., Cronbach’s coefficient α), exploratory factor analysis, and confirmatory factor analysis (Li, Pan, & Hu, 2013a).

The initial empirical findings about the psychometric quality of the use of CECPRS have been very promising. For example, Cronbach’s coefficient alphas are typically around .90 for subscales and about .95 for the total scale. The factor analysis work suggested four substantively meaningful factors (quality dimensions): (a) physical environment and provisions, (b) curricula structure and child-initiated activities, (c) teacher-lead group teaching, and (d) supervision and interactions. Interrater reliability estimates (in the form of kappa, κ) at the item rating level suggested substantial (κ = .60-.80) to outstanding (κ > .80) interrater agreement for almost all the items (Li et al., 2013a). In addition, initial research findings also show that the CECPRS score is statistically associated with child developmental outcomes (e.g., movement, cognition, language, etc.; Li, Pan, & Hu, 2013b), suggesting good criterion-related validity evidence (part of construct validity evidence) for CECPRS. As part of the programmatic research efforts related to CECPRS, this study was conducted to extend the research (Li et al., 2013a, 2013b) on the psychometric quality of CECPRS use. More specifically, this study examines the measurement reliability of CECPRS under the framework of the generalizability theory (G-theory). It should be noted that the psychometric researches described above did not involve the same data as used in this study.

Application of G-Theory for CECPRS

G-theory (Brennan, 2010; Cronbach, Gleser, Nanda, & Rajaratnam, 1972) has emerged as a comprehensive framework for measurement reliability. G-theory subsumes other forms of reliability approaches (e.g., internal consistency reliability, interrater reliability, intraclass correlation, etc.), and provides a comprehensive and unifying framework (Fan & Sun, 2013) for assessing measurement reliability, especially for complex measurement situations.

Relative versus absolute decisions

Classical test theory focuses on norm-referenced score interpretation, that is, reliability is concerned about the consistency of relative standings of the individuals, but not about the consistency of the actual scores. Within the G-theory framework, this type of interpretation is called “relative decision.” Criterion-referenced score interpretation is concerned about both the consistency of the relative standings of the individuals, and the consistency of actual scores (i.e., possible score change). This criterion-reference perspective is called “absolute decision” in G-theory. Two types of generalizability (reliability) coefficients correspond to these two types of decisions: “relative decision” generalizability coefficient $ρ^{2}$ and “absolute decision” generalizability coefficient φ (Brennan, 2010; Shavelson & Webb, 1991).

Researchers are usually familiar with the interrater reliability coefficient, which is a “relative decision” reliability estimate. In using CECPRS, however, the measurement interest is not only about the consistency of relative standings of the programs, but also about the consistency of the actual rating scores across raters, because these scores represent the quality differences across the programs. Because of this, the appropriate reliability coefficient should be “absolute decision” generalizability coefficient φ.

Planning for optimal measurement protocol

Conventional reliability approaches are typically post hoc, that is, measurement reliability is computed after the fact. G-theory, however, can be used proactively in planning for better measurement protocol. G-theory may have two stages: G-study and D-study. The G-study serves as a “pilot” study that provides information (e.g., variance components for different sources) for planning for future measurement study. In the D-study, the information from the G-study are used for planning for “optimal” measurement protocol so that the best possible reliability can be achieved while balancing other factors (e.g., cost and effort). This flexibility and forecasting capability are not generally available in conventional reliability approaches (Fan & Sun, 2013; Shavelson & Webb, 1991).

The application of G-theory has advanced from the univariate G-theory method to the multivariate G-theory (MG-theory) method. MG-theory application is appropriate for multidimensional and complicated measurement situations. MG-theory application offers methodological advantages: The analysis and the estimation process take into account not only the variances (variance components), but also the covariance structure, of the dimensions. Reliabilities of all dimensions are estimated simultaneously, rather than each dimension in isolation (univariate G-theory).

Evidently, evaluation of ECEC quality is a multidimensional measurement process. Conventional reliability methods do not have the capacity of handling multiple dimensions simultaneously. In addition, conventional approach of interrater reliability cannot handle the situation where different kindergarten classrooms may be rated by different (teams of) raters (i.e., raters nested under classroom), which can be a common practice in the evaluation of ECEC programs. Furthermore, as indicated in previous discussion, the G-study and D-study procedures provide the flexibility in exploring “optimal” measurement protocol. Based on these considerations, MG-theory is a more appropriate approach for reliability analysis when conducting a multidimensional evaluation involving multiple raters (Clauser, Harik, & Margolis, 2006; Yang, Chang, & Ma, 2004).

Method

Participants

This study took place in a large province on China’s east coast (i.e., Zhejiang province). In the study, 91 kindergartens (in China, a kindergarten is typically a school with multiple, or even many, classrooms for kids at the same or different ages) from six municipalities were included, with each two municipalities representing one of the three levels of economic development (advanced, average, below average), as categorized by the Zhejiang provincial government. The selection of the 91 kindergartens was stratified on the basis of current provincial government ratings of quality (high, moderate, and low quality), location (urban, county, township, and village), and funding source (public and private). Finally, from the 91 kindergartens, a sample of 176 kindergarten classrooms was randomly selected for inclusion in this study. The age make-up of the kindergarteners is as the following: 45 classrooms for 3- to 4-year-olds, 51 classrooms for 4- to 5-year-olds, 74 classrooms for 5- to 6-year-olds, and the remaining ones for mixed age children.

Measures

The CECPRS described previously was used to measure the program quality of the kindergarten classrooms. The CECPRS uses the same 7-point Likert-type scale as ECERS-R to rate ECEC program quality in Chinese social and cultural contexts, with 1 representing inadequate quality and 7 representing excellent quality. The CECPRS is made up of eight subscales: (a) Space and Furnishing, (b) Personal Care Routines, (c) Curriculum Planning and Implementation, (d) Whole-Group Instruction, (e) Activities, (f) Language-Reasoning, (g) Guidance and Interactions, and (h) Parents and Staff. The CECPRS has 51 items, 177 subitems, and a total of 700 indicators. More details about CECPRS can be found in Li et al. (2013a, 2013b).

Design

Using CECPRS, two raters (r) observed and rated each of the 176 classrooms (c). The raters were graduate students majoring in early childhood education, and they had received rigorous training in using CECPRS for rating kindergarten classroom quality. The training consisted of 2 days of lecture and 4 days of field practice. Specifically, in training practice, these raters achieved interrater reliability coefficient of .85 before they were allowed to rate real kindergarten classrooms. In this measurement situation, rater is the “facet,” that is, a potential measurement error source, as raters may be inconsistent from rating one classroom to another (rater-classroom interaction effect), and raters may also differ in terms of their general tendency of being lenient or stringent (rater effect). Because CECPRS has eight subscales (dimensions), all of which were rated by raters nested in classrooms, the measurement design of this study is a multidimensional nested design with the single facet of raters nested under classroom (i.e., r:c).

Having a nested design is typically not ideal, and should be avoided if possible, as such a design may “leave important questions unanswered” (Shavelson & Webb, 1991, p. 52). More specifically, nested design makes it impossible to estimate all variance components. Whether a nested design is usable in a G-theory study depends on the purpose of the study, and on which type of generalizability coefficient (relative decision, absolute decision) is needed or desired in a study. Fortunately, in this study, the nested design was usable, as it did not compromise the purpose of our study. Because we were interested in obtaining the G-theory absolute decision φ coefficient, the confounding introduced by the nested design would not affect the computation of this coefficient, as discussed later.

Results and Discussion

The G-Study Results

As discussed previously, G-theory applications may include G-study and D-study. As the first step, G-study includes design, data collection, and estimation of the relevant variance components under the design conditions (Shavelson & Webb, 1991). Once the estimated variance components of all the relevant sources are estimated, these can then be used in D-study to plan for future measurement protocols. In this study, mGENOVA (Brennan, 2001) was used for conducting all G-theory analyses.

Table 2 presents the G-study results for all variance components (the diagonal elements) and covariance components (i.e., covariation among the subscales) of the eight dimensions of CECPRS. Each variance component represents the estimated “true score” variance across the classrooms ( $σ_{C}^{2}$ ) on that specific dimension of ECEC program quality (Fan & Sun, 2013; Shavelson & Webb, 1991). Based on the results, the variance component of the first subscale (i.e., Space and Furnishing) was the largest, followed by the second subscale (i.e., Personal Care Routines) and the eighth subscale (i.e., Parents and Staff). The variance component of the sixth subscale (Language-Reasoning) was the lowest. Such information suggests that, relatively speaking, the kindergarten classrooms differed most on the dimension of Space and Furnishing, and they are most similar on the dimension of Language-Reasoning.

Table 2.

Estimated G-Study Variance and Covariance Components.

	Subscale
Source	One	Two	Three	Four	Five	Six	Seven	Eight
c	1.2062	0.8819	0.8004	0.6522	0.8951	0.8041	0.7239	0.8701
	1.0296	1.1298	0.8555	0.7941	0.8585	0.8645	0.8428	0.9102
	0.8619	0.8917	0.9613	0.7657	0.8508	0.8336	0.8026	0.8003
	0.6538	0.7704	0.6852	0.8329	0.6811	0.8320	0.8481	0.7577
	0.8945	0.8303	0.7590	0.5656	0.8279	0.8145	0.7209	0.8544
	0.7293	0.7588	0.6749	0.6271	0.6120	0.6819	0.8597	0.8555
	0.7572	0.8533	0.7495	0.7372	0.6247	0.6761	0.9070	0.8282
	0.9712	0.9832	0.7975	0.7028	0.7900	0.7180	0.8016	1.0327
r:c	0.0811	0.4725	0.2227	0.3296	0.4517	0.3642	0.1901	0.1609
	0.0475	0.1246	0.2016	0.2122	0.2482	0.2930	0.3200	0.3410
	0.0221	0.0248	0.1214	0.3516	0.2797	0.3734	0.2114	0.1703
	0.0351	0.0280	0.0458	0.1398	0.2719	0.4437	0.4032	0.1200
	0.0367	0.0250	0.0278	0.0290	0.0814	0.3766	0.2296	−0.0905
	0.0362	0.0361	0.0454	0.0579	0.0375	0.1218	0.3654	0.1610
	0.0219	0.0457	0.0298	0.0610	0.0265	0.0516	0.1637	0.1985
	0.0190	0.0499	0.0246	0.0186	−0.0107	0.0233	0.0333	0.1719

Note. The diagonal elements are variance components; the lower diagonal elements are covariances; the upper diagonal elements are correlations.

The correlation coefficients among the subscales were at least .65, suggesting that the eight dimensions are sufficiently related to each other to form the foundation for a composite measure. The variance component for the nested rater effect ( $σ_{r : c}^{2}$ for raters nested under classrooms) theoretically consists of two parts: the variance component for the rater effect ( $σ_{r}^{2}$ ), the confounded variance component for the interaction effect between rater and classroom, and that for the residual ( $σ_{r c, e}^{2}$ ). The nested design in the G-study (i.e., rater nested under classroom) makes these two components confounded, and inseparable.

The D-Study Results

Once the variance components from the G-study results are available, these can then be used in the D-study to investigate how a better measurement protocol can be designed (Brennan, 2010; Fan & Sun, 2013; Shavelson & Webb, 1991). For example, the measurement protocol can be modified by adding or reducing the number of raters, increasing or decreasing the number of items, and so on. The impact of these measurement protocol modifications on measurement reliability is then evaluated so that the “optimal” measurement protocol can be planned. In the following, we discuss the results from (a) the original measurement protocol, and (b) the new measurement protocol by modifying the number of raters under the nested design.

D-Study Results Under the Original Measurement Protocol

Table 3 presents the results of D-study under the original measurement protocol used in our G-study, that is, each classroom was rated by two raters, and the raters were nested under classroom. In the application of CECPRS, the ratings on each dimension represent kindergarten classroom quality levels, and a higher rating score represents better classroom quality than a lower rating score. Because of this, it is insufficient to only consider the consistency of relative rankings across raters; furthermore, the consistency of actual rating score values across the raters is also relevant. With these considerations, the generalizability coefficient for absolute decision (φ) is the appropriate form of reliability coefficient:

ϕ = \frac{σ_{C}^{2}}{σ_{C}^{2} + (\frac{σ_{r}^{2}}{n_{r}} + \frac{σ_{r c, e}^{2}}{n_{r}})} = \frac{σ_{C}^{2}}{σ_{C}^{2} + σ_{r : c}^{2} / n_{r}} .

Table 3.

D-Study Results Under Original Measurement Protocol.

	Subscales
	One	Two	Three	Four	Five	Six	Seven	Eight
True score variance ( $σ_{C}^{2}$ )	1.2062	1.1298	0.9613	0.8329	0.8279	0.6819	0.9070	1.0328
Error variance (absolute decision) ( $σ_{r : c}^{2} / n_{r}$ )	0.0406	0.0623	0.0607	0.0699	0.0407	0.0609	0.0819	0.0860
φ	0.9674	0.9477	0.9405	0.9225	0.9531	0.9180	0.9172	0.9231
S/N (absolute decision)	29.7252	18.1238	15.8281	11.9099	20.3221	11.1972	11.0823	12.0117

Note. S/N = signal-to-noise ratio.

In the formula above, as discussed hereinbefore, $σ_{r}^{2}$ is the variance component for rater effect, and $σ_{r c, e}^{2}$ is the variance component for rater-classroom interaction effect plus residual. Due to the nested design, however, it is not possible to separate the rater effect from the rater-classroom interaction effect; as a result, these two effects and the residual were confounded under $σ_{r : c}^{2}$ . For our purpose of obtaining absolute decision generalizability coefficient φ, this confounding did not pose any problem, as both effects (rater effect $σ_{r}^{2}$ and rater-classroom interaction effect plus residual term $σ_{r c, e}^{2}$ ) have been captured by $σ_{r : c}^{2}$ .

Each φ coefficient in Table 3 is the ratio of “true score” variance (i.e., variance component for the “object of measurement”: $σ_{C}^{2}$ ) to the “total score” variance, which is the sum of the “true score” variance and the error variance for the given number of raters (i.e., $σ_{C}^{2} + σ_{r : c}^{2} / n_{r}$ ). The φ coefficient is the estimated reliability coefficient for each scale for criterion-referenced interpretation (i.e., “absolute decision”). The greater the φ value, the higher the precision of the measurement (Brennan, 2010; Fan & Sun, 2013). From Table 3, we can see that, when two raters were used for each classroom, the lowest reliability estimate for all the subscales was above .9172 (Subscale 7: Guidance and Interactions), while the reliability for Subscale 1 (Space and Furniture) was the highest (φ = .9674). The relatively very small absolute error variances for all the subscales suggest good consistency of ratings on the subscales. Furthermore, the composite φ was .9724 (not shown in Table 3), indicates very high level of measurement reliability for CECPRS as a whole. The signal-to-noise (S/N) ratio is the ratio of the “true score” variance ( $σ_{C}^{2}$ ) to error variance ( $σ_{r : c}^{2} / n_{r}$ ). For example, the S/N for Subscale 3 was 15.8281, which means the “true score” variance was about 15 times larger than the error variance.

D-Study Results Under New Measurement Protocols

Evaluation of ECEC program quality is a very arduous, time-consuming, and complicated process. To explore for better measurement protocol, we considered a series of “what if?” scenarios, that is, how the number of raters would affect the measurement reliability of CECPRS applications. This process allows us to balance measurement considerations (e.g., reliability) and practical considerations (e.g., cost/efficiency for using different number of raters).

In our “what if?” scenarios, we considered 1 to 5 (i.e., n_r = 1, 2, 3, 4, or 5) raters for each classroom. These “what if” scenarios of using different numbers of raters in the D-study allow us to estimate the impact of changing this measurement condition on the reliability of CECPRS applications. Figure 1 graphically presents the results of these “what if” scenarios for examining the impact of using different number of raters on measurement reliabilities (φ) of CECPRS’s eight subscales and the overall CECPRS composite. It is shown in Figure 1 that the generalizability coefficients of all subscales were above .8 even for using one trained rater, suggesting that, in CECPRS applications, reliability for ECEC program quality ratings is generally good, but there is some variation across the subscales.

Figure 1.

φ coefficient curves of eight subscales and the composite (dashed curve) of CECPRS for using different number of raters (n_r).

More relevant and interesting is the pattern in Figure 1. As theoretically expected, reliability estimates increase as the number of rater increases. The more interesting information in Figure 1, however, is the pattern that the largest increase (i.e., the steepest section of the curves) occurred between one rater and two raters. After the two-rater scenario, the increase of φ coefficient value gradually flattens out, resulting in a situation of “diminishing returns” for further increasing the number of raters. More specifically, for the total improvement of reliability from using one rater to using five raters, the step of one rater to two rater accounts for about 60% of the total. After this step, the percentages of improvement are 20% (two to three raters), 10% (three to four raters), and 6% (four to five raters) of the total improvement, respectively. This pattern suggests that, in CECPRS applications, using two trained raters could be considered as the “optimal” measurement protocol for balancing the consideration for reliability (i.e., reaching a reasonable level of reliability) and the consideration for cost/efficiency. Although using more raters may further increase measurement reliability, the “diminishing return” pattern suggests that it may not be worth the extra cost of having more than two raters for each ECEC classroom.

Conclusion

For assessing the quality of early childhood program in the cultural and social contexts of China, the CECPRS was developed, primarily based on the adaptation from the widely known and used ECERS-R. Initial findings concerning the psychometric quality of CECPRS have been very promising. Multivariate generalizability theory (MG-theory) was used to examine the measurement reliability of CECPRS.

There are several noticeable findings from the MG-theory analyses. First, different subscales (e.g., Space and Furnishings, Language-Reasoning) showed some variation in terms of the variance component for the “object of measurement,” suggesting that the kindergarten classrooms differed more on some aspect of quality (e.g., Space and Furnishings) than on others (e.g., Language-Reasoning). Substantively, some dimensions may be easier for the raters to observe and evaluate than others, resulting in different degrees of measurement reliability. For example, Subscale 1 (Space and Furniture) measures the physical environment, such as indoor or outdoor spaces and equipment, which are relatively easy to be observed and evaluated. This may explain the higher reliability estimate of this subscale than, say, Subscale 6 (Language-Reasoning), which may be harder for the raters to observe and evaluate, thus resulting in lower level of reliability estimate.

Second, the D-study findings revealed a pattern of “diminishing returns,” suggesting that the optimal number of raters for each classroom could be two. Using more than two raters would increase the cost of measurement, while only resulting in minimal increase of measurement reliability.

In conclusion, the empirical findings in this study, together with those in previous studies (Li et al., 2013a, 2013b), indicate that, for measuring program quality of ECEC in the Chinese cultural/social contexts, the CECPRS has demonstrated good psychometric characteristics as related to measurement reliability. Moreover, based on the findings of MG-theory analyses presented above, we recommend using two trained raters per classroom so that we may achieve a reasonable balance between the consideration for good measurement reliability and the consideration for cost/efficiency. Furthermore, the application of MG-theory in evaluating measurement instrument for ECEC program quality is a breakthrough from the traditional methods, as the MG-theory allows a researcher to consider possibilities not readily available under the traditional reliability methods.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research and preparation of this manuscript has been supported by the following grants: (a) A Multidimensional and Integrated View Approach to Chinese Preschool Evaluation from the National Social Science Foundation for Young Scholars of China (Grant CHA110131); (b) A National Mechanism and Policy Study to Ensure Every Age-Qualifying Child a Basic and Quality Preschool Education (Grant AHA110004); (c) A Psychometric Study of Measures for Assessing the Quality of Preschool Education from the National Ministry of Education for Young Scholars of China (Grant13YJCZH011).

References

Brennan

R. L.

(2001). Manual or mGENOVA. Iowa City: Iowa Testing Programs, University of Iowa.

Brennan

R. L.

(2010). Generalizability theory. New York, NY: Springer.

Burchinal

Howes

Kontos

(2002). Structural predictors of child care quality in child care homes. Early Childhood Research Quarterly, 17, 87-105.

Cassidy

D. J.

Hestenes

L. L.

Hegde

Hestenes

Mims

(2005). Measurement of quality in preschool child care classrooms: An exploratory and confirmatory factor analysis of the early childhood environment rating scale-revised. Early Childhood Research Quarterly, 20, 345-360.

Clauser

B. E.

Harik

Margolis

M. J.

(2006). A multivariate generalizability analysis of data from a performance assessment of physicians’ clinical skills. Journal of Educational Measurement, 23, 173-191.

Cronbach

L. J.

Gleser

G. C.

Nanda

Rajaratnam

(1972). The dependability of behavioral measurements theory of generalizability for scores and profiles. New York, NY: John Wiley.

Dai

S. X.

Liu

(2003). 我国现行幼托机构教育质量评价工具研究 [Current evaluation tools for preschool quality rating in China]. Studies in Early Childhood Education, 7-8, 39-41.

Fan

Sun

(2013). Generalizability theory as the unified reliability framework in adolescence research. Journal of Early Adolescence. Advance online publication. doi:10.1177/0272431613482044

Fenech

(2011). An analysis of the conceptualization of “quality” in early childhood education and care empirical research: Promoting “blind spot” as foci for future research. Contemporary Issues in Early Childhood, 12, 102-117.

10.

Gordon

R. A.

Fujimoto

Kaestner

Korenman

Abner

(2012, April). An assessment of the validity of the ECERS–R with implications for measures of child care quality and relations to child development. Developmental Psychology. Advance online publication. doi:10.1037/a0027899

11.

Harms

Clifford

R. M.

Cryer

(2005). Early childhood environment rating scale (Rev. ed.). New York, NY: Teachers College Press. (Original work published 1998)

12.

(2012). The quality rating system of Chinese preschool education: Prospects and challenges. Childhood Education, 88, 14-22.

13.

Zhao

(2013). Exploring cultural differences in two quality measures: The early childhood environment rating scale-revised and kindergarten quality rating system. Manuscript submitted for publication.

14.

Zhu

(2009). 美国《幼儿学习环境评量表》及在中国的初步应用 [The early childhood environment rating scale and its initial use in China]. Journal of Preschool Education, 11, 47-51.

15.

Lambert

M. C.

Williams

S. G.

Morrison

J. W.

Samms-Vaughan

M. E.

Mayfield

W. A.

Thomberg

K. R.

(2008). Are the indicators for the language and reasoning subscale of the early childhood environment rating scale-revised psychometrically appropriate for Caribbean classrooms? International Journal of Early Year Education, 16, 41-60.

16.

Lambert

Abbott-Shim

McCarty

(2002). The relationship between classroom quality and ratings of the social functioning of head start children. Early Child Development and Care, 172, 231-245.

17.

Pan

(2013a). Chinese Early Childhood Program Rating Scale: A reliability and validity study. Manuscript submitted for publication.

18.

Pan

(2013b). Early childhood education quality in China and its associations with child outcomes. Manuscript submitted for publication.

19.

Mathers

Linskey

Seddon

Sylva

(2007). Using quality rating scales for professional development: Experiences from the UK. International Journal of Early Year Education, 15, 261-274.

20.

NICHD Early Child Care Research Network. (1999). Child outcomes when child care center classes meet recommended standards for quality. American Journal of Public Health, 89, 1072-1077.

21.

Pan

Liu

Lau

(2010). Evaluation of the kindergarten quality rating system in Beijing. Early Education and Development, 2, 186-204.

22.

Perlman

Zellman

G. L.

(2004). Examining the psychometric properties of Early Childhood Environment Rating Scale-Revised (ECERS-R). Early Childhood Research Quarterly, 19, 398-412.

23.

Rentzou

(2010). Using the ACEI global guidelines assessment to evaluate the quality of early child care in Greek settings. Early Childhood Education Journal, 38, 75-80.

24.

Shavelson

R. J.

Webb

N. M.

(1991). Generalizability theory: A primer. Newbury Park, CA: Sage.

25.

Sylva

Siraj-Blatchford

Taggart

Sammons

Melhuish

Elliot

Totsika

(2006). Capturing quality in early childhood through environmental rating scales. Early Childhood Research Quarterly, 21, 76-92.

26.

Tobin

(2005). Quality in early childhood education: An anthropologist’s perspective. Early Education and Development, 16, 421-434.

27.

Tout

Starr

Soli

Moodie

Kirby

Boller

(2010). Compendium of quality rating systems and evaluations. Prepared for Office of Planning, Research and Evaluation, Administration for Children and Families, Department of Health and Human Services, Washington, DC.

28.

Yang

Z. M.

Chang

S. Y.

(2004). Multivariate generalizability analysis of the Chinese college entrance comprehensive examination. Acta Psychologica Sinica, 36, 195-200.