Using Teacher Ratings to Track the Growth and Development of Young Children Using the Teaching Strategies GOLD ® Assessment System

Abstract

An important consideration in determining the validity of an observational assessment measure for young children is the variability attributed to the child versus that ascribed to the assessor or to some other factor such as classroom context. The Teaching Strategies GOLD^® assessment system was used to elicit teacher ratings of a national sample of 21,592 children (age 12-51 months). Teacher ratings of child development and learning were associated in expected directions with both child demographic characteristics and classroom composition variables. Children with disabilities started behind their typically developing peers and grew slower, girls showed an advantage in some areas over boys, and English language learners (ELLs) were rated lower at the beginning of the year and showed some faster rates of growth than their native English-speaking peers.

Keywords

Teacher Rating Scales developmental assessment child development

To ensure that all children are evaluated fairly, regardless of culture, language, or disabilities, assessment measures should be appropriate (National Association for the Education of Young Children & National Association of Early Childhood Specialists in State Departments of Education, 2003); reliable and valid (Snow & Van Hemel, 2008), including empirically valid (Hirsh-Pasek, Kochanoff, Newcombe, & de Villiers, 2005); and used for their intended purposes (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999). The present study adds to the existing literature on teacher-based observational assessment, specifically authentic assessment, by examining the validity of the Teaching Strategies GOLD^® (Heroman, Burts, Berke, & Bickart, 2010).

The Teaching Strategies GOLD^® is a seamless, authentic assessment measure designed to be used by teachers to evaluate the development and learning of children, from birth through kindergarten, including English language learners (ELLs) and children with disabilities. The measure differs in several ways from other authentic assessment measures (e.g., Meisels, Bickel, Nicholson, Xue, & Atkins-Burnett, 2001; Meisels, Wen, & Beachy-Quick, 2010; Moreno & Klute, 2011; Schweinhart, McNair, Barnes, & Larner, 1993), including the range of ages and domains measured and item-level scale points and behavioral anchors and the size, diversity, geographic locations, and program types included in the samples used in validation studies. Teachers regularly gather information related to 38 research-based objectives through observations, conversations with children and families, artifacts, and so forth. Assessment information is used to plan appropriate experiences; individualize instruction; monitor child progress, including determining when more specific evaluation is needed; and communicate child progress to families and other stakeholders. The objectives are operationalized into rating scale items structured as follows: Social-Emotional, 8 items (understanding, regulating, and expressing emotions; building relationships; and interacting appropriately); physical, 5 items (gross-motor development and fine-motor strength and coordination); language, 8 items (understanding and using language to communicate or express thoughts and needs); cognitive, 10 items (approaches to learning; memory; classification; and the use of symbols to represent objects, events, or persons); literacy, 12 items (phonological awareness; alphabet, print, and book knowledge; comprehension; and emergent writing skills); and mathematics, 7 items (number concepts and operations, spatial relationships and shapes, measurement and comparison, and pattern knowledge). Teachers summarize information at three checkpoints (fall, winter, and spring) using paper or online versions of the instrument (see www.teachingstrategies.com for additional information).

Teacher and Classroom Context

Teacher-based observational assessment, including authentic assessment, is more subjective than standardized measures (Cabell, Justice, Zucker, & Kilday, 2009), and teachers’ appraisals do not always align with those of outside evaluators or parents (e.g., Dinnebeil et al., 2013; Sims & Lonigan, 2012). Some researchers suggest that teacher evaluations may be influenced by factors that threaten validity (e.g., Waterman, McDermott, Fantuzzo, & Gadsden, 2012). Mashburn and Henry (2004) noted, “At least half of the variance in kindergarten teachers’ ratings remained unexplained by the child’s observed skills and abilities” (p. 29). Preconceived ideas about children, especially subgroups (e.g., Bennett, Gottesman, Rock, & Cerullo, 1993; Burchinal et al., 2011); perceptions of value differences between teachers and families (Hauser-Cram, Sirin, & Stipek, 2003); and education, specialized training, or training and/or experience using the assessment instrument (e.g., Mashburn & Henry, 2004; Meisels et al., 2010) are associated with teachers’ evaluations. Measures embedded in daily instruction that use various methods to gather child information over time can provide more complete information (Cabell et al., 2009) about the strengths, functional competencies, and needs of children than other measures. Several studies indicate that teachers used authentic assessment measures to accurately assess the children in their classrooms (e.g., Meisels et al., 2001; Moreno & Klute, 2011).

Classroom context can also influence child assessment outcomes (e.g., Mashburn, Hamre, Downer, & Pianta, 2006). In a study of teacher judgments of preschoolers’ math skills, approximately 40% of the variance was attributed to systematic differences between classrooms rather than to the child (Kilday, Kinzie, Mashburn, & Wittaker, 2012). In classrooms with high percentages of infants and toddlers, teachers may be less accurate in their assessments than in classrooms with older children (Meisels et al., 2010). Teachers in low-socioeconomic-status (SES) and low-achieving contexts tend to underestimate students’ abilities (Ready & Wright, 2011), and more behavioral problems and fewer prosocial behaviors were noted by teachers in low-income classrooms (Phillips & Lonigan, 2010). Behavior (Dinnebeil et al., 2013) and academic skills can also vary according to the percentage of children with special needs in the classroom (Gallagher & Lambert, 2006).

Child Demographic Characteristics

Differing results are reported regarding gender differences during early childhood. Mashburn and Henry (2004) noted that preschool and kindergarten boys typically develop slower than girls. Girls seem to have an advantage in early language and literacy, and the gender gap may widen over time (Ready, LoGerfo, Burkam, & Lee, 2005). Although boys were rated lower than girls on literacy skills, only half of the differences were explained by actual between-group differences (Ready & Wright, 2011). A female advantage is also reported for self-regulation (Matthews, Ponitz, & Morrison, 2009) and for social competence and behavior (Walker, 2004). Boys are likely to be rated as displaying more problematic behaviors than girls (Graves & Howes, 2011; Jerome, Hamre, & Pianta, 2009). Gender gaps in mathematics were found favoring kindergarten boys at the upper end of the achievement distribution and for Hispanic girls over boys at the bottom of the distribution (Penner & Paret, 2008; Robinson & Lubienski, 2011). Other studies indicate a male advantage for spatial skills (e.g., Gibbs, 2010; Levine, Huttenlocher, Taylor, & Langrock, 1999). Some studies suggest that gender gaps do not emerge until after kindergarten, and several researchers report no gender differences in mathematics (Klein, Adi-Japha, & Hakak-Benizri, 2010), general knowledge or early literacy achievement (Matthews et al., 2009), and task engagement and behavioral conflict (Vitiello, Booren, Downer, & Williford, 2012).

The academic achievement of ELLs across all grade levels is generally lower than White, English speakers (Downer et al., 2012). Spanish-speaking ELLs have some of the lowest mathematics and reading skills (Reardon & Galindo, 2009), whereas Asian ELLs have higher scores (Yesil-Dagli, 2011). Penner and Paret (2008) reported a mathematics advantage for Asian boys in the top of the achievement distribution. In another study, teachers underestimated the literacy skills of Asian ELLs (Ready & Wright, 2011). The English skills of Hispanic ELLs were underestimated at the beginning of the year, but the perceived disadvantage disappeared by spring.

Children with disabilities tend to be rated lower by teachers and parents on positive social functioning measures and score lower on emergent literacy development (Gallagher & Lambert, 2006). They show higher rates of language delays in preschool than children without disabilities (Goldstein, 2004). Communication patterns can influence assessment ratings (e.g., Dinnebeil et al., 2013); children with early speech and language impairments may be ignored by peers and respond less frequently to peer initiations than typically developing children (Hadley & Rice, 1991).

Research Aims

The overall purpose of this study was to offer evidence for the validity of the measure being evaluated. To do so, we sought to demonstrate that teacher ratings using the measure indicate differences in the expected directions between subgroups of children with known differences on demographic variables (disability status, ELLs, etc.). Similarly, we attempted to demonstrate that classroom average ratings vary in the expected directions based on the demographic composition of the classrooms. We also sought to demonstrate that teacher ratings using the measure could be used to track the growth and development of children. It is important to note that test validity based on known group differences (DeVellis, 2003) is not the same thing as biased ratings. Evidence for item and test bias is not based on differences between subgroups of children where differences would be expected. Rather, item and test bias have specific statistical definitions (Clauser & Mazor, 1998) and are based on research findings of differential item or test functioning (DIF or DTF) that indicate subgroups of children receive different ratings after underlying ability has been controlled for. For example, if two children have the same underlying ability on the construct of interest, belong to two different subgroups (i.e., native English speakers and ELLs), and receive different ratings on an item, then bias may exist. Item bias occurs when the presence of differential item or test functioning in fact reflects construct-irrelevant variance in performance. The reader is referred to (Kim, Lambert, & Burts, in press) for a study that demonstrated that teacher ratings using this measure do not suffer from bias based on child disability status, ethnicity, or ELL status.

Second, variance decomposition within a multilevel modeling context was used to examine how much of the variability in the ratings of child developmental progress is found between raters (teachers). Specifically, we are attempting to address the following research questions:

Research Question 1: What child characteristics are associated with teacher ratings of child growth, development, and learning?

Research Question 2: What classroom composition characteristics are associated with teacher ratings of child growth, development, and learning?

Research Question 3: How much of the variability in ratings of child developmental progress is between raters in a model that controls for child and classroom characteristics?

Method

Participants

A total of 111,059 children were rated by 8,042 teachers using the Teaching Strategies GOLD^® for the fall 2010 checkpoint. These children received educational services in 735 different programs at 3,792 different Head Start, private childcare, and school-based sites located in all regions and states of the United States. The population of children rated using the Teaching Strategies GOLD^® spanned the entire age range for which the assessment is intended. Teachers rated an average of 13.8 children. The teachers collected information about the race and ethnicity of each child and entered this information into the online system.

A growth-norm sample of 21,592 was created by sampling from the total population of children rated across three time points (fall, winter, and spring) using the measure. The sample was selected, stratifying by ethnicity and region, from among all children who were rated during all three rating periods using the online version of the assessment measure during academic year 2010-2011. These children ranged in age from 12 to 59 months at the time of the fall assessment. There were not sufficient data in the population to include a representative sample of children who were younger than 12 months and older than 59 months at the time of the fall assessment. This sample came from 40 different states and from the District of Columbia. The children were from the Northeastern (7.5%), Midwestern (54.7%), Southeastern (21.2%), and Western (16.6%) regions of the United States. The sample was similar to the 2010 U.S. Census Bureau population statistics of preschool-aged children with respect to gender (male 51.2%, female 48.8%). White children were represented in approximately their national proportion (52.1% in the Census Bureau’s estimate, 50.9% in the norm sample). African American children were overrepresented (13.6% in the Census Bureau’s estimate, 21.9% in the norm sample). Native American or Alaskan Native children comprised 2.5% of the norm sample, and Asian or Pacific Islander comprised 3.0% of the norm sample. Multiracial children and children of all other ethnic subgroups were closely represented in the overall proportion (8.9% in the Census Bureau’s estimate, 8.7% in the norm sample). Teachers reported unknown racial identity for 13.0% of the children in the norm sample.

Approximately one quarter of the children were identified as Hispanic (25.5% in the Census Bureau’s estimate, 25.7% in the norm sample). The primary language spoken in the home was English for 76.1%, Spanish for 17.5%, and 63 other languages for the remaining 6.4% of the children. Children with an Individual Family Service Plan (IFSP) or Individualized Education Plan (IEP) comprised 11.9% of the norm sample. Table 1 includes a summary of these child characteristics for the entire sample and based on the classroom averages.

Table 1.

Descriptives Statistics for Child and Classroom Characteristics.

	Age in Months	Disability Status	Gender (Boy)	Hispanic ELL	Other ELL
Children (n = 21,592)
Mean	49.524	11.9%	51.2%	17.5%	6.4%
SD	7.234	32.4%	50.0%	38.0%	24.6%

	Class Mean Age in Months	Class Percentage Disability Status	Class Percentage Boys	Class Percentage Hispanic ELL	Class Percentage Other ELL

Classrooms (n = 1,526)
Mean	49.179	14.6%	52.2%	15.7%	6.4%
SD	5.974	27.6%	15.5%	24.6%	15.8%

Measure

Development of the Teaching Strategies GOLD^® occurred over several years and incorporated feedback from teachers, administrators, consultants, and professional-development personnel; state early learning standards; and current research and professional literature, including literature identifying the knowledge, skills, and behaviors most predictive of school success. A study of the instrument with a subsample of infants through children aged 2 years (Kim & Smith, 2010) indicated high internal consistency reliability (α = .95-.99) and moderately high Rasch reliability statistics (person separation = 9.42, item separation = 19.20, person reliability = .99, item reliability = 1.00). Several other studies indicate generally strong overall psychometric properties of the instrument (Kim, Lambert, & Burts, in press).

Teachers rate child skills, knowledge, and behaviors along a 10-point progression of development and learning from “Not Yet” (Level 0) to Level 9 (beyond kindergarten expectations). “Indicator levels” (i.e., 2, 4, 6, and 8) include examples of what evidence may look like with majority and subgroups of children. “In-between levels” allow for additional steps in the progression. They do not include examples and are used to indicate that the child’s skills for that item are emerging but are not fully established. Teachers also enter into the online system the basic demographic information about each child and family that was used in this study. These teacher-reported values are also used to create classroom profiles of child demographics.

Teacher training of the instrument occurred over 2 days and included an overview of the measure and an exploration of the objectives and child progressions birth through kindergarten. Teachers watched video segments, participated in large-group discussions, evaluated a portfolio, completed family conference forms, and practiced uploading documentation samples and entering observation notes and progress checkpoints online. Interrater reliability of the measure (Kim, Lambert, & Burts, in press) was established by examining the correlations between the ratings of a master trainer and the ratings of teachers using the measure. The correlations were all above .90 with one exception; it was above .80.

Scale scores were created for each developmental domain using interval-level Rasch rating scale ability estimates. The ability estimates were then rescaled to conform to a distribution with a mean of 500 and standard deviation of 100. Values 3 or more standard deviations below the mean were given a value of 200, and values 3 or more standard deviations above the mean were given a value of 800. The scale score of 500 was considered normative for children 36 months of age, as these children are in the middle of the intended age range for use of the measure. A validation study of the Rasch-scaled developmental score suggests that teachers can make valid ratings of the developmental progress of children across the intended age range (Lambert, Kim, Taylor, & McGee, 2010). Rasch item and person reliabilities from these analyses, along with item and person separation indexes (Bond & Fox, 2007), were all very favorable for all scale scores across all three time points (Lambert et al., 2010). The Cronbach’s α reliability statistics obtained from this sample data were as follows: Social-Emotional (fall = .947, winter = .951, spring = .958), Physical (fall = .909, winter = .920, spring = .933), Language (fall = .957, winter = .960, spring = .965), Cognitive (fall = .961, winter = .965, spring = .972), Literacy (fall = .952, winter = .956, spring = .964), and Mathematics (fall = .937, winter = .940, spring = .951).

Analysis

A special case of multilevel modeling, three-level growth-curve modeling, was used to address the research questions. Separate models were created using each scale score as dependent variable. The HLM software package (Raudenbush, Bryk, & Congdon, 2004) was used for all analyses. The Level 1 models represented growth over time within child and included one predictor variable, month of the academic year centered on the winter assessment. Centering at the winter assessment was used to obtain more stable and robust estimates of intercept and growth rate. The resulting models therefore included an intercept term that represented each child’s estimated status at the winter assessment and a slope that estimated growth rate. The Level 2 models included child demographic variables. Child age was included as age in months at the time of the fall assessment. Disability status was coded 0 for typically developing children and 1 for children with an IEP or IFSP. Gender was coded 0 for females and 1 for males. ELL status was included as two dummy-coded variables representing Hispanic ELLs and all other ELLs. Native English speakers were accounted for as the baseline condition. Age in months was entered as a group-mean-centered predictor, and all other independent variables were entered uncentered. Classroom mean age, proportion with an IEP or IFSP, proportion of boys, and proportion in the two ELL categories were entered as grand-mean-centered predictors in the Level 3 models.

Results

Table 2 includes the variance decomposition estimates from the unconditional models, that is, models that contain no predictors. These results indicate that approximately one third of the variance in the scale scores was found between time points for the same child, another third between children within the same classroom, and another third between classrooms. In this application, each classroom had its own teacher or rater, so this value is also the proportion of the variance between raters, prior to accounting for classroom characteristics of the children.

Table 2.

Variance Decomposition.

	Proportion of Total Variance
	Level 1	Level 2	Level 3
Scale Score	Between Time Points Within Children	Between Children Within Classrooms	Between Classrooms
Social-Emotional	.337	.356	.307
Physical	.351	.292	.357
Language	.303	.410	.287
Cognitive	.350	.331	.319
Literacy	.326	.346	.329
Mathematics	.325	.356	.319

Table 3 includes the model’s estimated coefficients for the child demographic variables for each scale score, for both the winter status and growth models. These results address the first research question. Child age in months, as expected, was a statistically significant predictor of both winter status and growth rate for all scale scores. These results indicate that children were rated about 5 points higher for every additional month of age. These coefficients ranged from 4.432 points per month for the Physical scale to 5.656 for the Cognitive scale. The growth rate models indicate that we would expect children to grow about 0.10 points per month faster for every additional month of age. These coefficients ranged from 0.064 for the Social-Emotional scale to 0.187 for the Language scale score.

Table 3.

Results of Level 2 Models: Child Characteristics Associated With Initial Status and Growth.

Scale Score	Model	Age in Months	Disability Status	Gender (Boy)	Hispanic ELL	Other ELL
Social-Emotional	Winter	4.912***	−29.024***	−15.338***	−6.748***	−9.266***
Social-Emotional	Growth	0.064***	−0.581*	−0.435**	0.367*	0.130
Physical	Winter	4.432***	−22.518***	−7.353***	−2.432*	−0.693
Physical	Growth	0.098***	−0.812***	−0.290**	0.345*	0.039
Language	Winter	5.400***	−43.129***	−13.092***	−31.435***	−28.575***
Language	Growth	0.187***	−1.484***	−0.631***	−0.284	−0.107
Cognitive	Winter	5.656***	−31.846***	−12.953***	−14.339***	−10.853***
Cognitive	Growth	0.162***	−1.609***	−0.579***	0.279	0.162
Literacy	Winter	4.766***	−25.143***	−11.436***	−19.550***	−9.266***
Literacy	Growth	0.069***	−1.043***	−0.221**	−0.182	0.529*
Mathematics	Winter	4.607***	−26.314***	−6.447***	−18.572***	−8.718***
Mathematics	Growth	0.102***	−0.934***	−0.115	−0.329*	0.546*

Note: *p < .05. **p < .01. ***p < .001.

Child disability status was also a significant predictor of both winter status and growth rate and in the expected directions. Children with disabilities, based on these model results, can be expected to be rated lower and grow slower than typically developing children. Winter status coefficients ranged from −43.129 for Language to −22.518 for Physical. Growth rate coefficients ranged from −1.609 for the Cognitive scale to −.581 for the Social-Emotional scale. Boys were rated significantly lower on all scale scores, from −15.338 for Social-Emotional to −6.447 for Mathematics and grow slower than girls on every scale except Mathematics, from −.115 for Mathematics to −.631 for Language.

Hispanic ELL children were rated significantly lower on all scale scores, from −31.435 for Language to −2.432 for Physical. Teachers rated their growth as significantly higher than native English-speaking children for two of the scale scores: .367 for Social-Emotional, and .345 for Physical. Teachers rated their growth as significantly lower than native English-speaking children for Mathematics (−.329). Non-Hispanic ELL children were rated significantly lower compared with native English-speaking children for all of the scales except Physical, from −28.575 for Language to −.693 for Physical. However, these children were rated as growing faster than native English-speaking children on the Literacy (.529) and Mathematics (.546) scale scores.

Table 4 includes the coefficients for the Level 3 classroom composition variables addressing the second research question. These models estimate the average winter status for each scale score, ranging from 594.927 for Physical to 615.816 for Cognitive. As expected, class mean age was significantly associated with class average winter status ratings. These coefficients ranged from 4.210 for the Physical scale to 5.438 for the Cognitive scale. The classroom proportion of children with disabilities was significantly associated with class mean winter status for four of the scale scores, Social-Emotional, Language, Cognitive, and Literacy, indicating that classrooms with higher proportions of these children would be expected to be rated, on average, lower than classroom with lower proportions. The proportion of boys in the classroom was not related to the classroom winter status ratings for any of the scale scores. The classroom proportion of Hispanic ELL children was significantly associated with class average winter status, in the negative direction (lower ratings) for all scale scores except Literacy. The opposite finding was found for non-Hispanic ELL children where higher proportions of these children were significantly associated with higher-average initial-status ratings for all scale scores except Literacy and Mathematics.

Table 4.

Results of Level 3 models: Classroom Characteristics Associated With Class Mean Initial Status.

Scale Score	Intercept	Class Mean Age in Months	Class Proportion Disability Status	Class Proportion Boys	Class Proportion Hispanic ELL	Class Proportion Other ELL
Social-Emotional
Winter	608.631***	4.765***	−15.520***	3.945	−15.737***	14.576**
Growth	16.687***	0.220***	−3.248***	1.316	3.239***	−2.026*
Physical
Winter	594.927***	4.210***	−7.283	7.814	−18.979***	12.409*
Growth	14.810***	0.156***	−3.072***	−0.010	3.044***	−0.581
Language
Winter	613.780***	4.685***	−18.004***	2.194	−11.736**	21.715***
Growth	16.589***	0.262***	−3.384***	−0.113	3.313***	−1.371
Cognitive
Winter	615.816***	5.438***	−17.181***	2.686	−13.748**	12.912*
Growth	18.901***	0.309***	−3.559***	−0.431	3.599***	−1.400
Literacy
Winter	606.525***	5.172***	−7.764*	4.483	−1.489	10.146
Growth	15.232***	0.243***	−3.286***	0.247	2.502***	−0.636
Mathematics
Winter	603.637***	4.834***	−7.260	−3.811	−8.459*	4.623
Growth	14.693***	0.224***	−3.318***	0.659	3.181***	−0.338

Note: *p < .05. **p < .01. ***p < .001.

These models estimated the average monthly growth rates to be significant for each scale score, ranging from 14.693 for Mathematics to 18.901 for Cognitive. As expected, class mean age was significantly associated with growth rates for all scale scores, ranging from .156 for Physical to .309 for Cognitive. The classroom proportion of children with disabilities was significantly associated with lower class mean growth for all scale scores, ranging from −3.072 for Physical to −3.559 for Cognitive, indicating that classrooms with higher proportions of these children would be expected to grow, on average, slower than classroom with lower proportions. The proportion of boys in the classroom was not related to the classroom average growth for any of the scale scores. The classroom proportion of Hispanic ELL children was associated with significantly higher-average growth rates for all scale scores, ranging from 2.502 for Literacy to 3.599 for Cognitive. The opposite finding was found for non-Hispanic ELL children where higher proportions of these children were significantly associated with lower-average growth rates for the Social-Emotional scale (–2.026).

Table 5 includes the proportion of variance accounted for by the predictors in each model by scale score. Within the HLM context, estimates of variance accounted for can be made by observing the reduction in the residual variance in the model after the inclusion of the predictor variables. The time of year of the assessment, as a predictor of linear growth model in the Level 1 models was associated with approximately 80% of the variability in scores within child. Child demographic characteristics were associated with approximately 30% of the variance in child scores within classrooms. The classroom characteristics included in the models were associated with approximately 40% of the variance between classrooms in average winter status.

Table 5.

Variance Accounted for by Model Predictors.

	Proportion of Level Specific Variance Accounted for by Model Predictors
Scale Score	Level 1 Linear Growth Pattern	Level 2 Child Characteristics	Level 3 Classroom Characteristics
Social-Emotional	.812	.299	.382
Physical	.778	.340	.296
Language	.801	.327	.443
Cognitive	.830	.352	.404
Literacy	.836	.344	.463
Mathematics	.824	.324	.457

To address the third research question, we conceptualized the predictor variables in the model, time of the school year (fall, winter, and spring), demographic characteristics of the children, and the demographic composition of the children within classroom, as expected sources of variance in the scale scores given the focus of the teacher ratings and the expected variability in their ratings of growth, development, and learning. We calculated the proportion of the variance in the scale scores that could be considered as possibly due to rater effects, that is, differences in how teachers use the measure to rate the children in their own classrooms, by examining the proportion of the total variance in the scale scores that was comprised of residual variance in the Level 3 models after controlling for all predictors. This quantity is calculated for each scale score by dividing the Level 3 residual variance by the estimated total variance from the unconditional model. The Level 3 residual variance term represents the between-rater variance in the scale scores that is not accounted for by the model predictors that represent differences between classrooms in demographic composition. It is important to note that within the HLM models that we used, adding the child characteristics to the models controlled for within-rater variance, not between-rater variance. Classroom characteristics were controlled for in an effort to account for the between-rater variance that is due to the inevitable differences between classrooms in aggregate child demographics. Given that the results of the models clearly demonstrate that teacher ratings using this measure are able to distinguish between subgroups of children in the expected directions, it is reasonable to expect that differing classroom demographic profiles would therefore result in between-rater variance that is not due to rater effects but actual differences between classrooms.

These residual variance terms, expressed as proportions of the total variance in scale scores, were as follows: Social-Emotional .190, Physical .252, Language .160, Cognitive .190, Literacy .177, and Mathematics .173. These values indicate that between approximately 17% and 25% of the variance in scale scores is accounted for by unmeasured differences between classroom and teachers, including rater effects. Similarly, these results suggest that between approximately 75% and 83% of the variance in the scale scores is associated with either the predictor variances in the model or unmeasured child characteristics.

Discussion and Implications

The Teaching Strategies GOLD^® adds unique contributions to current authentic assessment measures through its design and validation processes. With any new assessment tool it is crucial to explore its psychometrics (Snow & Van Hemel, 2008). The present study provides further support for the measure’s validity and usefulness (Kim & Smith, 2010; Kim, Lambert, & Burts, in press). Specifically, the instrument showed sensitivity to age differences and to growth over time. As expected, older children had higher scores at all checkpoints than younger children. Supporting other research, children with disabilities started behind their nondisabled peers and grew slower over the year (Gallagher & Lambert, 2006; Goldstein, 2004). Similar to other studies, girls showed some advantages over boys (e.g., Matthews et al., 2009; Walker, 2004). Boys began lower and grew somewhat slower than girls (Ready et al., 2005) in all areas except mathematics.

Corroborating other studies (e.g., Downer et al., 2012; Reardon & Galindo, 2009; Yesil-Dagli, 2011), ELLs generally were rated lower at the beginning of the year than English-speaking peers and in some cases grew at faster rates than non-ELL peers. As children gained English skills and teachers became familiar with the children and their families, teachers may have become more accurate in their ratings. Hispanic ELLs growth in mathematics was lower than English-speaking peers while non-Hispanic ELLs grew faster in mathematics and literacy. This finding is not especially surprising in that some studies indicate that Hispanic ELLs have some of the lowest mathematics skills of any group (Reardon & Galindo, 2009). Mathematics is the hardest form of language for children to learn (Ginsburg, Lee, & Boyd, 2008). Hispanic children often come from families with less parent–child linguist engagement and lower-SES backgrounds than Whites and Asian American children (Garcia & Jensen, 2009), factors shown to influence literacy and mathematics skills.

Some authorities question whether teacher reports represent actual child differences or other factors such as teacher variability or classroom context (e.g., Gallagher & Lambert, 2006; Ready & Wright, 2011; Waterman et al., 2012). Mashburn and Henry (2004) note it is common for teachers’ global ratings of young children’s skills to have very high unexplained variance (as much as 50%). In the present study, error variance ranged from 16% to 25%, considerably lower than reported in some studies (e.g., Kilday et al., 2012). Teacher-based observational assessment is more subjective than standardized measures (Cabell et al., 2009) and has the possibility for greater variability (Kilday et al., 2012). Appropriate training of any teacher observational measure is essential (Dinnebeil et al., 2013) and can facilitate teachers’ awareness of the influence their perceptions (e.g., Bennett et al., 1993; Burchinal et al., 2011) and classroom contexts (e.g., Gallagher & Lambert, 2006; Meisels et al., 2010; Ready & Wright, 2011) have on child appraisals. Assessment measures that are embedded in daily instruction and that use various sources and methods to gather child information over time provide more complete information (Cabell et al., 2009) than other measures. Furthermore, the research-based objectives, multiple examples, additional scale points, behavioral anchors along the developmental progressions, and well-developed teacher training may have helped the Teaching Strategies GOLD^® address issues related to teacher-based ratings found in other studies.

Limitations and Directions for Future Research

It is important to note that the Level 2 models included only those child characteristics that were available to the researchers. Future research may benefit from the inclusion of a richer set of child and family characteristics. For example, parent education level and income, family SES, and the exact nature of special needs that lead to the disability status of the children were not available to the researchers. It is also important to note that the Level 3 models did not include teacher characteristics such as years of experience, educational level, and hours of training related to assessment issues in general and the Teaching Strategies GOLD^® system in particular. It is possible that some of the variance between classrooms in these analyses was associated with other unmeasured factors about the teachers and differences in classrooms or centers that are not accounted for by the demographic composition variables included in the models. For example, it is likely that centers and program sites vary in the amount and quality of training, supervision, and ongoing support that teachers receive related to assessment issues. Future research that focuses on a more thorough examination of the decomposition of the variance in ratings could build on the findings of this study with a formal generalizability study to examine interrater reliability.

Footnotes

Authors’ Note

This article is based on some of the same datasets and analyses contained in a previously released technical report entitled Technical Manual for the Teaching Strategies GOLD^® Assessment System.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Partial funding for this research was provided by Teaching Strategies, LLC. Views expressed are those of the authors and do not necessarily reflect those of the funding agency.

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (AERA/APA/NCME). (1999). Standards for educational and psychological testing. Washington, DC: Author.

Bennett

R. E.

Gottesman

R. L.

Rock

D. A.

Cerullo

(1993). Influence of behavioral perceptions and gender on teachers’ judgments of students’ academic skill. Journal of Educational Psychology, 85, 347-356.

Bond

T. G.

Fox

C. M.

(2007). Applying the Rasch model: Fundamental measurement in the human sciences (2nd ed.). Mahwah, NJ: Lawrence Erlbaum.

Burchinal

McCartney

Steinberg

Crosnoe

Friedman

S. L.

McLoyd

NICHD Early Child Care Research Network. (2011). Examining the Black–White achievement gap among low-income children using the NICHD study of early child care and youth development. Child Development, 82, 1404-1420.

Cabell

S. Q.

Justice

L. M.

Zucker

T. A.

Kilday

C. R.

(2009). Validity of teacher report for assessing the emergent literacy skills of at-risk preschoolers. Language and Speech & Hearing Services in Schools, 40, 161-173.

Clauser

B. E.

Mazor

K. M.

(1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17, 31-44.

Dinnebeil

L. A.

Sawyer

B. E.

Logan

Dynia

J. M.

Cancio

Justice

L. M.

(2013). Influences on the congruence between parents’ and teachers’ ratings of young children’s social skills and problem behaviors. Early Childhood Research Quarterly, 28, 144-152. doi:10.1016/j.ecresq.2012.03.03.001

DeVellis

R. F.

(2003). Scale development: Theory and applications (2nd ed.). Thousand Oaks, CA: SAGE.

Downer

J. T.

Lopez

M. L.

Grimm

K. J.

Hamagami

Pianta

R. C.

Howes

(2012). Observations of teacher–child interactions in classrooms serving Latinos and dual language learners: Applicability of the Classroom Assessment Scoring System in diverse settings. Early Childhood Research Quarterly, 27, 21-32.

10.

Gallagher

P. A.

Lambert

R. G.

(2006). Classroom quality, concentration of children with special needs, and child outcomes in Head Start. Exceptional Children, 73(1), 31-51.

11.

Garcia

Jensen

(2009). Early educational opportunities for children of Hispanic origins. SRCD Policy Report, 23, 3-19.

12.

Gibbs

B. G.

(2010). Reversing fortunes or content change? Gender gaps in math-related skill throughout childhood. Social Science Research, 39, 540-569.

13.

Ginsburg

H. P.

Lee

J. S.

Boyd

J. S.

(2008). Mathematics education for young children: What it is and how to promote it. Social Policy Report, 22(1), 1-22.

14.

Goldstein

(2004). Helping young children with special needs develop vocabulary. Early Childhood Education Journal, 32, 1-43.

15.

Graves

S. L.

Howes

(2011). Ethnic differences in social-emotional development in preschool: The impact of teacher child relationships and classroom quality. School Psychology Quarterly, 26, 202-214.

16.

Hadley

P. A.

Rice

M. L.

(1991). Conversational responsiveness of speech-and language-impaired preschoolers. Journal of Speech and Hearing Research, 34, 1308-1317.

17.

Hauser-Cram

Sirin

S. R.

Stipek

(2003). When teachers’ and parents’ values differ: Teachers’ ratings of academic competence in children from low-income families. Journal of Educational Psychology, 95, 813-820.

18.

Heroman

Burts

D. C.

Berke

Bickart

T. S.

(2010). The Creative Curriculum for preschool—Volume 5, Objectives for development & learning: Birth through kindergarten. Washington, DC: Teaching Strategies, LLC.

19.

Hirsh-Pasek

Kochanoff

Newcombe

N. S.

de Villiers

(2005). Using scientific knowledge to inform preschool assessment: Making the case for “empirical validity.” Social Policy Report, 14(1), 1-19.

20.

Jerome

E. M.

Hamre

B. K.

Pianta

R. C.

(2009). Teacher-child relationships from kindergarten to sixth grade: Early childhood predictors of teacher-perceived conflict and closeness. Social Development, 18, 915-945.

21.

Kilday

C. R.

Kinzie

M. B.

Mashburn

A. J.

Wittaker

J. V.

(2012). Accuracy of teacher judgments of preschoolers’ math skills. Journal of Psychoeducational Assessment, 30, 148-159.

22.

Kim

Smith

J. D.

(2010). Evaluation of two observational assessment systems for children’s development and learning. NHSA Dialog, 13, 253-267.

23.

Kim

D. H.

Lambert

R. G.

Burts

D. C.

(in press). Evidence of the validity of Teaching Strategies GOLD® Assessment tool for English language learners and children with disabilities. Early Education and Development.

24.

Klein

P. S.

Adi-Japha

Hakak-Benizri

(2010). Mathematical thinking of kindergarten boys and girls: Similar achievement, different contributing processes. Educational Studies in mathematics, 73, 233-246.

25.

Lambert

R. G.

Kim

Taylor

McGee

J. R.

(2010). Technical Manual for the Teaching Strategies GOLD™ Assessment System (CEMETR-2010-06). Retrieved from University of North Carolina Charlotte, Center for Educational Measurement and Evaluation website: https://education.uncc.edu/ceme/ceme-technical-reports

26.

Levine

S. C.

Huttenlocher

Taylor

Langrock

(1999). Early sex differences in spatial skill. Developmental Psychology, 35, 940-949.

27.

Mashburn

A. J.

Hamre

B. K.

Downer

J. T.

Pianta

R. C.

(2006). Teacher and classroom characteristics associated with teachers’ ratings of prekindergartners’ relationships and behaviors. Journal of Psychoeducational Assessment, 24, 367-380.

28.

Mashburn

A. J.

Henry

G. T.

(2004). Assessing school readiness: Validity and bias in preschool and kindergarten teachers’ ratings. Educational Measurement Issues and Practices, 23(4), 16-30.

29.

Matthews

J. S.

Ponitz

C. C.

Morrison

F. J.

(2009). Early gender differences in self-regulation and academic achievement. Journal of Educational Psychology, 101, 689-704.

30.

Meisels

S. J.

Bickel

D. D.

Nicholson

Xue

Atkins-Burnett

(2001). Trusting teachers’ judgments: A validity study of a curriculum-embedded performance assessment in kindergarten to grade 3. American Educational Research Journal, 38(1), 73-95.

31.

Meisels

S. J.

Wen

Beachy-Quick

(2010). Authentic assessment for infants and toddlers: Exploring the reliability and validity of the Ounce Scale. Applied Developmental Science, 14, 55-71.

32.

Moreno

A. J.

Klute

M. M.

(2011). Infant-toddler teachers can successfully employ authentic assessment: The Learning Through Relating system. Early Childhood Research Quarterly, 26, 484-496.

33.

National Association for the Education of Young Children and National Association of Early Childhood Specialists in State Departments of Education (NAEYC and NAECS/SDE). (2003). Early childhood curriculum, assessment, and program evaluation: Building an effective, accountable system in programs for children birth through age 8. Joint position statement. Retrieved from http://www.naeyc.org/dap

34.

Penner

A. M.

Paret

(2008). Gender differences in mathematics achievement: Exploring the early grades and the extremes. Social Science Research, 37, 239-253.

35.

Phillips

B. M.

Lonigan

C. J.

(2010). Child and informant influences on behavioral ratings of preschool children. Psychology in The Schools, 47, 374-390.

36.

Raudenbush

S. W.

Bryk

A. S.

Congdon

(2004). HLM 6 for Windows [Computer software]. Lincolnwood, IL: Scientific Software International.

37.

Ready

D. D.

LoGerfo

L. F.

Burkam

D. T.

Lee

V. E.

(2005). Explaining girls’ advantage in kindergarten literacy learning: Do classroom behaviors make a difference? The Elementary School Journal, 106(1), 21-38.

38.

Ready

D. D.

Wright

D. L.

(2011). Accuracy and inaccuracy in teachers’ perceptions of young children’s cognitive abilities: The role of child background and classroom context. American Educational Research Journal, 48, 335-360.

39.

Reardon

S. F.

Galindo

(2009). The Hispanic–White achievement gap in math and reading in the elementary grades. American Educational Research Journal, 46, 853-891.

40.

Robinson

J. P.

Lubienski

S. T.

(2011). The development of gender gaps in mathematics and reading during elementary and middle school: Examining direct cognitive assessments and teacher ratings. American Educational Research Journal, 48, 268-302.

41.

Schweinhart

L. J.

McNair

Barnes

Larner

(1993). Observing young children in action to assess their development: The High/Scope Child Observation Record Study. Educational and Psychological Measurement, 53, 445-455.

42.

Sims

D. M.

Lonigan

C. J.

(2012). Multi-method assessment of ADHD characteristics in Preschool children: Relations between measures. Early Childhood Research Quarterly, 27, 329-337. doi:10.1016/j.ecresq.2011.08.004

43.

Snow

C. E.

Van Hemel

S. B.

(Eds.). (2008). Early childhood assessment: Why, what, and how? National Research Council of the National Academies. Washington, DC: National Academies Press. Retrieved from http://www.nap.edu/catalog/12446.html

44.

Vitiello

V. E.

Booren

L. M.

Downer

J. T.

Williford

A. P.

(2012). Variation in children’s classroom engagement throughout a day in preschool: Relations to classroom and child factors. Early Childhood Research Quarterly, 27, 210-220.

45.

Walker

(2004). Teacher reports of social behavior and peer acceptance in early childhood: Sex and social status differences. Child Study Journal, 34(1), 13-28.

46.

Waterman

McDermott

P. A.

Fantuzzo

J. W.

Gadsden

V. L.

(2012). The matter of assessor variance in early childhood education—Or whose score is it anyway? Early Childhood Research Quarterly, 27, 46-54.

47.

Yesil-Dagli

(2011). Predicting ELL students’ beginning first grade English oral reading fluency from initial kindergarten vocabulary, letter naming, and phonological awareness skills. Early Childhood Research Quarterly, 26, 15-29.