Abstract
An important consideration in determining the validity of an observational assessment measure for young children is the variability attributed to the child versus that ascribed to the assessor or to some other factor such as classroom context. The Teaching Strategies GOLD® assessment system was used to elicit teacher ratings of a national sample of 21,592 children (age 12-51 months). Teacher ratings of child development and learning were associated in expected directions with both child demographic characteristics and classroom composition variables. Children with disabilities started behind their typically developing peers and grew slower, girls showed an advantage in some areas over boys, and English language learners (ELLs) were rated lower at the beginning of the year and showed some faster rates of growth than their native English-speaking peers.
To ensure that all children are evaluated fairly, regardless of culture, language, or disabilities, assessment measures should be appropriate (National Association for the Education of Young Children & National Association of Early Childhood Specialists in State Departments of Education, 2003); reliable and valid (Snow & Van Hemel, 2008), including empirically valid (Hirsh-Pasek, Kochanoff, Newcombe, & de Villiers, 2005); and used for their intended purposes (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999). The present study adds to the existing literature on teacher-based observational assessment, specifically authentic assessment, by examining the validity of the Teaching Strategies GOLD® (Heroman, Burts, Berke, & Bickart, 2010).
The Teaching Strategies GOLD® is a seamless, authentic assessment measure designed to be used by teachers to evaluate the development and learning of children, from birth through kindergarten, including English language learners (ELLs) and children with disabilities. The measure differs in several ways from other authentic assessment measures (e.g., Meisels, Bickel, Nicholson, Xue, & Atkins-Burnett, 2001; Meisels, Wen, & Beachy-Quick, 2010; Moreno & Klute, 2011; Schweinhart, McNair, Barnes, & Larner, 1993), including the range of ages and domains measured and item-level scale points and behavioral anchors and the size, diversity, geographic locations, and program types included in the samples used in validation studies. Teachers regularly gather information related to 38 research-based objectives through observations, conversations with children and families, artifacts, and so forth. Assessment information is used to plan appropriate experiences; individualize instruction; monitor child progress, including determining when more specific evaluation is needed; and communicate child progress to families and other stakeholders. The objectives are operationalized into rating scale items structured as follows: Social-Emotional, 8 items (understanding, regulating, and expressing emotions; building relationships; and interacting appropriately); physical, 5 items (gross-motor development and fine-motor strength and coordination); language, 8 items (understanding and using language to communicate or express thoughts and needs); cognitive, 10 items (approaches to learning; memory; classification; and the use of symbols to represent objects, events, or persons); literacy, 12 items (phonological awareness; alphabet, print, and book knowledge; comprehension; and emergent writing skills); and mathematics, 7 items (number concepts and operations, spatial relationships and shapes, measurement and comparison, and pattern knowledge). Teachers summarize information at three checkpoints (fall, winter, and spring) using paper or online versions of the instrument (see www.teachingstrategies.com for additional information).
Teacher and Classroom Context
Teacher-based observational assessment, including authentic assessment, is more subjective than standardized measures (Cabell, Justice, Zucker, & Kilday, 2009), and teachers’ appraisals do not always align with those of outside evaluators or parents (e.g., Dinnebeil et al., 2013; Sims & Lonigan, 2012). Some researchers suggest that teacher evaluations may be influenced by factors that threaten validity (e.g., Waterman, McDermott, Fantuzzo, & Gadsden, 2012). Mashburn and Henry (2004) noted, “At least half of the variance in kindergarten teachers’ ratings remained unexplained by the child’s observed skills and abilities” (p. 29). Preconceived ideas about children, especially subgroups (e.g., Bennett, Gottesman, Rock, & Cerullo, 1993; Burchinal et al., 2011); perceptions of value differences between teachers and families (Hauser-Cram, Sirin, & Stipek, 2003); and education, specialized training, or training and/or experience using the assessment instrument (e.g., Mashburn & Henry, 2004; Meisels et al., 2010) are associated with teachers’ evaluations. Measures embedded in daily instruction that use various methods to gather child information over time can provide more complete information (Cabell et al., 2009) about the strengths, functional competencies, and needs of children than other measures. Several studies indicate that teachers used authentic assessment measures to accurately assess the children in their classrooms (e.g., Meisels et al., 2001; Moreno & Klute, 2011).
Classroom context can also influence child assessment outcomes (e.g., Mashburn, Hamre, Downer, & Pianta, 2006). In a study of teacher judgments of preschoolers’ math skills, approximately 40% of the variance was attributed to systematic differences between classrooms rather than to the child (Kilday, Kinzie, Mashburn, & Wittaker, 2012). In classrooms with high percentages of infants and toddlers, teachers may be less accurate in their assessments than in classrooms with older children (Meisels et al., 2010). Teachers in low-socioeconomic-status (SES) and low-achieving contexts tend to underestimate students’ abilities (Ready & Wright, 2011), and more behavioral problems and fewer prosocial behaviors were noted by teachers in low-income classrooms (Phillips & Lonigan, 2010). Behavior (Dinnebeil et al., 2013) and academic skills can also vary according to the percentage of children with special needs in the classroom (Gallagher & Lambert, 2006).
Child Demographic Characteristics
Differing results are reported regarding gender differences during early childhood. Mashburn and Henry (2004) noted that preschool and kindergarten boys typically develop slower than girls. Girls seem to have an advantage in early language and literacy, and the gender gap may widen over time (Ready, LoGerfo, Burkam, & Lee, 2005). Although boys were rated lower than girls on literacy skills, only half of the differences were explained by actual between-group differences (Ready & Wright, 2011). A female advantage is also reported for self-regulation (Matthews, Ponitz, & Morrison, 2009) and for social competence and behavior (Walker, 2004). Boys are likely to be rated as displaying more problematic behaviors than girls (Graves & Howes, 2011; Jerome, Hamre, & Pianta, 2009). Gender gaps in mathematics were found favoring kindergarten boys at the upper end of the achievement distribution and for Hispanic girls over boys at the bottom of the distribution (Penner & Paret, 2008; Robinson & Lubienski, 2011). Other studies indicate a male advantage for spatial skills (e.g., Gibbs, 2010; Levine, Huttenlocher, Taylor, & Langrock, 1999). Some studies suggest that gender gaps do not emerge until after kindergarten, and several researchers report no gender differences in mathematics (Klein, Adi-Japha, & Hakak-Benizri, 2010), general knowledge or early literacy achievement (Matthews et al., 2009), and task engagement and behavioral conflict (Vitiello, Booren, Downer, & Williford, 2012).
The academic achievement of ELLs across all grade levels is generally lower than White, English speakers (Downer et al., 2012). Spanish-speaking ELLs have some of the lowest mathematics and reading skills (Reardon & Galindo, 2009), whereas Asian ELLs have higher scores (Yesil-Dagli, 2011). Penner and Paret (2008) reported a mathematics advantage for Asian boys in the top of the achievement distribution. In another study, teachers underestimated the literacy skills of Asian ELLs (Ready & Wright, 2011). The English skills of Hispanic ELLs were underestimated at the beginning of the year, but the perceived disadvantage disappeared by spring.
Children with disabilities tend to be rated lower by teachers and parents on positive social functioning measures and score lower on emergent literacy development (Gallagher & Lambert, 2006). They show higher rates of language delays in preschool than children without disabilities (Goldstein, 2004). Communication patterns can influence assessment ratings (e.g., Dinnebeil et al., 2013); children with early speech and language impairments may be ignored by peers and respond less frequently to peer initiations than typically developing children (Hadley & Rice, 1991).
Research Aims
The overall purpose of this study was to offer evidence for the validity of the measure being evaluated. To do so, we sought to demonstrate that teacher ratings using the measure indicate differences in the expected directions between subgroups of children with known differences on demographic variables (disability status, ELLs, etc.). Similarly, we attempted to demonstrate that classroom average ratings vary in the expected directions based on the demographic composition of the classrooms. We also sought to demonstrate that teacher ratings using the measure could be used to track the growth and development of children. It is important to note that test validity based on known group differences (DeVellis, 2003) is not the same thing as biased ratings. Evidence for item and test bias is not based on differences between subgroups of children where differences would be expected. Rather, item and test bias have specific statistical definitions (Clauser & Mazor, 1998) and are based on research findings of differential item or test functioning (DIF or DTF) that indicate subgroups of children receive different ratings after underlying ability has been controlled for. For example, if two children have the same underlying ability on the construct of interest, belong to two different subgroups (i.e., native English speakers and ELLs), and receive different ratings on an item, then bias may exist. Item bias occurs when the presence of differential item or test functioning in fact reflects construct-irrelevant variance in performance. The reader is referred to (Kim, Lambert, & Burts, in press) for a study that demonstrated that teacher ratings using this measure do not suffer from bias based on child disability status, ethnicity, or ELL status.
Second, variance decomposition within a multilevel modeling context was used to examine how much of the variability in the ratings of child developmental progress is found between raters (teachers). Specifically, we are attempting to address the following research questions:
Research Question 1: What child characteristics are associated with teacher ratings of child growth, development, and learning?
Research Question 2: What classroom composition characteristics are associated with teacher ratings of child growth, development, and learning?
Research Question 3: How much of the variability in ratings of child developmental progress is between raters in a model that controls for child and classroom characteristics?
Method
Participants
A total of 111,059 children were rated by 8,042 teachers using the Teaching Strategies GOLD® for the fall 2010 checkpoint. These children received educational services in 735 different programs at 3,792 different Head Start, private childcare, and school-based sites located in all regions and states of the United States. The population of children rated using the Teaching Strategies GOLD® spanned the entire age range for which the assessment is intended. Teachers rated an average of 13.8 children. The teachers collected information about the race and ethnicity of each child and entered this information into the online system.
A growth-norm sample of 21,592 was created by sampling from the total population of children rated across three time points (fall, winter, and spring) using the measure. The sample was selected, stratifying by ethnicity and region, from among all children who were rated during all three rating periods using the online version of the assessment measure during academic year 2010-2011. These children ranged in age from 12 to 59 months at the time of the fall assessment. There were not sufficient data in the population to include a representative sample of children who were younger than 12 months and older than 59 months at the time of the fall assessment. This sample came from 40 different states and from the District of Columbia. The children were from the Northeastern (7.5%), Midwestern (54.7%), Southeastern (21.2%), and Western (16.6%) regions of the United States. The sample was similar to the 2010 U.S. Census Bureau population statistics of preschool-aged children with respect to gender (male 51.2%, female 48.8%). White children were represented in approximately their national proportion (52.1% in the Census Bureau’s estimate, 50.9% in the norm sample). African American children were overrepresented (13.6% in the Census Bureau’s estimate, 21.9% in the norm sample). Native American or Alaskan Native children comprised 2.5% of the norm sample, and Asian or Pacific Islander comprised 3.0% of the norm sample. Multiracial children and children of all other ethnic subgroups were closely represented in the overall proportion (8.9% in the Census Bureau’s estimate, 8.7% in the norm sample). Teachers reported unknown racial identity for 13.0% of the children in the norm sample.
Approximately one quarter of the children were identified as Hispanic (25.5% in the Census Bureau’s estimate, 25.7% in the norm sample). The primary language spoken in the home was English for 76.1%, Spanish for 17.5%, and 63 other languages for the remaining 6.4% of the children. Children with an Individual Family Service Plan (IFSP) or Individualized Education Plan (IEP) comprised 11.9% of the norm sample. Table 1 includes a summary of these child characteristics for the entire sample and based on the classroom averages.
Descriptives Statistics for Child and Classroom Characteristics.
Measure
Development of the Teaching Strategies GOLD® occurred over several years and incorporated feedback from teachers, administrators, consultants, and professional-development personnel; state early learning standards; and current research and professional literature, including literature identifying the knowledge, skills, and behaviors most predictive of school success. A study of the instrument with a subsample of infants through children aged 2 years (Kim & Smith, 2010) indicated high internal consistency reliability (α = .95-.99) and moderately high Rasch reliability statistics (person separation = 9.42, item separation = 19.20, person reliability = .99, item reliability = 1.00). Several other studies indicate generally strong overall psychometric properties of the instrument (Kim, Lambert, & Burts, in press).
Teachers rate child skills, knowledge, and behaviors along a 10-point progression of development and learning from “Not Yet” (Level 0) to Level 9 (beyond kindergarten expectations). “Indicator levels” (i.e., 2, 4, 6, and 8) include examples of what evidence may look like with majority and subgroups of children. “In-between levels” allow for additional steps in the progression. They do not include examples and are used to indicate that the child’s skills for that item are emerging but are not fully established. Teachers also enter into the online system the basic demographic information about each child and family that was used in this study. These teacher-reported values are also used to create classroom profiles of child demographics.
Teacher training of the instrument occurred over 2 days and included an overview of the measure and an exploration of the objectives and child progressions birth through kindergarten. Teachers watched video segments, participated in large-group discussions, evaluated a portfolio, completed family conference forms, and practiced uploading documentation samples and entering observation notes and progress checkpoints online. Interrater reliability of the measure (Kim, Lambert, & Burts, in press) was established by examining the correlations between the ratings of a master trainer and the ratings of teachers using the measure. The correlations were all above .90 with one exception; it was above .80.
Scale scores were created for each developmental domain using interval-level Rasch rating scale ability estimates. The ability estimates were then rescaled to conform to a distribution with a mean of 500 and standard deviation of 100. Values 3 or more standard deviations below the mean were given a value of 200, and values 3 or more standard deviations above the mean were given a value of 800. The scale score of 500 was considered normative for children 36 months of age, as these children are in the middle of the intended age range for use of the measure. A validation study of the Rasch-scaled developmental score suggests that teachers can make valid ratings of the developmental progress of children across the intended age range (Lambert, Kim, Taylor, & McGee, 2010). Rasch item and person reliabilities from these analyses, along with item and person separation indexes (Bond & Fox, 2007), were all very favorable for all scale scores across all three time points (Lambert et al., 2010). The Cronbach’s α reliability statistics obtained from this sample data were as follows: Social-Emotional (fall = .947, winter = .951, spring = .958), Physical (fall = .909, winter = .920, spring = .933), Language (fall = .957, winter = .960, spring = .965), Cognitive (fall = .961, winter = .965, spring = .972), Literacy (fall = .952, winter = .956, spring = .964), and Mathematics (fall = .937, winter = .940, spring = .951).
Analysis
A special case of multilevel modeling, three-level growth-curve modeling, was used to address the research questions. Separate models were created using each scale score as dependent variable. The HLM software package (Raudenbush, Bryk, & Congdon, 2004) was used for all analyses. The Level 1 models represented growth over time within child and included one predictor variable, month of the academic year centered on the winter assessment. Centering at the winter assessment was used to obtain more stable and robust estimates of intercept and growth rate. The resulting models therefore included an intercept term that represented each child’s estimated status at the winter assessment and a slope that estimated growth rate. The Level 2 models included child demographic variables. Child age was included as age in months at the time of the fall assessment. Disability status was coded 0 for typically developing children and 1 for children with an IEP or IFSP. Gender was coded 0 for females and 1 for males. ELL status was included as two dummy-coded variables representing Hispanic ELLs and all other ELLs. Native English speakers were accounted for as the baseline condition. Age in months was entered as a group-mean-centered predictor, and all other independent variables were entered uncentered. Classroom mean age, proportion with an IEP or IFSP, proportion of boys, and proportion in the two ELL categories were entered as grand-mean-centered predictors in the Level 3 models.
Results
Table 2 includes the variance decomposition estimates from the unconditional models, that is, models that contain no predictors. These results indicate that approximately one third of the variance in the scale scores was found between time points for the same child, another third between children within the same classroom, and another third between classrooms. In this application, each classroom had its own teacher or rater, so this value is also the proportion of the variance between raters, prior to accounting for classroom characteristics of the children.
Variance Decomposition.
Table 3 includes the model’s estimated coefficients for the child demographic variables for each scale score, for both the winter status and growth models. These results address the first research question. Child age in months, as expected, was a statistically significant predictor of both winter status and growth rate for all scale scores. These results indicate that children were rated about 5 points higher for every additional month of age. These coefficients ranged from 4.432 points per month for the Physical scale to 5.656 for the Cognitive scale. The growth rate models indicate that we would expect children to grow about 0.10 points per month faster for every additional month of age. These coefficients ranged from 0.064 for the Social-Emotional scale to 0.187 for the Language scale score.
Results of Level 2 Models: Child Characteristics Associated With Initial Status and Growth.
Note: *p < .05. **p < .01. ***p < .001.
Child disability status was also a significant predictor of both winter status and growth rate and in the expected directions. Children with disabilities, based on these model results, can be expected to be rated lower and grow slower than typically developing children. Winter status coefficients ranged from −43.129 for Language to −22.518 for Physical. Growth rate coefficients ranged from −1.609 for the Cognitive scale to −.581 for the Social-Emotional scale. Boys were rated significantly lower on all scale scores, from −15.338 for Social-Emotional to −6.447 for Mathematics and grow slower than girls on every scale except Mathematics, from −.115 for Mathematics to −.631 for Language.
Hispanic ELL children were rated significantly lower on all scale scores, from −31.435 for Language to −2.432 for Physical. Teachers rated their growth as significantly higher than native English-speaking children for two of the scale scores: .367 for Social-Emotional, and .345 for Physical. Teachers rated their growth as significantly lower than native English-speaking children for Mathematics (−.329). Non-Hispanic ELL children were rated significantly lower compared with native English-speaking children for all of the scales except Physical, from −28.575 for Language to −.693 for Physical. However, these children were rated as growing faster than native English-speaking children on the Literacy (.529) and Mathematics (.546) scale scores.
Table 4 includes the coefficients for the Level 3 classroom composition variables addressing the second research question. These models estimate the average winter status for each scale score, ranging from 594.927 for Physical to 615.816 for Cognitive. As expected, class mean age was significantly associated with class average winter status ratings. These coefficients ranged from 4.210 for the Physical scale to 5.438 for the Cognitive scale. The classroom proportion of children with disabilities was significantly associated with class mean winter status for four of the scale scores, Social-Emotional, Language, Cognitive, and Literacy, indicating that classrooms with higher proportions of these children would be expected to be rated, on average, lower than classroom with lower proportions. The proportion of boys in the classroom was not related to the classroom winter status ratings for any of the scale scores. The classroom proportion of Hispanic ELL children was significantly associated with class average winter status, in the negative direction (lower ratings) for all scale scores except Literacy. The opposite finding was found for non-Hispanic ELL children where higher proportions of these children were significantly associated with higher-average initial-status ratings for all scale scores except Literacy and Mathematics.
Results of Level 3 models: Classroom Characteristics Associated With Class Mean Initial Status.
Note: *p < .05. **p < .01. ***p < .001.
These models estimated the average monthly growth rates to be significant for each scale score, ranging from 14.693 for Mathematics to 18.901 for Cognitive. As expected, class mean age was significantly associated with growth rates for all scale scores, ranging from .156 for Physical to .309 for Cognitive. The classroom proportion of children with disabilities was significantly associated with lower class mean growth for all scale scores, ranging from −3.072 for Physical to −3.559 for Cognitive, indicating that classrooms with higher proportions of these children would be expected to grow, on average, slower than classroom with lower proportions. The proportion of boys in the classroom was not related to the classroom average growth for any of the scale scores. The classroom proportion of Hispanic ELL children was associated with significantly higher-average growth rates for all scale scores, ranging from 2.502 for Literacy to 3.599 for Cognitive. The opposite finding was found for non-Hispanic ELL children where higher proportions of these children were significantly associated with lower-average growth rates for the Social-Emotional scale (–2.026).
Table 5 includes the proportion of variance accounted for by the predictors in each model by scale score. Within the HLM context, estimates of variance accounted for can be made by observing the reduction in the residual variance in the model after the inclusion of the predictor variables. The time of year of the assessment, as a predictor of linear growth model in the Level 1 models was associated with approximately 80% of the variability in scores within child. Child demographic characteristics were associated with approximately 30% of the variance in child scores within classrooms. The classroom characteristics included in the models were associated with approximately 40% of the variance between classrooms in average winter status.
Variance Accounted for by Model Predictors.
To address the third research question, we conceptualized the predictor variables in the model, time of the school year (fall, winter, and spring), demographic characteristics of the children, and the demographic composition of the children within classroom, as expected sources of variance in the scale scores given the focus of the teacher ratings and the expected variability in their ratings of growth, development, and learning. We calculated the proportion of the variance in the scale scores that could be considered as possibly due to rater effects, that is, differences in how teachers use the measure to rate the children in their own classrooms, by examining the proportion of the total variance in the scale scores that was comprised of residual variance in the Level 3 models after controlling for all predictors. This quantity is calculated for each scale score by dividing the Level 3 residual variance by the estimated total variance from the unconditional model. The Level 3 residual variance term represents the between-rater variance in the scale scores that is not accounted for by the model predictors that represent differences between classrooms in demographic composition. It is important to note that within the HLM models that we used, adding the child characteristics to the models controlled for within-rater variance, not between-rater variance. Classroom characteristics were controlled for in an effort to account for the between-rater variance that is due to the inevitable differences between classrooms in aggregate child demographics. Given that the results of the models clearly demonstrate that teacher ratings using this measure are able to distinguish between subgroups of children in the expected directions, it is reasonable to expect that differing classroom demographic profiles would therefore result in between-rater variance that is not due to rater effects but actual differences between classrooms.
These residual variance terms, expressed as proportions of the total variance in scale scores, were as follows: Social-Emotional .190, Physical .252, Language .160, Cognitive .190, Literacy .177, and Mathematics .173. These values indicate that between approximately 17% and 25% of the variance in scale scores is accounted for by unmeasured differences between classroom and teachers, including rater effects. Similarly, these results suggest that between approximately 75% and 83% of the variance in the scale scores is associated with either the predictor variances in the model or unmeasured child characteristics.
Discussion and Implications
The Teaching Strategies GOLD® adds unique contributions to current authentic assessment measures through its design and validation processes. With any new assessment tool it is crucial to explore its psychometrics (Snow & Van Hemel, 2008). The present study provides further support for the measure’s validity and usefulness (Kim & Smith, 2010; Kim, Lambert, & Burts, in press). Specifically, the instrument showed sensitivity to age differences and to growth over time. As expected, older children had higher scores at all checkpoints than younger children. Supporting other research, children with disabilities started behind their nondisabled peers and grew slower over the year (Gallagher & Lambert, 2006; Goldstein, 2004). Similar to other studies, girls showed some advantages over boys (e.g., Matthews et al., 2009; Walker, 2004). Boys began lower and grew somewhat slower than girls (Ready et al., 2005) in all areas except mathematics.
Corroborating other studies (e.g., Downer et al., 2012; Reardon & Galindo, 2009; Yesil-Dagli, 2011), ELLs generally were rated lower at the beginning of the year than English-speaking peers and in some cases grew at faster rates than non-ELL peers. As children gained English skills and teachers became familiar with the children and their families, teachers may have become more accurate in their ratings. Hispanic ELLs growth in mathematics was lower than English-speaking peers while non-Hispanic ELLs grew faster in mathematics and literacy. This finding is not especially surprising in that some studies indicate that Hispanic ELLs have some of the lowest mathematics skills of any group (Reardon & Galindo, 2009). Mathematics is the hardest form of language for children to learn (Ginsburg, Lee, & Boyd, 2008). Hispanic children often come from families with less parent–child linguist engagement and lower-SES backgrounds than Whites and Asian American children (Garcia & Jensen, 2009), factors shown to influence literacy and mathematics skills.
Some authorities question whether teacher reports represent actual child differences or other factors such as teacher variability or classroom context (e.g., Gallagher & Lambert, 2006; Ready & Wright, 2011; Waterman et al., 2012). Mashburn and Henry (2004) note it is common for teachers’ global ratings of young children’s skills to have very high unexplained variance (as much as 50%). In the present study, error variance ranged from 16% to 25%, considerably lower than reported in some studies (e.g., Kilday et al., 2012). Teacher-based observational assessment is more subjective than standardized measures (Cabell et al., 2009) and has the possibility for greater variability (Kilday et al., 2012). Appropriate training of any teacher observational measure is essential (Dinnebeil et al., 2013) and can facilitate teachers’ awareness of the influence their perceptions (e.g., Bennett et al., 1993; Burchinal et al., 2011) and classroom contexts (e.g., Gallagher & Lambert, 2006; Meisels et al., 2010; Ready & Wright, 2011) have on child appraisals. Assessment measures that are embedded in daily instruction and that use various sources and methods to gather child information over time provide more complete information (Cabell et al., 2009) than other measures. Furthermore, the research-based objectives, multiple examples, additional scale points, behavioral anchors along the developmental progressions, and well-developed teacher training may have helped the Teaching Strategies GOLD® address issues related to teacher-based ratings found in other studies.
Limitations and Directions for Future Research
It is important to note that the Level 2 models included only those child characteristics that were available to the researchers. Future research may benefit from the inclusion of a richer set of child and family characteristics. For example, parent education level and income, family SES, and the exact nature of special needs that lead to the disability status of the children were not available to the researchers. It is also important to note that the Level 3 models did not include teacher characteristics such as years of experience, educational level, and hours of training related to assessment issues in general and the Teaching Strategies GOLD® system in particular. It is possible that some of the variance between classrooms in these analyses was associated with other unmeasured factors about the teachers and differences in classrooms or centers that are not accounted for by the demographic composition variables included in the models. For example, it is likely that centers and program sites vary in the amount and quality of training, supervision, and ongoing support that teachers receive related to assessment issues. Future research that focuses on a more thorough examination of the decomposition of the variance in ratings could build on the findings of this study with a formal generalizability study to examine interrater reliability.
Footnotes
Authors’ Note
This article is based on some of the same datasets and analyses contained in a previously released technical report entitled Technical Manual for the Teaching Strategies GOLD® Assessment System.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Partial funding for this research was provided by Teaching Strategies, LLC. Views expressed are those of the authors and do not necessarily reflect those of the funding agency.
