Abstract

The special issue presents four empirical articles that address the use of observation as a research methodology in schools and with early adolescents. The growing use of systematic, empirically tested observational frameworks in school-based research is crucial for increasing the replicability and generalizability of findings across settings. That said, observations are often mistakenly assumed to be the “gold standard” assessment, without more nuanced discussions about the best uses and potential misuses of observational methodology. In the following commentary, we use the four articles as spring boards from which to discuss strengths and drawbacks of observational methodology. We address issues related to rigor and replicability, matching the choice of the observational system to the phenomenon under study, considering early adolescents’ developmental needs in the choice of system, and the implications of averaging observations across time points. Throughout, we explore the implications for applied use in educational settings. We conclude with suggestions for future directions as the field increasingly utilizes systematic observation of early adolescents in school-based research.
Rigor and Replicability
A commonly referenced benefit of observational research methodology is that it may provide a more unbiased lens to capture early adolescents’ behaviors. Relative to teacher ratings which may be subject to teachers’ social desirability biases, positive or negative halo effects about students, or idiosyncratic interpretation of questions or scales, observational systems are usually implemented by independent coders who attend trainings and undergo coding practice (Ostrov & Hard, 2013). Typically, coders are then tested on the reliability of their coding. For instance, two coders might independently observe the same student at the same time and rate that student’s behaviors, after which ratings are examined for inter-rater reliability across coders. Although two coders can become consistently incorrect with each other over time (e.g., they begin to see things the same way as each other, but that way is incorrect, a phenomenon known as coder drift), this may be prevented via training or periodic joint coding (and reliability checks; Ostrov & Hard, 2013).
We note that it is challenging (and costly) to train independent observers to perceive behavior and settings in similar ways. It requires methodical training procedures and manuals with detailed descriptions of behaviors and accompanying codes. In addition, it requires personnel who are ideally impartial and not the classroom teacher to serve as observers. For this reason, observational systems of student behaviors are often more costly than alternatives such as rating scales, and therefore may be impractical for some school districts to enact.
A clear strength of codified training and detailed manuals is the transportability of observational systems (and training procedures) across settings (Downer et al., 2011). This can result in a big leap forward in scientific understanding of phenomena. With large and diverse samples, researchers can ascertain the soundness of a framework across a range of settings. Studies in the current issue make such contributions: Oh, Osgood, and Smith (2015) show that the factor structure of two observational systems was similar across 44 afterschool programs set in varying locations (urban vs. suburban) and enrolling diverse groups of youth (e.g., majority affluent vs. majority low income). Similarly, Hafen, Hamre, Allen, Bell, Gitomer, and Pianta (2015) demonstrate that the factor structure of the Classroom Assessment Scoring System–Secondary (CLASS-S) observational system holds across 1,482 classrooms with varying content matter (e.g., English, math). This helps increase confidence in applying these observational frameworks with diverse populations in diverse settings. Yet, as authors in the current issue acknowledge, replication of results is needed, with the consideration of establishing predictive validity of observational system codes to valued child and setting-level outcomes. Once this is established, it will be incumbent upon scholars to collaborate and share frameworks, thereby accumulating bodies of evidence to inform theory about early adolescents and their educational settings.
On the other hand, claims that observations are “less prone to bias” need to be tempered with the recognition that observational systems are grounded in beliefs and assumptions (Downer et al., 2011). What behavior is coded as positive, or considered to be normative versus deviant, reflects the values of the individuals who created (and chose to adopt) that particular observational system. Training simply encourages coders to consistently, and reliably, adopt those particular beliefs and assumptions as they view behavior. Observational systems are also affected by the social context in which the observations are occurring. For instance, in some settings, gendered norms for behavior may be tightly enforced, such that male students may be expected to play in groups and ostracized if they do not do so (as suggested by Coplan, Ooi, & Rose-Krasnor, 2015). But, in another setting, there may be leeway for non-conforming behavior whereby boys do not have to conform to the expected patterns of play at recess. As we disseminate observational systems across diverse contexts, it is important to be aware that we might fail to recognize the extent to which the values of the researchers or the unique setting in which the system was created have influenced the system.
Matching to the Studied Phenomenon
Observational systems vary in the degree to which they focus on discrete individual behaviors, sequences of interactions among individuals, or patterns of interactions aggregated at the setting level (Stuhlman, Hamre, Downer, & Pianta, 2014). There are strengths and weaknesses to each focus. Ultimately, it behooves researchers and educators to select an observational system that matches the phenomenon of interest.
The range in scope of observational systems is reflected in the four studies in the current issue. The Student Interaction in Specific Settings (SISS) focuses on the number of students’ discrete rule violating behaviors in a 5-minute span within a preselected 10 foot by 20 foot physical space (Cash, Bradshaw, & Leaf, 2015). Their coding isolates narrow behavioral indicators. The Play Observation Scale, however, integrates a slightly wider range of behaviors in a single code (Coplan et al., 2015). For instance, after watching a child in a series of 30-second time samples, an observer using the Play Observation Scale codes “solitary play” if the child both plays alone at a distance from peers and pays little or no attention to the group. Other coding systems are more global in nature; observers detect patterns of interactions to characterize a milieu. Oh et al. (2015) and Hafen et al. (2015) employ these kinds of setting-level observational systems in their use of the Promising Practices Rating Scale, the Caregiver Interaction Scale, and the CLASS-S.
It may be easier to attain inter-rater reliability the more that a coding system focuses on simple discrete behaviors (Ostrov & Hard, 2013). On the other hand, global coding systems that reflect complex transactions among students and/or teachers may better capture the climate of a classroom or processes with generalizability to broader outcomes such as student adjustment or achievement (Grossman et al., 2010). Ultimately, however, the choice of the observational coding system should match the scope of the studied phenomenon. Existing research and theory should guide the process in determining whether patterns of interaction or instances of discrete behavior are most pertinent to the research question. For example, researchers interested in adults’ differential treatment of children in the same classroom would not select the CLASS-S as their observation tool. The CLASS-S has been validated as a global indicator of the quality of interactions among all students and teachers and is not designed to detect how clusters of students within the same classroom may experience the teacher differently (Weinstein, 2008).
In addition, observational systems vary in the type of behavior (or process) they are meant to assess. This should also be a consideration when deciding whether to use observation as the methodology of choice. Even though observational systems are often considered “gold standard,” their utility depends on the phenomena being studied (Stuhlman et al., 2014). Coplan et al. (2015) suggest that educators can be trained in the Play Observation Scale to use as a screener for identifying students who may need further assessment about their socio-emotional functioning because of their withdrawn social behaviors. There is a clear value in early detection of problematic peer relationships. That said, an important question for future research is whether it might be more efficient and cost-effective to implement school-wide screening for internalizing difficulties using a self-report instrument or a teacher report.
The shortcomings of self-report, however, are more glaring when it comes to students exhibiting externalizing difficulties. These students tend to see themselves in an artificially positive light relative to the perceptions of others, a phenomenon known as positive illusory bias (Owens, Goldfine, Evangelista, Hoza, & Kaiser, 2007). As such, observation (or teacher report) may be necessary to more accurately capture behavior. Nonetheless, it may still be difficult for either teachers or observers to detect covert aggressive acts, or relationally aggressive acts, among peers (as opposed to overt physical aggression; Cornell, Sheras, & Cole, 2006). These problematic aggressive social behaviors may be best detected through methods in which peers report one another’s behaviors (e.g., peer sociometrics).
Developmental Sensitivity
Hafen et al. (2015) raise considerations about observational systems’ attunement to the developmental stage of the observed students. The secondary school version of CLASS directs observers to note the degree to which students have autonomy in the classroom, including their opportunities for leadership. The inclusion of this indicator reflects the autonomy needs related to healthy adolescent development. This underscores the importance of closely critiquing observational systems for their developmental attunement. We cannot assume that a system developed to observe behavior at the preschool level is applicable to the high school level.
Developmental considerations can also guide what kinds of behaviors should be observed at which age. Coplan et al. (2015) convincingly argue that social participation in the schoolyard among early adolescents is worthy of study. They note that as children grow older, norms around the expected types of social interactions become increasingly narrow. Thus, early adolescents who are atypical in their patterns of social interaction may, in fact, be more distressed or more rejected by peers than are younger students. An implication is that we need to carefully consider when the investment in observational measurement (given the costs) is most worthwhile, and developmental considerations of the observed students may inform this decision.
Averaging Across Time Points
Observational systems can call for observing behavior across multiple time points and then taking the average of these observations as the final score (Cash & Pianta, 2014). An underlying assumption of this methodology is that each observation carries equal weight in affecting the final score. Another assumption is that any variability within a participant across time points represents measurement error.
Both these assumptions merit testing (Cash & Pianta, 2014; Curby, Brock, & Hamre, 2013). Regarding the assessment of the quality of peer interactions, one fleeting, highly aggressive act that occurs in less than 1 minute might be the most salient interaction of the day for that student and have ramifications for his or her well-being. The averaging, then, falsely presumes that this interaction carries equal weight relative to all the other intervals of equal time in which nothing salient occurred.
Oh et al. (2015) considered the different sources of variation in ratings of afterschool settings. Although ratings between different caregiving staff in afterschool programs, not surprisingly, accounted for much of this variability, day-to-day fluctuations of ratings within the same caregiving staff also represented a substantial proportion of variability. Such day-to-day fluctuations may not be measurement error but rather, a real phenomenon in afterschool programs and classrooms (Patrick & Mantzicopoulos, 2014). Day-to-day fluctuations may also be meaningful for valued outcomes. For instance, students who experience their caregiving staff or teachers as consistently neutral may have different adjustment than students who experience them as dramatically positive 1 day but negative the next.
Relevance for Educators
Behavioral observations have potential to offer clear data with practical uses for educators (Pianta & Hamre, 2009). Cash et al. (2015) found that their observations of rule violations were significantly and negatively related to staff reports of positive behavior supports in non-classroom settings. When there were more systems in place to manage behavior (as reported by staff surveys) in non-classroom areas, fewer verbal violations were observed. The findings could provide convincing evidence to staff that “active supervision” in non-classroom settings and “scheduling of student movement” are useful interventions in reducing problematic behavior among students.
Systematic observations can demonstrate how classrooms and afterschool settings “work.” They open up the “black box” and identify the “mechanism of action” that help or hinder students (Pianta & Hamre, 2009). The behavioral anchoring of observational systems is particularly useful in concretely identifying adult behaviors that are linked to positive student outcomes (Grossman et al., 2010). This can facilitate skill development in staff. Some programs, such as My Teaching Partner, have already integrated systematic observation into their coaching and training models (Hafen et al., 2015).
In this era of accountability, teacher effectiveness evaluation systems typically integrate classroom observations of instruction (Ho & Kane, 2013; Ohio Department of Education, 2011). However, a concern is that observations are being used as a “gold standard” for assessment with few training or support mechanisms in place for teachers (Pianta & Hamre, 2009). In other words, results from observations are being used punitively and not as integral to a process of training and professional development.
Another concern is the challenging, labor-intensive, and resource-demanding nature of high-quality observation (Mashburn, Downer, Rivers, Brackett, & Martinez, 2013). The studies in this issue demonstrated the need for rigorous psychometric testing of observational systems before they are employed at scale. Given the increasing popularity of observational systems as integral to teacher assessment, it will be key to develop efficient ways to effectively train observers. In addition, it is crucial to document the extent to which observations add incremental prediction to important outcomes, above and beyond the utility of rating scale measures, which are typically easier and less costly to administer.
Summary
Behavioral observation is an assessment technique with much relevance for use in schools and with early adolescents. The benefits of observational coding systems include the potential to reduce self-report (or teacher-report) bias, to test frameworks across diverse contexts, and ultimately, to inform educational theory and practice. However, it is also recommended that researchers and educators consider the values and beliefs underlying the observational system, match their choice of observational system to the phenomenon under study and the developmental needs of the target population being observed, and to question the assumption that observations should be averaged across time points. Another challenge is for researchers to report when findings across source and methods diverge (e.g., teacher ratings, self-report ratings, and observations yield different pictures). Instead of seeing divergence as a short coming, it may open the doors to more complex understandings of phenomena.
