Abstract
Direct assessments of instructional practice (e.g., classroom observations) are necessary to identify and eliminate opportunity gaps in students’ learning of mathematics. This study examined 114 middle school mathematics classrooms in four instructionally focused urban districts. Results from the Instructional Quality Assessment identified high percentages of lessons featuring cognitively challenging tasks, but declines in cognitive challenge during implementation and discussions. Overall instructional quality exceeded results from studies with nationally representative samples and paralleled results of studies of instructionally focused urban middle schools. Significant differences existed between districts, favoring the district with veteran teachers, long-term use of Standards-based curricula, and professional development initiatives.
In more than a decade of educational policy advocating standardized testing as the primary means of improving mathematics teaching and learning, students in U.S. classrooms continue to post substandard performance on mathematical assessments at state, national, and international levels (National Research Council, 2011). As evidence, only 30% of U.S. eighth-grade students scored high or advanced in mathematics on the 2011 Trends in International Mathematics and Science Study (TIMSS), compared with a minimum of 61% of students at high or advanced levels in the five top-performing countries (Mullis, Martin, Foy, & Arora, 2012). In the 2011 National Assessment of Educational Progress (NAEP), only 35% of the national sample of eighth graders demonstrated mathematical proficiency (National Center for Education Statistics [NCES], 2011a), with only Massachusetts having more than half (51%) of eighth-grade students proficient in mathematics. Substandard performance on mathematical achievement tests is even more pronounced in urban and rural schools, schools serving large populations of students with limited English proficiency, and schools in areas with high poverty (U.S. Department of Education, NCES, 2006). In the NAEP 2011 Trial Urban District Assessment, which analyzed data from a subset of 21 urban districts participating in the 2011 NAEP, an average of 26% of eighth-grade students demonstrated mathematical proficiency (NCES, 2011b).
Across this same time period, results from educational research consistently indicate that the most significant factors associated with students’ mathematical achievement are pedagogical (Boaler & Staples, 2008; Hiebert et al., 2003; Stein & Lane, 1996). Differences in the implementation of curricula and other resources, between schools and between teachers within the same school, provide different opportunities for learning that subsequently generate differences in student achievement. Even when high-quality resources are present, student learning is mainly affected by how resources or curricula are implemented in the classroom. Understanding how to improve students’ opportunities to learn mathematics thus requires direct assessments, based on observations and artifacts of teaching, of what teachers and students are doing in classrooms in the process of teaching and learning mathematics. This is particularly important in urban districts, where a deep understanding of students’ opportunities to learn mathematics is essential for identifying strengths of the system (i.e., classroom practices that appear to be supporting students’ learning) and pathways for improvement (i.e., classroom practices that might be changed to enhance students’ learning).
Toward this purpose, mathematics education research consistently identifies a set of instructional practices that appear to support students’ learning of mathematics with understanding, collectively called “ambitious mathematics instruction” (Franke, Kazemi, & Battey, 2007). Research connecting ambitious mathematics instruction to student achievement has identified key instructional components, such as cognitively challenging tasks (i.e., tasks that engage students in making sense of mathematics; Hiebert et al., 2003; Stein & Lane, 1996; Tarr, Reys, Reys, Chavez, Shih, & Osterlind, 2008) and mathematical discussions (Boaler & Staples, 2008), and has delineated specific ways teachers enact or implement these practices successfully (e.g., McClain, 2002; Stein, Engle, Smith, & Hughes, 2008). Nationally commissioned reports (e.g., Kilpatrick, Swafford, & Findell, 2001) and standards from the National Council of Teachers of Mathematics (NCTM; 2000) and the Common Core State Standards in Mathematics (National Governors Association, 2010) advocate ambitious mathematics instruction. Mathematics curricula designed to support such instruction (see Kilpatrick, 2003) are now widely available by commercial publishers. Hence, current research, standards, and curricula can equip districts to implement ambitious instruction, and students’ opportunities to learn mathematics can be assessed by identifying a set of well-defined instructional practices through direct observations of teaching.
In this investigation, we utilize classroom observations to examine middle school mathematics instruction in four large urban school districts. The districts were participating in the Middle School Mathematics and the Institutional Setting of Teaching (MIST) project, 1 which investigated how differences in school and district settings influence mathematics teachers’ instructional practices and students’ mathematics achievement over a 4-year period (Cobb & Smith, 2008). Each district was committed to significant educational reforms in middle school mathematics, driven by the goal of enhancing students’ learning and understanding of mathematics. Because of their intention to increase students’ scores on standardized tests by improving classroom instruction, rather than (and often antithetical to) concentrated efforts to “teach to the test” (Le, Lockwood, Stecher, Hamilton, & Martinez, 2009), we refer to these districts as “instructionally focused.” They faced challenges typical of large urban districts (e.g., large percentages of families in poverty, high rates of student and teacher turnover), but were atypical in their approach to improving mathematics teaching and learning.
We explore the following research questions using data from the first year (2007-2008) of the project:
In the next section, we describe how direct assessments of instruction are necessary for understanding students’ opportunities to learn mathematics, and we summarize previous studies that utilized classroom observations to identify ambitious mathematics instruction.
Background
The NCTM (2000) Equity Principle states, “. . . all students need access each year they are in school to a coherent, challenging mathematics curriculum that is taught by competent and well-supported mathematics teachers.” However, differences in instructional quality between school districts in the United States with different demographic and socioeconomic conditions are well documented, as “the opportunity gap in students’ access to qualified teachers between students of high and low socioeconomic status (SES) was among the largest in the world” (Akiba, LeTendre, & Scribner, 2007, p. 369): Students in low-SES categories (e.g., qualifying for free/reduced lunch) and ethnic minority groups (e.g., Black, Hispanic) are more likely than their high-SES, non-minority peers to (a) have novice teachers; (b) have uncertified or out-of-field teachers, particularly in mathematics; and (c) attend schools with high teacher instability.
Instructional quality can also vary greatly within a school, as different teachers create dramatically different learning environments for students. Disparities in students’ opportunities to learn are intensified by highly qualified and experienced teachers often selecting or being assigned to teach advanced mathematics classes, resulting in unequal rates of academic growth for students depending on which teachers and level of mathematics classes they are assigned (McCaffrey, Lockwood, Koretz, & Hamilton, 2003; Rowan, Correnti, & Miller, 2002). Combining low-quality teaching in remedial mathematics classes with disproportionate numbers of children from minority, poor, or English-learning subgroups assigned to such classes, the cycle of substandard performance is reinforced and perpetuated. Underserved populations remain underserved, and children who need the most mathematical support and the best mathematics instruction do not receive it.
The prevalence of and reliance on product-oriented accountability (i.e., student achievement scores and/or gains in scores over time) have limited the development and use of a process-oriented system focused on students’ opportunities to learn and capable of characterizing the teaching and learning that occurs in classrooms, schools, or districts. Comparisons of students’ achievement disaggregated across race, socioeconomic status, English proficiency, or other demographic factors do not provide parallel comparisons of students’ opportunities to learn within the classroom or school settings (Lipman, 2004). In other words, what children are able to achieve is not benchmarked against what they have the opportunity to achieve. By elevating the importance of students’ opportunities to learn, differences among subgroups and schools may be easier to explain and eradicate by shifting the focus to “the conditions of learning as well as the outcomes” (Gutstein et al., 2005, p. 93; emphasis in original). Milner (2010) suggested, “(a)s an explanation of disparate outcomes, opportunity is multifaceted, complicated, process-oriented, and much more nuanced than achievement” (p. 7, emphasis added). Hence, direct assessments of students’ opportunities to learn mathematics, through observations or artifacts of teaching, capable of capturing the activities in which teachers and students engage during mathematics instruction, are needed to deeply examine and improve students’ outcomes in learning mathematics (Pianta & Hamre, 2009; Stein & Matsumura, 2008).
In this study, we utilized classroom observations to identify students’ opportunities to learn mathematics in large urban districts. Consistent with Perry (2013), we consider opportunity to learn specific to mathematics teaching and learning, defined by (a) the nature of mathematics instructional tasks and (b) how tasks are implemented during instruction, including opportunities for mathematical discussions. We conceptualize opportunity gaps, specific to students’ learning of mathematics, as differences in opportunities to learn mathematics created or perpetuated by the choice of mathematics instructional tasks and nature of task implementation and discussion (i.e., differences due to the presence or absence of ambitious mathematics instruction). In the next section, we describe components of ambitious mathematics instruction and justify why this framework provides important indicators of students’ opportunities to learn mathematics.
Ambitious Mathematics Instruction
The conceptualization of students’ opportunities to learn mathematics by considering instructional tasks, task implementation, and discussion is informed by research originating with the Quantitative Understanding: Amplifying Student Achievement and Reasoning (QUASAR) project (Silver & Stein, 1996). Stein, Smith, Henningsen, and Silver (2009) defined a mathematical task as a mathematical problem or set of problems that address a related mathematical idea or context, and they distinguish between cognitively challenging (“high-level”) tasks and rote, procedural (“low-level”) tasks. Cognitively challenging tasks provide students opportunities to engage in problem solving, thinking and reasoning, and/or developing an understanding of mathematical ideas, procedures, and formulas (Stein, Grover, & Henningsen, 1996). Rote or procedural tasks engage students in reproducing or practicing facts, procedures, or computations without connection to meaning or understanding. Figure 1 provides an example of a high-level and low-level task to engage students in finding the solution to a system of linear equations, represented by the point of intersection on the graph of the equations. In Figure 1a, the task engages students in problem solving, mathematical modeling (of the parking garage costs), and decision making. The task does not suggest a solution strategy, and students could solve the task using tables, graphs, equations, or reasoning about the context. The task in Figure 1b provides students only procedural practice in graphing linear equations and identifying the point of intersection. We are not suggesting that students do not need to memorize mathematical facts or practice mathematical procedures. We assert, however, that students need greater opportunities to explore and understand mathematics by engaging in cognitively challenging mathematical work and thinking, and that this type of work provides greater access, interest, and opportunity to learn.

Tasks with different levels of cognitive challenge.
Task implementation refers to ways in which tasks are enacted by teachers and students during mathematics lessons (i.e., how teachers support students’ work on mathematical tasks and how students actually engage with the mathematics). In the Mathematical Task Framework, Stein and colleagues (1996) described how task challenge can change from (a) the task as it appears in print, (b) the task as set up or introduced by the teacher, and (c) the task as implemented by the teacher and students during the lesson. In ambitious mathematics instruction, teachers introduce (or “launch”) a cognitively challenging task and maintain the challenge during implementation by (a) supporting students to engage with (or “explore”) the task and (b) orchestrating whole-group discussions where students share mathematical work and thinking, justify claims, make connections between mathematical ideas, and “summarize” the mathematical goals of the lesson (McClain, 2002; NCTM, 2000; Stein et al., 2008).
Ambitious mathematics curricula containing cognitively challenging tasks, such as Connected Mathematics Project 2 (CMP2; Lappan, Fey, Fitzgerald, Friel, & Philips, 2006) middle school curriculum used in three of four districts in this study, have been shown to increase student performance on problem-solving assessments and minimize achievement gaps, while maintaining students’ performance on basic skills and computational assessments (Post et al., 2008; Reys, Reys, Lapan, & Holliday, 2003; Ridgeway, Zawojewski, Hoover, & Lambdin, 2003; Riordan & Noyce, 2001; Schoenfeld, 2002; Thompson & Senk, 2001). Among teachers using ambitious curricula, student achievement is highest in classrooms where students experience consistent opportunities to engage in high-level thinking and reasoning during mathematics instruction. Higher performing students in the United States (e.g., Boaler & Staples, 2008; Stein & Lane, 1996; Tarr et al., 2008) and internationally (e.g., Hiebert et al., 2003) have teachers who sustain students’ engagement in cognitively challenging work. Schoenfeld (2002) identified significantly higher achievement among students having teachers rated as “high-implementers” of ambitious elementary and middle school mathematics curricula than students having teachers rated as “low-implementers” in a large urban school district with ambitious goals for mathematics instruction. Specific aspects of ambitious instruction (e.g., setting high expectations, valuing students’ efforts, maintaining cognitive challenge, and fostering mathematical inquiry and discussion) appear to affect student achievement and minimize achievement gaps regardless of the type of curriculum in place (Boaler & Staples, 2008; Tarr et al., 2008).
Hence, studies over the past decade relating mathematics teachers’ instructional practices to student achievement invariably determine that teaching matters. Although ambitious teaching has been associated with improved test scores, often the nature and depth of students’ learning cannot be captured on current standardized achievement tests. Scholars have identified the shortcomings of standardized tests as measures of students’ mathematical learning (Kilpatrick, 2003; National Mathematics Advisory Panel, 2008) and as measures of teaching quality (Le et al., 2009; McCaffrey et al., 2003). Standardized tests designed to assess a greater depth of mathematical understanding (i.e., tests developed for the Common Core State Standards initiative), containing test items beyond memorization and procedures, will be more likely to capture students’ learning in ambitious instructional settings. Even so, test scores neither provide data about the aspects of instruction that supported or inhibited students’ learning and subsequent test performance, nor can they identify pathways for instructional improvement or disparities in students’ opportunities to learn mathematics. Efforts to minimize achievement gaps should grow from efforts to minimize opportunity gaps (Flores, 2007), and identifying differences in students’ opportunities to engage in ambitious mathematics instruction is a promising step toward this goal.
We acknowledge that several factors beyond ambitious teaching affect students’ opportunities to learn mathematics in urban schools, including (a) the percentage of non-certified mathematics teachers in schools serving African American and low-income students (Jackson & Wilson, 2012); (b) teachers’ perceptions of students’ mathematical abilities, such as deficit, color-blind, or meritocratic mind-sets (Jackson & Wilson, 2012; Martin, 2007; Milner, 2010); or (c) students’ mathematical identities, and how they see themselves (or have been positioned to see themselves) as learners and doers of mathematics (Boaler & Staples, 2008). Ambitious instructional practices can, however, provide a specific framework from which teachers can begin to hold students to higher expectations, provide mathematical work that is engaging and relevant, and develop students’ identities as capable mathematicians.
Classroom Observation Studies Identifying Ambitious Mathematics Instruction
Several studies have utilized observations of teaching to assess ambitious mathematics instruction in U.S. classrooms. The TIMSS 1999 Video Study (Hiebert et al., 2003) and the Inside the Classroom Study (Weiss, Pasley, Smith, Banilower, & Heck, 2003) examined nationally representative samples of school districts. Both studies identified a dearth of opportunities for U.S. students to engage in cognitively challenging work in mathematics classrooms. TIMSS observed 100 eighth-grade U.S. mathematics classrooms. Although 15% of instructional tasks could provide opportunities for conceptual understanding, less than 1% of tasks were implemented in ways that supported students’ development of mathematical concepts (Hiebert et al., 2003). Similarly, Inside the Classroom Study rated only 15% of 364 observed lessons (in K-12 mathematics and science) as high quality based on the criteria of intellectual rigor, teacher questioning for conceptual understanding, and students’ opportunities for sense-making (Weiss et al., 2003).
Recently, the Measures of Effective Teaching (MET) Project (Kane & Staiger, 2012) conducted a large-scale study exploring the use of classroom observations, student surveys, and student achievement data to produce a robust measure of teaching effectiveness. In 2009-2010, researchers analyzed 1,000 mathematics lessons in Grades 4 to 8 from public schools across the country. According to the research report (Kane & Staiger, 2012), “scores are highest for competencies related to creating an orderly environment and lowest for those associated with the most complex aspects of instruction” (p. 8). Observed lessons frequently demonstrated content alignment and mathematical accuracy, but infrequently demonstrated ambitious instructional practices, such as student participation in reasoning or investigation, problem-based approaches, and teachers’ questioning strategies.
Other studies specifically examined districts utilizing Standards-based middle school mathematics curricula (e.g., Connected Mathematics Project [CMP]) and/or engaging teachers in professional development. The QUASAR Project (Silver & Stein, 1996) provided professional development to middle school mathematics teachers from five urban districts with economically disadvantaged student populations. Many of these teachers were utilizing pilot versions of current Standards-based curricula. Based on a representative sample of 144 observations from 1990 through 1993, with teachers observed for three 3-day cycles yearly, (a) 74% of observed lessons featured cognitively challenging tasks, (b) 31% of observed lessons provided evidence of students engaging in cognitively challenging mathematical work and thinking throughout the lesson, and (c) 50% of observations included discussions where students provided mathematical explanations and justifications (Stein et al., 1996).
The Middle School Mathematics Study observed 33 middle school mathematics teachers in 10 districts, with 2 districts classified as urban by the research team (e.g., serving a city with a population greater than 100,000 people). Researchers compared instructional practices and student achievement between teachers using Standards-based curricula (and receiving professional development specifically around using the curricula) and teachers using traditional curricula (and not receiving curriculum-specific professional development). Two observations per teacher indicated that 70% of teachers using Standards-based curricula maintained learning environments in which (a) lessons promoted conceptual understanding, (b) lessons supported the exploration of multiple perspectives and strategies, (c) students made mathematical conjectures, (d) students explained their responses or strategies, and (e) teachers used and built upon students’ contributions (Tarr et al., 2008). The study does not indicate whether results were consistent across districts or whether differences existed in the urban districts. Consistent with QUASAR results, teachers with access to Standards-based curricula and professional development enacted far greater percentages of high-quality lessons than teachers in national and large-scale samples.
In 2005-2006, the Instructional Leadership Study (Quint, Akey, Rappaport, & Willner, 2007) conducted observations in 49 elementary schools in three urban districts with content-focused professional development ranging from 1 to 5 years. Results from observations of 132 third-grade mathematics lessons identified 65.1% with low overall quality and 14.4% with moderate to high quality when considering instructional tasks, task implementation, and discussion. Results are not disaggregated by teachers’ years of professional development support or by the use of Standards-based curricula. Overall, these results appear more consistent with national samples than with QUASAR or the Middle School Mathematics Study.
Table 1 provides a summary of characteristics of the highlighted classroom observation studies and the current investigation, listed chronologically according to year(s) of classroom observations. We draw on the results of these studies to situate our findings regarding students’ opportunities to learn mathematics, as evidenced by the presence of ambitious mathematics instruction, in urban districts. Next, we describe the methodology in this investigation.
Summary of Characteristics of Classroom Observation Studies.
Note. QUASAR = Quantitative Understanding: Amplifying Student Achievement and Reasoning; TIMSS = Trends in International Mathematics and Science Study; MET = Measures of Effective Teaching.
NA indicates that the portion of schools using Standards-based curricula or providing professional development cannot be determined from the description of the study.
Method
Data for this investigation come from the initial year of a 4-year study investigating what it takes to improve middle-grades mathematics teaching at the scale of four large urban districts. Each year (2007-2011), the MIST project collected several types of data to test and refine hypotheses and conjectures about district and school organizational arrangements, social relations, and material resources that might support mathematics teachers’ development of ambitious instructional practices at scale (Cobb & Smith, 2008). This report describes the nature of mathematics instruction during the first year of the project (2007-2008): specifically, (a) the rigor of instructional tasks and task implementation, (b) students’ opportunities to engage in mathematical discussions, (c) differences in students’ opportunities to learn mathematics between districts, and (d) comparisons to previous classroom observation studies that assessed ambitious mathematics instruction.
Sample
Table 2 provides student demographic information for the four study districts. All four districts serve a significant number of non-White students, and more than half of the students in each district receive free or reduced-price lunches. Although typical of large urban districts in the challenges they face, including serving large numbers of traditionally low-performing students in mathematics, having limited resources, under-prepared teachers, and high teacher turnover (Darling-Hammond, 2000), these districts are atypical in their instructionally focused approach to increasing student achievement in middle school mathematics. All four districts share the vision of ambitious mathematics instruction and intend to improve student achievement in middle school mathematics by supporting teachers’ development of ambitious instructional practices (as opposed to teaching to the test).
Student Demographic Information for Districts A, B, C, and D.
Note. LEP = Limited English Proficient; To protect the anonymity of the districts, the number of students is rounded to the nearest 5,000 and percentages are rounded to the nearest 5%.
The project team purposefully selected a sample of schools from each district with the goal of choosing schools representative of the district as a whole, while selecting approximately 30 teachers per district. Given the variation in school size, the sample of schools ranged from 6 in District C to 10 in District A. Within each school, we created a randomly ordered list of mainstream mathematics teachers. We then offered study participation in that order and recruited two to five middle school mathematics teachers per school. Given the voluntary nature of the study, we had some schools where the first set of randomly selected teachers agreed to participate and other schools where we had to ask multiple teachers to find the desired number of willing participants. Because of our sampling approach, it is possible that the sample of teachers is not perfectly representative of the teaching population in each district, though it is also likely that our sample does not differ dramatically. Table 3 contains demographic information for participating teachers, by district.
Demographic Information for Participating Teachers.
The number does not represent the full sample from the district. We do not have demographic information for three teachers (two in District B and one in District D) who participated in the study.
As shown in Table 3, teachers in District A are significantly more experienced than teachers in the other three districts (p < .05). Another critical difference between districts (not reflected in Table 3) is the curriculum: District C is the only district in our study that had not adopted CMP2 as its primary curriculum. Instead, the primary curriculum was procedural in nature, and teachers were expected to supplement with CMP2. Furthermore, Districts B and D were in their first year of implementation of CMP2, whereas District A had a long-standing commitment to high-quality curriculum (including CMP2) and professional development initiatives.
Measuring Ambitious Mathematics Instruction
Data on teachers’ instructional practices and students’ opportunities to learn mathematics were collected using the Instructional Quality Assessment (IQA) Lesson Observation rubrics (Boston, 2012) for Academic Rigor (AR) and Accountable Talk (AT). The constructs measured by the IQA rubrics align with the ambitious curricular and instructional reform efforts undertaken by districts in this study. Boston (2012) provided a thorough conceptual foundation for the IQA rubrics, described briefly here.
Academic Rigor (AR)
Stein and colleagues’ (1996) Mathematical Tasks Framework and Levels of Cognitive Demand served as the main conceptual framework for the AR rubrics:
Task Potential assesses the level of cognitive demand of the main instructional task (i.e., the task that occupied the most instructional time in the lesson). This dimension is rated by considering the level of thinking required to produce a complete and thorough response that satisfies the stated demands of the task.
Implementation assesses teacher’s implementation of and students’ engagement with the instructional task. While Task Potential assesses the level of rigorous thinking that the task has the potential to elicit from students, Implementation assesses the level of rigorous thinking in which students actually engaged. The score for this dimension is holistic, reflecting the highest level of engagement of the majority of students during individual or small-group work on the task and during any discussion following students’ work on the task.
(Rigor of the) Discussion assesses the level of cognitive processes evident in the discussion following students’ work on the task (i.e., whether students show their work and/or explain their thinking about important mathematical content). This dimension provides an overall, holistic rating of the discussion on how the talk advances students’ understanding of the mathematical content. The discussion contributes to the Implementation score, and also receives a separate score for Discussion.
Each dimension of AR is rated on a scale of 0 to 4 (0 indicates the construct was absent) that represents a continuum of low to high levels of rigor and reflects discrete categories of cognitive demand. As summarized in Figure 2, the descriptors for each score level are relatively consistent across dimensions, though the referent changes from mathematical tasks (Task Potential), to the cognitive processes evident throughout the lesson (Implementation), to the cognitive processes evident in the discussion (Discussion). This rating scheme facilitates comparisons across dimensions and fosters a strong qualitative idea of what each score level “looks like” in an actual classroom situation.

Summary of IQA score levels.
Accountable Talk (AT)
AT (Resnick & Hall, 2001) upholds the standards of the discipline of mathematics for accuracy, evidence for claims, and reasoning behind ideas and conjectures, while also responding to, developing, and advancing the knowledge, ideas, and claims of all students in the classroom (i.e., talk that is accountable to the discipline and the learning community).
In the IQA rubrics, AT is measured through linking and press 2 :
“Linking” describes talk that is accountable to the learning community: revoicing (O’Connor & Michaels, 1996); prompting students to extend, analyze, or critique the mathematical work and thinking of others (Cazden, 2001; Cobb, Boufi, McClain, & Whitenack, 1997; McClain, 2002); and students’ connections and comparisons to the work or ideas of others. Raters consider whether the teacher makes explicit talk moves to support students in connecting ideas and positions to build coherence in the discussion (Teacher’s Linking), and whether student’s contributions explicitly link to and build on each other (Students’ Linking).
“Press” describes talk moves accountable to the discipline: teachers’ prompting students to explain their thinking, validate the accuracy of their computations, and justify their claims (Boaler & Staples, 2008). Teacher’s Press and Students’ Providing assess teachers’ press for accurate knowledge, thorough explanations of ideas, and appropriate justification for claims in classroom talk, and students’ efforts to provide accurate knowledge and evidence to support their claims, present arguments, and draw conclusions.
The AT rubrics are rated solely on the whole-group discussion following students’ work on the task(s), and not on any talk that occurs during the introduction of the task or as students work (individually or in small groups) on the task itself. The frequency and quality of talk moves for each score level are held consistent across the Linking and Press rubrics. A score of 0 indicates that no discussion occurred. Score levels 1 and 2 reflect low-quality AT: the absence of a talk move or type of student response, weak or minimal attempts to make the talk move by teachers, or weak or minimal attempts to link ideas together or justify their knowledge and claims by students. Score levels 3 and 4 represent consistent, high-quality AT moves and student responses. The consistency of high-quality scores (3 or 4) versus low-quality scores (1 or 2) within the AT rubrics and between the AT and AR rubrics enhances the interpretive value of the IQA rubric results.
Data Collection and Analysis
Throughout January to March 2008, videographers recorded 2 days of instruction (consecutively, when possible, to account for lessons that might extend beyond 1 day) for each participating teacher. Teachers were asked to include a problem-solving activity and a related whole-group discussion in the observed lessons. To be clear, the goal of the video-recordings was not to capture the nature of teachers’ everyday practice, but rather to assess the extent to which a teacher might enact the particular kind of instruction articulated by district leaders as the goal of the instructional reform. Given the directions to include a problem-solving lesson and whole-group discussion, it is consistent to consider what was video-recorded as teachers’ best shot at enacting ambitious instructional practices.
The video-recorded lessons were coded using the IQA Lesson Observations rubrics for AR and AT. The IQA rubrics were tested for reliability and validity by the project team (Matsumura, Garnier, Slater, & Boston, 2008)) and external researchers (Quint et al., 2007). Coders were trained to use the IQA rubrics reliably. Before actual coding began, coders achieved 80% reliability on previously coded videos, chosen to represent the variety of anomalies that coders would encounter. Each coder was then randomly assigned a list of teachers. The set of two class-days for each teacher was coded chronologically, given that lessons from the first day might continue into the second day (resulting in one set of scores for the spanning lesson). Over the course of the coding period, one set of teacher scores for each coder was randomly checked for reliability once every 2 weeks to account for rater drift, which resulted in double-coding of approximately 15% of the lessons. When differences in scores occurred on the double-coded lessons, coders reached a consensus through discussion. The overall percent exact-point agreement in initial coding was 71.3% with an average kappa score 3 of 0.49. Consensus scores were used in all analyses.
We analyzed one set of IQA rubric scores per teacher. For teachers with two complete sets of rubric scores (e.g., teachers who completed two entire lessons over the 2 days of videotaping), we consistently selected for analysis the highest set of scores over the 2 days of instruction. Recall that we perceive the video-recorded lessons as teachers’ best shot at enacting ambitious instruction, because we did not record teachers frequently enough to capture typical classroom practice. Our decision for choosing the highest set of IQA rubric scores for each teacher is consistent with this perception. For example, in cases when teachers only had whole-class discussions on one of the two days of instruction, we selected the set of scores for the lesson involving a whole-class discussion. When teachers’ lessons spanned both days of videotaping, we assumed this 2-day lesson was the teacher’s best effort (especially given the extended time to enact the lesson).
Consistent with our intent to characterize the nature of instruction across four large urban school districts striving for ambitious mathematics instruction, we present descriptive statistics for each district, analyze differences in students’ opportunities to learn mathematics between districts, and compare our results with previous research. By providing information about distributions and standard deviations, we provide an indication of within-district variation. With our emphasis on district trends, however, we focus the analyses on district means and distributions, comparisons between districts, and district-level comparisons to previous research. With score levels 0 to 4 representing a scale of increasing quality and also distinct categories of performance, mean scores are provided to support interpretations of rubric results within a district, and non-parametric tests provide comparisons between districts. We describe particular tests used with corresponding results in the following section.
Results
Given our interest in the nature of middle-grades mathematics instruction at the scale of four large, urban school districts, we examined scores on the rubrics for approximately 30 teachers in each district for the 2007-2008 school year. Table 4 provides district means, standard deviations, and score frequencies for each rubric. We draw on data in Table 4 to characterize and compare students’ opportunities to learn mathematics across the four districts.
Quality of Observed Mathematics Instruction by District: Academic Rigor and Accountable Talk Rubrics, 2007-2008 School Year.
AR: Instructional Tasks, Task Implementation, and Discussion
First, we highlight findings pertaining to Research Question 1 and the mathematical rigor of instructional tasks, task implementation, and whole-group discussions.
Tasks
In Districts A, B, and D, Task Potential means were 3.14, 3.17, and 3.18, respectively. Task Potential means above 3 indicate the use of cognitively challenging tasks during the majority of observed lessons, also evident in the percent of lessons scoring a 3 or 4 for Task Potential in Districts A, B, and D (85.7%, 82.1%, 72.8%). 4 More than half of observed lessons in District C (62.1%) featured high-level instructional tasks, though District C posted the lowest task mean (2.66) and the lowest percent of lessons (3.5%) with instructional tasks scoring a 4 for Task Potential (i.e., cognitively challenging tasks that explicitly required students to provide, explain, or illustrate mathematical thinking and reasoning), with Districts A, B, and D at 28.6%, 35.7%, and 45.5%, respectively.
Implementation
Implementation means in each district fell below a score of 3 (2.67, 2.32, 2.03, 2.58). District A exhibited high-level instruction in the majority of observed lessons, with 53.6% of lessons scoring 3 or 4 in Implementation. Districts B and D exhibited high-level Implementation in 32.1% and 45.4% (respectively) of observed lessons, whereas high-level Implementation occurred in only 3.5% of observed lessons in District C. These data indicate that, with the exception of District A, more than half of observed lessons did not engage students in high-level thinking and reasoning. Instead, as suggested by the percentage of observed lessons scoring a 2 in Implementation for all districts in the study, instruction typically focused on procedures without connections to meaning and understanding. Very few lessons received Implementation scores of 4: four lessons (14.3%) in District A, one (3.5%) in District C, and four (12.1%) in District D.
While Task Potential provides information about the potential rigor of the mathematical activity in the classroom, Implementation characterizes the actual rigor of mathematical activity during the lesson. A Wilcoxon Signed-Rank test suggests that for all four districts, the mean for Implementation is significantly lower than the mean for Task Potential, indicating a decline in cognitive challenge between students’ opportunities for and actual engagement in thinking and reasoning during the observed lessons overall (z = 7.67; p < .001) and within each district (z = 3.35, z = 4.19, z = 3.75, and z = 3.96, for Districts A, B, C, and D, respectively; p < .001).
Rigor of the discussion
The majority of observed lessons in all districts exhibit low-quality mathematical discussions. District A was the highest among the districts, with a mean of 2.21 and 42.8% of Discussion scores at 3 or 4. Districts B, C, and D posted mean scores below 2, with low percentages of high-level discussions (14.3%, 7.0%, 27.3%). Only three observed lessons scored 4 for Discussion, two in District A (7.1%) and one in District C (3.5%). At the low end of the scale, scores of 0 or 1 (indicating no discussion or discussion characterized by one-word responses) occurred in more than half (58.6%) of lessons in District C, but less than one third of lessons in the other districts (21.4%, 32.1%, 30.4%).
AT: Students’ Opportunities to Engage in Mathematical Discourse
Given the general low quality of whole-group discussions, it is not surprising that findings pertaining to Research Question 2 and opportunities for rich mathematical discussions are equally sparse. Across all districts, few instances occurred of the AT constructs Linking and Press. This suggests that even when teachers conduct whole-group discussions, students are rarely given opportunities to connect to each other’s mathematical work and thinking or to offer rich mathematical explanations and justifications.
Teacher and student linking
Minimal occurrences of Linking occurred during the observed lessons. Means for Teacher Linking were at or below 2 in each district (2.04, 1.79, 1.48, 1.67) and means for Student Linking near or below 1 (1.25, 1.04, 0.79, 0.97). District A again posts the highest scores, with high-level (scores of 3 or 4) Teacher Linking and Student Linking occurring in 17.9% and 7.2% of observed lessons, respectively. High-level Teacher Linking was demonstrated infrequently in Districts B, C, and D (7.1%, 3.5%, 9.0%). No lessons in Districts B, C, or D exhibited high-level Student Linking. Lessons receiving a 4 were limited to two occurrences of Teacher Linking (one [3.6%] in District A and one [3.0%] in District D) and one occurrence of Student Linking (District A, 3.6%).
Teacher press and student providing
Instances of teachers pressing for students’ reasoning and justification and of students providing valid reasons and justifications also occurred infrequently during the observed lessons. Only District A posted a mean score for Teacher Press above 2 (2.32, 1.93, 1.55, 1.73), whereas all districts’ mean scores for Student Providing fell below 2 (1.93, 1.79, 1.52, 1.58). In District A, high-quality Teacher Press occurred in 42.8% of observed lessons, approximately twice as often as other districts (21.5%, 20.7%, 18.2%). High-quality Student Providing occurred in 28.5% of observed lessons in District A, followed by 18.2% in District D, 10.3% in District C, and 7.1% in District B. Score level 4 was achieved for Teacher Press in eight lessons overall: three (10.7%) in District A, one (3.6%) in District B, and four (12.1%) in District D. Four lessons reached a 4 in Student Providing: two each in Districts A (7.1%) and B (7.2%).
Differences Between Districts
Table 4 provides district means and standard deviations. Statistically significant differences between districts (p < .05; see Figure 3), identified using Wilcoxon Rank-Sum tests (also called the Mann-Whitney two-sample statistic), occurred when (a) District A outscored District C on all rubrics except Student Providing and (b) District C fell significantly lower than Districts B and D on Task Potential, Discussion, and Implementation. No significant differences existed between Districts A and B or between B and D, and Districts A and D differed significantly only on Teacher Press.

Comparisons of district mean scores in 2007-2008.
Across rubrics, no significant differences occurred between districts on Student Providing, and no significant differences were found between Districts B, C, and D on all AT rubrics. Significant differences occurred mainly on AR rubrics, with District C significantly lower than other districts on all AR rubrics.
Comparisons With Previous Classroom Observation Studies
Table 1 enables comparisons between the results of this study (of four large, instructionally focused urban districts), and previous results from nationally representative or large-scale samples of districts (e.g., TIMSS, Inside the Classroom Study, MET Project) and from instructionally focused urban districts (e.g., QUASAR, Middle School Mathematics Study, Instructional Leadership Study). First, lessons in this study generally exhibited higher levels of instructional quality than nationally representative samples of districts in the TIMSS and Inside the Classroom studies, where no more than 15% of observed lessons demonstrated high-quality tasks and/or implementation. For districts in this study, percentages of lesson observations featuring cognitively challenging instructional tasks (rated 3 or 4) ranged from 62.1% to 85.8%. Only District C had fewer than 15% of Implementations considered high-level, and Districts B and C had fewer than 15% of Discussions rated highly. While the MET study did not provide exact percentages of ambitious instructional practices, researchers “rarely found highly accomplished practice . . . associated with the intent to teach students higher-order thinking skills” (Kane & Staiger, 2012, p. 10). In the current study, though we rarely identified high-quality discussions, notable percentages (32.1%-53.6%) of lesson implementation in three districts engaged students in higher order thinking skills.
Several aspects of observed instruction in this study were consistent with results from studies of instructionally focused urban middle schools with professional development and/or Standards-based mathematics curricula (e.g., QUASAR, Middle School Mathematics Study). The Middle School Mathematics Study identified 70% of lessons with overall high quality. QUASAR researchers identified 74% of lessons with high-level tasks, 31% of lessons with high-level implementation, and 50% of lessons with high-level discussions. All districts in our study posted comparable percentages (62.1%-85.7%) of high-level tasks in the observed lessons. Districts A, B, and D exceeded QUASAR in percentage of high-level implementations (53.6%, 32.1%, 45.4%). Percentages of high-level discussions in District A (42.8%) approached the percentage observed in QUASAR (50%), with other districts falling below 25%. District C, dissimilar to other districts in this study in the lack of a Standards-based curriculum, differed considerably from QUASAR and the Middle School Mathematics Study in high-level implementation (3.5%) and discussions (7%).
The Instructional Leadership Study (Quint et al., 2007) utilized the same rubrics as this investigation. Means on the AR rubrics across all observations were 2.26 for Task Potential, 2.10 for Implementation, and 1.76 for Discussion. In this investigation, observed lessons demonstrate the same pattern in mean scores as the Instructional Leadership Study, with the highest mean occurring for Task Potential and consistently lower means in Implementation and Discussion. Districts A, B, and D posted higher means than districts in the Instructional Leadership Study on all AR rubrics (except Discussion in District B). Both studies identified similarly low means on the AT rubrics.
Although all districts in the current investigation were aiming for ambitious mathematics instruction, the observed lessons indicate significant variation in teacher’s enactment of such instruction. In the discussion that follows, we describe what the results indicate about students’ opportunities to learn mathematics in urban middle school classrooms.
Discussion
Ambitious mathematics instruction provides opportunities for students to learn mathematics with understanding, and has been shown to decrease achievement gaps (Boaler & Staples, 2008; Schoenfeld, 2002). Hence, identifying components of ambitious mathematics instruction can provide a means for identifying differences, or opportunity gaps, in how mathematics is taught and learned in different districts, schools, and classrooms. Only by attending to these opportunity gaps can we begin to eradicate achievement gaps (Flores, 2007; Lipman, 2004).
In this discussion, we use our results to characterize middle school mathematics instruction in four instructionally focused urban districts. Consistent with the larger goals of the project, we hypothesize how differences in instructional practices and students’ opportunities to learn mathematics connect to aspects of the institutional setting (namely, availability of Standards-based mathematics curricula and professional development opportunities), and we suggest pathways for improvement. We also situate our findings within prior research to assess the progress of ambitious mathematics instruction over time.
Characterizing Mathematics Instruction in Instructionally Focused Urban Districts
Potential of the task
The majority of observed lessons featured cognitively challenging instructional tasks, and with the exception of District C (3.5%), notable percentages of tasks in Districts A (28.6%), B (35.7%), and D (45.5%) explicitly required students to provide, explain, or illustrate their mathematical thinking and reasoning (i.e., Task Potential score of 4). By posing instructional tasks with high cognitive demands in the majority of observed lessons, teachers in each district provided students opportunities for mathematical learning and understanding. In the QUASAR study, the consistent presence of high-level instructional tasks, regardless of the level of implementation, resulted in moderate gains in student achievement (Stein & Lane, 1996).
Across several studies, including TIMSS (Hiebert et al., 2003), QUASAR (Stein & Lane, 1996), and studies using the IQA rubrics (Boston, 2012; Boston & Smith, 2009; Quint et al., 2007), tasks with low cognitive demands are rarely implemented in ways that result in high-level thinking and reasoning. In the majority of observed lessons, Task Potential sets the ceiling for Implementation, and in fact, for all discussion-based rubrics as well. Tasks with low cognitive demands simply do not provide fodder for teachers to engage students in thinking, reasoning, or mathematical discourse throughout the enactment of the lesson. If opportunities for high-level thinking and reasoning are not embedded in instructional tasks, these opportunities rarely materialize during mathematics lessons. This finding, robust in its consistency across several studies, suggests that Standards-based curricula and/or high-level instructional tasks are a necessary condition for ambitious mathematics instruction. Cognitively challenging tasks can support positive mathematical identities by positioning students as learners and doers of mathematics, setting high expectations, providing multiple access points, and encouraging multiple solution strategies—features of the instructional setting noted as particularly important for the success of African American and low-income students in urban schools (Boaler & Staples, 2008). As part of a framework for considering students’ opportunity to learn mathematics, using cognitively challenging instructional tasks can provide a concrete first step for teachers to elicit and recognize students’ mathematical abilities and perhaps begin to move beyond deficit, innate-ability, or meritocratic mind-sets.
Task implementation
In all districts, Implementation means were significantly lower than Task Potential means, indicating that students’ actual engagement in thinking and reasoning during the observed lessons did not reflect the opportunities for high-level cognitive processes embedded in instructional tasks. This decline suggests that students are not fully benefitting from opportunities for mathematical learning in Standards-based curricula or high-level tasks. Empirical research from more than a decade indicates that the highest learning gains occur in classrooms where students consistently engage in high-level thinking and reasoning (e.g., Schoenfeld, 2002; Stein & Lane, 1996; Tarr et al., 2008). Hence, opportunity gaps and achievement gaps could be affected to a greater extent if high-level cognitive demands were consistently maintained during implementation. High-level Implementation involves holding students accountable for the mathematical work and thinking in the task and providing students the right amount of support to maintain students’ engagement (without taking over the mathematical work and thinking; Henningsen & Stein, 1997). Hence, teachers’ instructional moves to maintain high-level demands during implementation can promote positive mathematical identities, establish trust, and communicate high expectations. These classroom practices are identified as particularly important for fostering the success of African American and low-income students in urban schools (Boaler & Staples, 2008; Milner, 2010).
Significant differences in Implementation also existed between districts. At the extremes, more than half of observed lessons in District A (15/28; 53.6%) received high-level Implementation scores compared with only one lesson in District C (1/29; 3.5%), providing students in each district with distinctly different opportunities to learn mathematics. What aspects of the institutional setting may have affected differences in implementation between districts? Research has identified many challenges in maintaining high-level demands in mathematics classrooms where students and teachers are accustomed to rote procedures and memorization (rather than exploration, thinking, and reasoning; Henningsen & Stein, 1997). More experienced teachers, long-term use of CMP, and professional development initiatives may have affected implementation in District A. Studies of Standards-based curricula identify improvements in teachers’ implementation and in student achievement over time, with significant increases typically occurring in and beyond the second year of use (Bray, 2005; Post et al., 2008; Reys et al., 2003), and generally associate “a longer implementation in the school . . . with a greater score advantage for students” (Riordan & Noyce, 2001, p. 383). These findings suggest the importance of persevering with Standards-based curricula and maintaining an instructional focus at the administrative level, as districts frequently discard or replace initiatives that do not yield immediate results.
Research also indicates the necessity and value of professional development initiatives in implementing Standards-based mathematics curricula (Senk & Thompson, 2003). Teachers may have neither experienced ambitious instruction as learners of mathematics, nor explored or practiced this type of pedagogy during preservice teacher training or field-based experiences (Franke et al., 2007). Professional development specifically aimed at enhancing teachers’ ability to enact ambitious instruction and maintain the demands of cognitively challenging instructional tasks has proven effective and enduring (Boston & Smith, 2009; 2011), and is readily accessible in professional development materials (e.g., Smith, Silver, & Stein, 2005; Stein et al., 2009).
Mathematical discussions
The majority of discussions in all districts consisted of students demonstrating procedures or providing brief responses to teachers’ questions (i.e., Discussion scores of 1 or 2), with few occurrences of the AT constructs of Linking and Press and extremes occurring again between Districts A and C. Notably, almost half (12/28; 42.8%) of discussions in District A were characterized by explanations of students thinking and reasoning (i.e., Discussion scores of 3 or 4) and high-level Teacher Press. Similar elements of the school setting hypothesized for differences in districts’ performance on the Implementation rubric can be posited for differences on the rubrics assessing classroom discourse. Teachers in District A had more experience and training in implementing CMP2 and the components of ambitious instruction it entails, including orchestrating whole-group discussions. Hence, similar arguments can also be waged for the value of ongoing professional development opportunities, even within District A, to support teachers to engage students in mathematical discourse.
Comparisons With Prior Research
As shown in Table 1, studies utilizing national samples exhibit a low occurrence of ambitious mathematics instruction, even with the passing of a decade between TIMSS in 1999 and MET in 2009-2010. In contrast, when comparing districts with ambitious mathematics curricula and professional development initiatives over a similar span of time, QUASAR (1990-1993) and the Middle School Mathematics Study (2003-2004) identified far greater percentages of lessons exhibiting ambitious instructional practices. This finding was not replicated in the Instructional Leadership Study, though the longevity of professional development in each district was unclear. In our study, similar distinctions in ambitious mathematics instruction appear between District A, Districts B and D, and District C. Similar results across District A, QUASAR, and the Middle School Mathematics Study suggest that Standards-based curriculum and professional development opportunities are necessary conditions for enacting ambitious mathematics instruction.
Given our sample of four large urban districts with ambitious goals for mathematics instruction, atypical of many urban districts in their instructional focus, it is reasonable to assume that the instructional patterns in our results are the same or more rigorous than what might be found in other large urban districts across the United States. This suggests that the majority of students in urban districts have few opportunities to engage in high-level thinking and reasoning in mathematics, and indicates the need for additional work in providing richer opportunities to learn mathematics for students in urban schools.
Conclusion: Implications for Minimizing Opportunity Gaps
In this investigation, we assessed middle school mathematics instruction in four large urban districts participating in a long-term project seeking to identify how school and district settings affect mathematics teachers’ instructional practices and students’ learning. We proposed, consistent with recent work by Perry (2013), that cognitively challenging tasks and high-level task implementation provide a useful framework for considering students’ opportunities to learn mathematics.
Several hypotheses follow from our work. First, Standards-based curricula and/or cognitively challenging instructional tasks appear to be necessary conditions for supporting higher levels of AR. Task levels set the ceiling for the level of implementation and for all discussion-based rubrics. Hence, districts and classrooms lacking high-level instructional tasks in mathematics offer students far different opportunities to learn mathematics than classrooms and districts utilizing such tasks. Second, teachers need support to (a) maintain students’ opportunities for thinking, reasoning, and problem solving throughout lesson implementation, and (b) orchestrate high-quality whole-class discussions that include AT moves. Results from District A and results of other studies of instructionally focused urban middle schools suggest that professional development experiences may equip teachers to achieve high levels of implementation and discussion. Third, although transience of reform initiatives and teachers frequently plague urban districts, longevity of use of Standards-based curricula and teachers’ experience with such curricula appear to support the successful enactment of ambitious mathematics instruction. Hence, urban districts need to remain instructionally focused, even in the absence of immediate gains in achievement test scores.
More broadly, our work indicates how direct assessments of instructional quality, based on observations and artifacts of teaching, might equip urban districts to (a) monitor reform efforts, including curricular implementation or professional development; (b) identify differences in students’ opportunities to learn mathematics; and (c) suggest pathways for providing rich mathematical learning experiences capable of reducing opportunity gaps and achievement gaps.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The empirical work reported in this paper has been supported by the National Science Foundation under grants DRL-0830029 and ESI-0554535. Anne Wilhelm’s contributions to the article were supported by the Institute of Education Sciences (IES) pre-doctoral research training program, grant number R305B080025. The opinions expressed do not necessarily reflect the views of either Foundation or IES. We would like to thank Paul Cobb, Thomas Smith, and Glenn Colby for their contributions to this work.
