Abstract
Grading refers to the symbols assigned to individual pieces of student work or to composite measures of student performance on report cards. This review of over 100 years of research on grading considers five types of studies: (a) early studies of the reliability of grades, (b) quantitative studies of the composition of K–12 report card grades, (c) survey and interview studies of teachers’ perceptions of grades, (d) studies of standards-based grading, and (e) grading in higher education. Early 20th-century studies generally condemned teachers’ grades as unreliable. More recent studies of the relationships of grades to tested achievement and survey studies of teachers’ grading practices and beliefs suggest that grades assess a multidimensional construct containing both cognitive and noncognitive factors reflecting what teachers value in student work. Implications for future research and for grading practices are discussed.
Grading refers to the symbols assigned to individual pieces of student work or to composite measures of student performance on student report cards. Grades or marks, as they were referred to in the first half of the 20th century, were the focus of some of the earliest educational research. Grading research history parallels the history of educational research more generally, with studies becoming both more rigorous and sophisticated over time. Grading is important to study because of the centrality of grades in the educational experience of all students. Grades are widely perceived to be what students “earn” for their achievement (Brookhart, 1993, p. 139), and have pervasive influence on students and schooling (Pattison, Grodsky, & Muller, 2013). Furthermore, grades predict important future educational consequences, such as dropping out of school (Bowers, 2010a; Bowers & Sprott, 2012; Bowers, Sprott, & Taff, 2013), applying and being admitted to college, and college success (Atkinson & Geiser, 2009; Bowers, 2010a; Thorsen & Cliffordson, 2012). Grades are especially predictive of academic success in more open admissions higher education institutions (Sawyer, 2013).
Purpose of This Review, and Research Question
This review synthesizes findings from five types of grading studies: (a) early studies of the reliability of grades on student work, (b) quantitative studies of the composition of K–12 report card grades and related educational outcomes, (c) survey and interview studies of teachers’ perceptions of grades and grading practices, (d) studies of standards-based grading (SBG) and the relationship between students’ report card grades and large-scale accountability assessments, and (e) grading in higher education. The central question underlying all of these studies is, “What do grades mean?” In essence, this is a validity question (Kane, 2006; Messick, 1989). It concerns whether evidence supports the intended meaning and use of grades as an educational measure. To date, several reviews have given partial answers to that question, but none of these reviews synthesize 100 years of research from five types of studies. The purpose of this review is to provide a more comprehensive and complete answer to the research question, “What do grades mean?”
Background
The earliest research on grading concerned mostly the reliability of grades teachers assigned to students’ work. The earliest investigation of which the authors are aware was published in the Journal of the Royal Statistical Society. Edgeworth (1888) applied the “theory of errors” (p. 600) based on normal curve theory to the case of grading examinations. He described three different sources of error: (a) chance; (b) personal differences among graders regarding the whole exam (severity or leniency and speed) and individual items on the exam, now referred to as task variation; and (c) “taking his [the examinee’s] answers as representative of his proficiency” (p. 614), now referred to as generalizing to the domain. In parsing these sources of error, Edgeworth went beyond simple chance variation in grades to treat grades as subject to multiple sources of variation or error. This nuanced view, which was quite advanced for its time, remains useful today. Edgeworth pointed out the educational consequences of unreliability in grading, especially in awarding diplomas, honors and other qualifications to students. He used this point to build an argument for improving reliability. Today, the existence of unintended adverse consequences is also an argument for improving validity (Messick, 1989).
During the 19th century, student progress reports were presented to parents orally by the teacher during a visit to a student’s home, with little standardization of content. Oral reports were eventually abandoned in favor of written narrative descriptions of how students were performing in certain skills like penmanship, reading, or arithmetic (Guskey & Bailey, 2001). In the 20th century, high school student populations became so diverse and subject area instruction so specific that high schools sought a way to manage the increasing demands and complexity of evaluating student progress (Guskey & Bailey, 2001). Although elementary schools maintained narrative descriptions, high schools increasingly favored percentage grades because the completion of narrative descriptions was viewed as time-consuming and lacking cost-effectiveness (Farr, 2000). One could argue that this move to percentage grades eliminated the specific communication of what students knew and could do.
Reviews by Crooks (1933), A. Z. Smith and Dobbin (1960), and Kirschenbaum, Napier, and Simon (1971) debated whether grading should be norm- or criterion-referenced, based on clearly defined standards for student learning. Although high schools tended to stay with norm-referenced grades to accommodate the need for ranking students for college admissions, some elementary school educators transitioned to what was eventually called mastery learning and then standards-based education. Based on studies of grading reliability (F. J. Kelly, 1914; Rugg, 1918), in the 1920s, teachers began to adopt grading systems with fewer and broader categories (e.g., the A–F scale). Still, variation in grading practices persisted. Hill (1935) found variability in the frequency of grade reports, ranging from 2 to 12 times per year, and a wide array of grade reporting practices. Of 443 schools studied, 8% employed descriptive grading, 9% percentage grading, 31% percentage-equivalent categorical grading, 54% categorical grading that was not percentage-equivalent, and 2% “gave a general rating on some basis such as ‘degree to which the pupil is working to capacity’” (Hill, 1935, p. 119). By the 1940s, more than 80% of U.S. schools had adopted the A–F grading scale. A–F remained the most commonly used scale until the present day. Current grading reforms move in the direction of SBG, a relatively new and increasingly common practice (Grindberg, 2014) in which grades are based on standards for achievement. In SBG, work habits and other nonachievement factors are reported separately from achievement (Guskey & Bailey, 2010).
Method
Literature searches for each of the five types of studies were conducted by different groups of coauthors, using the same general strategy: (a) a keyword search of electronic databases, (b) review of abstracts against criteria for the type of study, (c) a full read of studies that met criteria, and (d) a snowball search using the references from qualified studies. All searches were limited to articles published in English. To identify studies of grading reliability, electronic searches using the terms “teachers’ marks (or marking)” and “teachers’ grades (or grading)” were conducted in the following databases: ERIC, the Journal of Educational Measurement, Educational Measurement: Issues and Practice, ProQuest’s Periodicals Index Online, and the Journal of Educational Research. The criterion for inclusion was that the research addressed individual pieces of student work (usually examinations), not composite report card grades. Sixteen empirical studies were found (Table 1).
Early studies of the reliability of grades
To identify studies of grades and related educational outcomes, search terms included “(grades OR marks) AND (model* OR relationship OR correlation OR association OR factor).” Databases searched included JSTOR, ERIC, and Educational Full Text Wilson Web. Criteria for inclusion were that the study (a) examined the relationship of K–12 grades to schooling outcomes, (b) used quantitative methods, and (c) examined data from actual student assessments rather than teacher perspectives on grading. Forty-one empirical studies were identified (Tables 2, 3, and 4).
Studies of the relation of K–12 report card grades and tested achievement
Studies of K–12 report card grades as multidimensional measures of academic knowledge, engagement, and persistence
Studies of grades as predictors of educational outcomes
For studies of K–12 teachers’ perspectives about grading and grading practices, the search terms used were “grade(s),” “grading,” and “marking” with “teacher perceptions,” “teacher practices,” and “teacher attitudes.” Databases searched included ERIC, Education Research Complete, Dissertation Abstracts, and Google Scholar. Criteria for inclusion were that the study topic was K–12 teachers’ perceptions of grading and grading practices and were published since 1994 (the date of Brookhart’s previous review). Thirty-five empirical studies were found (31 are presented in Table 5, and four that investigated SBG are in Table 6).
Studies of teachers’ grading practices and perceptions
Studies of standards-based grading
The search for studies of SBG used the search terms “standards” and (“grades” or “reports) and “education.” Databases searched included Psycinfo, Psycarticles, ERIC, and Education Source. The criterion for inclusion was that articles needed to address SBG. Eight empirical studies were identified (Table 6).
For studies of grading in higher education, search terms included “grades” or “grading,” combined with “university,” “college,” and “higher education” in the title. Databases searched included EBSCO Education Research Complete, ERIC, and ProQuest (Education Journals). The inclusion criterion was that the study investigated grading practices in higher education. University websites in 12 different countries were also consulted to allow for international comparisons. Fourteen empirical studies were found (Table 7).
Studies of grading in higher education
Results
Grading Reliability
Table 1 displays the results of studies on the reliability of teachers’ grades. The main finding was that great variation exists in the grades teachers assign to students’ work (Ashbaugh, 1924; Brimi, 2011; Eells, 1930; Healy, 1935; Hulten, 1925; F. J. Kelly, 1914; Lauterbach, 1928; Rugg, 1918; Silberstein, 1922; Sims, 1933; Starch, 1913, 1915; Starch & Elliott, 1912, 1913a, 1913b). Three studies (Bolton, 1927; Jacoby, 1910; Shriner, 1930) argued against this conclusion, however, contending that teacher variability in grading was not as great as commonly suggested.
As the work of Edgeworth (1888) previewed, these studies identified several sources of the variability in grading. Starch (1913), for example, determined that three major factors produced an average probable error of 5.4 on a 100-point scale across instructors and schools. Specifically, “differences due to the pure inability to distinguish between closely allied degrees of merit” (p. 630) contributed 2.2 points, “differences in the relative values placed by different teachers upon various elements in a paper, including content and form” (p. 630) contributed 2.1 points, and “differences among the standards of different teachers” (p. 630) contributed 1.0 point. Although investigated, “differences among the standards of different schools” (p. 630) contributed practically nothing toward the total (p. 632).
Other studies listed in Table 1 identify these and other sources of grading variability. Differences in grading criteria, or lack of criteria, were found to be a prominent source of variability in grades (Ashbaugh, 1924; Brimi, 2011; Eells, 1930; Healy, 1935; Silberstein, 1922), akin to Starch’s (1913) difference in the relative values teachers place on various elements in a paper. Teacher severity or leniency was found to be another source of variability in grades (Shriner, 1930; Silberstein, 1922; Sims, 1933), similar to Starch’s differences in teachers’ standards. Differences in student work quality were associated with variability in grades, but the findings were inconsistent. Bolton (1927), for example, found greater grading variability for poorer papers. Similarly, Jacoby (1910) interpreted his high agreement as a result of the high quality of the papers in his sample. Eells (1930), however, found greater grading consistency in the poorer papers. Lauterbach (1928) found more grading variability for typewritten compositions than for handwritten versions of the same work. Finally, between-teacher error was a central factor in all of the studies in Table 1. Studies by Eells (1930) and Hulten (1925) demonstrated within-teacher error, as well.
Given a probable error of around 5 in a 100-point scale, Starch (1913) recommended the use of a 9-point scale (i.e., A+, A−, B+, B−, C+, C-, D+, D−, and F) and later tested the improvement in reliability gained by moving to a 5-point scale based on the normal distribution (Starch, 1915). His and other studies contributed to the movement in the early 20th century away from a 100-point scale. The ABCDF letter grade scale became more common and remains the most prevalent grading scale in schools in the United States today.
Grades and Related Educational Outcomes
Quantitative studies of grades and related educational outcomes moved the focus of research on grades from questions of reliability to questions of validity. Three types of studies investigated the meaning of grades in this way. The oldest line of research (Table 2) looked at the relationship between grades and scores on standardized tests of intelligence or achievement. Today, those studies would be seen as seeking concurrent evidence for validity under the assumption that graded achievement should be the same as tested achievement (Brookhart, 2015). As the 20th century progressed, researchers added noncognitive variables to these studies, describing grades as multidimensional measures of academic knowledge, engagement, and persistence (Table 3). A third group of more recent studies looked at the relationship between grades and other educational outcomes, for example, dropping out of school or future success in school (Table 4). These studies offer predictive evidence for validity under the assumption that grades measure school success.
Correlation of Grades and Other Assessments
Table 2 describes studies that investigated the relationship between grades (usually grade point average [GPA]) and standardized test scores in an effort to understand the composition of the grades and marks that teachers assign to K–12 students. Despite the enduring perception that the correlation between grades and standardized test scores is strong (Allen, 2005; Duckworth, Quinn, & Tsukayama, 2012; Stanley & Baines, 2004), this correlation is and always has been relatively modest, in the .5 range. As Willingham, Pollack, and Lewis (2002) noted, Understanding these characteristics of grades is important for the valid use of test scores as well as grade averages because, in practice, the two measures are often intimately connected . . . [there is a] tendency to assume that a grade average and a test score are, in some sense, mutual surrogates; that is, measuring much the same thing, even in the face of obvious differences. (p. 2)
Research on the relationship between grades and standardized assessment results is marked by two major eras: early 20th-century studies and late 20th into 21st century studies. Unzicker (1925) found that average grades across subjects correlated .47 with intelligence test scores. C. C. Ross and Hooks (1930) reviewed 20 studies conducted from 1920 through 1929 on report card grades and intelligence test scores in elementary school as predictors of junior high and high school grades. Results showed that the correlations between grades in seventh grade and intelligence test scores ranged from .38 to .44. C. C. Ross and Hooks concluded, Data from this and other studies indicate that the grade school record affords a more reliable or consistent basis of prediction than any other available, the correlations in three widely-scattered school systems showing remarkable stability; and that without question the grade school record of the pupil is the most usable or practical of all bases for prediction, being available wherever cumulative records are kept, without cost and with a minimum expenditure of time and effort. (p. 195)
Subsequent studies moved from correlating grades and intelligence test scores to correlating grades with standardized achievement results (Carter, 1952, r = .52; Moore, 1939, r = .61). McCandless, Roberts, and Starnes (1972) found a smaller correlation (r = .31) after accounting for socioeconomic status, ethnicity, and gender. Although the sample selection procedures and methods used in these early investigations are problematic by current standards, they represent a clear desire on the part of researchers to understand what teacher-assigned grades represent in comparison to other known standardized assessments. In other words, their focus was criterion validity (C. C. Ross & Hooks, 1930).
Investigations from the late 20th century and into the 21st century replicated earlier studies but included larger, more representative samples and used more current standardized tests and methods (Brennan, Kim, Wenz-Gross, & Siperstein, 2001; Woodruff & Ziomek, 2004). Brennan et al. (2001), for example, compared reading scores from the Massachusetts Comprehensive Assessment System state test to grades in mathematics, English, and science and found correlations ranging from .54 to .59. Similarly, using GPA and 2003 TerraNova Second Edition/California Achievement Tests, Duckworth and Seligman (2006) found a correlation of .66. Subsequently, Duckworth et al. (2012) examined standardized reading and mathematics test scores to GPA and found correlations between .62 and .66.
Woodruff and Ziomek (2004) compared GPA and ACT composite scores for all high school students who took the ACT college entrance exam between 1991 and 2003. They found moderate but consistent correlations ranging from .56 to .58 over the years for average GPA and composite ACT scores, from .54 to .57 for mathematics grades and ACT scores, and from .45 to .50 in English. Student GPAs were self-reported, however. Pattison et al. (2013) examined four decades of achievement data on tens of thousands of students using national databases to compare high school GPA to reading and mathematics standardized tests. The authors found GPA correlations consistent with past research, ranging from .52 to .64 in mathematics and from .46 to .54 in reading comprehension.
Although some variability exists across years and subjects, correlations have remained moderate but remarkably consistent in studies based on large, nationally representative data sets. Across 100 years of research, teacher-assigned grades typically correlate about .5 with standardized measures of achievement. In other words, 25% of the variation in grades teachers assign is attributable to a trait measured by standardized tests (Bowers, 2011). The remaining 75% is attributable to something else. As Swineford (1947) noted in a study on grading in middle and high school, “The data clearly show that marks assigned by teachers in this school are reliable measures of something [italics added] but there is apparently a lack of agreement on just what that something should be” (p. 517). A correlation of .5 is neither very weak—countering arguments that grades are completely subjective measures of academic knowledge—nor is it very strong—refuting arguments that grades are a strong measure of fundamental academic knowledge, and remain consistent despite large shifts in the educational system, especially in relation to accountability and standardized testing (Bowers, 2011; Linn, 1982).
Grades as Multidimensional Measures of Academic Knowledge, Engagement, and Persistence
Investigations of the composition of K–12 report card grades consistently find them to be multidimensional, comprising minimally academic knowledge, substantive engagement, and persistence. Table 3 presents studies of grades and other measures, including many noncognitive variables. The earliest study of this type, Sobel (1936) found that students with high grades and low test scores had outstanding penmanship, attendance, punctuality, and effort marks, and their teachers rated them high in industry, perseverance, dependability, cooperation, and ambition. Similarly, Miner (1967) factor-analyzed longitudinal data for a sample of students, including their grades in 1st, 3rd, 6th, 9th, and 12th grades; achievement tests in 5th, 6th, and 9th grades; and citizenship grades in 1st, 3rd, and 6th grades. She identified a three-factor solution: (a) objective achievement as measured through standardized assessments, (b) early classroom citizenship (a behavior factor), and (c) high school achievement as measured through grades, demonstrating that behavior and two types of achievement could be identified as separate factors.
Farkas, Grobe, Sheehan, and Shuan (1990) showed that student work habits were the strongest noncognitive predictors of grades. They noted, “Most striking is the powerful effect of student work habits upon course grades . . . teacher judgments of student non-cognitive characteristics are powerful determinants of course grades, even when student cognitive performance is controlled” (p. 140). Likewise, Willingham et al. (2002), using large national databases, found a moderate relationship between grades and tests as well as strong positive relationships between grades and student motivation, engagement, completion of work assigned, and persistence. Relying on a theory of a conative factor of schooling—focusing on student interest, volition, and self-regulation (Snow, 1989)—the authors suggested that grades provide a useful assessment of both conative and cognitive student factors (Willingham et al., 2002).
S. Kelly (2008) countered a criticism of the conative factor theory of grades, namely that teachers may award grades based on students appearing engaged and going through the motions (i.e., a procedural form of engagement) as opposed to more substantive engagement involving legitimate effort and participation that leads to increased learning. He found positive and significant effects of students’ substantive engagement on subsequent grades but no relationship with procedural engagement, noting, “This finding suggests that most teachers successfully use grades to reward achievement-oriented behavior and promote a widespread growth in achievement” (p. 45). S. Kelly also argued that misperceptions that teachers do not distinguish between apparent and substantive engagement lends mistaken support to the use of high-stakes tests as inherently more “objective” (p. 46) than teacher assessments.
Recent studies have expanded on this work, applying sophisticated methodologies. Bowers (2009, 2011) used multidimensional scaling to examine the relationship between grades and standardized test scores in each semester in high school in both core subjects (mathematics, English, science, and social studies) and noncore subjects (foreign/non-English languages, art, and physical education). Bowers (2011) found evidence for a three-factor structure: (a) a cognitive factor that describes the relationship between tests and core subject grades, (b) a conative and engagement factor between core subject grades and noncore subject grades (termed a “Success at School Factor, SSF,” p. 154), and (c) a factor that described the difference between grades in art and physical education. He also showed that teachers’ assessment of students’ ability to negotiate the social processes of schooling represents much of the variance in grades that is unrelated to test scores. These results point to the importance of substantive engagement and persistence (S. Kelly, 2008; Willingham et al., 2002) as factors that help students in both core and noncore subjects. Subsequently, Duckworth et al. (2012) used structural equation modeling for 510 New York City fifth through eighth graders to show that engagement and persistence are mediated through teacher evaluations of student conduct and homework completion.
Casillas et al. (2012) examined the interrelationship among grades, standardized assessment scores, and a range of psychosocial characteristics and behavior. Twenty-five percent of the explained variance in GPAs was attributable to the standardized assessments; the rest was predicted by a combination of prior grades (30%), psychosocial factors (23%), behavioral indicators (10%), demographics (9%), and school factors (3%). Academic discipline and commitment to school (i.e., the degree to which the student is hard working, conscientious, and effortful) had the strongest relationship to GPA.
A set of recent studies focused on the Swedish national context (Cliffordson, 2008; Klapp Lekholm, 2011; Klapp Lekholm & Cliffordson, 2008, 2009; Thorsen, 2014; Thorsen & Cliffordson, 2012), which is interesting because report cards are uniform throughout the country and require teachers to grade students using the same performance level scoring system used by the national exam. Klapp Lekholm and Cliffordson (2008) showed that grades consisted of two major factors: a cognitive achievement factor and a noncognitive “common grade dimension” (p. 188). In a follow-up study, Klapp Lekholm and Cliffordson (2009) reanalyzed the same data, examining the relationships between multiple student and school characteristics and both the cognitive and noncognitive achievement factors. For the cognitive achievement factor of grades, student self-perception of competence, self-efficacy, coping strategies, and subject-specific interest were most important. In contrast, the most important student variables for the noncognitive factor were motivation and a general interest in school. These structural equation modeling results were replicated across three full population-level cohorts in Sweden representing all 99,085 9th grade students in 2003, 105,697 students in 2004, and 108,753 in 2005 (Thorsen & Cliffordson, 2012), as well as in comparison to both norm-referenced and criterion-referenced grading systems, examining 3,855 students in Sweden (Thorsen, 2014). Klapp Lekholm and Cliffordson (2009) wrote, The relation between general interest or motivation and the common grade dimension seems to recognize that students who are motivated often possess both specific and general goals and approach new phenomena with the goal of understanding them, which is a student characteristic awarded in grades. (p. 19)
These findings, similar to those of S. Kelly (2008), Bowers (2009, 2011), and Casillas et al. (2012), support the idea that substantive engagement is an important component of grades that is distinct from the skills measured by standardized tests. A validity argument that expects grades and standardized tests to correlate highly therefore may not be sound because the construct of school achievement is not fully defined by standardized test scores. Tested achievement represents one dimension of the results of schooling, privileging “individual cognition, pure mentation, symbol manipulation, and generalized learning” (Resnick, 1987, pp. 13–15).
Grades as Predictors of Educational Outcomes
Table 4 presents studies of grades as predictors of educational outcomes. Teacher-assigned grades are known to predict graduation from high school (Bowers, 2014), as well as transition from high school to college (Atkinson & Geiser, 2009; Cliffordson, 2008). Satisfactory grades historically have been used as one of the means to grant students a high school diploma (Rumberger, 2011). Studies from the second half of the 20th century and into the 21st century, however, have focused on using grades from early grade levels to predict student graduation rate or risk of dropping out of school (Gleason & Dynarski, 2002; Pallas, 1989).
Early studies in this domain (Fitzsimmons, Cheever, Leonard, & Macunovich, 1969; Lloyd, 1974, 1978; Voss, Wendling, & Elliott, 1966) identified teacher-assigned grades as one of the strongest predictors of student risk for failing to graduate from high school. Subsequent studies included other variables such as absence and misbehavior and found that grades remained a strong predictor (Barrington & Hendricks, 1989; Cairns, Cairns, & Neckerman, 1989; Ekstrom, Goertz, Pollack, & Rock, 1986; Ensminger & Slusarcick, 1992; Finn, 1989; Hargis, 1990; Morris, Ehren, & Lenz, 1991; Rumberger, 1987; Troob, 1985). More recent research using a life course perspective showed that low or failing grades have a cumulative effect over a student’s time in school and contribute to the eventual decision to leave (Alexander, Entwisle, & Kabbani, 2001; Jimerson, Egeland, Sroufe, & Carlson, 2000; Pallas, 2003; Roderick & Camburn, 1999).
Other research in this area considered grades in two ways: the influence of low grades (Ds and Fs) on dropping out, and the relationship of a continuous scale of grades (e.g., GPA) to at-risk status and eventual graduation or dropping out. Three examples are particularly notable. Allensworth and colleagues have shown that failing a core subject in ninth grade is highly correlated with dropping out of school, and places a student offtrack for graduation (Allensworth, 2013; Allensworth & Easton, 2005, 2007). Such failure also compromises the transition from middle school to high school (Allensworth, Gwynne, Moore, & de la Torre, 2014). Balfanz, Herzog, and MacIver (2007) showed a strong relationship between failing core courses in sixth grade and dropping out. Focusing on modeling conditional risk, Bowers (2010b) found the strongest predictor of dropping out after grade retention was having D and F grades.
Few studies, however, have focused on grades as the sole predictor of graduation or dropping out. Most studies examine longitudinal grade patterns, using either data-mining techniques such as cluster analysis of all K–12 course grades (Bowers, 2010a) or mixture modeling techniques to identify growth patterns or decline in GPA in early high school (Bowers & Sprott, 2012). A recent review of the studies on the accuracy of dropout predictors showed that along with the Allensworth Chicago on-track indicator (Allensworth & Easton, 2007), longitudinal GPA trajectories were among the most accurate predictors identified (Bowers et al., 2013).
Teachers’ Perceptions of Grading and Grading Practices
Systematic investigations of teachers’ grading practices and perceptions about grading began to be published in the 1980s and were summarized in Brookhart’s (1994) review of 19 empirical studies of teachers grading practices, opinions, and beliefs. Five themes were supported. First, teachers use measures of achievement, primarily tests, as major determinants of grades. Second, teachers believe it is important to grade fairly. Views of fairness included using multiple sources of information, incorporating effort, and making it clear to students what is assessed and how they will be graded. This finding suggests teachers consider school achievement to include the work students do in school, not just the final outcome. Third, in 12 of the studies, teachers included noncognitive factors in grades, including ability, effort, improvement, completion of work, and, to a small extent, other student behaviors. Fourth, grading practices are not consistent across teachers, with respect to either the purpose or the extent to which noncognitive factors are considered, reflecting differences in teachers’ beliefs and values. Finally, grading practices vary by grade level.
Secondary teachers emphasize achievement products such as tests whereas elementary teachers use informal evidence of learning along with achievement and performance assessments. Brookhart’s (1994) review demonstrated an upswing in interest in investigating grading practices during this period, in which performance-based and portfolio classroom assessment was emphasized and reports of the unreliability of teachers’ subjective judgments about student work also increased. The findings were in accord with policymakers’ increasing distrust of teachers’ judgments about student achievement.
Teachers’ Reported Grading Practices
Empirical studies of teachers’ grading practices over the past 20 years have mainly used surveys to document how teachers use both cognitive and noncognitive evidence, primarily effort, and their own professional judgment in determining grades. Table 5 shows most studies published since Brookhart’s (1994) review document that teachers in different subjects and grade levels use “hodgepodge” grading (Brookhart, 1991, p. 36), combining achievement, effort, behavior, improvement, and attitudes (Adrian, 2012; Bailey, 2012; Cizek, Fitzgerald, & Rachor, 1995; Cross & Frary, 1999; Duncan & Noonan, 2007; Frary, Cross, & Weber, 1993; Grimes, 2010; Guskey, 2002, 2009a; Imperial, 2011; Liu, 2008b; Llosa, 2008; McMillan, 2001; McMillan & Lawson, 2001; McMillan, Myran, & Workman, 2002; McMillan & Nash, 2000; Randall & Engelhard, 2009, 2010; Russell & Austin, 2010; Sun & Cheng, 2013; Svennberg, Meckbach, & Redelius, 2014; Troug & Friedman, 1996; Yesbeck, 2011). Teachers often make grading decisions with little school or district guidance.
Teachers distinguish among nonachievement factors in grading. They view “academic enablers” (McMillan, 2001, p. 25), including effort, ability, work habits, attention, and participation, differently from other nonachievement factors, such as student personality and behavior. McMillan (2001), consistent with earlier research, found that academic performance and academic enablers were by far most important in determining grades. These findings have been replicated (Duncan & Noonan, 2007; McMillan et al., 2002). In a qualitative study, McMillan and Nash (2000) found that teaching philosophy and judgments about what is best for students’ motivation and learning contribute to variability of grading practices, suggesting that an emphasis on effort, in particular, influences these outcomes. Randall and Engelhard (2010) found that teacher beliefs about what best supports students are important factors in grading, especially using noncognitive factors for borderline grades, as Sun and Cheng (2013) also found with a sample of Chinese secondary teachers. These studies suggest that part of the reason for the multidimensional nature of grading reported in the previous section is that teachers’ conceptions of academic achievement include behavior that supports and promotes academic achievement, and that teachers evaluate these behaviors as well as academic content in determining grades. These studies also showed significant variation among teachers within the same school. That is, the weight that different teachers give to separate factors can vary a great deal within a single elementary or secondary school (Cizek et al., 1995; Cross & Frary, 1999; Duncan & Noonan, 2007; Guskey, 2009a; Henke, Chen, Goldman, Rollefson, & Gruber, 1999; Troug & Friedman, 1996; Webster, 2011).
Teacher Perceptions About Grading
Compared to the number of studies about teachers’ grading practices, relatively few studies focus directly on perceptual constructs such as importance, meaning, value, attitudes, and beliefs. Several studies used Brookhart’s (1994) suggestion that Messick’s (1989) construct validity framework is a reasonable approach for investigating perceptions. This framework focuses on both the interpretation of the construct (what grading means) and the implications and consequences of grading (the effect it has on students). Sun and Cheng (2013) used this conceptual framework to analyze teachers’ comments about their grading and the extent to which values and consequences were considered. The results showed that teachers interpreted good grades as a reward for accomplished work, based on both effort and quality, student attitude toward achievement as reflected by homework completion, and progress in learning. Teachers indicated the need for fairness and accuracy, not just accomplishment, saying that grades are fairer if they are lowered for lack of effort or participation and that grading needs to be strict for high achievers. Teachers also considered consequences of grading decisions for students’ future success and feelings of competence.
Fairness in an individual sense is a theme in several studies of teacher perceptions of grades (Bonner & Chen, 2009; Grimes, 2010; Hay & Macdonald, 2008; Kunnath, 2016; Sun & Cheng, 2013; Svennberg et al., 2014; Tierney, Simon, & Charland, 2011). Teachers perceive grades to have value according to what they can do for individual students. Many teachers use their understanding of individual student circumstances, their instructional experience, and perceptions of equity, consistency, accuracy, and fairness to make professional judgments, instead of relying solely on a grading algorithm. These claims suggest that grading practices may vary within a single classroom, just as it does among teachers, and that this variation is viewed, at least by some teachers, as a needed element of accurate, fair grading, not as a problem. In a case study of one high school mathematics teacher in Canada, M. Simon et al. (2010) reported that standardized grading policy often conflicted with professional judgment and had a significant impact on determining students’ final grades.
Some researchers (Liu, 2008a; Liu, O’Connell, & McCoach, 2006; Wiley, 2011) have developed scales to assess teachers’ beliefs and attitudes about grading, including items that load on importance, usefulness, effort, ability, grading habits, and perceived self-efficacy of the grading process. These studies have corroborated the survey and interview findings about teachers’ beliefs in using both cognitive and noncognitive factors in grading. Guskey (2009a) found differences between elementary and secondary teachers in their perspectives about purposes of grading. Elementary teachers were more likely to view grading as a process of communication with students and parents and to differentiate grades for individual students. Secondary teachers believed that grading served a classroom control and management function, emphasizing student behavior and completion of work.
In short, findings from the limited number of studies on teacher perceptions of grading are largely consistent with findings from grading practice surveys. Some studies have successfully explored the basis for practices and show that teachers view grading as a means to have fair, individualized, positive impacts on students’ learning and motivation and, to a lesser extent, classroom control. Together, the research on grading practices and perceptions suggests the following four clear and enduring findings. First, teachers idiosyncratically use a multitude of achievement and nonachievement factors in their grading practices to improve learning and motivation as well as document academic performance. Second, student effort is a key element in grading. Third, teachers advocate for students by helping them achieve high grades. Finally, teacher judgment is an essential part of fair and accurate grading.
Standards-Based Grading
SBG recommendations emphasize communicating student progress in relation to grade-level standards (e.g., adding fractions, computing area) that describe performance using ordered categories (e.g., below basic, basic, proficient, advanced) and involve separate reporting of work habits and behavior (Brookhart, 2011; Guskey, 2009b; Guskey & Bailey, 2001, 2010; Marzano & Heflebower, 2011; McMillan, 2009; Melograno, 2007; Mohnsen, 2013; O’Connor, 2009; Scriffiny, 2008; Shippy, Washer, & Perrin, 2013; Wiggins, 1994). SBG is differentiated from standardized grading, which provides teachers with uniform grading procedures in an attempt to improve consistency in grading methods, and from mastery grading, which expresses student performance on a variety of skills using a binary mastered/not mastered scale (Guskey & Bailey, 2001). Some also assert that SBG can provide exceptionally high-quality information to parents, teachers, and students and, therefore, has the potential to bring about instructional improvements and larger educational reforms. Others urge caution. Cizek (2000), for example, warned that SBG may be no better than other reporting formats and subject to the same misinterpretations as other grading scales.
Literature on SBG implementation recommendations is extensive, but empirical studies are few. Studies of SBG to date have focused mostly on the implementation of SBG reforms and the relationship of SBG to state achievement tests designed to measure the same or similar standards. One study investigated student, teacher, and parent perceptions of SBG. Table 6 presents these studies.
Implementation of SBG
Schools, districts, and teachers have experienced difficulties in implementing SBG (Clarridge & Whitaker, 1994; Cox, 2011; Hay & Macdonald, 2008; McMunn, Schenck, & McColskey, 2003; M. Simon et al., 2010; Tierney et al., 2011). The understanding and support of teachers, parents, and students are key to successful implementation of SBG practices, especially grading on standards and separating achievement grades from learning skills (academic enablers). Although many teachers report that they support such grading reforms, they also report using practices that mix effort, improvement, or motivation with academic achievement (Cox, 2011; Hay & Macdonald, 2008; McMunn et al., 2003). Teachers also vary in implementing SBG practices (Cox, 2011), especially in using common assessments, following minimum grading policies, accepting work late with no penalty, and allowing students to retest and replace poor scores with retest scores.
The previous section summarized two studies of grading practices in Ontario, Canada, which adopted SBG province-wide and required teachers to grade students on specific topics within each content area using percentage grades. M. Simon et al. (2010) identified tensions between provincial grading policies and one teacher’s practice. Tierney et al. (2011) found that few teachers were aware of and applying provincial SBG policies. These findings are consistent with McMunn et al.’s (2003) findings, which showed that changes in grading practice do not necessarily follow after changes in grading policy.
SBG as a Communication Tool
Swan, Guskey, and Jung (2014; see also Guskey, Swan, & Jung, 2010) found that parents, teachers, and students preferred SBG over traditional report cards, with teachers considering adopting SBG having the most favorable attitudes. Teachers implementing SBG reported that it took longer to record the detailed information included in the SBG report cards but felt the additional time was worthwhile because SBGs yielded higher-quality information. An earlier informal report by Guskey (2004) found, however, that many parents attempted to interpret nearly all labels (e.g., below basic, basic, proficient, advanced) in terms of letter grades. It may be that a decade of increasing familiarity with SBG has changed perceptions of the meaning and usefulness of SBG.
Relationship of SBGs to High-Stakes Test Scores
One might expect consistency between SBGs and standards-based assessment scores because they purport to measure the same standards. Eight papers examined this consistency (Howley, Kusimo, & Parrott, 1999; Klapp Lekholm, 2011; Klapp Lekholm & Cliffordson, 2008, 2009; J. A. Ross & Kostuch, 2011; Thorsen & Cliffordson, 2012; Welsh & D’Agostino, 2009; Welsh, D’Agostino, & Kaniskan, 2013). All yielded essentially the same results: SBGs and high-stakes, standards-based assessment scores were only moderately related. Howley et al. (1999) found that 50% of the variance in GPA could be explained by standards-based assessment scores, and the magnitude of the relationship varied by school. Interview data revealed that even in SBG settings, some teachers included noncognitive factors (e.g., attendance and participation) in grades. This finding may explain the modest relationship, at least in part.
Welsh and D’Agostino (2009) and Welsh et al. (2013) developed an Appraisal Scale that gauged teachers’ efforts to assess and grade students on standards attainment. This 10-item measure focused on the alignment of assessments with standards and on the use of a clear, standards attainment–focused grading method. They found small to moderate correlations between this measure and grade–test score convergence. That is, the standards-based grades of teachers who used criterion-referenced achievement information were more related to standards-based assessments than were the grades of teachers who did not follow this practice. Welsh and D’Agostino (2009) and Welsh et al. (2013) found that SBG–test score relationships were larger in writing and mathematics than in reading. In addition, although teachers assigned lower grades than test scores in mathematics, grades were higher than test scores in reading and writing. J. A. Ross and Kostuch (2011) also found stronger SBG–test correlations in mathematics than in reading or writing, and grades tended to be higher than test scores, with the exception of writing scores at some grade levels.
Grading in Higher Education
Grades in higher education differ markedly among countries. As a case in point, four dramatic differences exist between the United States and New Zealand. First, grading practices are much more centralized in New Zealand, where grading is fairly consistent across universities and highly consistent within universities. Second, the grading scale starts with a passing score of 50%, and 80% and above yields an A. Third, the use of essay is more prevalent in New Zealand than multiple-choice testing. Fourth, grade distributions are reviewed and grades of individual instructors are considered each semester at departmental-level meetings. These practices are, at best, rarities in higher education in the United States.
An examination of 35 country and university websites paints a broad picture of the diversity in grading practices. Many countries use a system like that in New Zealand, in which 50 or 51 is the minimal passing score, and 80 and above (sometimes 90 and above) is required for an “A.” Many countries also offer an “E” grade, which is sometimes a passing score and other times indicates a failure less egregious than an “F.” If 50% is considered passing, then skepticism toward multiple-choice testing (where there is often a 1 in 4 chance of a correct guess) becomes understandable. In the Netherlands, a 1 (lowest) to 10 (highest) system is used, with Grades 1 to 3 and 9 and 10 rarely awarded, leaving a 5-point grading system for most students (Nuffic, 2013). In the European Union, differences between countries are so substantial that the European Credit Transfer and Accumulation System was created (European Commission, 2009).
Grading in higher education varies within countries, as well. In the United States, it is typically seen as a matter of academic freedom and not a fit subject for external intervention. Indeed, in an analysis of the American Association of Collegiate Registrars and Admissions Officers survey of grading practices in higher education in the United States, Collins and Nickel (1974) reported, “There are as many different types of grading systems as there are institutions” (p. 3). The 2004 version of the same survey suggested, however, a somewhat more settled situation in recent years (Brumfield, 2005). Grading in higher education shares many issues of grade meaning with the K–12 context, which have been addressed above. Two unique issues for grade meaning remain: grading and student course evaluations, and historical changes in expected grade distributions. Table 7 presents studies in these areas.
Grades and Student Course Evaluations
Students in higher education routinely evaluate the quality of their course experiences and their instructors’ teaching. The relationship between course grades and course evaluations has been of interest for at least 40 years (Abrami, Dickens, Perry, & Leventhal, 1980; Holmes, 1972) and is a subquestion in the general research about student evaluations of courses (e.g., Centra, 1993; Marsh, 1984, 1987; McKeachie, 1979; Spooren, Brockx, & Mortelmans, 2013). The hypothesis is straightforward: Students will give higher course evaluations to faculty who are lenient graders. This grade-leniency theory (Love & Kotchen, 2010; McKenzie, 1975) has long been lamented, particularly by faculty who perceive themselves as rigorous graders and do not enjoy favorable student evaluations. This assumption is so prevalent that it is close to accepted as settled science (Ginexi, 2003; Marsh, 1987; Salmons, 1993). Ginexi (2003) posited that the relationship between anticipated grades and course evaluation ratings could be a function of cognitive dissonance (between the student’s self-image and an anticipated low grade) or of revenge theory (retribution for an anticipated low grade). Although Maurer (2006) argued that revenge theory is popular among faculty receiving low course evaluations, both his study and an earlier study by Kasten and Young (1983) did not find this to be the case. These authors therefore argued for the cognitive dissonance model, where attributing poor teaching to the perceived lack of student success is an intrapersonal face-saving device.
A critical look at the literature presents an alternative argument. First, the relationship between anticipated grades and course evaluation ratings is moderate at best. Meta-analytic work (Centra & Creech, 1976; Feldman, 1997) suggests correlations between .10 and .30, or that anticipated grades account for less than 10% of the variance in course evaluations. It therefore appears that anticipated grades have little influence on student evaluations. Second, the relationship between anticipated grades and course evaluations could simply reflect an honest assessment of students’ opinions of instruction, which varies according to the students’ experiences of the course (J. K. Smith & Smith, 2009). Students who like the instructional approach may be expected to do better than students who do not. Students exposed to exceptionally good teaching might be expected to do well in the course and to rate the instruction highly (and vice versa for poor instruction). Although face-saving or revenge might occur, a fair amount of honest and accurate appraisal of the quality of teaching might be reflected in the observed correlations.
Historical Changes in Expectations for Grade Distributions
The roots of grading in higher education can be traced back hundreds of years. In the 16th century, Cambridge University developed a three-tier grading system with 25% of the grades at the top, 50% in the middle, and 25% at the bottom (Winter, 1993). Working from European models, American universities invented systems for ranking and categorizing students based both on academic performance and on progress, conduct, attentiveness, interest, effort, and regular attendance at class and chapel (Cureton, 1971; Rugg, 1918; Schneider & Hutt, 2014). Grades were ubiquitous at all levels of education at the turn of the 20th century but were idiosyncratically determined (Schneider & Hutt, 2014), as described earlier.
To resolve inconsistencies, educators turned to the new science of statistics, and a concomitant passion for measuring and ranking human characteristics (Pearson, 1930). Inspired by the work of his cousin, Charles Darwin, Francis Galton pioneered the field of psychometrics, extending his efforts to rank one’s fitness to produce high-quality offspring on an A to D scale (Galton & Galton, 1998). Educators began to debate how normal curve theory and other scientific advances should be applied to grading. As with K–12 education, the consensus was that the 0 to 100 marking system led to an unjustified implication of precision, and that the normal curve would allow for transformation of student ranks into A–F or other categories (Rugg, 1918).
Meyer (1908) argued for grade categories as follows: excellent (3% of students), superior (22%), medium (50%), inferior (22%), and failure (3%). He argued that a student picked at random is as likely to be of medium ability as not. Interestingly, Meyer’s terms for the middle three grades (superior, medium, and inferior) are norm-referenced, whereas the two extreme grades (excellent and failure) are criterion-referenced. Roughly a decade later, Nicolson (1917) found that 36 out of 64 colleges were using a 5-point scale for grading, typically A, B, C, D, and F. The questions debated at the time were more over the details of such systems as opposed to the overall approach. As Rugg (1918) stated, Now the term inherited capacity practically defines itself. By it we mean the “start in life;” the sum total of nervous possibilities which the infant has at birth and to which, therefore, nothing that the individual himself can do will contribute in any way whatsoever. (p. 706)
Rugg (1918) went on to say that educational conditions interact with inherited capacity, resulting in what he called “ability-to-do” (p. 706). He recommended that teachers base marks on observations of students’ performance that reflect those abilities, and that grades should form a normal distribution. This approach reduces grading to determining the number of grading divisions and the number of students who should fall into each category. Thus, there is a shift from a decentralized and fundamentally haphazard approach to assigning grades to one that is based on “scientific” (p. 701) principles. Furthermore, Rugg argued that letter grades were preferable to percentage grades as they more accurately represented the level of precision that was possible.
Another interesting aspect of Rugg’s (1918) and Meyer’s (1908) work is the notion that grades should simply be a method of ranking students, and not necessarily used for making decisions about achievement. Although Meyer argued that 3% should fail a typical course (and he feared that people would see this as too lenient), he was less certain about what to do with the “inferior” group, stating that grades should solely represent a student’s rank in the class. In hindsight, these approaches seem reductionist at best. Although the notion of grading “on the curve” remained popular through at least through the early 1960s, a categorical (A–F) approach to assigning grades was implemented. This system tended to mask keeping a close eye on the notion that neither too many As nor too many Fs were handed out (Guskey, 2000; Kulick & Wright, 2008). The normal curve was the “silent partner” of the grading system.
In the United States in the 1960s, a confluence of technical and societal events led to dramatic changes in perspectives about grading. These were criterion-referenced testing (Glaser, 1963), mastery learning and mastery testing (Bloom, 1971; Mayo, 1970), the Civil Rights movement, and the war in Vietnam. Glaser (1963) brought forth the innovative idea that sense should be made out of test performance by “referencing” performance not to a norming group but rather to the domain whence the test came; students’ performance should not be based on the performance of their peers. The proper referent, according to Glaser, was the level of mastery on the subject matter being assessed. Working from Carroll’s (1963) model of school learning, Bloom (1971) developed the underlying argument for mastery learning theory: that achievement in any course (and by extension, the grade received) should be a function of the quality of teaching, the perseverance of the student, and the time allowed for the student to master the material (Guskey, 1985).
It was not the case that the work of Bloom (1971) and Glaser (1963) single-handedly changed how grading took place in higher education, but ideas about teaching and learning partially inspired by this work led to a substantial rethinking of the proper aims of education. Bring into this mix a national reexamination of status and equity, and the time was ripe for a humanistic and social reassessment of grading and learning in general. The final ingredient in the mix was the war in Vietnam. The United States had its first conscription since World War II, and as the war grew increasingly unpopular, so did the pressure on professors not to fail students and make them subject to the draft. The effect of the draft on grading practices in higher education is unmistakable (Rojstaczer & Healy, 2012). The proportion of A and B grades rose dramatically during the years of the draft; the proportion of D and F grades fell concomitantly.
Grades have risen again dramatically in the past 25 years. Rojstaczer and Healy (2012) argued that the increase resulted from new views of students as consumers, or even customers, and away from viewing students as needing discipline. Others have contended that faculty inflate grades to vie for good course ratings (the grade-leniency theory, Love & Kotchen, 2010). Or perhaps students are higher achieving than they were and deserve better grades.
Discussion
This review shows that over the past 100 years, teacher-assigned grades have been maligned by researchers and pyschometricians alike as subjective and unreliable measures of student academic achievement (Allen, 2005; Banker, 1927; Carter, 1952; Evans, 1976; Hargis, 1990; Kirschenbaum et al., 1971; Quann, 1983; S. B. Simon & Bellanca, 1976). However, others have noted that grades are a useful indicator of numerous factors that matter to students, teachers, parents, schools, and communities (Bisesi, Farr, Greene, & Haydel, 2000; Folzer-Napier, 1976; Linn, 1982). Over the past 100 years, research has attempted to identify the different components of grades in order to inform educational decision making (Bowers, 2009; Parsons, 1959). Interestingly, although standardized assessment scores have been shown to have low criterion validity for overall schooling outcomes (e.g., high school graduation and admission to postsecondary institutions), grades consistently predict K–12 educational persistence, completion, and transition from high school to college (Atkinson & Geiser, 2009; Bowers et al., 2013).
One hundred years of quantitative studies of the composition of K–12 report card grades demonstrate that teacher-assigned grades represent both the cognitive knowledge measured in standardized assessment scores and, to a smaller extent, noncognitive factors such as substantive engagement, persistence, and positive school behaviors (e.g., Bowers, 2009, 2011; Farkas et al., 1990; Klapp Lekholm & Cliffordson, 2008, 2009; Miner, 1967; Willingham et al., 2002). Grades are useful in predicting and identifying students who may face challenges in either the academic component of schooling or in the sociobehavioral domain (e.g., Allensworth, 2013; Allensworth & Easton, 2007; Allensworth et al., 2014; Atkinson & Geiser, 2009; Bowers, 2014).
The conclusion is that grades typically represent a mixture of multiple factors that teachers value. Teachers recognize the important role of effort in achievement and motivation (Aronson, 2008; Cizek et al., 1995; Cross & Frary, 1999; Duncan & Noonan, 2007; Guskey, 2002, 2009a; Imperial, 2011; S. Kelly, 2008; Liu, 2008b; McMillan, 2001; McMillan et al., 2002; McMillan & Lawson, 2001; McMillan & Nash, 2000; Randall & Engelhard, 2009, 2010; Russell & Austin, 2010; Sun & Cheng, 2013; Svennberg et al., 2014; Troug & Friedman, 1996; Yesbeck, 2011). They differentiate academic enablers (McMillan, 2001, p. 25) like effort, ability, improvement, work habits, attention, and participation, which they endorse as relevant to grading, from other student characteristics like gender, socioeconomic status, or personality, which they do not endorse as relevant to grading.
This quality of graded achievement as a multidimensional measure of success in school may be what makes grades better predictors of future success in school than tested achievement (Atkinson & Geiser, 2009; Barrington & Hendricks, 1989; Bowers, 2014; Cairns et al., 1989; Cliffordson, 2008; Ekstrom et al., 1986; Ensminger & Slusarcick, 1992; Finn, 1989; Fitzsimmons et al., 1969; Hargis, 1990; Lloyd, 1974, 1978; Morris et al., 1991; Rumberger, 1987; Troob, 1985; Voss et al., 1966), especially given known limitations of achievement testing (Nichols & Berliner, 2007; Polikoff, Porter, & Smithson, 2011). In the search for assessments of noncognitive factors that predict educational outcomes (Heckman & Rubinstein, 2001; Levin, 2013), grades appear to be useful. Current theories postulate that both cognitive and noncognitive skills are important to acquire and build over the course of life. Although noncognitive skills may help students develop cognitive skills, the reverse is not true (Cunha & Heckman, 2008).
Teachers’ values are a major component in this multidimensional interpretation of grades. Besides academic enablers, two other important teacher values work to make graded achievement different from tested achievement. One is the value that teachers place on being fair to students (Bonner, 2016; Bonner & Chen, 2009; Brookhart, 1994; Grimes, 2010; Hay & Macdonald, 2008; Sun & Cheng, 2013; Svennberg et al., 2014; Tierney et al., 2011). In their concept of fairness, most teachers believe that students who try should not fail, whether or not they learn. Related to this concept is teachers’ wish to help all or most students be successful (Bonner, 2016; Brookhart, 1994).
Grades, therefore, must be considered multidimensional measures that reflect mostly achievement of classroom learning intentions and also, to a lesser degree, students’ efforts at getting there. Grades are not unidimensional measures of pure achievement, as has been assumed in the past (e.g., Carter, 1952; McCandless et al., 1972; Moore, 1939; C. C. Ross & Hooks, 1930) or recommended in the present (e.g., Brookhart, 2009, 2011; Guskey, 2000; Guskey & Bailey, 2010; Marzano & Heflebower, 2011; O’Connor, 2009; Scriffiny, 2008). Although measurement experts and professional developers may wish grades were unadulterated measures of what students have learned and are able to do, strong evidence indicates that they are not.
For those who wish grades could be a more focused measure of achievement of intended instructional outcomes, future research needs to cast a broader net. The value teachers attach to effort and other academic enablers in grades and their insistence that grades should be fair point to instructional and societal issues that are well beyond the scope of grading. Why, for example, do some students who sincerely try to learn what they are taught not achieve the intended learning outcomes? Two important possibilities include intended learning outcomes that are developmentally inappropriate for these students (e.g., these students lack readiness or prior instruction in the domain), and poorly designed lessons that do not make clear what students are expected to learn, do not instruct students in appropriate ways, and do not arrange learning activities and formative assessments in ways that help students learn well.
Research focusing solely on grades typically misses antecedent causes. Future research should make these connections. For example, does more of the variance in grades reflect achievement in classes where lessons are high-quality and appropriate for students? Is a negatively skewed grade distribution, where most students achieve and very few fail, effective for the purposes of certifying achievement, communicating with students and parents, passing students to the next grade, or predicting future educational success? Do changes in instructional design lead to changes in grading practices, in grade distributions, and in the usefulness of grades as predictors of future educational success?
This review suggests that most teachers’ grades do not yield a pure achievement measure but are rather a multidimensional measure dependent on both what the students learn and how they behave in the classroom. This conclusion, however, does not excuse low-quality grading practices or suggest there is no room for improvement. One hundred years of grading research have generally confirmed large variation among teachers in the validity and reliability of grades, both in the meaning of grades and in the accuracy of reporting. Early research found great variation among teachers when asked to grade the same examination or paper. Many of these early studies communicated a “what’s wrong with teachers” undertone that today would likely be seen as researcher bias.
Early researchers attributed sources of variation in teachers’ grades to one or more of the following sources: criteria (Ashbaugh, 1924; Brimi, 2011; Healy, 1935; Silberstein, 1922; Sims, 1933, Starch, 1915; Starch & Elliott, 1913a,b), students’ work quality (Bolton, 1927; Healy, 1935; Jacoby, 1910; Lauterbach, 1928; Shriner, 1930; Sims, 1933), teacher severity/leniency (Shriner, 1930; Silberstein, 1922; Sims, 1933; Starch, 1915; Starch & Elliott, 1913b), task (Silberstein, 1922; Starch & Elliott, 1913a), scale (Ashbaugh, 1924; Sims, 1933; Starch 1913, 1915), and teacher error (Brimi, 2011; Eells, 1930; Hulten, 1925; Lauterbach, 1928, Silberstein, 1922; Starch & Elliott, 1912, 1913a,b). Starch (1913; Starch & Elliott 1913b) found that teacher error and emphasizing different criteria were the two largest sources of variation.
Regarding sources of error, J. K. Smith (2003) suggested reconceptualizing reliability for grades as a matter of sufficiency of information for making the grade assignment. This recommendation is consistent with the fact that as grades are aggregated from individual pieces of work to report card or course grades and GPAs, reliability increases. The reliability of overall college grade-point average is estimated at .93 (Beatty, Walmsley, Sackett, Kuncel, & Koch, 2015).
In most studies investigating teachers’ grading reliability, teachers were sent examination papers without specific grading criteria and simply asked to assign grades. Today, this lack of clear grading criteria would be seen as a shortcoming in the assessment process. Most of these studies thus confounded teachers’ inability to judge student work consistently and random error, considering both teacher error. Rater training offers a modern solution to this situation. Research has shown that with training on established criteria, individuals can judge examinees’ work more accurately and reliably (Myford, 2012). Unfortunately, most teachers and professors today are not well trained, typically grade alone, and rarely seek help from colleagues to check the reliability of their grading. Thus, working toward clearer criteria, collaborating among teachers, and involving students in the development of grading criteria appear to be promising approaches to enhancing grading reliability.
Considering criteria as a source of variation in teachers’ grading has implications for grade meaning and validity. The attributes on which grading decisions are based function as the constructs the grades are intended to measure. To the extent teachers include factors that do not indicate achievement in the domain they intend to measure (e.g., when grades include consideration of format and surface level features of an assignment), grades do not give students, parents, or other educators accurate information about learning. Furthermore, to the extent teachers do not appropriately interpret student work as evidence of learning, the intended meaning of the grade is also compromised. There is evidence that even teachers who explicitly decide to grade solely on achievement of learning standards sometimes mix effort, improvement, and other academic enablers when determining grades (Cox, 2011; Hay & Macdonald, 2008; McMunn et al., 2003).
Future research in this area should seek ways to help teachers improve the criteria they use to grade, their skill at identifying levels of quality on the criteria, and their ability to effectively merge these assessment skills and instructional skills. When students are taught the criteria by which to judge high-quality work and are assessed by those same criteria, grade meaning is enhanced. Even if grades remain multidimensional measures of success in school, the dimensions on which grades are based should be defensible goals of schooling and should match students’ opportunities to learn.
No research agenda will ever entirely eliminate teacher variation in grading. Nevertheless, the authors of this review have suggested several ways forward. Investigating grading in the larger context of instruction and assessment will help focus research on important sources and causes of invalid or unreliable grading decisions. Investigating ways to differentiate instruction more effectively, routinely, and easily will reduce teachers’ feelings of pressure to pass students who may try but do not reach an expected level of achievement. Investigating the multidimensional construct of “success in school” will acknowledge that grades measure something significant that is not measured by achievement tests. Investigating ways to help teachers develop skills in writing or selecting and then communicating criteria, and recognizing these criteria in students’ work, will improve the quality of grading. All of these seem reachable goals to achieve before the next century of grading research. All will assuredly contribute to enhancing the validity, reliability, and fairness of grading.
Footnotes
Note
Contributing authors worked equally and are listed in alphabetical order after the two project leaders.
Authors
SUSAN M. BROOKHART, PhD, is an independent educational consultant and an adjunct faculty member at Duquesne University, Pittsburgh, PA 15282; email:
THOMAS R. GUSKEY, PhD, is a professor of education at the University of Kentucky, Lexington, KY 40506; email:
ALEX J. BOWERS, PhD, is an associate professor of education leadership at Teachers College, Columbia University, New York, NY 10027; email:
JAMES H. MCMILLAN, PhD, is interim associate dean for academic affairs and professor of education at Virginia Commonwealth University, Richmond, VA 23284; email:
JEFFREY K. SMITH, PhD, is a professor of education at the University of Otago in Dunedin, New Zealand; email:
LISA F. SMITH, PhD, is a professor and dean of education at the University of Otago in Dunedin, New Zealand; email:
MICHAEL T. STEVENS is a graduate student in the School of Education at the University of California, Davis, CA 95616; email:
MEGAN E. WELSH, PhD, is an assistant professor in educational assessment and measurement at the University of California, Davis, CA 95616; email:
