Incentivizing Multiple Revisions Improves Student Writing Without Increasing Instructor Workload

Abstract

Previous research has shown that when students are required to submit a draft and a revision of their writing, large proportions of students do not improve across drafts. We implemented a writing assignment in which students were permitted to submit up to four optional drafts. To encourage substantive revisions, students were awarded additional points if they received all points on the grading rubric. Based on the grades of the instructors, 31% of students eventually earned perfect scores in this assignment, compared to 13% in a typical single revision assignment. Permitting students to submit up to four optional drafts resulted in nearly the same amount of grading for the instructor as requiring students to submit two drafts.

Keywords

writing instruction revision motivation

A subset of undergraduate psychology courses, often called “writing-intensive” courses, place an additional, secondary emphasis on writing instruction. In such courses, writing assignments often are designed so that students submit a draft of writing to be graded, and they receive editorial feedback. Then students are permitted to revise and resubmit that writing to be graded again. Stellmack, Keenan, Sandidge, Sippl, and Konheim-Kalkstein (2012) examined the effectiveness of this review–revise–resubmit procedure in the context of an assignment in which students wrote an American Psychological Association (APA)-style introduction in an introductory research methods class. They found that when the first and second drafts were graded by blind graders (who had no knowledge of the authors’ identities or whether a draft was a first or second draft), only 57% of the students’ scores increased across drafts. They also found that the average increase in scores across drafts ranged from 4% to 7% across several experiments. Comparable improvement was seen regardless of whether initial feedback was given to the author of the paper by a student peer or the lab instructor, or when the students critiqued their own writing. More recently, Greenberg (2015) analyzed scores obtained by students on first and second drafts of a complete APA-style paper. Each of Greenberg’s students graded a fellow student’s paper using the grading rubric that would be used by the instructor to grade their papers. Students then revised and resubmitted their own papers. Greenberg found an increase in scores across drafts of about 5% but only 60.5% of students’ scores increased across drafts, similar to the results of Stellmack et al. (2012).

Stellmack et al. (2012) went on to compare the scores assigned by blind graders to those assigned by the lab instructors who provided feedback on the first drafts. A much higher percentage (82%) of students’ scores increased across drafts based on the scores of the nonblinded lab instructors. Overall, the results of the research described above suggest that the review–revise–resubmit procedure does not lead to a substantial increase in writing quality or writing ability in an objective sense but rather that the procedure serves to give students experience in satisfying a particular reviewer.

Although the increases in scores reported by Stellmack et al. (2012) and Greenberg (2015) were statistically significant, we felt that they were disappointingly small. Of more interest to us was the fact that when students were required to revise and resubmit their writing, only about 60% of those students showed an increase in score across drafts, whereas 40% showed no change or a decrease in score across drafts when judged by graders other than the instructor. Our goal was to design a writing assignment that increases the percentage of students who show improvement from the first to final draft in the judgment of blinded graders.

One problem in evaluating the effectiveness of a particular writing instruction activity is the inherently subjective nature of evaluating student writing. It is frequently observed that interrater agreement among graders of student writing (measured as the proportion of times that graders agree with one another) generally is not particularly high. For example, Stellmack, Konheim-Kalkstein, Manor, Massey, and Schmitz (2009) found low interrater agreement (agreement between reviewers in .37 of the scores they assigned) for graders who developed and refined a grading rubric over several months. Newell, Dahm, and Newell (2002) also reported comparably low interrater agreement (.47 proportion of agreement measured in the same way as Stellmack, Konheim-Kalkstein, Manor, Massey, & Schmitz, 2009) in the grading of student writing with a rubric.¹ Indeed, subjectivity and low interrater agreement in evaluating scientific writing are implicitly acknowledged in the peer-review process when an editor seeks reviews from multiple reviewers. If interrater agreement was expected to be particularly high, comments from only one reviewer would be necessary. Low interrater agreement in the peer-review process can be seen in the differences between comments provided by different peer reviewers and, in the extreme case, when one reviewer recommends publication and another recommends rejection. Granted, one benefit of having multiple peer reviewers is to tap into different points of view, but by acknowledging that there are likely to be diverse opinions regarding a single sample of writing, one is saying, in effect, that there is likely to be low interrater agreement. In other words, those reviewers on their own would be likely to identify different positive and negative aspects of the manuscript under review.

These difficulties in objectively evaluating writing underscore the need to focus on the “reviewer satisfaction” aspect of the review–revise–resubmit procedure rather than any role the procedure might have in improving student writing per se. This perspective transforms the goal of writing instruction from improving writing in any objective sense to that of adapting one’s writing in order to reach consensus among one’s peers and to make the content of one’s work available to the public, which are the overarching goals that scientific writing subserves.

With this perspective in mind, we made some modifications to a writing assignment that incorporated the review–revise–resubmit procedure to more closely simulate the real-world situation of submitting a manuscript for publication. A manuscript submitted to a journal does not receive a grade but rather it is judged to be acceptable or unacceptable. When a manuscript is judged to be unacceptable (and assuming that it is not rejected), the manuscript must be revised until the reviewers and editor are satisfied and the paper becomes acceptable. In our assignment, each draft that students submitted was assigned a score on a continuous scale. Students received feedback on their writing in the form of the score and as editorial comments made by the grader. Students could submit up to four drafts of the paper, responding to feedback from previous drafts in each subsequent submission, and each draft was graded by the same grader. If students received a perfect score (as defined below) on any draft, they received additional points for the assignment. Although all drafts were assigned a score, each student received only one grade for the assignment, which was the highest score obtained on any draft. In other words, only one draft score counted toward each student’s final grade in the course. The intention was to motivate the students to attempt to perfect their writing in response to their reviewer’s (grader’s) comments. We believe that this assignment more closely approaches the real-world peer-review situation in that authors are permitted to submit multiple revisions and there are additional rewards associated with completely satisfying the reviewers. Admittedly, the assignment is not identical to the real-world peer-review process in all respects; for example, manuscripts submitted for publication typically are not scored with a rubric. It was necessary to grade papers with a rubric in the context of the class in which the assignment was given so that students had an explicit statement of the criteria that would be used to grade their papers and so that a numerical score could be assigned to students who did not perfect their papers.

In this study, a “perfect” paper is one that was assigned all of the points available on the grading rubric by the instructor. Note that this is not the same as saying that the student met all of the criteria of the rubric but rather that the student met those criteria in the opinion of the grader. In this way, and in acknowledgment of the subjective nature of grading writing, the grader is as much a part of the measurement instrument as the printed rubric. Strictly speaking, the students’ goal is not simply to obtain all the points on the rubric but, more precisely, it is to produce a piece of writing that will convince the particular grader to assign all the points on the rubric (which is true of any assignment that involves subjectively evaluated criteria).

To summarize the logic of our approach to the writing assignment we are describing here: When students revise and resubmit their writing, evaluation of that writing generally shows rather low interrater agreement. As a result, it is difficult to provide evidence that permitting students to revise and resubmit their writing leads to improvement in their writing in any objective sense. Previous research (e.g., Stellmack, Keenan, Sandidge, Sippl, & Konheim-Kalkstein, 2012) has shown that writing seems to improve most in a revision when it is judged by the person who gave feedback on the first draft. Therefore, we can reasonably conclude that the true effect of the procedure is to train students to satisfy a particular grader/reviewer, which is also a necessary skill in submitting and revising a manuscript for publication. With these ideas in mind, we attempted to structure a writing assignment in a way that would more closely approximate the peer-review and publication process than submitting a paper twice and having it graded twice. We provided opportunities for students to submit up to four drafts of their writing, and we provided additional rewards to students who perfected their papers in the opinion of the grader.

Our goal in this paper is to evaluate the outcomes and effectiveness of the assignment. To do so, we chose to analyze the writing of students whose papers were eventually judged to be perfect by the lab instructor who graded the papers in the context of the course. Blind to the lab instructor’s marks, we graded first and final drafts of those students. We designed the study in this way because we wanted to know how much improvement in writing would be shown by those students who chose to persist in revising and resubmitting their papers to the point that they were eventually judged to be perfect by their reviewer/grader. We compared the amount of improvement to that of a random sample of students from a previous semester who were required to turn in first and second drafts for grades. (Although we focused on the students who eventually perfected their papers, we also report summary data from all students.) As additional measures of the effectiveness of the writing assignment, we also analyzed how many students took advantage of the opportunity to submit multiple drafts and how successful they were in perfecting their papers.

Method

Participants

We analyzed the papers of students who were enrolled in an introductory research methods course at the University of Minnesota in the spring of 2012. The course had a total of 285 students, of which 68% were female; 92% were of junior or senior standing and 8% were sophomores; 83% were declared psychology majors and the remainder were nonpsychology majors or had not declared a major. Of the 48 students who received perfect scores on their final drafts of the writing assignment, we randomly chose 34 students and blindly graded the first and final drafts submitted by each.

Because the assignment permitted students to submit up to four drafts, we wanted to assess the burden of grading on the instructors as well as the relationship between number of drafts and proportion of students who perfected their papers. However, information regarding the number of drafts submitted by each student was not available for the spring semester of 2012. Therefore, we examined the number of drafts submitted by students who were enrolled in the research methods course in the spring semester of 2013 and who were given the same writing assignment structured in the same way. In that semester, there were 194 students enrolled in the course, 69% of whom were female; 85% were juniors or seniors and 13% were sophomores; 78% were declared psychology majors and the remainder were nonpsychology majors or had not declared a major. Thus, the profiles of the students in terms of gender and year of study were similar in the two semesters that were analyzed.

Design

In the research methods course, students formed groups of approximately three to five students. Each group designed and performed an experiment on a topic of their choice. In one writing assignment, students were instructed to independently write the introduction section of a manuscript that described their experiment in APA format (American Psychological Association, 2010). Students were allowed to submit one draft each week for four weeks, beginning the week after the assignment was given. Students were required to submit only one draft by the final turn-in date (four weeks after the assignment was given). In this way, students were able to choose whether to submit any additional drafts during the first three weeks of the assignment. In contrast, in the writing assignment analyzed by Stellmack et al. (2012), students were required to submit both a first draft and a revision of their papers, both of which were graded for course credit. Students received instruction in APA-style writing through a series of lectures and in-class exercises.

The assignment was graded out of 24 points using the most recent version of the rubric developed and described by Stellmack et al. (2009; the rubric is available online at http://www.psych.umn.edu/psylabs/acoustic/rubrics.htm.) Although up to four drafts were scored for each student, each student’s final grade for the assignment was the highest score obtained on any draft that was submitted during the 4-week assignment period. Furthermore, if students received all 24 rubric points on any draft of their papers, they received four additional points, for a total of 28 points. Thus, possible scores on the assignment were 28, 23, 22, 21, and so on. The additional points were intended to encourage the students to persist in attempting to perfect their papers in response to feedback from the grader.

Procedure

The research methods class was structured such that there was a single large lecture section and a number of smaller lab sections (14 in spring 2012 and 12 in spring 2013). Each student registered for the lecture section and one of the lab sections. Writing assignments were given in the lab sections. Each lab section was conducted by a different graduate teaching assistant (the lab instructor) who graded and provided feedback on the student writing. The teaching assistants were instructed in the use of the grading rubric and graded a sample introduction with the rubric prior to grading student papers. In training the teaching assistants, the instructor of the course (the first author of this paper) discussed the grading of the sample introduction to encourage consistent use of the rubric across lab sections.

After the spring semester of 2012, we randomly chose 34 students from the 48 who eventually received perfect scores on the writing assignment, with the restriction that the chosen students had to have submitted at least two drafts. We removed all identifying information from the first and final drafts submitted by each of those students. There was no explicit indication on any paper as to whether it was a first or final draft.

The students’ papers were distributed among four graders (the four authors of this paper, who will be identified as Graders 1, 2, 3, and 4) such that each grader received both the first and final drafts of 17 students (a total of 34 papers) in a random order. Graders 1 and 4 graded the same 34 papers, and Graders 2 and 3 graded the remaining papers. Grader 4 was a former student in the undergraduate research methods class and a former graduate lab instructor for the course. Therefore, she was familiar with the use of the rubric prior to the study, but she graded several sample introductions and discussed the application of the rubric with the first author prior to grading papers for this study to enhance interrater agreement. The remaining authors had extensive experience in grading student writing in previous writing research.

Results

As measures of interrater agreement among the blind graders, we computed the correlation between the total scores (out of 24 points) of all papers graded by each pair of graders. For Graders 1 and 4: Pearson r(32) = .71, p < .001; Spearman rank order r _S(34) = .73, p < .001. For Graders 2 and 3: Pearson r(32) = .65, p < .001; Spearman rank order r _S(34) = .57, p < .001. These correlations measuring interrater agreement are slightly higher than in Stellmack et al. (2009) and Stellmack et al. (2012) and are comparable to those reported by Greenberg (2012). Because there may be a correlation between scores of different graders even when the graders do not assign the same absolute scores, we also computed the proportion of times that graders assigned the same scores as each other, as did Stellmack et al. (2009). Each blind grader graded 34 papers, and for each paper, a score was assigned in each of eight categories, so each grader assigned a total of 272 scores. Graders 1 and 4 assigned the same score in 136/272 (.50) cases, and Graders 2 and 3 assigned the same score in 137/272 (.50) cases. These measures of agreement were also slightly higher than in Stellmack et al. (2009).

Correlations between the scores of each blind grader and the corresponding scores of the lab instructors were much lower. The Pearson r values ranged from −.05 to .27 and the Spearman rank order r _S values ranged from −.12 to .33. None of these correlations was significantly different from 0, with p values ranging from .29 to .86. These correlations were computed using only scores on first drafts because lab instructors assigned perfect scores to all of the second drafts in our sample.

Because we chose papers of students who eventually received perfect scores from their lab instructors, all 34 students showed an increase in scores across drafts in terms of the scores assigned by the lab instructors (i.e., the score on every final draft was a perfect 24). For the grades assigned by the blind graders, we evaluated the change in score across drafts for each pair of drafts graded by each blind grader. In other words, 68 changes in score were considered because each student’s papers were graded by two different graders. There was a significant increase from the mean first draft score (M = 14.63, SD = 4.14) to the mean final draft score, (M = 17.29, SD = 3.51), t(67) = 6.29, p < .001, d = 0.76. The mean increase between first and final draft scores of 2.66 points out of 24 represents a change of about 11%. For the 68 pairs of scores, 49 (72%) showed an increase in score across drafts, 16 (24%) decreased across drafts, and 3 (4%) were unchanged. These proportions were compared with the data of Stellmack et al. (2012), who graded first and second (final) drafts of a random sample of all students (not only students who received perfect scores on their second drafts). For 96 pairs of scores, Stellmack et al. (2012) found that 55 (57%) showed an increase across drafts, 30 (31%) showed a decrease, and 11 (12%) were unchanged. Despite the large differences between the proportions for the present data and for those of Stellmack et al. (2012), the differences were not significant; χ²(2) = 4.53, p = .104.

As students were required to submit only one draft and additional drafts were optional, we wanted to evaluate how many students submitted additional drafts and how likely those students were to perfect their papers (where “perfection” is defined in terms of the grades assigned by the lab instructors). We did not have these data from the spring semester of 2012, so we analyzed the data for 194 students that were registered for the course in the spring semester of 2013. Table 1 shows the number of students who submitted one, two, three, or four drafts and who ultimately did or did not receive perfect scores. As shown in the table, 60 (31%) of 194 students eventually received a perfect score for the assignment. No students who submitted only one draft received a perfect score. Students had a greater than 50% chance of obtaining a perfect score only if they submitted three or four drafts. In comparison, in experiment 2 of Stellmack et al. (2012), in which students were required to submit two drafts of an APA-style introduction, in that class of 157 students, only 20 (13%) ultimately obtained a perfect score on the final draft. The difference between our sample and that of Stellmack et al. (2012) in terms of the proportion of students who eventually received a perfect score is significant, χ²(1) = 16.31, p < .001.

Table 1.

Number of Students Who Perfected or Did Not Perfect Their Papers and the Number of Drafts Submitted by Each.

	Number of Drafts Submitted				Total
	1	2	3	4	Total
Eventually perfected paper	0	21	33	6	60
Did not perfect paper	64	43	25	2	134
Total	64	64	58	8	194
Proportion who perfected paper	0.00	0.33	0.57	0.75	0.31

As an additional measure of the success of the activity, of the 130 students who submitted more than one draft in spring of 2013 (in which up to three additional drafts were optional), 129 of the students’ scores increased from the first to the final draft (based on the lab instructors’ grades). For the 130 students who submitted more than one draft, there was an increase in mean score from first to final draft: the mean score on the first draft was 16.60 (SD = 3.90) and the mean score on the final draft was 22.37 (SD = 2.28). The mean increase between first and final drafts, 5.77 (24%) out of 24, was statistically significant, t(129) = 19.39, p < .001, d = 1.70. A large effect size was computed in spite of the ceiling effect exhibited in the final draft scores. In comparison, for the data of Stellmack et al. (2012), the mean change in score from first to second (final) draft was only about 8% based on the instructors’ grades, and Greenberg (2015) reported a mean change in score of about 5%. In both of the latter cases, students were required to submit two drafts.

The assignment did not place a great burden on the lab instructors who graded the papers because in any given week, they did not receive a draft from every student in their lab. In particular, in the first 2 weeks of the 4-week assignment period, there were very few submissions. Of the 194 students in the spring semester of 2013, 18 students (9%) submitted drafts in Week 1, 72 students (37%) submitted drafts in Week 2, 125 students (64%) submitted drafts in Week 3, and 179 students (92%) submitted drafts in Week 4. Some students had already perfected their papers in earlier weeks or chose not to submit additional drafts, so they did not submit papers in the final week. Thirty-three percent of all students submitted only one draft. Had each of the 194 students been required to submit two drafts, the lab instructors would have had to grade 388 papers. In our assignment, 394 papers were graded overall. Thus, the total amount of grading was nearly the same as it would have been if students were required to submit two drafts.

Discussion

As in some previous research in which blind graders graded student writing (e.g., Stellmack et al., 2012), we found rather low interrater agreement between the blind graders and the lab instructors. For all of the students whose papers we examined, the lab instructors, who knew the authors’ identities and the draft numbers, judged that the writing improved between the first and final drafts and that all of the final drafts were “perfect.” In contrast, the blind graders judged that only 72% of the students improved their papers across drafts and no final drafts were judged to be perfect. The students whose papers we examined were successful in addressing the weaknesses identified by their instructors, but independent graders did not always see the same weaknesses and thus did not see the same improvement across drafts. This result replicates the findings of Stellmack et al. (2012) and underscores the subjective nature of evaluating writing and its consequences. When students revise and resubmit their writing, the writing does not improve in an objective sense because there is a lack of interrater agreement in evaluating writing. Rather, students’ revisions are directed toward satisfying a particular grader and the review–revise–resubmit procedure rewards their ability to do so. We attempted to design a writing assignment that would emphasize the “satisfying the reviewer” aspect of the procedure. The results showed that a substantial proportion of students who chose to submit three or four drafts eventually perfected their papers in the judgment of the grader and, on average, students who chose to revise their papers showed a much larger increase in scores than students who were required to submit two drafts in a different semester.

We also attempted to structure the writing assignment in a way such that students would receive disproportionately large benefits if they substantially revised their writing across drafts so that students would be motivated to attempt to do so. When students are required to submit two drafts of their writing (unlike in the present assignment) and both drafts are graded for course credit, there may be little motivation for the students to put much effort into the revision. For example, suppose that the student is satisfied with the grade earned on the first draft and that the amount of effort required to improve the paper would likely result in little additional improvement in grade. The student might choose to simply resubmit the paper with few modifications and accept a grade for the revision that is similar to the grade on the original submission. This strategy wastes the instructor’s time in terms of grading the revisions. In contrast, when students benefit only by improving their writing across drafts, as in the present assignment, then presumably a student is more likely to submit a revision only if he or she believes it is likely to lead to an improved grade. Consistent with these ideas, Covic and Jones (2008) reported that 17% (9/54) of their students were satisfied with their grades on the first draft of a writing assignment and they chose not to submit an optional second draft, which might have improved their grades. Likewise, our writing assignment allowed students to choose whether or not to submit additional drafts, thereby relieving the grader of the burden of grading drafts that did not incorporate any substantial improvements.

At the same time, allowing students to submit up to four drafts rather than only two gave a much larger proportion of students the chance to successfully complete our simulated review process. Of the 60 students who eventually perfected their papers in the spring semester of 2013, only 21 (35%) did so in two drafts. The remaining 39 students (65%) who perfected their papers required three or four drafts to do so. Additional motivation for students to revise their papers came from the fact that a “perfect” paper earned 28 points while the next possible point total was only 23, which meant that students could get an A on the paper only by completely satisfying the reviewer. Failing to do so could result only in a maximum grade of B. This further emphasized to students the importance of learning to respond successfully to a reviewer. As an additional measure of the effectiveness of this assignment, as noted in the results, a higher proportion of students (31%) eventually produced “perfect” papers in the present assignment than in a situation in which students were required to submit two drafts (13%). Another indicator of the effectiveness of the present assignment is the fact that the mean change in score was 24% of the possible points compared to mean changes of only 5–8% for classes that were required to submit two drafts.

The review–revise–resubmit procedure seems to contribute to the development of a student’s ability to respond to and satisfy a reviewer rather than improving the student’s writing in an objective sense (Stellmack et al., 2012). The implementation of the review–revise–resubmit procedure as described here requires students to submit only one draft of the writing but allows students to submit up to three additional optional drafts. In addition, the grading scheme is set up so that students benefit only by improving their writing in the eyes of their grader across drafts, presumably by responding to the comments made by the grader on earlier drafts. However, students can benefit substantially by eventually completely satisfying their grader. The data presented here suggest that when implemented in this way, a large proportion of students have an opportunity to demonstrate their ability to edit their writing to satisfy a reviewer, with little change in the grading burden of the instructor.

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Note

References

American Psychological Association. (2010). Publication manual of the American Psychological Association (6th ed.). Washington, DC: Author.

Covic

Jones

M. K.

(2008). Is the essay resubmission option a formative or a summative assessment and does it matter as long as the grades improve? Assessment & Evaluation in Higher Education, 33, 75–85. doi:10.1080/02602930601122928

Greenberg

K. P.

(2012). A reliable and valid weighted scoring instrument for use in grading APA-style empirical research reports. Teaching of Psychology, 39, 17–23. doi:10.1177/0098628311430643

Greenberg

K. P.

(2015). Rubric use in formative assessment: A detailed behavioral rubric helps students improve their scientific writing skills. Teaching of Psychology, 42, 211–217. doi:10.1177/0098628315587618

Newell

J. A.

Dahm

K. D.

Newell

H. L.

(2002). Rubric development and interrater reliability issues in assessing learning outcomes. Chemical Engineering Education, 36, 212–215.

Stellmack

M. A.

Keenan

N. K.

Sandidge

R. R.

Sippl

A. L.

Konheim-Kalkstein

Y. L.

(2012). Review, revise, and resubmit: The effects of self-critique, peer review, and instructor feedback on student writing. Teaching of Psychology, 39, 235–244. doi:10.1177/0098628312456589

Stellmack

M. A.

Konheim-Kalkstein

Y. L.

Manor

J. E.

Massey

A. R.

Schmitz

J. A. P.

(2009). An assessment of reliability and validity of a rubric for APA-style introductions. Teaching of Psychology, 36, 102–107. doi:10.1080/00986280902739776

Thaler

Kazemi

Huscher

(2009). Developing a rubric to assess student learning outcomes using a class assignment. Teaching of Psychology, 36, 113–116. doi:10.1080/00986280902739305