Abstract
This meta-analysis examined if students writing about content material in science, social studies, and mathematics facilitated learning (k = 56 experiments). Studies in this review were true or quasi-experiments (with pretests), written in English, and conducted with students in Grades 1 to 12 in which the writing-to-learn activity was part of instruction. Studies were not included if the control condition used writing to support learning (except when treatment students spent more time engaging in writing-to-learn activities), study attrition exceeded 20%, instructional time and content coverage differed between treatment and control conditions, pretest scores approached ceiling levels, letter grades were the learning outcome, and students attended a special school for students with disabilities. As predicted, writing about content reliably enhanced learning (effect size = 0.30). It was equally effective at improving learning in science, social studies, and mathematics as well as the learning of elementary, middle, and high school students. Writing-to-learn effects were not moderated by the features of writing activities, instruction, or assessment. Furthermore, variability in obtained effects were not related to features of study quality. Directions for future research and implications for practice are provided.
Writing is critical to success in almost all aspects of society today, including school, work, and home (Freedman et al., 2016; National Commission on Writing, 2003, 2004). We use writing to communicate, record information, persuade, create imaginary worlds, express feelings, entertain, heal psychological wounds, chronicle experiences, explore the meaning of events and situations, and accomplish various tasks at work (Graham, 2018). At school, many teachers use writing to support students’ learning of content material (Gillespie et al., 2014; Ray et al., 2016). Teachers often have students do a variety of writing activities, including writing reports, crafting written narratives to illustrate ideas, defending viewpoints and predictions via written arguments, and writing to show how to use information or solve a problem. Other activities teachers use to engage students in learning include writing in journals or learning logs to explore ideas about what is learned and to self-reflect on their understandings, providing answers to document-based questions, developing written explanations, taking notes when reading or listening to lectures, composing summaries and descriptions, creating outlines, and completing graphic organizers.
Starting in the 1960s and continuing today, prominent writing educators claimed that writing promotes learning (e.g., Britton et al., 1975; Galbraith & Baaijen, 2018). The early sources for this claim involved testimonials from professional writers (Paris Reviews; Plimpton, 1989), cross-cultural comparisons (e.g., Goody & Watt, 1963), theory (e.g., Britton, 1982), and anecdotal evidence (Klein & Boscolo, 2016). Despite the speculative nature of the effects of writing on learning, educators in English and the humanities in the 1960s and 1970s began to encourage students to use writing as a tool for learning in all content areas (e.g., Britton, 1982). Initially, this took the form of writing across the curriculum where students did specific types of writing (e.g., expressive, free writing, journaling, and argumentative writing) to promote learning in multiple disciplines (e.g., Langer & Applebee, 1987). This view of using writing-to-learn shifted, as evidence began to accumulate that different disciplines relied on using writing to support learning, and that specific types of writing (e.g., argumentation) can take different disciplinary forms (Bazerman, 2009; Hyland, 2004; Smagorinsky & Mayer, 2014).
The current meta-analysis examined if school-aged students’ writing about content material improved their learning. Consistent with a writing within discipline approach (Nelson, 2001), we examined the impact of writing activities on students’ learning in science, social studies, and mathematics. While writing was traditionally the domain of the language arts (Klein & Boscolo, 2016), there was a concerted effort in recent years to promote using writing as a learning activity in each of these disciplines (e.g., Hacker et al., 2019; Kiuhara et al., 2019; MacArthur, 2014; Wallace et al., 2004). Additionally, Miller et al. (2018) reported that 90% of all writing-to-learn studies, at least at the secondary level, involved science, social studies, and mathematics. Applebee and Langer (2011) further reported that students wrote more in these three disciplines collectively than they did in their English language arts classes. Given the emphasis placed on writing-to-learn in compulsory education, it is important to determine if such instruction facilitates students’ learning in these disciplines.
As Klein and Boscolo (2016) noted, writing-to-learn is not easy to define. Writing and learning have been conceptualized in many different ways, both are used in academic and nonacademic contexts, and learning is viewed as central to writing, but writing is not a necessary condition for learning. In the context of this meta-analysis, we defined writing-to-learn as the use of writing as a vehicle for strengthening, extending, and deepening students’ knowledge (cf. Graham & Perin, 2007). Given the purpose of this meta-analysis, we selectively narrowed our focus on the use of writing to promote academic learning. To count as a writing-to-learn activity, writing had to be purposefully assigned to promote academic learning. A variety of different types of writing met this standard, including using writing to summarize information, compare and contrast ideas, connect new and old information, describe one or more processes, explain how something works, create a story or poem to illustrate or extend ideas, construct analogies, and build an argument. It also included taking notes about content material being learned or using writing to complete graphic organizers/mind maps to represent the conceptual or structural relationships of content information. All of these types of writing activities require that students think and make decisions about content material.
To be considered a writing-to-learn activity, students had to produce written words by hand or digital means. Writing numbers to complete a mathematics problem did not count as writing-to-learn, but writing an explanation for how to solve the problem did. Drawing a map, diagram, or picture was not counted as writing-to-learn unless students created text with the created visual. Furthermore, writing-to-learn activities could be completed by an individual or done collectively, as long as all students were involved in producing text. Finally, the audience for a writing activity did not have to extend beyond the students but could involve others (e.g., peers).
Theoretical Support for Writing-to-learn
There are multiple theoretical positions for how writing about content supports learning. These include cognitive and social-cultural perspectives (Klein & Boscolo, 2016).
Cognitive Theories
One of the earliest theories championing the view that writing inherently elicits learning was offered by Britton et al. (1975), who placed special emphasis on the role of language in learning, arguing that language facilitates learning and that writing was a form of language (a derivative of expressive speech). Britton (1982) claimed that writers do not know exactly what they will say when they begin to convert an idea into written text, and that the semantics and syntax of language shape this process, resulting in new learning about an idea at the “point of utterance.” According to Britton, writing about content material most likely influences learning when it involves an authentic and personal act of meaning-making. He placed special emphasis on expressive writing. This form of writing was assumed to be closest to the spoken word, making it a useful tool for learning for students at all developmental levels.
A more recent theoretical model of how writing inherently elicits learning was offered by Galbraith and Baaijen (2018), who described a knowledge-constituting process which evokes cognitive operations and structures that operate below the level of conscious thought. This model draws on connectionist principles of information processing. A writer’s semantic knowledge is stored in long-term memory “as a fixed set of connection strengths between units within a distributed architecture . . .” (pp. 241–242). It is assumed that the writer’s understanding of a concept is implicit and not directly retrievable. It is only through the act of composing text that the writer discovers or has access to this knowledge. More specifically, the writer must synthesize these implicit connections in memory until an initial fusion of understanding is obtained. This initial understanding of knowledge may undergo additional modification and synthesis as the writer evaluates or considers other ideas or additional text is created. The extent to which writing leads to new learning depends on the extent to which the content produced by the writer in this way differs from content already held in the writer’s episodic memory. When this differs, the writer presumably experiences a new development in understanding.
Galbraith and Baaijen (2018) proposed a second and effortful process by which writing facilitates learning. This knowledge-transforming process occurs when writing ideas and content retrieved through episodic memory, external sources, or produced through the knowledge-constituting processes (described above) are subjected to evaluation and manipulation in working memory in order to satisfy the learner’s goals. Galbraith and Baaijen claimed that this process operates best when the ideas under consideration are represented in a fixed form (e.g., notes, a sentence in text) so that working memory is not overtaxed and the writer can concentrate on evaluating how well the target information addresses their learning goals and contributes to how the overall text is structured. This process transforms the writer’s understanding of information by creating a more organized representation in memory.
Somewhat similarly, Silva and Limongi (2019) proposed that writing about content material facilitates learning by consolidating information in long-term memory. This consolidation is achieved through rehearsal of information, elaboration of it, or both. To illustrate, the evaluation of whether information from long-term memory or external sources satisfies a learner’s goals requires active maintenance of these goals and information in working memory. This can be achieved only through rehearsal. Likewise, the evaluation and manipulation of this information in working memory are likely to involve elaboration. Even if the ideas held in working memory are not transformed through elaboration, rehearsal can lead to learning as when the writer acquires new information from external sources or becomes more familiar and facile with ideas drawn from long-term memory.
Other cognitive theories of how writing affects learning emphasize the agency of the learner, indicating that the goals and strategies students apply when writing about information leads to learning (Klein & Boscolo, 2016). When writing about content material, students must make decisions about how to handle new information and execute strategies to satisfy their goals (Graham & Hebert, 2011). Such strategies include but are not limited to the following: identifying which information is most important and needs to be presented in text (fostering explicitness), organizing ideas into a coherent written whole (establishing specific relationships among ideas), deciding how best to present content material in writing (fostering a personal involvement with the information), transforming ideas into written sentences (thinking about what these ideas mean to create text that satisfies the writer’s intentions), and reexamining, evaluating, or revising the text generated (leading to additional reflection and retention of information). The deliberate use of any of these strategies, whether they involve the pursuit of rhetorical goals (i.e., backward search strategies) or the revising of text iteratively to resolve contradictions (i.e., forward search strategies), presumably results in specific learning effects (Klein, 1999).
Specific effects of writing on learning were also championed by Bangert-Drowns et al. (2004). They argued that particular writing tasks promote learners’ use of cognitive learning strategies. For instance, writing can serve as a rehearsal strategy, such as when learners record in a journal the most important things learned, increasing both content exposure and time on task. Similarly, writing a report about a particular subject can evoke the use of organization and elaboration learning strategies, as learners build a structure for the text, link new learning to current understandings, and synthesize and elaborate on their ideas. Specific writing activities may further encourage comprehension monitoring and general cognitive awareness. For example, this can happen when the writing activity prompts students to reconsider old ideas in light of new information. As noted by Langer and Applebee (1987), however, different kinds of writing-to-learn activities promote different kinds of thinking and lead to differences in what is learned.
Klein (1999) identified an additional way in which genre can influence learning, indicating learning can occur when writers “use genre structures to organize relationships among elements of text, and thereby among elements of knowledge” (p. 203). Genres are particular kinds of text, which are commonly distinguished by purpose as well as text structure, and they are embodied though genre elements that form specific relationships with each other. Examples of genres used to promote learning are argumentation, informational writing, and narrative. Writing scholars (Langer & Applebee, 1987) contend that as writers generate and specify relationships among content appropriate to each genre element (e.g., evidence, claims, and warrants in argumentation), they acquire new learning as they construct corresponding relations among their own knowledge.
Social-Cultural Theories
The explanations presented so far on how writing supports learning focused on cognitive mechanisms. Learning effects from writing can also be approached from a sociocultural view. Accordingly, writing is a tool that learners use in a goal-directed fashion to mediate meaning construction, thought, and activity (Smagorinsky, 1995). Both learning and writing are considered a product of a system that includes people, community goals, social practices, typified actions, tools, and a collective history (Graham, 2018). Members of a specific system use writing to accomplish community goals (e.g., writing-to-learn), applying sanctioned social practices and actions to do so. Theoretically, the effects of writing on learning vary depending on the purposes of the writer’s community. Writing does not automatically enable a learner to construct new meaning. Instead, it represents the potential to do so (Wertsch, 1991).
Accordingly, social context is pivotal in writing-to-learn (Langer & Applebee, 1987). In social-cultural theories, the effects of writing on learning are attributed to social, institutional, cultural, historical, and political factors (Graham, 2018). For example, social context can exert its effects by supporting or not supporting the use of specific instructional procedures, like writing-to-learn (Klein, 1999). Writing may be a useful tool for promoting learning in a class where writing and writing for learning are sanctioned and valued, but it is not effective in a class where this is not the case. As a result, the effects of writing on learning vary according to the purposes, values, and history of the institutions in which writing is applied as a tool for learning. Its effects are also expected to vary depending on learners’ goals, values, identity, expectations, and beliefs.
Social-cultural theories further emphasize that writing becomes embedded and particular to the context in which it is learned (Smagorinsky, 1995). Consequently, a writing-to-learn activity that becomes a typical and routinized tool for accomplishing goal-directed learning in one classroom may not transfer readily to another classroom. The success of a writing-to-learn activity then depends on the instructional environment in which it is implemented, as a complex set of values, social interactions, and typified instructional practices moderate its impact.
Theoretical Implications
Both cognitive and social-cultural theories provide multiple rationales for how writing can facilitate learning. Cognitively, writing elicits implicit and explicit use of cognitive operations and structures that can facilitate learning. Socially, writing provides members of a community with a tool for learning that can be used in a goal-directed fashion to foster thinking, meaning-making, and action. These theories provide support for the proposition, tested in this meta-analysis, that writing about content facilitates learning for school-aged students.
These theories also raise questions about the conditions under which writing-to-learn is effective. This led to the second purpose of this meta-analysis: the identification of moderator variables that are reliably related to the magnitude of effects obtained with writing-to-learn activities. Such an analysis provides a more nuanced view of the effects of writing on learning. Theoretically derived moderators in this review included content area, grade level, features of the writing-to-learn activities, features of assessment, and features of instruction. We indicate why such moderators are theoretically important below. Unfortunately, we were unable to examine if the social, cultural, institutional, political, and historical factors so important to social-cultural theory also accounted for variability in study effects. The available writing-to-learn investigations provided little, if any, documentation about these factors (Klein, 1999).
Content Area
According to a social-cultural approach, writing may be a useful tool for promoting learning in classes where writing and writing for learning are sanctioned and appreciated, but it is less effective or not effective at all in classes where this is not the case (Smagorinsky, 1995). This view was particularly important in the context of the current meta-analysis, as the value and emphasis that teachers placed on writing-to-learn activities vary by discipline according to findings from recent national surveys. Social studies teachers are more likely to use writing to promote learning, followed by science and then mathematics teachers (Gillespie et al., 2014; Ray et al., 2016). Consequently, the potentially positive effects of writing-to-learn may be most pronounced in content areas where writing is more common and valued.
Grade Level
Theories that emphasize that the effects of writing on learning are a product of learners’ effort and capabilities (e.g., Klein & Boscolo, 2016) raise issues about the effectiveness of writing-to-learn for different students. The types of mental activities that students use when writing-to-learn, such as goal setting, elaboration, synthesis, evaluation, transformation, and reflection, require considerable effort and are not always applied successfully by students (Graham & Harris, 2000). Younger students may approach writing in a different way than older more experienced students, minimizing the role of such operations because they have not yet mastered them or they demand too many cognitive resources. Additionally, younger students are still mastering basic writing skills, such as spelling, handwriting, and sentence construction (Graham & Harris, 2000), which may impede their use of writing as a tool for learning. As a result, older and more experienced students may benefit more from using a writing-to-learn activity than younger and less experienced students.
Features of Writing-to-Learn Activities and Assessment
Theories that promote the view that particular writing activities (Bangert-Drowns et al., 2004; Langer & Applebee, 1987) or different genres (Klein, 1999) promote learning in specific ways also draw attention to possible variability in the effectiveness of writing-to-learn activities. The impact of a writing-to-learn activity depends on the thinking and processing that result as a consequence of using it. Some writing tasks like notetaking may result in a minimal level of information processing, whereas creating a written argument may lead to a deeper level of processing. A writing task may also be designed to promote a specific kind of learning, such as comprehension of factual information versus critically analyzing concepts or ideas. Moreover, writing tasks that involve metacognitive monitoring may yield greater effects than ones that do not involve such assessments, as metacognition promotes active monitoring, analysis, evaluating, and adaptive learning (Hacker, 2018). Thus, different kinds of writing-to-learn activities may be more or less effective.
The same theories cited above (e.g., Langer & Applebee, 1987) suggest that the outcomes in writing-to-learn studies may differ depending on the means used to assess learning. Because different writing activities can lead students to think about information in different ways, resulting in different types of learning, what is assessed, how it is assessed, and the alignment between assessment and instruction can all affect the impact of writing on learning, resulting in variability in effects. For instance, larger writing-to-learn effects are likely when the assessment is tightly aligned to information taught versus assessments that measure related material not taught. Similarly, assessments that measure the type(s) of knowledge the writing activity is designed to promote (e.g., comprehension/comprehension) may yield different effects than assessments that are not as well matched to the learning goals of the writing activity (e.g., comprehension/application or synthesis/comprehension). Hence, how learning is measured presumably accounts for variance in writing-to-learn effects.
Features of Instruction
Variability in writing-to-learn effects are likely a consequence of the instructional conditions under which they are applied. From a social-cultural perspective (Langer & Applebee, 1987), the impact of writing on learning may vary depending on how well these activities are embedded in the fabric of the classroom or how well teachers are prepared to implement them. For example, a writing activity should produce larger effects if it is a common and valued part of the instructional regime (Smagorinsky, 1995), as when a teacher frequently and explicitly uses writing as a tool for learning versus a teacher or researcher who implements such practices infrequently. Similarly, writing-to-learn activities may be even more effective in classes where teachers are provided with professional development and ongoing support.
From a cognitive perspective (Galbraith & Baaijen, 2018; Silva & Limongi, 2019), instructional supports that help students use writing-to-learn activities effectively should moderate their effects. These models assume that cognitive resources are limited, and that the effects of writing on learning are enhanced when they do not overtax working memory. Thus, teaching students how to apply writing-to-learn activities through instruction or facilitating their mastery through frequent use should enhance the effects of writing on learning.
Empirical Support for Writing-to-Learn
While there are multiple theoretical explanations for how writing about content supports learning, the earliest reviews of the literature (Ackerman, 1993; Applebee, 1984, Smagorinsky, 1995) concluded that writing has the “potential” to influence learning. These descriptive reviews consistently noted that confounds in the studies reviewed were many, and only about one half of the studies reviewed resulted in improved learning.
Two meta-analyses were conducted during the past decade to determine the impact of writing on learning. The first meta-analysis by Bangert-Drowns et al. (2004) conducted a systematic search of the published and gray literature (electronic, journal, and reference searches ending in 1999) to identify true and quasi-experiments conducted with school-aged and college students who wrote about content in the context of academic settings. Studies were included only if students in the treatment condition wrote more than students in the control condition.
Bangert-Drowns et al. (2004) identified 48 studies. Pertinent to the meta-analysis reported here, 40 of these investigations involved science, social studies, or mathematics (other studies concentrated on content areas such as literature, business, and nursing); 25 studies focused on college students; and 23 studies were conducted with elementary or secondary students. The overall finding from their review provided evidence that writing about content material promoted learning, yielding a small but statistically significant average weighted effect size (ES) of 0.17. Negative effects, however, were obtained in one out of every four studies, and there was considerable variability in the obtained effects (ranging from −0.77 to 1.48). Moderator analyses found that increased treatment length and writing tasks that included metacognitive prompts predicted larger effects, but longer writing assignments and the use of writing-to-learn in the middle school grades (Grades 6–8) predicted lower effects. Given the potential situational effects of writing-to-learn described in the previous section, it was surprising that other contextual variables such as content area (mathematics, social studies, science, and other), type of writing (tasks that included personal writing vs. tasks without personal writing), and writing context (in or outside of classroom) did not reliably account for variability in effects. This was also the case for methodological variables such as random assignment, publication (e.g., dissertation vs. other; year of publication), and instructor (researcher involved or not involved).
Graham and Perin (2007) conducted a second meta-analysis examining the effects of writing-to-learn using true and quasi-experiments conducted with students in Grades 4 to 12 (the meta-analysis was commissioned by a private foundation specifically for these grades). Graham and Perin searched the published and gray literature (conducting electronic, journal, and reference searches through 2006). They identified 26 studies. Only three of these studies were not included in Bangert-Drowns et al. (2004). Not surprisingly, the results of the two meta-analyses were very similar. Graham and Perin reported a statistically significant average weighted ES of 0.23 for writing-to-learn, with 25% of investigations producing a negative effect (ES ranged from −0.77 to 1.48). No moderator analyses were conducted.
Despite the strengths of these two prior meta-analyses, there are several issues that raise questions about the conclusions drawn from them. It is possible that the obtained ESs of 0.17 and 0.23 from Bangert-Drowns et al. (2004) and Graham and Perin (2007), respectively, were inaccurate. Both reviews included quasi-experiments that did not control for pretest differences. This included close to three out of every four studies in the earlier meta-analysis and more than one half of the investigations in the later one. Moreover, these previous reviews did not winzorize ESs that were outliers, apply a correction to ESs for small size, correct clustering effects of ESs from quasi-experiments, or examine possible publication bias in the studies reviewed. Because the obtained effects in these prior reviews were relatively small, it is possible that correction of these issues could have led to a different and a less positive outcome.
Interpretations of the findings from Bangert-Drowns et al. (2004) and Graham and Perin (2007) are further blurred by three issues. First, these reviews did not evaluate if measures of learning were reliable and free of ceiling effects, teacher effects were controlled, or treatment fidelity was established. If issues involving these and other aspects of study quality (e.g., design, pretest-equivalence) were problematic and related to variability in effects, this would raise concerns about the confidence that can be placed in the findings from these meta-analyses.
Second, Bangert-Drowns et al. (2004) indicated they did not include in their review any studies with fatal flaws, citing as examples studies with substantial attrition, significant differences between treatment and control students, and confounds with other factors that might explain treatment effects (it was not clear, however, if they coded for time and content differences between conditions). They were correct to eliminate such studies, but unfortunately neither did they operationally define what was meant by fatal flaws nor did they establish if decisions made about individual studies were reliable (i.e., coding for reliability was not established for any aspect of their review). These issues were also evident in Graham and Perin (2007), as they did not operationally define how they eliminated flawed studies or indicated if their decisions were reliable. These issues further weaken the confidence that can be placed in the obtained findings from these two meta-analyses.
Third, conclusions drawn by meta-analysists are constrained by the number of studies available, which presented a particular challenge for the conclusion that writing had a negative effect on learning in middle schools (Bangert-Drowns et al., 2004). This conclusion was based on just six studies. We are not faulting Bangert-Drowns and his colleagues for drawing this conclusion, but as more studies become available through new research, it is possible that the obtained effect of writing on learning in the middle schools may change, as outcomes based on small sample sizes are not less reliable. The addition of new writing-to-learn studies may also affect the magnitude of effects for writing to learn across all studies as well as specific moderators involving type of content to be learned, characteristics of the writing activity, or the features of the methods used to assess learning.
Current Meta-Analysis
Purpose
Because of the theoretical promise of writing-to-learn, the issues surrounding previous meta-analyses, and the accumulation of new studies (two thirds of the studies in this meta-analysis were new), we conducted a meta-analysis examining the effects of writing on learning. We limited this analysis to writing about science, social studies, and mathematics in Grades 1 to 12, as most of the writing-to-learn research at these grades involves these disciplines (Miller et al., 2018). We did not include studies with college students, as they are fundamentally different from elementary and secondary students and are also a more select group of learners (Rivard, 1994). Similar to previous meta-analyses, only true and quasi-experiments were examined.
We further addressed multiple issues evident in the two prior meta-analyses regarding methodological design. Quasi-experiments had to include pretest and posttest measures of learning. Treatment and control conditions had to devote equal instructional time to learning and teach the same basic content material. Learning measures were excluded from any analyses if pretest scores approached ceiling levels (it is more difficult to demonstrate learning effects when this occurs). Grades as a measure of learning were also excluded, as they can reflect more than just academic growth. All variables were operationalized and reliability of coding was established. ESs were adjusted for small sample size, outliers were winzorized, clustering effects for quasi-experiments were corrected, and possible publication bias was examined.
Like the two previous meta-analyses, we examined both published and unpublished studies. As described above, a number of criteria were instituted to ensure that all of the included investigations were methodologically sound. We further evaluated each study on the following nine quality indicators: research design, reliability of measures, pretest equivalence, posttest ceiling effects, posttest floor effects, teacher effects controlled, multiple classes in each condition, problems with attrition, and treatment fidelity. This allowed us to determine if study quality was related to variability in obtained effects, and to draw methodological recommendations for future research (this was not done in the previous meta-analyses). Moderator analyses also involved characteristics of participants, writing activities, assessments, and treatment/control comparisons, which included a broader array of variables than were included in Bangert-Drowns et al., 2004).
Research Questions and Predictions
We asked the following research questions for students in Grades 1 to 12:
We anticipated an affirmative answer to each research question. We expected that writing would enhance learning in the studies reviewed (RQ1), basing this prediction on the multiple theoretical explanations for writing-to-learn effects presented earlier. This prediction was further consistent with findings from Bangert-Drowns et al. (2004) and Graham and Perin (2007).
It was further expected that writing-to-learn effects would be moderated by content area (RQ2). We predicted writing would result in statistically significant different effects in science, social studies, and mathematics, and that effects would be larger in disciplines where writing is more valued and sanctioned (Smagorinsky, 1995). This was based on findings that teachers in different disciplines place more or less emphasis on writing (Gillespie et al., 2014).
We anticipated that the learning gains of students who were in earlier grades would be smaller than those made by students in later grades (RQ3). Students in earlier grades may not have acquired needed content knowledge or writing skills, making it less likely they could effectively use writing to support learning. We further expected that features of the writing-to-learn activities would be related to variability in ESs (RQ3). Different types of writing (e.g., argumentation, informational) contain different structures for organizing relationships among knowledge (Klein, 1999); the depth of processing promoted by a writing activity through analysis and interpretation should create more opportunities than merely recording information; metacognitive prompting is more likely to encourage learners to evaluate and modify their understandings (Hacker, 2018); and the highest level of learning a writing activity was designed to promote (e.g., knowledge, comprehension, application, analysis, synthesis, or evaluation; i.e., Bloom et al., 1956) should influence what is learned.
It was expected that features of instruction (RQ3) would be associated with writing-to-learn effects. Longer treatments or ones that involve more writing should result in stronger effects than shorter treatments or ones that involve less writing, as learners have more opportunities to become proficient in using the writing activities. We also anticipated that professional development as well as teaching students to use writing activities would produce larger effects, as teachers receiving such preparation would be better prepared to implement these activities and students better prepared to apply them, respectively. We further assumed that larger effects would be obtained when teachers implemented instruction when compared to researchers. Teachers know their students better than researchers, and teacher-led instruction signals that writing-to-learn activities are sanctioned and valued.
We predicted that features of assessments would also be correlated with obtained ESs (RQ3). Different types of measures (e.g., multiple-choice or essay tests) as well as measures that assess different levels of knowledge (as specified by Bloom et al., 1956) focus on distinct aspects of learning and should yield different effects. Measures created by a researcher should result in higher effects than measures designed by developers of standardized assessments, as researchers often align their assessments closely to the goals of the experiment and standardized measures by design do not. Alignment between assessments and content material should also create variable effects, as measures more closely aligned to content are likely to produce larger effects. Last, the match between the type of knowledge assessed and the type of knowledge that the writing activities is designed to promote should result in variability in effects. For instance, a study in which the assessment is exactly matched to the learning goals of the writing activity should produce a larger effect than a study in which the assessment measures a higher level of knowledge than the writing activity promotes.
We further expected that variability in effects would be related to features of the experimental methods used to assess the effects of writing-to-learn. We anticipated that type of treatment and control comparisons (RQ3) would account for variability in effects, as writing effects are likely to be stronger in studies where writing is compared to no writing versus studies in which more and less writing are compared. We also anticipated that measures of study quality would account for variability in effects (RQ3). Indicators of study quality such as study design, reliability of measures, and ceiling/floor effects are likely to contribute to variability in the effects of writing-to-learn. To illustrate, quasi-experiments may produce larger effects than true experiments as they are less tightly controlled. Variability of effects may be greater in studies where measures are not reliable versus studies that employ reliable measures, as unreliable measures increase the likelihood of error. Studies where measures exhibit ceiling/floor effects may exhibit less variability in effects than studies without such problems, as such measurement problems can restrict the amount of growth observed.
Method
Inclusion/Exclusion Criteria
Each study had to meet nine criteria to be included in this meta-analysis: (1) involved students in Grades 1 to 12; (2) tested if writing enhanced students’ learning in science, social studies, or mathematics; (3) applied a true- or quasi-experimental design to test the effects of writing on learning (quasi-experiment had to include a comparable pretest assessment for each outcome of interest); (4) included at least one measure assessing student learning (letter grades were not considered a viable assessment); (5) incorporated the writing-to-learn activity as part of classroom instruction; (6) controlled for the amount of time students in the treatment and control conditions spent learning content material; (7) focused on teaching the same content knowledge to both treatment and control students; (8) reported the statistics necessary to compute a weighted ES (or data were obtained from the authors); and (9) were written in English.
Studies were excluded for four reasons: (1) the control condition used writing to support learning in science, social studies, and mathematics (one exception was made: Treatment students did considerably more writing to support learning than controls); (2) the intervention was conducted in special schools for students with disabilities; (3) the attrition rate at the end of the experiment was 20% or greater (attrition greater than 20% can bias study findings; Dumville et al., 2006); and (4) pretest scores had a high ceiling effect (i.e., a mean sore a full SD below the highest possible score). We did not include studies with kindergarten students, as they do little writing at school, and they are still learning the most rudimentary writing skills (Puranik et al., 2014).
The goal of this meta-analysis was to isolate the effects of writing on learning in science, social studies, and mathematics and not to compare different writing-to-learn activities. The inclusion and exclusion criteria were designed to achieve this objective, as included experiments involved students in the treatment condition using writing-to-learn activities, with the students in the control condition learning similar content and doing either no writing or engaging in considerably less writing-to-learn activities than treatment students. These criteria addressed possible confounds due to a lack of a control condition, pretest differences in quasi-experiments, differences in the content and the amount of time spent learning by treatment and control students, ceiling effects on learning measures at pretest, and high attrition rates.
Search Strategies
Electronic Searches
We conducted a series of electronic searches in multiple databases. The search dates were January 1, 1998, to March 30, 2017. We did not conduct electronic searchers earlier than 1998. Bangert-Drowns et al. (2004) conducted a thorough search of multiple databases before 1998 involving ERIC, Education Index, Dissertation Abstracts, and Psychological Abstracts. All relevant studies from their meta-analysis that met inclusion/exclusion criteria were included in our review.
The electronic searches we conducted involved the following databases: Academic Search Premier; Professional Development Collection; Web of Science; Psychological and Behavioral Science Collection; Education Full Text and ERIC; Scopus; J-Stor for journal articles, book chapters, or conference papers; as well as Dissertation and Thesis Full Text for master’s theses and doctoral dissertations. Key search terms included the following combination of terms: (a) (writing OR “writing-to-learn”) AND (“social studies”), (history), (geography), (civics) AND (elementary), (“middle school”), (“junior high school”), (“high school”), (secondary); (b) (writing OR “writing-to-learn”) AND (science), (biology), (physics), (chemistry) AND (elementary), (“middle school”), (“junior high school”), (“high school”), (secondary); and (c) (writing OR “writing-to-learn”) AND (math*), (algebra), (geometry), (calculus) AND (elementary), (“middle school”), (“junior high school”), (“high school”), (secondary). These searches yielded 31,348 potentially relevant listings for consideration.
Previous Reviews
Relevant studies were also identified by examining papers included in three prior meta-analyses. Two prior writing-to-learn meta-analyses described earlier (Bangert-Drowns et al., 2004; Graham & Perin, 2007) resulted in 22 suitable studies. We also examined a third meta-analysis that focused on the impact of writing on reading (Graham & Hebert, 2011). Graham and Hebert included true and quasi-experiments examining if (1) writing about material read enhances students’ comprehension of said text, (2) increasing how often students write improves reading comprehension, and (3) providing writing instruction results in better reading. We thought it possible that some of the 65 publications in their review in which students wrote about text read might also include learning measures in science, social studies, or mathematics.
Journal Searches
We conducted hand searches of journals most likely to publish writing-to-learn studies in Grades 1 to 12. For each journal, this review started with the first volume. The journals reviewed included Journal of Writing Research, Written Communication, Research in the Teaching of English, Reading and Writing: An Interdisciplinary Journal, Journal of Educational Psychology, American Educational Research Journal, Contemporary Educational Psychology, Journal of Educational Research, International Journal of Educational Research, and Learning and Instruction. These searches identified 47 publications that were potentially suitable for this review.
Author Contacts
Researchers who had published papers on writing-to-learn in Grades 1 to 12 in books, articles, or anthologies or made presentations on writing-to-learn at international writing research conferences were contacted. We asked them to identify references of published or unpublished studies that would be relevant for our review as well as to forward contact information of other writing-to-learn researchers they thought we should contact. In total, we contacted 85 scholars who identified 45 possible publications to include in our review.
Reference Lists
The reference lists of the studies included in this meta-analysis were checked to identify other possible studies not identified via the other four search processes. This did not yield any new publications.
Selection of Studies for Review
After eliminating duplicate listings from the five search strategies, 31,456 separate listings were identified. Ninety-eight of the 206 publications identified by examining prior reviews, searching relevant journals, and contacting the study authors were duplicate listings.
A three-step process was applied to determine which studies to include in this meta-analysis. One, the second author with the assistance of the last author and two graduate research assistants read the title and abstract of the 31,348 listings identified through the electronic searches and the 108 separate listings located through prior reviews, journal searches, and study author contacts. This involved looking for evidence in the title and abstract that the paper presented an empirical study involving students in Grades 1 to 12 (Criterion 1) who engaged in writing activities to learn about science, social studies, or mathematics (Criterion 2). Five hundred and four publications met these criteria. Interrater agreement for this phase of the selection process was 99%, with disagreements on 315 listings. The first author, who had 40 years of experience conducting literacy research, resolved these differences after discussion with the second author. When disagreements arose, the full paper was obtained and examined to help the two authors rectify these disagreements. Sixty percent of these disagreements involved whether content learning was assessed, whereas 30% of the disagreements centered on the degree to which writing-to-learn was part of both treatment and control conditions.
The second phase of the selection process examined the full text of the remaining 504 publications (84% of these texts were from the electronic searches). The distribution among these studies was 43% in science, 21% in social studies, and 36% in mathematics. The second and last author read each document to determine if inclusion criteria were met and all exclusion criteria were not met (agreement was 95%). Two inclusion criteria were not included in this analysis (6 and 7). They were completed in Phase 3. As with Phase 1, all disagreements were resolved via discussion between first and second authors. Our analysis at this point yielded 56 publications that included 59 experiments (three publications yielded two experiments each; Boscolo & Mason, 2001; Leopold & Leutner, 2012; Willey, 1988). The reasons for excluding the 448 publications were the following: (1) 57% of studies did not meet the criterion for design (k = 252, Criterion 3), 34% of studies did not include a learning measure (k = 151; Criterion 4), and (3) the remaining 10% of the studies (k = 45) did not involve instruction (Criterion 5), include a writing-to-learn activity (Criterion 1), or need statistics (Criterion 8).
The third phase of the selection process involved the first and second authors independently examining the 59 identified experiments (in 56 documents) to determine if any study violated inclusion Criterion 6 (instructional time was equivalent for treatment and control conditions) or Criterion 7 (instruction for treatment and control focused on the same content). There was one disagreement that was resolved through discussion. One study was eliminated as students in the treatment condition received more instruction than the control condition before the pretest (Criterion 6). Two other studies were eliminated as there was uncertainty concerning whether treatment and control students were taught the same content material (Criterion 7). A total of 56 studies in 53 documents meet all criteria.
It is important to make clear the relationship between this meta-analysis and the two prior ones examining writing-to-learn. From the Bangert-Drowns et al. (2004) meta-analysis, we identified 20 investigations from the 48 included in their review that met all of our inclusion/exclusion criteria. The Graham and Perin (2007) meta-analysis (Grades 4–12) yielded two additional experiments that were suitable for our meta-analysis (i.e., Boscolo & Mason, 2001; Hand et al., 2004).
Coding
Descriptors
Each experiment was coded for the following descriptors: (1) content area (science, social studies, and mathematics), (2) publication type (i.e., journal article, dissertation/thesis, conference presentation, book chapter, or other), (3) where the treatment was delivered (i.e., regular classroom, self-contained setting, after school, or other), (4) student type (i.e., full range of a classroom, average students, high achieving students, English language learners, at-risk learners, students with disabilities, or other), (5) number of participants in the experiment, and (6) grade level (i.e., elementary [Grade 1–5 or 6 depending on the school], middle school [Grade 6–8], or high school [Grade 9–12]).
Writing Activities
Writing activities used in an experiment to promote learning were coded in five ways. One, the genre of the writing activity was coded as follows: (1) composing informational text that emphasized understanding and communication about subject matter content (e.g., summarizing information, presenting information in reports, connecting new and old information, comparing and contrasting ideas, describing processes, explaining why or how a process operates, or creating analogies), (2) building an argument (e.g., bringing evidence together to support a claim or hypotheses or stating an opinion about subject matter material), (3) producing a narrative (e.g., creating a story or poem to illustrate or expand subject-matter content or creating a word problem to illustrate an example of a specific mathematics principle), and (4) creating a graphical representation of content (e.g., using notetaking, graphic organizers, or mind maps as a tool for remembering, understanding, or analyzing content, but with little or no construction of new connected text). If a study used a combination of activities (e.g., journal writing and graphical representations), type of writing was coded as mixed.
Two, writing activities were coded as promoting (1) analysis and interpretation or (2) mostly recording of information. This provided an index of the level of processing (stronger vs. weaker) the writing activity was designed to promote. Analysis and interpretation involved the act of analyzing content material to understand its nature, as well as essential features, and interpreting this information through explanation, reframing, or making new connections. This included writing activities as diverse as explaining why and how a process operated, keeping a journal in science class to identify what concepts still need to be learned about photosynthesis, creating example word problems to illustrate a mathematical function, and building an argument to support a hypothesis such as causes of climate change. In contrast, the main purpose of recording information was to set down in writing content information to highlight or summarize what was most important, including what was learned. Examples of recording information included writing activities such as notetaking, outlining, summarizing, filling out graphic organizers, or logging information in journals for later retrieval. The examples above were not automatically coded as recording of information. For example, a journal entry where students just recounted what they learned was coded as recording information. If the journal entry also required students to identify what they still needed to know, it was coded as analysis and interpretation, as it required analyzing what was learned to draw interpretations about what was not learned but still needed to be acquired.
Three, writing activities were coded for metacognitive prompting. Types of writing that involved metacognitive prompting encouraged students to use writing to reflect on (1) the learning process (e.g., setting goals, or reflecting on strategic processes); (2) the current understanding of content material including revising or reconstructing understandings, identifying patterns in information, and making predictions; (3) the identification of failures and successes in learning including accomplishments; and (4) the motivational or affective responses to content material that might inhibit or facilitate learning. The definition we adopted was based on Bangert-Drowns et al.’s (2004) definition of metacognitive prompting but expanded to include writing treatments involving argumentation where students were asked to think about multiple sides of a position or make a prediction that involved such considerations.
Four, writing activities were coded using the Bloom et al. (1956) cognitive (i.e., knowledge-based) taxonomy. The cognitive taxonomy includes knowledge (recognizing and remembering facts, terms, and basic concepts), comprehension (understanding of facts and ideas), application (applying using acquired knowledge to solve problems), analysis (breaking information into component parts, explaining how these parts are related, drawing generalizations related to these parts and relationships, and providing evidence to support generalizations), synthesis (combining facts, ideas, parts, generalizations together to form a larger or whole structure or pattern), and evaluation (presenting, judging, and defending ideas or opinions about information using internal or external criteria or both). The writing activity used in a study was examined to determine the highest level of learning it was designed to promote. In all but one case, the writing activity promoted the highest level identified as well as the lower levels of the taxonomy. In Kramarski and Mevarech (2003), students engaged in all knowledge applications except application.
Five, we recorded the specific types of writing activities students did to support learning. Examples include notetaking, responding to prompts, creating questions, describing, persuading, outlining, lab report, graphic organizer, applying content learned in one situation to another (writing a letter to a peer), and journal writing.
Features of Instruction
We coded five features of instruction provided to treatment students or teachers. This included the number of days students received the treatment and were engaged in writing during the treatment. It also included determining if students were taught how to apply the target writing-to-learn activities, if treatment teachers were provided with professional development, and if teachers or researchers delivered the treatment.
Assessments
Assessments to measure learning outcomes were coded for four features: (1) highest level of the Bloom et al. (1956) cognitive taxonomy assessed (i.e., knowledge, comprehension, application, analysis, synthesis, and evaluation), (2) type of assessment used to measure content learning (i.e., multiple choice, open-ended, criterion-referenced [mostly assessments from textbooks], norm-referenced, combination of assessment types, or other), (3) who designed the content learning measure (i.e., researcher, commercial material developer, or publisher of a standardized assessment), and (4) the degree of alignment between the assessment items and the content taught. The degree of alignment included two levels: proximal alignment (test items were tightly aligned with content taught) and distal alignment (test items measured content beyond what was directly taught). Examples of proximally aligned measures included unit tests or researcher-designed tests that directly measured the material taught, whereas examples of distally aligned measures included norm-referenced or standardized tests that measured performance in a content area broadly or open-ended response in which the student’s response could go beyond the material specifically taught. Again, using Bloom’s taxonomy to classify the highest level of knowledge measured, we further coded whether the highest level of knowledge measured by the assessment matched the highest level of knowledge the writing activity was designed to promote. If these matched, it was coded as the same. If the assessment measured a lower level of knowledge than the writing activity was designed to promote, it was coded as lower. If the assessment measured a higher level of knowledge than the writing activity was designed to promote, it was coded as higher.
Treatment/Control Comparisons
We coded two aspects of the treatment and control comparison in each study. This involved categorizing studies as (1) treatment students wrote-to-learn and control students did not (i.e., some writing vs. no writing) or (2) treatment students did more writing-to-learn than control students (i.e., more writing vs. less writing).
Study Quality
Each experiment was scored for nine indicators of study quality. A score of 1.0 was assigned if the quality indicator was met; otherwise a score of 0 was assigned. The indicators were (1) high-quality design (true-experimental), (2) not an N of 1 study (more than two groups or classes in each condition), (3) teacher effects were controlled (i.e., teachers randomly assigned to condition or taught in each condition), (4) attrition was not greater than 10%, (5) differential attrition was not evident (i.e., attrition between treatment and control did not differ by more than 5%), (6) pretest scores were equivalent (i.e., the mean scores for each condition did not differ by more than the smallest SD at pretest; true experiments without pretests automatically met this criterion), (7) no floor or ceiling concerns at pretest or posttest (i.e., when the standard deviation for a measure was added or subtracted from the mean score for the assessment, it did not exceed the lowest or highest possible score for the measure for any condition), (8) reliable learning measures (i.e., reliability coefficients of .70 or higher were reported), and (9) treatment fidelity established (i.e., evidence for treatment fidelity provided). These quality indicators were used in other meta-analyses of intervention literacy research (e.g., Graham & Harris, 2018; Slavin et al., 2008) and are important elements in conducting high-quality intervention research (see Graham & Harris, 2014).
The second and third author coded all studies. Interrater agreements was 95.6%. After discussion with the second author, the first author resolved all disagreements.
Analyses
Basic Procedures
ESs were computed by subtracting the posttest mean for the control condition from the posttest mean for the writing-to-learn treatment condition, and then dividing the difference by the pooled SD of the two conditions at posttest. Pretest differences between the treatment and control conditions were accounted for by subtracting the mean pretest score of each condition from the respective mean posttest score. No pretest adjustments were made for true experiments that did not report pretest scores.
Before calculating some ESs, it was necessary to average the performance of two or more groups in each condition using a procedure recommended by Nouri and Greenberg (Cortina & Nouri, 2000). To illustrate, some experiments (e.g., Akkus et al., 2007; Faber et al., 2000) reported statistics separately for students based on differences in achievement level, making it necessary to calculate an overall mean and SD for all participants in each condition using the Nouri–Greenberg procedure. It was also necessary in some cases to estimate missing SDs from the statistics reported by the study authors (e.g., Lodholz, 1980).
All quasi-experiments included in this meta-analysis assigned schools, classes, or groups of students to conditions (e.g., Bell & Bell, 1985) but examined student-level effects. To adjust for these clustering effects, we applied procedures recommended by Hedges (2007). This involved the use of an imputed intraclass correlation (ICC) estimate. Previous research reported average ICCs between .10 and .25 for educational achievement outcomes (Hedges & Hedberg, 2007; Stockford, 2009). Using these parameters as a guideline, we used the midpoint (.175) between these estimates as the ICC in this meta-analysis.
One experiment in this review randomly assigned classes to conditions and used summary statistics based on class means (i.e., Lodholz, 1980). The summary statistics of all other experiments were based on the scores of individual students. Using class means to estimate ESs is incommensurate with ESs based on student-level variance. We corrected this problem using a procedure recommended by Hedges (2007), applying the .175 ICC described above.
With one exception, effects were calculated from measures administered immediately following treatment. An experiment by Davis (1990), however, did not include the means needed to calculate an ES at posttest but did include the needed data for an assessment delivered 2 weeks later. We used these means along with the SDs reported at posttest (these were the only SDs available) to calculate an ES. Furthermore, we were not able to obtain SDs at posttest for an experiment by Wells (1986) and used pretest SDs as the best available approximation for these missing data. Finally, all ESs were adjusted for small sample size bias (i.e., Hedges, 2007).
Statistical Analysis of Effect Sizes
Average Weighted Effect Size
Comprehensive meta-analysis (Version 3; Borenstein et al., 2014) was used to conduct all analyses. We employed a weighted random-effects model (weighted by multiplying each ES by its inverse variance). For all research questions, average weighted ESs, corresponding confidence intervals, level of statistical significance, and two measures of variability were computed. I2 described the ratio of true heterogeneity to total variance, whereas τ2 represented the between-study heterogeneity component used for estimating inverse variance weights (Tanner-Smith & Tipton, 2014).
To avoid inflating sample size, only one ES per study was used to compute an average weighted ES (Gleser & Olkin, 2009). There was one exception to this. Boscolo and Mason (2001) contained three conditions: writing-to-learn in science, writing-to-learn in social studies, and a control condition. We treated the writing-to-learn in science comparison with control as one study, and the writing-to-learn in social studies comparison with control as a second study. As noted earlier, Willey (1988) and Leopold and Leutner (2012) each included two experiments.
If an experiment included multiple measures of learning, the ESs were aggregated to provide a single estimate of the effect for a study. This was done to avoid inflating sample size. For example, if a study contained a multiple-choice measure, a norm-referenced test, and an open-ended assessment, then ESs for each measure were computed and averaged.
Outliers
Before computing an average weighted ES to examine the impact of writing on learning, we examined if any single ES was exerting undue influence in terms of number of participants and magnitude of effect. An outlier was defined as a score falling 3 times the interquartile range above the 75th percentile or below the 25th percentile of the distribution of scores (Tukey, 1977). We identified and adjusted accordingly four outliers for number of participants (i.e., Brodney, 1993; Chen et al., 2013; Lodholz, 1980; Merchie & Van Keer, 2016) and one outlier for magnitude of effect (i.e., Caukin, 2010).
Moderator Analyses
We conducted a series of preplanned moderator analyses to determine if heterogeneity in obtained effects was related to identifiable differences between studies. Planned comparisons included examining if variability in ESs was related to content area (science, social studies, and mathematics), grade level (elementary, middle school, and high school), features of the writing-to-learn activity (type of writing, analysis and interpretation vs. recording, metacognitive prompting present or not present, and highest level of knowledge promoted by writing activities), features of instruction (number of treatment days, number of days treatment students wrote, students taught how to apply writing-to-learn activities, professional development provided to teachers, who delivered instruction), features of assessment (highest level of knowledge assessed, type of measure, who created the assessment, alignment between assessment/content, and match between highest level of knowledge assessed and promoted by the writing activity), type of treatment/control comparison (writing/no writing vs. more writing vs. less writing), and indices of study quality (peer review, experimental design, attrition, reliability of measures, ceiling/floor effects, pretest equivalence, teacher effects, N-of-1 problem). More specific information on each metaregression are provided in the Results section.
Publication Bias
Two methods were applied to determine if publication bias existed in the experiments used in this meta-analysis: (1) creating a funnel plot of precision and examining it using the Duval and Tweedie’s (2000) trim and fill procedure and (2) conducting a Begg and Mazumdar rank correlation test (this procedure tests the interdependence of variance and ESs; Begg & Mazumdar, 1994).
Results
Characteristics of Studies
Context
The studies in this review mostly took place in general education classrooms. This was the case in 54 experiments (96%). Two other experiments took place in a summer school program (Peterson, 2003) and a college laboratory school (Konopak et al., 1990). Most of studies were conducted in the United States, but 15 of them (27%) occurred in other countries, including four in Germany, three in Italy, and one each in Canada, Belgium, Taiwan, Turkey, Spain, Israel, Malaysia, and Lebanon.
Content Areas
Characteristics and findings for the experiments in this meta-analysis are presented in Table 1. Twenty-six experiments (46% of all studies) tested the effects of writing on science; 21 experiments (38% of studies) assessed the impact of writing on mathematics; eight experiments (14% of studies) examined the influence of writing on social studies; and one experiment (2% of studies) investigated the effects of writing in both science and social studies.
Characteristics and overall effect sizes for writing-to-learn treatments
Note. N = number of students in study; Tx = treatment; Wr = writing; metacog = metacognitive; SS = social studies; SC = science; M = mathematics; FR = full range; SD = students with disabilities; EL = English learner; LA = low achieving; AV = average students; NS = not specified; CD = cannot determine; J = journal writing; A = analysis and interpretation; I = informative writing; R = mostly recording information; G = graphical representations; AR = argumentative writing; MIX = multiple types of writing; N = narrative.
Journal article.
Participants’ Grade Level and Characteristics
The 56 experiments included 6,235 students in Grades 1 to 11. Nineteen experiments (34%) were conducted with elementary grade children, 18 experiments (32%) with middle school students, 20 experiments (32%) with youngsters in high school, and one experiment with both middle and high school students (Akkus et al., 2007). Students in 44 experiments (79%) represented the full range of ability in general education classes. Four experiments (7%) included just average students (Ayres, 1993; Bigelow, 1992; V. M. Johnson, 1998; Tsai, 1995), and one experiment (2%) involved only stronger students (Caukin, 2010). Five experiments (9%) included students who were identified as having language, learning, or motivational challenges. These included students who were identified as English language learners (Rosa, 1999); experienced low performance in mathematics (Jitendra et al., 2013), reading (Morphy, 2013), or science (Akkus et al. 2007); or expressed low interest in social studies (Faber et al., 2000). Akkus et al. (2007) also included high-performing science students, whereas Faber et al. (2000) included students with high interest in social studies. Finally, two experiments did not provide enough information to determine students’ capabilities (Peterson, 2003; Wäschle et al., 2015).
Writing-to-Learn Instruction
In 19 of the 56 experiments (34%), writing-to-learn involved writing informational text. This took many forms, including but not limited to summarizing information, explaining how to carry out an activity, creating a report, identifying what was learned, reflecting on what information still needed to be learned, connecting new information to old information, personalizing information learned, and sharing information about a topic on the web. Informative writing was used to promote learning in 50%, 35%, and 29% of social studies, science, and mathematics studies, respectively.
Eighteen experiments (32%) involved treatment students writing in a journal. This included diaries and learning logs. Journal writing was most common in experiments focused on learning mathematics (52%), followed by social studies (38%) and science (15%), respectively.
Argumentative writing was applied in seven of the experiments (13%). Six of these seven experiments involved science where students typically put forth their predictions, claims, and evidence in writing. Argumentation was also applied in one experiments in mathematics (Cross, 2008) in which students constructed written arguments for solving mathematical problems.
Students engaged in narrative writing in three experiments (5%) to create story problems in mathematics (i.e., Rosa, 1999) or to write a narrative about content taught in science (i.e., Pillsbury, 2008). Writing in eight experiments (14%) involved creating a graphical representation of key concepts and content through outlining, completing graphic organizers or mind maps, or making notes. This approach was applied in 25% of studies focusing on social studies, in 15% of the studies focusing on science, in one mathematics study (Jitendra et al., 2013), and in one study (Merchie & Van Keer, 2016) that used writing to influence both science and social studies learning. Last, two science experiments (i.e., Ayres, 1993; Jang, 2010) used a combination of writing procedures to promote learning including journals and other writing activities (e.g., graphic organizers, written analogies, and written summaries).
The majority of the writing-to-learn methods tested in the 56 experiments involved analysis and interpretation. Analysis and interpretation procedures were applied in 39 experiments (70%). This included writing to compare and analyze two or more new ideas, compare and contrast new information and previously learned content, offer and defend a hypothesis, create an argument for solving a problem and address a counterclaim, determine what still needs to be learned, reflect on the learning process, use new ideas for generating a new interpretation, and explain how two new ideas link together.
In 16 experiments (29%), the writing activity involved recording information. This included using writing to summarize ideas, reiterate the steps used to complete a task, take notes, complete graphic organizers, identify what was already known about a topic, and list what was learned. For one mathematics experiment (i.e., Greer, 2010), it was not possible to determine if the informational writing activities applied fostered analysis and interpretation or recording of information as the procedures were not adequately specified.
More than one half of the experiments (55%) involved metacognitive prompting. Examples of metacognitive prompting included writing activities in which students made judgments about what they still needed to know, reflected on the processes they were using to learn, made predictions, and reconstructed understandings. However, there was not enough information in Greer (2010) to determine if this was the case.
Forty-four of the writing treatments (79%) involved two or more writing activities, with 26 of these 44 experiments applying three or more writing activities (59%). Experiments that applied a single writing activity were quite varied, and included writing activities such as notetaking, summarizing, letter writing, creating word story problems, and using graphic organizers. The most common writing activities applied across all experiments were metacognitive reflection (30 studies; 55%), writing answers to questions about content material (28 studies; 50%), creating written descriptions/explanations about content (28 studies; 50%), and writing a summary (21 studies; 38%).
When we examined the writing activity in each study in terms of the Bloom et al. (1956) cognitive taxonomy, evaluation was the highest level promoted in 30% of the investigations (35%, 30%, and 25% of studies in science, mathematics, and social studies promoted evaluation of content information, respectively). This was followed by comprehension promoted in 27% of studies (31%, 25%, and 20% of studies in science, social studies, and mathematics, respectively), synthesis in 18% of studies (23% and 20% of studies in science and mathematics, respectively), analysis in 11% of studies (25%, 15%, and 4% of studies in social studies, mathematics, social studies, and science, respectively), application in 7% of studies (15% and 4% of studies in mathematics and science, respectively), and knowledge in 5% of studies (25% and 4% of studies in social studies and science, respectively). We were unable to draw any conclusions about Greer (2010), as the writing activities were poorly described.
Features of Instruction for Students and Teachers
The length of treatment ranged from 1 to 120 days, averaging 36.26 days per study (SD = 28.69). It was not possible to determine treatment length in two studies (see Table 1). In 84% of studies with data on this variable, treatment length was 10 days or longer with three studies involving a single day of treatment.
The number of treatment days during which students wrote ranged from 1 to 120, averaging 28.55 days per study (SD = 26.93). We were unable to determine how many days students wrote in five studies (see Table 1). Students wrote for at least 10 days or more in 75% of studies. Four studies involved just a single day of writing. Across studies where treatment length and number of days writing were available, students averaged writing on 79% of the available treatment days. In 50% of studies, students wrote on every treatment day. There were only six studies in which students wrote on less than 50% of treatment days.
In slightly more than one half of the studies (55%), students in the treatment condition were taught how to apply writing-to-learn activities. This occurred in 58%, 52%, and 50% of the studies in science, mathematics, and social studies, respectively. Professional development was provided in 38% studies. This occurred 57%, 25%, and 23% of mathematics, social studies, and science studies, respectively. Teachers implemented instruction in 79% of studies (researchers provided instruction in the other studies).
Assessing the Effects of Writing-to-Learn
In terms of Bloom’s et al. (1956) cognitive taxonomy, the level of knowledge assessed in the studies in this review focused primarily on comprehension (43%), application (30%), and knowledge (13%). Measures assessing comprehension and knowledge predominated social studies (75%), with the remaining measures assessing application. Somewhat similarly, 69% of measures in science studies involved comprehension and knowledge, with measures of application, synthesis, and evaluation accounting for 12%, 12%, and 8% of the remaining assessments, respectively. Application was the most common level of knowledge assessed in mathematics studies (60%), followed by comprehension (25%), knowledge (5%), synthesis (5%), and evaluation (5%). Furthermore, the match between highest level of knowledge assessed and promoted via Bloom’s taxonomy showed that in 59% of the studies the assessments measured a lower level of knowledge than the writing activities were designed to promote. There was an exact match in 31% of studies. In 9% of studies, assessments were at a higher knowledge level than the writing activities promoted.
The impact of writing on learning was assessed with a variety of measures. Multiple-choice measures were applied in 15 studies (27%), open-ended response measures in 13 studies (23%), norm-referenced measures in nine studies (16% with all of these involving mathematics), criterion-referenced measures in eight studies (14% with most of these involving textbook assessments), and a mix of assessments in 10 studies (18%). The majority of assessments (63%) administered in the 56 studies were designed by researchers, with 19% of assessments developed by publishers of standardized assessments (e.g., norm-referenced tests) and another 18% created by publishers of commercial materials (e.g., unit tests in textbooks). Eighty percent of the time assessments were closely aligned with the content taught, but one fifth of the assessments measured learning more broadly.
Treatment and Control Comparisons
The treatment and control comparison mostly involved students in the treatment condition writing about content and students in the control condition not writing at all. This was the contrast in 41 experiments (73%). The remaining 15 experiments (27%) involved students in the treatment condition writing more about the content being learned than students in the control condition. We provide two examples below to illustrate how the writing between treatment and controls differed in these 15 experiments.
One, in Boscolo and Mason (2001), students in the treatment and control conditions engaged in the same kind of writing in three sessions (write everything you know about the spices and metals in the 15th century; observe the reproduction of Columbus’s landing and write all the information you can; and write your perceptions about your work on this unit), whereas treatment students engaged in further writing in six additional sessions (e.g., writing explanations, comments, hypotheses, analyses, interpretations, reflections on content material). Two, in Gillespie Rouse et al. (2017), students in the control condition wrote about what they liked about science class and the work they were doing, whereas the writing of students in the treatment condition focused directly on what they were learning (e.g., “What is happening with the balance and weights? Are you noticing any patterns? What makes a balance bean balance? What makes it tilt right or left?”).
Quality of Studies
Table 1 includes a quality score for each experiment. This score represents the sum of all criteria met, except for treatment fidelity. We did not include treatment fidelity in this sum, as it occurred in just three experiments (5%). In 31 of the experiments (55%), a true-experimental design was used to assess the effects of writing on learning. Attrition greater than 10% was not an issue in 47 experiments (84%), and differential attrition did not occur in 42 experiments (75%). In 45 experiments (80%) there was no meaningful difference at pretest between treatment and control condition. Ceiling and floor problems were not evident in 30 experiments (54%). Treatment and control conditions each included more than two instructional groups in 30 experiments (54%). Reliability of measures was established in 29 experiments (52%). Teacher effects were controlled in 29 experiments (52%). Also, 43% of studies were journal articles that were peer reviewed, and the remaining studies were dissertations/thesis.
The Impact of Writing on Learning
Writing about content improved students’ learning, as we obtained a statistically significant average weighted ES of 0.30 (see Table 2). While a positive ES was obtained in 82% of the investigations (see Table 1), there was considerable variability in effects. The I2 statistic showed that 71% of variance resulted from between-study factors. Publication bias did not appear to be an issue in this omnibus analysis. The trim and fill method imputed no ESs to the left or the right of the funnel plot. Furthermore, the correlation between study effects and variance in effects was not statistically significant (p for the Egger’s test = .15).
Overall analysis of weighted effect sizes and confidence levels
Note. K = number of effect sizes; g = Hedges’s g; PD = professional development.
Content Area Differences
Writing about content improved students’ learning in science, social studies, and mathematics. Average weighted ESs for these three areas were 0.31, 0.31, and 0.32, respectively. Each of these effects was reliably greater than no effect, and there was considerable variability in ESs in each subject area, except social studies (see Table 2). A metaregression, however, did not find that discipline moderated writing-to-learn effects, as this predictor did not account for a statistically significant proportion of variance, Q = .04, df = 2, p = .98. Merchie and Van Keer (2016) was not included in this analysis as it involved a science/social studies treatment.
Grade-Level Differences
We conducted a metaregression to determine if writing-to-learn effects were moderated by grade level (i.e., elementary, middle school, and high school). Writing about content did improve students’ learning at all three grade levels. Average weighted ESs for elementary, middle school, and high school students were 0.29, 0.30, and 0.30, respectively. Effects at each of these grade levels were statistically significant, and there was considerable variability in ESs at the middle and high school levels but not at the elementary grades (see Table 2). The overall effects of writing-to-learn were not moderated by grade level. This predictor did not account for a statistically significant proportion of variance in the metaregression, Q = .08, df = 2, p = .96. Akkus et al. (2007) was not included in this analysis as it did not contain separate data for middle and high school students.
Differences Related to Features of the Writing Activities
Three metaregressions were conducted to determine if features of writing-to-learn treatments were related to variability of effects. Two metaregressions included one predictor. This included an analysis examining type of writing (informational, journal, argument, or graphical writing), and another examining the highest level of the Bloom et al. (1956) taxonomy that promoted (knowledge/comprehension, analysis/synthesis, and evaluation). The other metaregression included two predictors: (1) writing promoted analysis and interpretation versus recording information and (2) metacognitive prompt included or not included. We were unable to apply all four predictors in a single analysis or even two analyses as we would have had to drop too many studies. For type of writing, there were not enough studies examining narrative writing (k = 3) or a mixture of writing activities (k = 2) to include them in the analyses. For highest level of the Bloom et al. taxonomy, we combined several levels to increase our power (knowledge [k = 3] and comprehension [k = 15]; analysis and synthesis [k = 5 and k = 10, respectively]), and there were not enough studies with writing activities categorized as application (k = 5) to include them in an analysis. We could include only Greer (2010) in the analysis on type of writing, as the writing activities were not well enough described to include it in the other two metaregressions.
The average weighted ESs for types of writing were argumentative = 0.42, informational = 0.34, journal = 0.33, and graphical = 0.19. Each of these effects were statistically significant (and evidenced high to moderate heterogeneity of ESs), except for graphical writing, which did not yield an effect greater than zero (see Table 2). However, the overall effects of writing-to-learn were not moderated by type of writing. This predictor did not account for a statistically significant proportion of variance, Q = 1.42, df = 3, p = .70.
The average weighted ES for writing-to-learn treatments that involved analysis and interpretation of content material was 0.36, and 0.18 for writing that involved recording information. The ES for analysis and interpretation as well as recording information was reliably greater than no effect (see Table 2). When the writing-to-learn treatment involved metacognitive prompting, the average weighted ES was 0.40 and 0.15 when it did not include such prompting. The ES for metacognitive prompting was reliably greater than no effect; this was not the case when such prompting was not included. These two predictors evidenced high to moderate variability. They did not moderate the effects of writing-to-learn, Q = 5.58, df = 2, p = .06.
The average weighted ES for writing activities that promoted knowledge/comprehension, analysis/synthesis, and evaluation was 0.16, 0.32, and 0.44, respectively. The ES for each level of this predictor was reliably greater than no effect (see Table 2), but there was high to moderate variability for each variable. The overall effect of writing-to-learn was not moderated by the highest level of knowledge prompted by the writing activity/activities, Q = 3.31, df = 2, p = .19.
Differences Related to Features of Instruction
We conducted two metaregressions to determine if features of instruction moderated writing-to-learn effects. For the first metaregression, we examined the collective impact number of treatment days (M = 36.26) and number of days treatment students wrote (M = 28.55). We were not able to include five studies in this analysis as it was not possible to determine number of days for treatment, writing, or both (see Table 1). Effects of writing-to-learn were not moderated by number of treatment and writing days, Q = 0.64, df = 2, p = .73.
The second metaregression examined if writing-to-learn effects were moderated by three predictors: treatment students taught or not taught how to apply writing activities, treatment teachers did or did not receive professional development, and teachers or researchers delivered the treatment. The average weighted ESs for writing activities taught (ES = 0.29) and not taught (ES = 0.32), professional development provided (ES = 0.30) or not provided (ES = 0.31), and teachers (ES = 0.28) or researchers (ES = 0.44) delivered instruction were all statistically significant (see Table 2). Considerable variability was evident for all predictors. The effects of writing-to-learn were not moderated by these three predictors, Q = 0.17, df = 3, p = .98.
Differences Related to Features of the Assessments
Five metaregressions were conducted to determine if features of the assessments used to gauge the effects of learning in the studies reviewed were related to variability of effects. Each metaregression included one predictor. It was necessary to conduct a separate analyses for the first metaregression as we had to eliminate eight studies—type of knowledge assessed in Bell and Bell (1985) could not be determined, and studies involving assessing synthesis (k = 4) and evaluation (k = 3) of knowledge were too few to include in the analysis. For the other three metaregressions, collinearity problems occurred when more than one predictor was included. This mostly happened because of the overlap in categories across predictors (e.g., norm-referenced, standardized, and less proximally aligned assessments).
For the first metaregression, average weighted ESs for assessing knowledge, comprehension, and application were 0.24, 0.30, and 0.45, respectively. The ESs for assessing comprehension and application of knowledge were reliably greater than no effect (see Table 2), but the ES for assessing knowledge was not. There was considerable variability in ESs for comprehension and application. The overall effect of writing-to-learn was not moderated by the highest level of knowledge assessed, Q = 3.31, df = 2, p = .19.
For the second metaregression, the average weighted ESs for types of assessment were the following: criterion-related measure = 0.39; open-ended response format = 0.38; norm-referenced = 0.31; multiple-choice = 0.10, and mixed-assessment format = 0.18. Each of these effects were statistically significant, except for multiple-choice and mixed-assessment formats (see Table 2). Norm-referenced measures were the only type of assessment that did not demonstrate a high or moderate degree of variability in ESs. The overall effects of writing-to-learn, however, were not moderated by type of assessment, Q = 4.70, df = 4, p = .32. We did not include four studies in this analysis as they did not clearly indicate the type of tests (Bell & Bell, 1985; Greer, 2010; Idris, 2009) or used an assessment other than the ones tested (Gillespie Rouse et al., 2017). We could not use all assessments in Parson (2013) and Pillsbury (2008) for the same reasons.
For the third metaregression, the average weighted ESs for writing-to-learn treatments that involved researcher-designed tests, tests from textbook, and standardized assessments designed by test developers were 0.30, 0.24, and 0.31, respectively. Each of these effects was reliably greater than zero (see Table 2). High to moderate variability of ESs was evident for researcher-designed tests and standardized tests. The overall effects of writing-to-learn were not moderated by who designed the assessments, Q = 0.48, df = 2, p = .78. Jitendra et al. (2013) included researcher-designed and standardized assessment from test developers.
In the fourth metaregression, the average weighted ES of 0.29 for assessments that were more proximally aligned to content taught was statistically significant, as was the effect of 0.31 for assessments that were less proximally aligned to taught content (see Table 2). There was considerably variability in ESs only for more proximally aligned measures. The effects of writing-to-learn were not moderated by test alignment, Q = 0.09, df = 1, p = .77.
For the fifth metaregression, the average weighted ES of 0.32 for assessments and writing activities that were matched in terms of knowledge level measured and promoted was statistically significant. This was also the case for the average weighted ES of 0.34 for studies where assessment measured a lower level of knowledge than the highest level of knowledge the writing activity was designed to promote. There was considerably variability in ESs only for studies where a lower level of knowledge was assessed (see Table 2). The effects of writing-to-learn were not moderated by the match between the highest level of knowledge assessed and highest level of knowledge promoted by the writing activity, Q = 0.14, df = 1, p = .71. We did not include studies in this analysis where the assessment measured a higher level of knowledge than the writing activity promoted, as there were only five of these investigations.
Differences Related to Treatment/Control Comparisons
Writing about content was effective when writing-to-learn treatments were compared to control conditions where students did not write or wrote significantly less than treatment students. The average weighted ES of 0.25 for the writing versus no writing comparison was reliably greater than no effect as was the ES of 0.49 for the more writing versus less writing comparison (see Table 2). There was considerable variability in the ESs for both of these treatment/control comparisons. The metaregression did not find that writing-to-learn effects were moderated by type of treatment/control comparison, Q = 3.69, df = 1, p = .055.
Differences Related to Study Quality
We conducted two metaregressions to determine if study quality moderated writing-to-learn effects. For the first metaregression, we examined the collective impact of the following quality indicators: study design (true experiment vs. quasi-experiment), reliability of measures (reliable vs. not reliable), ceiling/floor concerns for measures (no concerns vs. concerns), pretest equivalence (equivalent vs. not equivalent), N-of-1 studies (multiple groups per condition vs. only one group for one condition or more), teacher effects (effects controlled vs. effects not controlled), and overall attrition (overall attrition was >10% vs. <10%). We were able to include differential attrition in this analysis due to collinearity problems with overall attrition. As can be seen in Table 2, the average weighted ES for each level of the seven predictors and differential attrition were positive and reliably greater than no effect, except when attrition was >10% (ES = .05) or differential attrition was a problem (ES = 0.11). There was considerable variability in ESs. Collectively, study quality as indicated by these seven predictors identified above did not moderate writing-to-learn effects, 11.38, df = 7, p = .12.
For the second metaregression, we examined peer reviews’ relation to magnitude of effects. Less than one half of the investigations (k = 24) included in this review underwent peer review. As a result, it was possible that the obtained writing-to-learn effects were moderated by differences in investigations that were and were not peer reviewed. The average weighted ES of 0.34 for peer-reviewed studies was statistically significant, as was the effect of 0.27 for studies that did not undergo peer review (see Table 2). Both types of studies evidenced considerably variability in ESs. The effects of writing-to-learn were not moderated by whether a study underwent peer review or not, Q = 0.38, df = 1, p = .54. Furthermore, we found that there was no statistical difference (F = 1.47, df = 1, 54, p = .23) in the overall quality score for studies that were peer reviewed (M = 4.79, SD = 1.81) or not peer reviewed (M = 4.25; SD = 1.52).
Discussion
In this meta-analysis, we examined the effects of writing on learning for Grades 1 to 12 students. We anticipated that writing about science, social studies, or mathematics content would facilitate learning, as both cognitive and social-cultural theories provide multiple explanations for why such effects are likely (see Galbraith & Baaijen, 2018; Klein, 1999; Klein & Boscolo; Silva & Limongi, 2019; Smagorinsky, 1995). Because different writing-to-learn activities can promote different types of thinking (Langer & Applebee, 1987) and the effects of writing can vary depending on context (Graham, 2018), we further anticipated there would be considerable variability in the obtained effects. We conducted multiple moderator analyses to determine if variability in effects were related to content area, grade level, features of the writing-to-learn activities, features of assessment, features of instruction, and quality of research methodology.
Writing Improves Learning in Science, Social Studies, and Mathematics
As predicted, writing about content in science, social studies, and mathematics enhanced learning. In 56 investigations across the three content areas, a statistically significant average weighted ES of 0.30 was obtained. The average weighted ES for science (0.30), social studies (0.33), and mathematics (0.32) were virtually identical. While additional writing-to-learn research is needed in each of these content areas (science contained the largest number of studies at 26), this is especially the case for social studies. Only eight studies isolated the effects of writing-to-learn in this content area. The lack of studies in social studies in relation to science and mathematics is in direct contrast to classroom realities, where social studies teachers are more likely to use writing as a tool for learning than teachers in the other two content areas (Gillespie et al., 2014; Ray et al., 2016). It is possible that this imbalance is a result of how social studies researchers conduct instructional research. As Hicks et al. (2012) indicated, there is very little focus in social studies research on testing the effectiveness of specific strategies. Whatever the reasons for the lack of writing-to-learn investigations in social studies, this is an area that must be a priority in the future.
It is also important to note that the overall ES for writing-to-learn in this meta-analysis exceeded the average weighted ES of 0.17 obtained by Bangert-Drowns et al. (2004) with school-aged and college students and the average weighted ES of 0.23 reported by Graham and Perin (2007) with students in Grades 4 to 12. The effects for science were relatively equivalent in this meta-analysis and the one by Bangert-Drowns et al. (0.30 vs. 0.32). However, our analysis obtained larger effects for social studies (0.33 vs. 0.13) and mathematics (0.32 vs. 0.24). Effects for the three content areas were not provided by Graham and Perin.
The current meta-analysis provided a more reliable and valid estimate of the effects of writing-to-learn in these three content areas than previous meta-analyses, as it included more studies and used more stringent controls than the two previous reviews. In contrast to the prior meta-analyses, we directly established that equal instructional time and content exposure were provided to treatment and control students. Quasi-experiments were included only if the authors reported a pretest learning measure and grades were not used as a measure of learning. We also operationalized and coded studies for reliability. In addition, we applied sophisticated data analysis procedures, including winzoring outliers, adjusting quasi-experiments for clustering effects, correcting for small sample size bias, and examining possible publication bias.
This is not to say, however, that writing about science, social studies, or mathematics always enhanced learning in the studies reviewed here. A negative effect was obtained in 18% of our studies. The variables coded in this review offered limited insight though into why these studies were not successful. A post hoc comparison of studies with negative and positive effects revealed that these investigations were similar on most of the coded variables. The only variables that should have theoretically enhanced writing-to-learn effects and were 15% or more less common in studies with negative outcomes were professional development (21% less common), number of days writing activities were applied (21% less common), metacognitive prompting (17% less common), and writing activities that promoted higher levels of learning (activities that promoted evaluation were 25% less common). These differences were not pervasive enough to make any of these variables a singular negative catalyst across studies. It is possible that one or more of these variables did exert a strong negative influence in a specific investigation, or they may have operated in tandem to produce negative effects across multiple studies.
In order to better understand why writing-to-learn studies do or do not produce positive learning effects, future research needs to (1) better describe studies, (2) expand the variables examined, and (3) systematically examine the relationship between writing-to-learn outcomes and variables that potentially moderate these effects. We provide specific examples below.
Two possible reasons for why some studies may produce negative results include that the target writing activities were poorly constructed or applied. In the current review, it was not possible for us to evaluate either of these variables, as researchers did not consistently provide enough information to determine all of the mental operations writing activities were designed to provoke (e.g., rehearsal, organization, elaboration, transformation, goal setting) or how students actually used them while learning. Such rich descriptions of these and other variables (e.g., instructors, participants, context, content to be learned, and assessment procedures) are needed to better understand why a writing-to-learn study is or is not successful.
We were also not able to examine all of the variables we considered important to the success of writing as a learning activity, as they simply were not studied by researchers. For instance, it is reasonable to expect that students’ capability as writers might influence the success of a specific writing activity, particularly when the writing task is complex, even when learners are capable of applying a writing activity, though, they may fail to do so if they do not value writing or learning (Holliday et al., 1994). Measures of such attributes were virtually nonexistent in the 56 studies reviewed here. We encourage researchers to expand the variables in their studies to include additional ones that are theoretically important to the success of writing-to-learn.
Researchers should also take a more active approach to examining if promising variables predict the success of writing-to-learn activities. For example, girls are generally better writers than boys, and students from more affluent families tend to write better than ones from poorer families (Graham, 2018). Researchers can determine if these variables are related to learning outcomes by making gender and socioeconomic status independent variables in studies. Other potentially promising variables such as teachers’ motivations or their views about the acceptability of writing activities can be examined by correlating teachers’ scores on such measures with how much students’ learn. Such analyses were rarely conducted in the studies reviewed here.
The Factors That Create Variability in Writing-to-Learn Effects Are Uncertain
While writing generally enhanced learning in this meta-analysis, there was considerable variability in effects (ESs ranged from 1.67 to −0.74). Over 70% of the variability in effects across studies was due to factors other than chance. Because we anticipated that writing-to-learn effects would be variable, we constructed a number of theoretically derived moderator analyses. Despite our prediction that the moderators tested would account for excess variability in observed outcomes, this was not the case. There were no statistically significant difference by content area (science, social studies, and mathematics), grade (elementary, middle, and high school school), features of writing activities (i.e., types of writing, promotion of analysis and interpretation, inclusion of metacognitive prompting, and highest level of learning promoted), features of instruction (professional development, teaching writing-to-learn activities, instructor, number of days of instruction, and number of days writing activities were applied), features of assessments (highest level of knowledge assessed, type of test, who designed the assessment, alignment of assessment with content, and the match between the level of knowledge assessed and level of knowledge the writing activity was designed to promote), or study quality.
Findings from the moderator analyses reported above were not completely consistent with Bangert-Drowns et al. (2004). They found that grade level was related to variability of effects, with middle school studies producing smaller effects than elementary, high school, and college students. Differences in outcomes in the two meta-analyses are likely due to the addition of new middle school studies in this review that produced more positive effects. They further found that writing activities that prompted metacognition were more effective than ones that did not. It is not readily clear why similar effects were not obtained in the current review. This was probably not due to the inclusion of college students in Bangert-Drowns et al., as the average effects for college (0.47; k = 5) and school-aged students (0.40; k = 12) were similar.
We also thought that it was possible that specific design features of the experiments in the 56 studies reviewed would account for excess variability in effects. This was also not the case, as there was no statistically significant differences between measures of study quality (peer review, type of experimental design, reliability of measures, ceiling/floor effects for measures, pretest equivalence, teacher effects controlled, level of attrition, and more than one instructional group in each condition) or the type of treatment/control comparison (writing vs. no writing; more writing vs. less writing).
While our findings from the moderator analyses are consistent with the proposition that the effects of writing-to-learn are generally constant across different disciplines, grade level of students, writing activities, instructional procedures, types of outcomes, and design features of studies, the effects of writing on learning are more situational than these findings imply for three reasons. One, as noted earlier, not all studies produced a positive effect and the reasons for why this was the case were not clear. Two, our moderator analyses involved 56 studies. It is possible that we did not have enough statistical power to detect differences that did exist. For instance, the variable that produced the largest effect, argumentative writing (0.42), and the one that resulted in the smallest effect, attrition greater than 10% (0.05), were based on just seven and nine studies, respectively. The impact of argumentative writing was more than double at least one of the other types of writing, graphical representation, included in its moderator analysis, whereas attrition greater than 10% was 7 times smaller than its contrast, attrition less than 10%. As additional studies accumulate over time, it is possible that future meta-analyses with more statistical power will be more successful in identifying factors that moderate writing-to-learn effects.
As more studies become available, it should also be possible to examine the interactions between potential moderators. For example, argumentative writing may be differentially effective depending on students’ grade. There were not enough studies in this review to examine such interactions. Instead we had to concentrate on main effects, analyzing grade and type of writing separately. Consequently, it is possible that the moderator variables examined in this review may account for variability in writing-to-learn effects but only in terms of their relationship with other variables. This proposition should be tested in future meta-analyses.
Three, and most important, the substantial variability in the effects obtained in this meta-analysis provided the strongest evidence that writing-to-learn effects are not constant across all situations. If we are to identify the sources of this variability, we need to isolate and test additional theoretically derived moderators. As indicated in the previous section, this requires that not only must researchers provide richer descriptions of their studies so that additional moderators can be coded in future meta-analyses, but they must also purposefully expand the moderators included in their investigations. We illustrate a broad array of features that may moderate writing-to-learn effects by drawing on a model proposed by Graham (2018). This is not the only model of writing that could be used for this purpose, but it is the only model available that conjointly describes the social and cognitive aspects of writing.
In the current meta-analysis, it was especially difficult to identify contextual variables that might moderate writing-to-learn effects. As Klein (1999) noted, context is poorly described in such studies. Graham’s (2018) model provided a useful organizing structure for more fully describing context and identifying new moderates, as it specified seven contextual features that shape and constrain how writing, including writing-to-learn, operate in a writing community: (1) purposes for writing (goals, values, norms, audiences, and motivations for writing as well as the stances and identities writing fosters), (2) community members (teachers, mentors, writers, collaborators, and readers as well as their roles, responsibilities, participation, familiarity with community practices, and the value and power within the community), (3) writing tools (tools for writing including paper and pencil and digital tools), (4) writing actions (sanctioned and typical practices used to accomplish writing purposes), (5) written products (text completed and in production, plans, source and content material, and modeled text), (6) physical and social environment (physical and digital locales, relations among community members, and social practices for writing), and (7) the collective history of the writing community (past and current events that shape how writing is enacted in the community, which are shaped by broader social-cultural, political, institutional, and historical forces).
While it is highly unlikely that researchers will describe all of the contextual features described above, this model provides a blueprint for better contextualizing writing-to-learn studies. It also draws attention to multiple features of context that may moderate writing-to-learn effects. We encourage researchers to not only describe such features in future studies (especially purposes for writing, content to be learned, and common practices used to promote learning and writing) but also increasingly test the effects of these moderators in their investigations.
As noted in the previous section, studies in the current meta-analysis provided little to no information about what students actually did when applying writing-to-learn activities, nor did they devote much attention to identifying cognitive factors that affected these mental processes. Graham’s (2018) model described the cognitive activities students use as they write, including writing-to-learn (Graham, 2019). According to this model, students engage in a variety of activities while writing-to-learn, including making decisions about what to do and learn as well as how much effort to commit. These decisions are driven by students’ beliefs (e.g., value and utility of the writing activity and content to be learned, perceptions of competence as a writer and learner, and identity as a writer and learner), which in turn fuels effort. This provides the impetus for drawing on resources, including content knowledge, from long-term memory as well as content presented in class, as writers use the control processes of attention, working memory, and executive control to initiate, direct, and sustain writing processes (conceptualization, ideation, translation, transcription, and reconceptualization) that engage students in thinking that facilitates learning (e.g., rehearing, elaborating, evaluating, culling, connecting, and organizing ideas). These actions are further affected by psychological, physical, and biological factors including emotions, personality traits, and physical states.
As with context, Graham’s (2018) model provided a useful description of cognitive factors that are likely to moderate writing-to-learn effects, including students’ goals, beliefs about writing and learning, current knowledge of content, functioning of control mechanisms (including students’ success in using the thinking processes evoked by writing activities), writing capabilities, emotional reactions to writing and learning, personality traits, and physical state (e.g., stress). Again, we do not expect that researchers will address the moderating effects of all of these factors in every future study, but some of them should become routine fixtures in these investigations (e.g., prior knowledge, writing ability, and description of the cognitive activities writing was designed to evoke), and all of them should increasingly become more common in future research. These same variables (goals, knowledge and beliefs, cognitive processing, writing capabilities, emotions, personality traits, and physical states) also influence teachers’ actions (Graham & Harris, 2018) and provide additional candidates for moderating variables.
Implications for Theory
Cognitive theories of writing-to-learn contend that writing elicits both implicit and explicit use of cognitive operations and structures that facilitate learning (e.g., Galbraith & Baaijen, 2018; Klein, 1999; Klein & Boscolo, 2016). Evidence from the current meta-analysis provides support for the proposition that explicitly controlled cognitive processes promote learning. All studies reviewed used writing activities that prompted students to use cognitive operations to facilitate learning. To illustrate, 77% of studies encouraged students to apply writing activities involving analysis and interpretation, metacognitive prompting, or both. In the remaining studies, students used writing to identify important information, to organize the information, or both. We cannot draw any conclusions about the specific cognitive mechanisms that facilitated learning or the impact of genre on writing-to-learn, though, as none of our moderator analyses were statistically significant and writing activities and their use was not described consistently in sufficient detail to permit a finer grained analysis of specific mental operations (e.g., rehearsal, elaboration, comparisons, organization). Nor can we draw specific conclusions about implicit processing of information, as the studies reviewed did not purposefully assess such learning.
The current review also provides support for a social-cultural view of writing-to-learn, demonstrating that writing is a tool students can use in a goal-directed fashion to successfully construct meaning in content classes. Our findings also support the social-cultural tenet that writing effects are not automatic (Smagorinsky, 1995) but vary. Positive effects were not obtained in 18% of studies reviewed here, and there was considerable variability in effects across studies. Evidence that writing-to-learn effects are more likely in classrooms where writing is valued and sanctioned was not substantiated, as we did not find that larger effects were obtained in studies where writing was more commonly used to promote learning.
Implications for Practice
This meta-analysis provides current evidence that science, social studies, and mathematics teachers can reasonably expect that asking their students to write about content material will enhance learning. Writing-to-learn improved students’ comprehension and application of content knowledge, including the learning of material that was not the immediate focus of instruction (as assessed by more distal measures of learning).
Based on the findings from this review and how writing-to-learn treatments were instantiated, we provide the following observations for practice. One, writing is a useful learning tool across grades, and should be applied accordingly. Two, students can apply writing in multiple ways to improve content learning. This includes using writing to promote knowledge/comprehension, analysis/synthesis, and evaluation of content through informational, argumentative, and journal writing that involves metacognitive reflection but may or may not involve analysis and interpretation. Three, writing-to-learn activities can be profitably integrated into science, social studies, and mathematics classes as a common and frequent element of instruction. In the studies reviewed here, students used writing to promote learning an average of 30 days per study. Four, the incorporation of writing into content learning is not always effective, and some features of writing activities may require special scrutiny. While types of writing (informational, argumentation, journal, and graphical representation) and metacognitive prompting (present/not present) did not moderate writing-to-learn effects, graphical representation and no metacognitive prompting failed to produce statistically significant effects.
As with any research-supported practice, caution must be exercised when teachers implement writing-to-learn practices in their classes. First, we recommend that teachers carefully match writing activities with their goals for promoting content learning. It is important that students have the prerequisite writing skills needed to use such writing activities to support learning, which may require that teachers provide instruction on how to use the writing activity (Cross, 2008). Second, teachers should not automatically assume that writing activities that were effective in the research studies reviewed here will automatically be effective in their classrooms. The conditions that exist in a research study and a teacher’s class will not be identical. When using writing-to-learn activities, we encourage teachers to monitor if writing activities achieve the desired goals, and to make necessary adjustments if this is not the case. Moreover, the effective use of writing-to-learn assumes that teachers are adequately prepared to use these procedures. Many teachers indicate they are not prepared to do so (Ray et al., 2016).
Limitations and Additional Research Recommendations
Our meta-analysis located only 56 studies that met our inclusion/exclusion criteria. It is notable that no studies examined writing-to-learn in Grade 12. The number of studies that used narrative or argumentative writing tasks for learning was limited. Only three studies involved digital writing tools, and it was rare for researchers to examine if the same writing activity was effective across multiple studies. Additional high-quality intervention research is needed to draw a fuller and more nuanced picture of the effects of writing on learning.
The findings from our review must further be tempered by the fact that we were unable to draw any conclusions about the effects of writing on students’ learning over time. Maintenance data on learning effects were virtually nonexistent, and this must be rectified in future studies.
With the exception of grade level, we were not able to draw any conclusion about the effects of writing-to-learn on different types of students. This was because researchers inconsistently described participant characteristics, and they rarely provided outcome data for different groups. This must be rectified in future studies. At a minimum, findings should be reported by grade, gender, race, socioeconomic status, prior knowledge, and writing skills.
While we conducted a comprehensive search for studies, it is impossible to locate every pertinent investigation. We do not believe that missing studies were a significant problem though, as out post hoc analytic strategies (i.e., trim and fill and the Begg and Mazumdar rank correlation test) did not suggest that bias in the studies reviewed was evident.
Finally, while formal scientific review (peer review vs. no peer review) and study quality (e.g., study design, attrition) did not moderate obtained effects in this review, researchers need to ensure that future studies are even more sound than the studies reviewed here, and that all procedures are adequately described. This includes better addressing teacher effects and reliability of measures as well as decreasing/eliminating N-of-1 investigations and administering treatment fidelity measures in every investigation.
Footnotes
Authors
STEVE GRAHAM is the Warner Professor in the Division of Educational Leadership and Innovation in Mary Lou Fulton Teachers College, Arizona State University, H. B. Farmer Education Building, 1050 S Forest Mall, Tempe, AZ 85281; email:
SHARLENE A. KIUHARA is an assistant professor in the Department of Special Education at the University of Utah, 1721 Campus Center Drive, SAEC 2275, Salt Lake City, UT 84112; email:
MEADE MACKAY is a doctoral student at the University of Utah, 1721 Campus Center Drive, SAEC 2275, Salt Lake City, UT 84112-9253; email:
