Abstract
The continuation task, a new form of reading-writing integrated task in which test-takers read an incomplete story and then write the continuation and ending of the story, has been increasingly used in writing assessment, especially in China. However, language-test developers’ understanding of the effects of important task-related factors on test-takers’ performance with regard to this task is still in its infancy. In this study we investigate the effect of prompt type on English as a foreign language (EFL) learners’ writing performance and writing strategy use in a continuation task. Four groups of Chinese EFL learners performed a continuation task with four different prompts and filled out a writing strategy questionnaire. The participants’ continuations were scored holistically and textually analyzed using a range of fluency, grammatical accuracy, lexical complexity, syntactic complexity, cohesion, and source-use features. Prompt type significantly affected the participants’ overall continuation writing scores, syntactic complexity, cohesion, and source-use features. It also significantly affected the participants’ monitoring strategy. We discuss how continuation-task conditions, such as providing opening sentences or key words (or both) for test-takers to use will affect how the test-takers orient themselves to the writing task and, concomitantly, may affect performance outcomes.
Introduction
In recent years, integrated reading-writing tasks have been increasingly used in large-scale, high-stakes language tests due to their high degree of authenticity, potential for content-bias reduction, and positive washback (Cho, Rijman, & Novak, 2013; Plakans, 2015; Shin & Ewert, 2014). In the context of China, the continuation task, a new form of an integrated reading-writing task that requires learners to read and continue an incomplete story, is increasingly favored in language pedagogy and assessment. In 2016, the continuation task was adopted for the first time in the National Matriculation English Test (Zhejiang Province version) (hereafter NMET-ZJ), one of the largest-scale and most impactful standardized tests in China (see Cheng & Qi, 2006, for a description, and Zhao, 2016, for an overview of the test’s impact). The increased attention to this task is based on research evidence of its facilitative effect on language learning (Wang, 2012, 2015). Nevertheless, our understanding of the effects of key task-related factors on test-takers’ performance in this task is still in its infancy. Previous studies on task-related variables in the continuation task focused on the properties of the source text, such as the perceived level of its interest (Xue, 2013) and its degree of linguistic complexity (Peng, Wang, & Lu, 2020). The effect of prompt type, a key consideration in writing task design, has not yet been systematically examined. With the current study, we fill this research gap by investigating the effect of prompt type on test-takers’ writing performance and writing strategy use in the continuation task. In the rest of this section, we first discuss the theoretical basis of the continuation task and empirical research supporting its language learning potential and its validity and reliability as a writing assessment instrument. We then review previous research on the effect of writing prompts on writing performance. We present our research questions at the end of this section.
The continuation task
Amidst continued efforts to develop valid writing tasks to assess language learners’ writing ability, integrated reading-writing tasks that require test-takers to integrate information from source materials in producing their own texts (Plakans, 2015) have gained increasing popularity. A new form of integrated reading-writing task is the continuation task, which requires learners to read an incomplete story and continue it in a logical and coherent way (Wang & Wang, 2015). This task was initially theorized as an activity with substantial language learning potential by Wang (2011, 2016). Based on the Interactive Alignment Model (Pickering & Garrod, 2004), which assumes that successful communication in first language dialogues is achieved by interaction and alignment between interlocutors, Wang (2011, 2016) extended the concept of alignment to the interaction between second language (L2) learners and reading materials. He argued that the provision of an open-ended source text in the continuation task creates a contextual and linguistic basis for learners to complete the writing task and promotes learning by facilitating learner interaction and alignment with the source text as well as the coupling of comprehension with production. Empirical studies reported that the continuation task can help reduce form-based errors (Wang & Wang, 2015), generate more gains in language accuracy and complexity than topic-based writing tasks (Jiang, 2015; Jiang & Chen, 2015), stimulate learner imagination, reduce anxiety, and foster a sense of writing achievement (Zhang, 2016).
Along with its learning potential, the continuation task has been seen to be suitable for evaluating test-takers’ integrated reading-writing ability in language testing (Liu & Chen, 2016; Wang & Qi, 2013). Wang and Qi (2013) systematically evaluated the continuation task’s rating scale validity, scoring reliability, task difficulty, and consensus validity in large-scale, high-stakes, L2 proficiency testing contexts using a multi-facet Rasch model and other statistical methods. Their results indicated that the task constituted a valid way to measure test-takers’ integrated reading-writing ability and that it reliably discriminated test-takers’ L2 proficiency.
Informed by the theoretical basis of the continuation task and empirical support for its learning potential and its validity and reliability as a writing assessment instrument, the task was for the first time implemented in NMET-ZJ in 2016 (see Appendix A, Prompt 4) (Liu & Chen, 2016; Wang, 2016). Specifically, it required test-takers to read a 334-word incomplete story about Jane’s unpleasant camping experience and then complete the story in two additional paragraphs using the opening sentences provided for the paragraphs, at least five key words underlined in the source text, and at least 150 words. In accordance with Bachman and Palmer’s (1996) framework of language task characteristics, the task instructions and input text constitute the input of the task, which test-takers were expected to process and provide a response .
Several studies examined the effects of task-related variables on test-takers’ writing performance in the continuation task, suggesting that more interesting (Xue, 2013) and less complex (Peng et al., 2020) input texts and instructions explicitly encouraging alignment with the input text (Xin, 2017; Yuan, 2013) lead to higher writing accuracy and/or stronger alignment with the input text. However, the effect of prompt type has not yet been systematically examined. Furthermore, previous research on the continuation task has focused on college-level learners and assessment contexts (e.g., Peng et al., 2020; Yuan, 2013), while the adoption of this task in NMET-ZJ calls for attention to high school learners and assessment contexts. With the current study we aim to fill these research gaps.
Writing prompt
Broadly speaking, a writing prompt refers to any stimulus provided by a writing task for test-takers to respond to in a writing assessment (Kroll & Reid, 1994). In a narrower sense, it is conceived as a specific set of task instructions or requirements given to test-takers (Weigle, 2002). In the current study, we use this term in its narrower sense.
The writing prompt has been considered a critical task-related variable that may affect test-takers’ writing performance, along with other variables such as general task type (e.g., integrated vs. independent writing), genre (e.g., narrative vs. argumentative writing), and topic (e.g., a general vs. specific topic) (He & Shi, 2012). Theoretically, these variables were posited to elicit differential writing performance from either a task complexity (e.g., Kormos, 2011; Yang, Lu, & Weigle, 2015) or communicative functional perspective (e.g., Biber, Gray, & Staples, 2016). Research from the task complexity perspective generally sought to support either Robinson’s (2001) cognition hypothesis, which hypothesized that a more complex task would elicit more complex and accurate language, or Skehan’s (1998) limited attentional capacity model, which suggested the opposite as a result of learners’ limited attention capacity. Research from the communicative functional approach correlated linguistic variation among different task types and genres with the different communicative functions of the discourses they elicited (Biber & Conrad, 2009).
The task demands of a prompt may affect test-takers’ orientation to the task and their allocation of attentional resources during the writing process, which in turn may affect their writing performance (e.g., Brossell, 1983; Robinson, 2001). In Robinson’s (2001) cognition hypothesis, the number of elements involved in the task requirements is an important resource-directing factor that can be manipulated to increase or decrease the cognitive demands of the task, and subsequently impact test-takers’ task performance. Robinson (2001) argued that task prompts involving more elements may consume more attentional, memory and reasoning resources, resulting in increased accuracy and production complexity. Furthermore, increased task complexity may lead to more interaction with the task environment and negotiation for meaning, and subsequently greater noticing and integration of the input in production (Robinson, 2001). Meanwhile, Brossell (1983) found that a prompt with moderate information-load elicited higher quality essays than one with complete or no specification. He hypothesized that moderate-load prompts helped test-takers focus more than low-load prompts, while high-load prompts may overload test-takers and deplete their attention.
In addition to requiring test-takers to develop the story logically and coherently, the prompt of the continuation task in NMET-ZJ included two extra elements that would increase test-takers’ processing load. Test-takers needed to allocate attentional resources to frame the story based on the opening sentence of each paragraph and to purposefully use the target key words underlined in the source text. In light of the claims of the cognition hypothesis, we aimed to gain insights into the potential effects of prompts with differential cognitive demands on test-takers’ writing performance. To this end, we included four prompts involving none, one, or both of the two additional elements discussed above in the current study (see Appendix A).
Previous studies of the effect of prompt type have examined differences in test-takers’ writing scores and the textual features of their responses. Way, Joiner, and Seaman (2000) investigated the effects of three prompts (bare, vocabulary, and prose model) on the writing performance of students across three proficiency levels. They found that the prose model prompt produced the highest mean writing scores and the bare prompt the lowest. Bahrebar and Darabad (2013) reported similar results in their examination of the effects of the same three prompt types on the overall writing quality of Iranian intermediate EFL learners. In terms of textual features, it has been reported that a more explicit prompt tended to elicit more fluent and accurate production (Way et al., 2000), lower syntactic complexity (O’Loughlin & Wigglesworth, 2007), and more varied modes of argumentation (He & Sun, 2015). For the continuation task, previous investigations of prompt effect are limited to Yuan’s (2013) and Xin’s (2017) analyses of the effect of explicitness of task instructions on alignment with the source text and target structure use. More research on the prompt effect on test-takers’ writing performance in terms of writing scores and a more comprehensive set of textual features will be useful.
Although not a focus in previous studies of prompt type, test-takers’ writing strategy use has been examined in previous research on other task-related variables in writing assessment. Through think-aloud protocols and retrospective interviews, Plakans (2008) found that L2 writers performing reading-to-write tasks adopted more discourse synthesis strategies, whereas those performing writing-only tasks experience more initial planning. Through stimulated recall interviews with test-takers, Chapman (2016) found that such prompt characteristics as domain and response mode may affect test-takers’ processes and strategies for prompt selection, response planning, and response organization. The effect of prompt type on test-takers’ writing strategy use in the continuation task has not yet been examined.
The present study
In this study we examine the effect of prompt type on Chinese high-school EFL learners’ writing performance and writing strategy use in the continuation task. Specifically, we seek to address the following three research questions:
Does prompt type affect test-takers’ overall writing scores in the continuation task?
Does prompt type affect the textual features of test-takers’ responses to the continuation task?
Does prompt type affect test-takers’ writing strategy use in completing the continuation task?
Methodology
Participants
Our participants were 120 12th-grade EFL learners (68 female, 52 male) at a high school in Guangdong Province, China. To eliminate the effect of writing proficiency, we employed homogeneous sampling to select learners with comparable writing proficiency based on their performance in the English writing task in the most recent midterm exam at the school. The midterm exam was designed by local testing experts to be a mock test for the upcoming NMET (Guangdong Province version) (hereafter NMET-GD). The task required learners to write a letter to the editor of a weekly newspaper to offer positive comments and suggestions for improvement. This writing task aligned with the type of writing task commonly used in NMET-GD. The learner responses were rated by their English teachers using the NMET rating scale. The 120 participants selected all scored 12–13 out of a total of 15 points on the pre-writing test as they represented the largest sub-group of the learners. We randomly divided these participants into four groups (30 per group). A one-way ANOVA confirmed that there were no significant between-group differences in the writing scores (F(3, 116) = .72, p > .05).
Materials
We adapted the continuation task of NMET-ZJ 2016 into four versions, each containing one of the following prompts (see Table 1 and Appendix A) and each administered to one group: a bare prompt (Group 1), a framed prompt requiring test-takers to continue the story based the opening sentence provided in each paragraph (Group 2), a vocabulary prompt requiring test-takers’ to use at least five underlined key words in the source text (Group 3), and a framed vocabulary prompt requiring test-takers to continue the story using the opening sentence provided in each paragraph and at least five underlined key words in the source text (Group 4). All words in the passage were among the list of words expected to be mastered by 12th-grade learners in China, with the exception of one, which was glossed in Chinese. We explained the benefits of the continuation task for language learning as well as its adoption in NMET-ZJ to the participants prior to the study and provided scores and written feedback to those participants who requested them.
Four versions of the continuation task used in the study.
Procedure
The four groups completed the continuation task with one of the four prompts and subsequently filled out the writing strategy questionnaire in four classrooms at the high school simultaneously. Administration of the continuation task was provided by English teachers at the school, who received training from the researchers, and simulated that of authentic formal tests. Each participant received a packet containing a task sheet, an answer sheet, scratch paper, and a writing strategy questionnaire. The teacher in each classroom read the task instructions on the task sheet, ensured that all students understood them, and wrote down the time limit (40 minutes) on the blackboard. Upon finishing the continuation task, the participants filled out the questionnaire in five minutes before leaving the classroom.
Measures
One researcher and an English teacher experienced in rating NMET writing scored the participants’ continuations following the adapted five-level holistic scoring rubric for the continuation task in NMET-ZJ 2016, with a total score of 25 (see Appendix B). The rubric contained four primary criteria: (1) connection to the main ideas of the source text and fulfillment of the task requirements; (2) story development and richness of content; (3) grammatical structures and vocabulary; and (4) structure and coherence. The raters first went through the prompts, reviewed the rating scale, and analyzed writing samples exemplifying different score levels (Wang & Zhang, 2017). 1 They then each scored a few continuations independently and discussed their differences. Subsequently, they independently scored each of the remaining continuations. Inter-rater reliability, assessed using Pearson’s correlation analysis, was high (r = .88, p < .01). When the two scores were no more than five points apart, the average score was taken as the final score. Otherwise, another English teacher experienced in rating NMET writing provided an additional score and the final score was the average of the third score and the original score closer to it.
We analyzed the continuations using 21 features of fluency, grammatical accuracy, lexical complexity, syntactic complexity, cohesion, and source use (see Table 2). We chose these features because they have been reported to correlate with L2 writing quality (Chapman, 2016; Cumming et al., 2005; Gebril & Plakans, 2013; McNamara, Graesser, McCarthy, & Cai, 2014) and because they were judged to be meaningfully related to the evaluation criteria in the rating scale.
Summary of measures used to analyze test-takers’ continuations.
Given that the writing task was timed, we measured writing fluency as the total number of words in each continuation (Gebril & Plakans, 2013). We obtained word counts through Coh-Metrix (McNamara et al., 2014).
Grammatical accuracy is understood as the ability to be free from grammatical errors while using language to communicate (Plakans, Gebril, & Bilki, 2016). Consistent determination and classification of grammatical errors in L2 production have been shown to be challenging (e.g., Cumming et al., 2005). We opted for a simple holistic measure of grammatical accuracy following Cumming et al. (2005) and Gebril and Plakans (2013). This measure uses a three-point scale to characterize the grammatical accuracy of a writing sample, with 1 indicating many errors (e.g., over three per T-unit, often affecting comprehensibility), 2 some errors (two to three per T-unit; comprehensibility largely unaffected), and 3 few or no errors (comprehensibility unaffected). Grammatical accuracy scoring was independently performed by the same two raters, with an inter-rater reliability index (measured using Pearson’s correlation) of .84 (p < .01).
Lexical complexity refers to the variation and sophistication of the words in a text (Lu, 2012). Based on previous research on lexical complexity (McNamara et al., 2014; Riazi, 2016) and the narrative nature of the continuation writing task, we adopted the following four measures from Coh-Metrix: (1) Measure of Textual Lexical Diversity (MTLD), a measure of lexical diversity found not to be affected by text length; (2) incidence of content words (i.e., number of nouns, adverbs, adjectives, and main verbs per 1000 words); (3) concreteness of content words, a measure of the extent to which the content words are concrete or abstract; and (4) imageability of content words, 2 a measure of the ease to construct mental images for the content words.
Syntactic complexity, that is, the degree of sophistication and variation of the structures produced, has been operationalized in many ways (Lu, 2017). We adopted five indices incorporated in Coh-Metrix: (1) mean sentence length; (2) number of words before the main verb; (3) number of modifiers per noun phrase; (4) passive voice density (i.e., number of agentless passive forms per 1000 words); and (5) syntactic similarity, a measure of the extent to which adjacent sentences in a sample have similar structures. A higher value in the first four measures is associated with a higher degree of syntactic sophistication, whereas a higher degree of syntactic similarity is associated with a lower degree of syntactic variation (McNamara et al., 2014).
Cohesion features are explicit characteristics in a text that help create cohesive links between ideas and clauses (McNamara et al., 2014). We assessed the cohesion of the continuations using six incidence scores and three Latent Semantic Analysis (LSA) indices in Coh-Metrix. The incidence scores were for all connectives, causal connectives (e.g., because), logical connectives (e.g., if), adversative and contrastive connectives (e.g., although), temporal connectives (e.g., when), and additive connectives (e.g., moreover). The LSA indices were LSA similarity between adjacent paragraphs, LSA similarity between adjacent sentences, and LSA Given-New, which estimates the proportion of new information in each sentence. The LSA indices range from 0 to 1, with a higher value associated with greater cohesion (McNamara et al., 2014; Riazi, 2016).
Source use refers to the extent to which a continuation aligns with the source text, operationalized as the proportion of the top 20 most frequent four-word sequences in the continuation that were source-oriented (i.e., they also appeared in the source text) (Wang & Wang, 2015). Specifically, following Wang and Wang (2015), we first identified the top 20 most frequent four-word sequences in the continuations in each group using AntConc 3.5.7 (Anthony, 2018), then determined which of the 20 sequences were source-oriented, and finally calculated the source–use ratio as the ratio of the token frequency of the source-oriented sequences to the token frequency of the top 20 most frequent four-word sequences in each group.
We designed a writing strategy questionnaire to collect information on test-takers’ writing strategy use in completing the continuation task. The design was informed theoretically by the reading-to-write composing model (Plakans, 2008) and the composing process framework (Zhang & Zhou, 2014). We drew items from Yang’s (2014) Summarization Strategy Inventory and Yang and Plakans’ (2012) Strategy Inventory for Integrated Writing and considered input from three English writing teachers and eight 12th-grade students at the high school. The initial questionnaire contained 26 items. We then solicited expert judgment from a panel of one expert writing assessment researcher and five graduate students specializing in language testing. This led to several item adjustments and rephrasing of some of the instructions and items to improve clarity.
We piloted the questionnaire with 93 12th-grade students to confirm its construct validity and reliability. The Kaiser-Meyer-Olkin Measure of Sampling Adequacy was .83 and Barlett’s Test of Sphericity was significant (p < .05), indicating that the data satisfied the criteria for exploratory factor analysis (EFA). We used Principal Components Analysis as the method of extraction and Varimax with Kaiser Normalization as the method of rotation to explore the factor structure of the questionnaire, with the criterion of significant factor loadings set at .60 and the minimum eigenvalue set at 1.0. The EFA yielded six factors with eigenvalues greater than 1.0, accounting for 72.551% of the total variance. Seven items with factor loadings less than .60 or cross loadings greater than .40 were dropped. The final questionnaire contained 19 items that asked learners to indicate their frequency of using various strategies before, during, or after the continuation writing task on a five-point scale (5 = always, 4 = often, 3 = sometimes, 2 = seldom, 1 = never) (see Table 3). The following six latent writing strategy scales were represented: (1) evaluating (i.e., reexamining source text understanding, reconsidering task requirements, and revising the writing plan; Yang & Plakans, 2012), (2) connecting (i.e., integrating ideas from the source text with prior knowledge to develop the story; Plakans, 2008), (3) monitoring (i.e., checking and revising language use in the written text; Zhang & Zhou, 2014), (4) planning (i.e., setting writing goals and outlining the structure of the written text; Plakans, 2008), (5) source use (i.e., borrowing or paraphrasing key words and expressions from the source text; Yang & Plakans, 2012), and (6) organizing (i.e., using rhetorical and textual knowledge to construct a coherent and structured text; Plakans, 2009). Reliability analysis using Cronbach’s alpha showed that the questionnaire (alpha = .87) and most scales were reliable (.91 for Evaluating, .62 for Connecting, .81 for Monitoring, .49 for Planning, .69 for Source use, and .65 for Organizing). The three scales with fewer than three indicators (i.e., Planning, Source use, and Organizing) were dropped from further analysis.
Summary of the composites for the writing strategy use variables.
Data analysis
A quantitative analysis was employed to gauge the prompt effect on the participants’ writing performance and writing strategy use in the continuation task. To address Research Question 1, a one-way ANOVA was conducted to determine whether there were significant between-group differences in the writing scores of the four groups’ continuations. To address Research Question 2, a one-way MANOVA was first run to determine whether there were significant between-group differences in the textual features of the four groups’ continuations. Upon confirmation of the existence of such differences, a set of follow-up one-way ANOVAs were then performed to determine whether between-group differences existed in each feature. To address Research Question 3, a one-way MANOVA was first carried out to determine whether there were significant between-group differences in the following three writing strategy scales: Evaluating, Connecting, and Monitoring. Upon confirmation of the existence of such differences, three follow-up one-way ANOVAs were then run to determine whether between-group differences existed for each of the three scales. Additionally, for all variables that violated the assumption of homogeneity of variance, as indicated by the Levene test, a non-parametric Kruskal Wallis test was also performed, followed by pairwise comparisons with the Games Howel post hoc test if significant between-group differences were found. For all other variables, the LSD post hoc test was used for pairwise comparisons if the one-way ANOVA revealed significant between-group differences. Given that a total of 23 follow-up ANOVAs or Kruskal Wallis tests were run, the alpha value was adjusted to .002.
Results
Research Question 1: Overall writing scores
In terms of the overall writing scores, Group 4 (M = 17.41) and Group 1 (M = 13.78) obtained the highest and lowest mean score, respectively (see Figure 1). The Levene test indicated homogeneity of variance among the four groups (p = .15). As shown in Table 4, a one-way ANOVA revealed significant between-group differences in the mean scores (F(3, 116) = 4.92, p = .00, η2 = .11). Pairwise comparisons using the LSD post hoc test indicated that Group 1 had a significantly lower mean score than Groups 2 (p = .02), 3 (p = .01), and 4 (p = .00).

Box-and-whisker plot for overall writing scores for different prompt types. The bottom and top of the box represent the first and third quartiles, the line in the box represents the median, and the x in the box represents the mean. The dots represent outliers: one for P2 (6), two for P3 (6.5 and 8), and two for P4 (7 and 9.5). Error bars represent the range, excluding any outliers. The bottom whisker extends downward from the first quartile to the minimum, excluding any outliers. The top whisker extends upward from the third quartile to the maximum, excluding any outliers.
ANOVA results for overall writing scores.
Note: P1, P2, P3, and P4 denote the bare, framed, vocabulary, and framed vocabulary prompt, respectively. CI = confidence interval; LL = lower limit; UL = upper limit.
Research Question 2: Textual features
A one-way MANOVA revealed significant between-group differences in the textual features considered (Wilks’s L = .24, F(60, 290) = 2.95, p = .00, η2 = .38). A set of one-way ANOVAs were then run to determine whether between-group differences existed for each feature. The Levene test showed that MTLD, mean sentence length, number of words before the main verb, and LSA Given-New violated the assumption of homogeneity of variance. The Kruskal Wallis tests were thus run on these variables as well. Table 5 summarizes the ANOVA results and includes the results of the Kruskal Wallis tests where appropriate. None of the fluency (p = .72), grammatical accuracy (p = .89), and lexical complexity (p = .15–.34) features showed significant between-group differences.
ANOVA results for the textual features.
Note: P1, P2, P3, and P4 denote the bare, framed, vocabulary, and framed vocabulary prompt, respectively.
Significant between-group differences were found in four syntactic complexity measures. For mean sentence length (F(3, 116) = 17.64, p = .00, η2 = .31) and number of modifiers per noun phrase (F(3, 116) = 7.30, p = .00, η2 = .16), Group 3 and Group 2 showed the highest and lowest value, respectively (see Figures 2 and 4). LSD post hoc tests revealed that Group 2 produced significantly shorter sentences and fewer modifiers per noun phrase than all other three groups. For number of words before the main verbs (F(3, 116) = 5.35, p = .00, η2 = .12), Group 3 (M = 3.65) and Group 2 (M = 2.42) also showed the highest and lowest value, respectively (see Figure 3); Games-Howell post hoc tests indicated that Group 2 had a significantly lower value in this measure than Groups 3 and 4. For syntactic similarity of adjacent sentences (F(3, 116) = 11.12, p = .00, η2 = .22), Groups 1 and 3 showed the lowest value (M = .12), while Group 2 showed the highest value (M = .17) (see Figure 5). In this case, a higher degree of syntactic similarity corresponds to a lower degree of syntactic variation. LSD post hoc tests indicated that Group 2 had a significantly higher degree of syntactic similarity than all other groups; Group 4 (M = .14) also had a significantly higher degree of syntactic similarity than Groups 1 and 3.

Box-and-whisker plot for sentence length for different prompt types. The bottom and top of the box represent the first and third quartiles, the line in the box represents the median, and the x in the box represents the mean. The dots represent outliers: one for P1 (36) and one for P3 (36.4). Error bars represent the range, excluding any outliers. The bottom whisker extends downward from the first quartile to the minimum, excluding any outliers. The top whisker extends upward from the third quartile to the maximum, excluding any outliers.

Box-and-whisker plot for the number of words before the main verbs for different prompt types. The bottom and top of the box represent the first and third quartiles, the line in the box represents the median, and the x in the box represents the mean. The dots represent outliers: one for P1 (6.7) and one for P2 (4.3). Error bars represent the range, excluding any outliers. The bottom whisker extends downward from the first quartile to the minimum, excluding any outliers. The top whisker extends upward from the third quartile to the maximum, excluding any outliers.

Box-and-whisker plot for the number of modifiers per noun phrase for different prompt types. The bottom and top of the box represent the first and third quartiles, the line in the box represents the median, and the x in the box represents the mean. The dots represent outliers: two for P3 (.92 and .33). Error bars represent the range, excluding any outliers. The bottom whisker extends downward from the first quartile to the minimum. The top whisker extends upward from the third quartile to the maximum.

Box-and-whisker plot for syntactic similarity of adjacent sentences for different prompt types. The bottom and top of the box represent the first and third quartiles, the line in the box represents the median, and the x in the box represents the mean. Error bars represent the range. The bottom whisker extends downward from the first quartile to the minimum. The top whisker extends upward from the third quartile to the maximum.
Two of the eight cohesion features showed significant between-group differences, namely, additive connectives (F(3, 116) = 5.44, p = .00, η2 = .12) and LSA similarity between adjacent paragraphs (F(3, 116) = 33.58, p = .00, η2 = .46) (see Figures 6 and 7). Games-Howell and LSD post hoc tests revealed that Groups 1 and 3 had significantly lower values in both of these cohesion features than Groups 2 and 4. These results suggest that the continuations produced by Groups 2 and 4 tended to be more cohesive than those produced by Groups 1 and 3.

Box-and-whisker plot for additive connectives for different prompt types. The bottom and top of the box represent the first and third quartiles, the line in the box represents the median, and the x in the box represents the mean. The dots represent outliers: two for P3 (123.97 and 7.20). Error bars represent the range, excluding any outliers. The bottom whisker extends downward from the first quartile to the minimum, excluding any outliers. The top whisker extends upward from the third quartile to the maximum, excluding any outliers.

Box-and-whisker plot for LSA similarity between adjacent paragraphs for different prompt types. The bottom and top of the box represent the first and third quartiles, the line in the box represents the median, and the x in the box represents the mean. The dots represent outliers: one for P1 (.67) and one for P2 (.14). Error bars represent the range, excluding any outliers. The bottom whisker extends downward from the first quartile to the minimum, excluding any outliers. The top whisker extends upward from the third quartile to the maximum, excluding any outliers.
Table 6 has the descriptive statistics of the four groups’ use of source-oriented four-word sequences. Groups 3 and 4 used more types and tokens of source-oriented sequences and had higher source-use ratios than Groups 1 and 2, indicating that the participants working on Prompts 3 and 4 aligned more with the source text in terms of their use of the top 20 most frequent four-word sequences than those working on Prompts 1 and 2.
Descriptive statistics of the four groups’ use of source-oriented four-word sequences.
Note: P1, P2, P3, and P4 denote the bare, framed, vocabulary, and framed vocabulary prompt, respectively.
Research Question 3: Writing strategy use
A one-way MANOVA revealed significant between-group differences in the following three writing strategy scales: Evaluating; Connecting; and Monitoring (Wilks’s L = .83, F(9, 278) = 2.52, p = .01, η2 = .86). Three follow-up one-way ANOVAs were then run to determine whether between-group differences existed for each of these scales. As the three scales violated the assumption of homogeneity of variance based on the Levene test, Kruskal Wallis tests were run on them to confirm the ANOVA results as well. As shown in Table 7, significant between-group differences existed in the Monitoring scale only (F(3, 116) = 4.28, p = .00, η2 = .10) (see Figure 8). Group 2 (M = 12.37) and Group 3 (M = 10.07) showed the highest and lowest tendency to using monitoring strategies during task completion. Games-Howell post hoc tests revealed that Group 2 was more inclined to check their language use than Group 3.
ANOVA results for the writing strategy scales.
Note: P1, P2, P3, and P4 denote the bare, framed, vocabulary, and framed vocabulary prompt, respectively.

Box-and-whisker plot for monitoring strategy use for different prompt types. The bottom and top of the box represent the first and third quartiles, the line in the box represents the median, and the x in the box represents the mean. Error bars represent the range. The bottom whisker extends downward from the first quartile to the minimum. The top whisker extends upward from the third quartile to the maximum.
Discussion
Prompt effect on overall writing quality
Our results indicated a significant prompt effect on the participants’ overall writing quality in the continuation task, as evidenced in the significantly lower overall scores elicited by the bare prompt than by the other prompts with extra task elements. Additionally, the general trend observed was that prompts with more task elements yielded higher average scores at the group level. These results corroborate previous findings that writing prompts with higher cognitive demands may produce higher quality essays than bare prompts (Bahrebar & Darabad, 2013; Way et al., 2000; Robinson, 2001). With respect to the use of the continuation task in high-stakes testing contexts, our results provide empirical support for Liu and Chen’s (2016) recommendation that more specific task prompts, such as those that provide an opening sentence for each paragraph and that require the use of particular key words, can be considered to allow learners to demonstrate their writing ability more fully.
Prompt effect on the textual features of test-takers’ continuations
Our results revealed significant effects of prompt type on multiple syntactic complexity, cohesion, and source-use features in the participants’ continuations, but not on the fluency, grammatical accuracy, and lexical complexity features considered.
In terms of syntactic complexity and cohesion, the framed prompt elicited significantly lower syntactic sophistication and variation than the other prompts, and both the framed prompt and the framed vocabulary prompt elicited significantly lower syntactic variation but higher cohesion than the bare prompt and the vocabulary prompt. O’Loughlin and Wigglesworth (2007) reported that prompts with less information tended to elicit more complex language. Our results on the higher syntactic complexity elicited by the bare prompt than by the framed prompt and the framed vocabulary prompt partially support their finding. Our results also suggest that the provision of opening sentences elicited lower syntactic complexity but higher cohesion. This finding may be accounted for using Skehan and Foster’s (2001) Limited Attentional Capacity Model, which posited that, due to the limited capacity of human attention, humans must prioritize their attentional resources on tasks with different cognitive demands, resulting in trade-off effects in different areas of performance. When opening sentences for the paragraphs are provided, test-takers must devote attentional resource to ensuring that they develop their story in a way that logically connects each part to the opening sentence of each paragraph, resulting in higher cohesion. Consequently, less attentional resource can be allocated to the complexity of language, resulting in lower syntactic complexity.
The results on the source-use ratio indicated that the prompts with extra elements elicited a higher degree of alignment with the source text than the bare prompt. These results are consistent with Yuan’s (2013) finding that explicit prompts may enhance linguistic alignment in L2 writing. Additionally, the provision of required key words resulted in particularly higher source-use ratios. As Wang (2015) and Xiao (2013) noted, L2 writers who referred back to the source text more frequently tended to align more with the source text. The provision of required key words may have prompted the test-takers to refer back to the source text more often, leading to an increased level of alignment.
The four prompts yielded continuations with comparable fluency and grammatical accuracy. These results differ from Way et al.’s (2000) finding that the vocabulary prompt elicited more fluent and accurate writing samples than the bare prompt. This difference may have arisen from the difference in the proficiency level of the participants in the two studies. Whereas Way et al.’s (2000) participants were novice foreign language learners, our participants were at the upper-intermediate level, as evidenced in the scores (12 or 13 out of 15) they earned in the pre-writing test.
The four prompts also elicited comparable levels of lexical complexity. This was somewhat surprising, as the provision of required key words was initially expected to affect test-takers’ lexical choice. A possible explanation for this non-significant difference may lie in the nature of the underlined words. Chosen in accordance with Labov’s (1972) narrative discourse schema, these words served to orientate the test-takers to the key components of the story, including characters (e.g., Jane), environmental background (e.g., lake), objects (e.g., helicopter), action (e.g., climbed), and emotion (e.g., joy). It remains to be seen whether the provision of required words that are less essential to the development of the story and/or that are more complex may have a stronger effect on the lexical complexity of test-takers’ continuations.
Although the relationship between textual features and overall scores is out of the scope of the current study, we noted that continuations with higher levels of fluency, lexical complexity, cohesion, and source use tended to obtain higher overall scores, but not those with higher levels of grammatical accuracy and syntactic complexity. The trend observed for fluency may be explained by the timed nature of the task, while that for cohesion and source use may be explained by the emphasis on cohesive organization and alignment with the source text in the construct underlying the continuation task. The different trends observed for lexical and syntactic features may be attributed to the different weights assigned to them in the rating process. However, these trends and the relative contribution of the textual features to the overall scores certainly warrant further research.
Prompt effect on writing strategy use
The questionnaire results suggested a significant impact of the prompts on test-takers’ monitoring strategy use. The bare prompt and the framed prompt elicited higher monitoring strategy use than the vocabulary prompt and the framed vocabulary prompt. These results partially corroborate previous findings that with different prompts, L2 writers may orientate to the writing task differently and reallocate their attentional resources to fulfill the task (e.g., Xin, 2017). With the provision of required key words in Groups 3 and 4, test-takers can use these key words directly and may need to pay less attention to the accuracy of at least some aspects of their language use (e.g., spelling).
Conclusion
With this study we sought to understand the potential effects of prompt type, an important task variable, on test-takers’ writing performance and writing strategy use in the continuation task. Our findings revealed that whether the prompt provides opening sentences for the paragraphs or required key words or both may significantly affect test-takers’ overall writing scores, their use of syntactic complexity, cohesion and source-use features, and their use of monitoring strategies. Our findings suggest that prompts that include both opening sentences for the paragraphs and required key words are likely to allow test-takers to better demonstrate their full writing ability, compared to prompts that integrate none or one of these elements. Our findings also provide useful information that can inform L2 writing teachers’ decision in selecting and designing different writing prompts for the continuation task. Depending on the learning objectives of a specific pedagogical stage, teachers may choose to integrate different types of specific requirements in their prompt design to foster the learners’ use of different types of writing strategies and textual features.
The current study has several limitations, some of which can be addressed in future research. First, the measurement of grammatical accuracy using a simple three-point scale may be strengthened with more fine-grained measures that examine specific types of errors. Second, while we made an effort to ensure the validity and reliability of the writing strategy questionnaire, it is possible that not all learners could reliably recall the writing strategies they used. A more rigorous investigation of the writing process will help us better understand test-takers’ perception of the task requirements specified in different prompts and their decisions to adopt different writing strategies to meet those requirements. Third, we used purposeful sampling to recruit participants at a specific proficiency level. It will be useful to examine whether the findings obtained in this study can generalize to test-takers at other proficiency levels. Finally, we did not consider the effect of the interaction between prompt type and topic on test-taker’s writing performance, nor did we systematically examine the relationship between textual features and overall writing scores in the continuation tasks. Both would constitute useful avenues for future research.
Footnotes
Appendix A
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by a grant from the National Philosophy and Social Science Fund of China (15BYY080) to the corresponding author.
