Abstract
Reading-to-write (RTW) tasks are becoming increasingly popular and have already been used in several high-stakes English proficiency exams, either replacing or complementing a prompt-based essay test. However, it is still not clear that what accounts for successful or unsuccessful performance on an integrated reading–writing task is owing to the hybrid nature of reading and writing skills and to potential rater effects on test score variability. Thus, in this study, data-driven analytic rubrics for the RTW task were developed first. Then, the analytic subscores of 83 college ESL students’ responses to the RTW task were obtained. Correlational analyses were first used for the data to explore the relationship of the writing and reading skills engaged in different aspects of the RTW task. A multivariate G-study was also applied to examine the degree of variability attributable to test takers and raters on analytic subscores. The results indicate that a RTW task may tap into both reading and writing abilities given relatively high correlations observed among composite of and separate analytic subscores, and independent reading and writing scores. The multivariate G-study results also show that each analytic rating domain could capture the difference in variability of test takers’ proficiency utilized in the RTW task, and raters assigned scores neither too harshly nor too leniently across each analytic rating domain. However, the results also reveal that person and rater facets contributed to score variability differently in certain analytic categories. This study provides valuable insights into the nature of RTW tasks and has implications for rating rubric development for integrated tasks.
Keywords
Academic writing rarely takes place without recourse to understanding other written or spoken sources. That is, writing in an academic context usually involves source texts to which writers need to respond (Carson, 2001; Horowitz, 1986, 1991; Leki & Carson, 1997; Weigle, 2002). Thus, compared to a traditional impromptu writing task, an integrated reading-to-write (RTW) task better reflects authentic demands placed on students by simulating genuine tasks encountered in real academic settings (Feak & Dobson, 1996; Weigle, 2004). The content validity of the RTW task is also enhanced when it is aligned with the curriculum of language programs in which reading and writing skills are intentionally combined (Cumming, 2013; Grabe, 2003; Wolfersberger, 2013). In addition, source-based writing can help level the playing field by providing relevant content on which to write, thereby controlling for test takers’ uneven topical knowledge, which can bias their writing test scores (Gebril, 2009; Read, 1990; Weigle, 2004). Consequently, reading-to-write tasks have become increasingly common in several high-stakes testing contexts such as in the Test of English as a Foreign Language Internet-based Test (TOEFL iBT), subject A exam in California, the Canadian Academic English Language (CAEL) test, and the Georgia State Test of English Proficiency (GSTEP) (Weigle, 2004).
Such wide use of RTW tasks in high-stakes tests has recently prompted abundant language testing research on integrated skill writing assessment addressing many important issues including the constructs of varied integrated writing assessments in different contexts (Gebril & Plakans, 2013; Knoch & Sitajalabhorn, 2013; Wolfersberger, 2013), the role of test takers’ proficiency level in their performance on various types of integrated writing assessments (Gebril & Plakans, 2013; Sawaki, Quinlan, & Lee, 2013), and reading processes and outcomes involved in a RTW task (Weigle, Yang, & Montee, 2013). By the same token, L2 researchers have also drawn attention to the RTW task. In 2012, the Journal of Second Language Writing published a special issue on textual appropriation and source use in L2 writing. In particular, Weigle and Parker (2012) dealt with textual borrowing practices and its effect on the overall scores in a RTW task.
Yet, it is still not clear how the construct of integrated reading-writing assessment can be defined and operationalized into the rating criteria (Cumming, 2013; Yu, 2013). Given that “rating scales act as the de facto test construct in a writing assessment” (Knoch, 2011, p. 81), further research on the constructs of RTW assessment at the scoring level has been called for in order to better understand test takers’ performance on a RTW, particularly in relation to their independent reading and writing abilities, and raters’ behaviors across different domains of a RTW task.
Literature review
Most people read and write in combination to integrate information from source texts. When they focus on reading, reading-to-write tasks have been viewed as a learning tool (Trites & McGroarty, 2005) causing the reader/writer to select, evaluate, and use content from the source (Plakans, 2009a). On the other hand, when they view it as an alternative writing task, reading serves as a tool providing topical information, text revision models, and evaluation for writing (Hayes, 1996). No matter how they are viewed, RTW tasks are affected by independent reading or writing ability in terms of how test takers select and edit information from the source text and combine that information into their own texts (Asención, 2008; Plakans, 2009a).
At the process level, reading ability in a RTW task affects the number of notes taken and the degree of their elaboration (Kennedy, 1985), as well as the level of text synthesis made during task completion (Plakans, 2009a). Writing ability embedded in a source-based writing task influences the way learners integrate information from a given source text into their own writings. However, at the product level, based on correlational studies, the relationships between independent reading and writing scores, and integrated reading-to-write test scores are not yet clear. With regard to the relationship between general reading scores and scores of a RTW task, several studies report that there is a strong relationship between overall reading comprehension scores and RTW test scores. For example, Trites and McGroarty (2005) found that their RTW task scores were correlated with Nelson-Denny, TOEFL reading comprehension, and two other measures of basic reading comprehension scores between .68 and .70, indicating that all reading measures were found to be related to the RTW scores. Most recently, Sawaki, Quinlan, and Lee (2013) examined the factor structures of human and automated scores of the TOEFL iBT integrated essay responses, and independent reading and listening comprehension test scores. They found that their content measure of the integrated essays, which underlay learners’ responses to an integrated writing task, turned out to be a correlated yet distinct construct along with two writing factors, Productive Vocabulary and Sentence Conventions. The correlations among these factors ranged from .49 to .72. They also found that their content measure was strongly associated with independent reading and listening skills, ranging from .83 to .87, indicating that a skill of identifying appropriate information from a spoken lecture and a written source text in essay writing is closely related to abilities measured in reading and listening comprehension tests.
On the other hand, other studies have shown that the correlations between general reading comprehension scores and reading-to-write test scores are not that strong. For instance, Asención (2008) reported that there were low positive correlations between reading proficiency scores and the scores of two different types of RTW tasks, a summary (.28) and a response essay (.38) based on the same source text. Similarly, Watanabe (2001) argued that reading scores alone cannot reliably predict scores on a RTW task, as reading proficiency only accounted for 1% or 2% of the total variances in the two different RTW task scores.
As opposed to the mixed results about the relationship between independent reading scores and integrated RTW scores, most of the previous studies show that independent writing test scores and RTW scores are highly correlated (Brown, Hilgers, & Marsella, 1991; Lewkowicz, 1994; Watanabe, 2001). However, test takers did not always seem to perform equally on both tasks. For instance, Gebril (2009) found that the test takers performed slightly better on integrated writing tasks than students did on independent ones in terms of writing ability. He speculated that this may be due to the fact that students could model their writing on the source text, which led to higher scores on organization and development of ideas.
These previous studies reveal the nuanced picture of the construct of RTW tasks, particularly in regard to the interrelationships among RTW task scores and independent reading and writing scores. On the other hand, in most of the previous studies (Asención, 2008; Gebril, 2009, 2010; Gebril & Plakans, 2013), aggregated scores either holistically or analytically acquired have been used to rate RTW task performances prohibiting us from understanding the roles of reading and writing skills each in distinctive dimensions of essay quality in a RTW task at score levels. Most recently, Sawaki, Quinlan and Lee (2013) found that content understanding could be factored out as an important aspect of integrated writing assessments but was highly correlated with reading and listening test scores. Yet, their rubrics did not include other key features of integrated writing such as textual development and organization or the uses of original source texts (Gebril & Plakans, 2013; Knoch & Sitajalabhorn, 2013; Weigle & Parker, 2012), resulting in an under-representation of a broader construct of integrated writing skills. It is thus important to develop analytic rubrics based on actual learner responses to score RTW tasks and examine how each scoring domain is related to test takers’ independent reading and writing abilities.
It should be also noted that the quality of performance on RTW tasks is usually determined by human raters who assign scores. To some degree, raters themselves may account for scoring discrepancies among test takers on a RTW task. Previous research in performance assessment has shown that rater and rater-related interaction effects account for the considerable degree of variance in ratings (Hoyt & Kerns, 1999). The rater variable is one of the main sources of measurement errors that needs to be controlled because it often affects score reliability in writing assessments (Gebril, 2009). It has been also found that raters tend to differ strongly in the severity in their scoring in L2 writing performance assessments (Eckes, 2005, 2008; Kondo-Brown, 2002), and such rater bias tends to exist even after rater training (Weigle, 1998). It is particularly important to understand the effect of raters on the scoring of RTW tasks because the complex nature of a RTW task tends to make it difficult for raters to score reliably; raters have to focus on both the source texts and the test takers’ writing while they are rating the RTW task, and they tend to give different importance to various qualities of RTW task across different proficiency levels (Gebril & Plakans, 2014). However, to date, raters’ behaviors in terms of their severity or leniency, and the effects of these behaviors on score reliability of the changing number of raters in different analytic scoring domains of the RTW task have not yet been sufficiently investigated. In the context of RTW task scoring, a clear picture of the impact of rater factors and test-taker’s ability remains elusive. Generalizability theory (G-theory) may provide a statistical means for disentangling these different sources of variances, but to date G-theory has only been used in a few studies of the effects of raters on holistic scoring in RTW tasks (Gebril, 2009; 2010; Lee & Kantor, 2005). 1
Generalizability theory (G-theory)
G-theory is a measurement model for estimating the effects of a person’s ability, and systematic and random sources of measurement errors on test scores via the estimated variance components. It overcomes the limitation of classical test theory (CTT), which does not disentangle different sources of error variances (Shavelson & Webb, 1991). G-theory enables researchers to capture the effects of two different systematic sources of score variances in a single analysis: the object of measurement and the facet of measurement. The former refers to systematic sources of variances owing to test takers’ differences in their knowledge, ability, or skill (often noted person), whereas the latter is related to the error sources of variability, including items, tasks, and raters in performance assessment. It should be noted that the “universe of admissible observations”, which regards all the acceptable measurement conditions in a specific testing context, should be defined first to determine a facet to be investigated (Briesch, Swamination, Welsh, & Chalouleas, 2014, p. 20).
The second stage of G-study is a Decision study (D-study), which is conducted to examine the relative effects of an increasing number of conditions of a facet on the corresponding generalizability or dependability coefficients. A generalizability coefficient (Ep2) is the same as a reliability coefficient in classical test theory, which is estimated based on relative error variance, whereas a dependability coefficient (Φ) employs absolute error variance (Brennan, 2001a). The D-study results can thus inform testers of how many tasks or raters are appropriate to reach a certain desirable level of score dependability in a specific testing situation. G-theory can also be expanded to multivariate G-theory when a test consists of multiple domains or subscores as a fixed facet combined with other facets such as tasks or raters (Brennan, 2001a). Multivariate G-theory allows researchers to examine sources of covariances among multiple scores. The approach using multivariate G-theory to investigate the reliability of and interrelationships among each analytic category rated by multiple raters has been applied in previous studies of writing (Lee & Kantor, 2005) and speaking (Lee, 2006) testing contexts.
There have been a number of previous studies using G-theory to investigate the effects of raters (Lee, Gentile, & Kantor, 2008; Swartz et al., 1999) and of tasks (Lee & Kantor, 2005; Schoonen, 2005) on independent writing test score variances. The results reveal that the contributions of raters and task effects accounted for relatively small amounts of the total score variance, except for somewhat large interaction effects between persons and raters or task, but increasing the number of raters and tasks enhanced score dependability. In contrast, to date there is little prior research using a G-study approach on RTW test scores except for few studies done by Lee and Kantor (2005), and Gebril (2009, 2010). Their results showed that RTW scores were as reliable as independent writing scores, and that test score dependability increased as the number of tasks and raters increased, although increasing the number of tasks rather than the number of raters was more efficient.
Nevertheless, as previously mentioned, it is not known if such findings on the basis of holistic or composite scores can be applied to analytic rubric domains. Raters may judge a certain aspect of an RTW task less consistently than others, which may be a threat to reliability of test takers’ obtained scores in a RTW task. If raters are replaceable with any people with comparable characteristics after proper rater trainings, the rater effect could be generalizable to other similar testing situations, and suggest how many raters would be needed to achieve a desirable level of dependability. Thus, in the current study, a multivariate G-and D- study was conducted to investigate the degree of variability of test takers’ performance and raters’ severity across multiple analytic domains of a RTW task.
To sum up, it has been hard to pin down what contributes to successful performance on a RTW task due to its hybrid nature. Thus, the present study has set out to develop and use data-driven analytic scoring rubrics to explore the interrelationships among reading and writing abilities engaged in RTW tasks and different components of RTW constructs. This study also aims to examine the rater’s effects on RTW scoring in terms of the relative size of its contribution to score variance. To address these two outstanding issues in RTW task assessment, the following four research questions (RQ) were raised in this study:
RQ1: What are the interrelationships among analytic measures of RTW task scores and reading comprehension test scores and independent writing measures?
RQ2: To what extent do test takers vary on each analytic domain of a RTW task?
RQ3: To what extent do raters vary in terms of severity across different analytic scoring domains of a RTW task?
RQ4: To what extent does an increasing number of raters affect RTW task score dependability of each analytic scoring dimension?
Method
Participants
Our participants included matriculated undergraduate ESL students and experienced teachers of ESL academic learners at a large Midwestern university in the United States. Most of the students scored higher than 500 on the paper-based Test of English as a Foreign Language or 61 on the Internet-based TOEFL (iBT), although some students did not have any standardized English proficiency exam scores since these were not required for admission to this university. The data for this study were from 83 students who took the university’s English Proficiency Exam (EPE) prior to starting their first semester of study. The EPE has multiple-choice reading and listening components, a 35-minute written essay, and a 5-minute oral interview. On the basis of the EPE results, matriculated students may be required to take between one and four eight-week English language development courses over their first year of study or may not be required to take any ESL support courses. A total of six ESL instructors (each with more than five years teaching experience and at least an MA in TESOL and applied linguistics) participated in the development of the analytic assessment tools used in this study as well as in the rating of the RTW task responses.
Independent reading task
The participants’ scores on the reading comprehension (independent reading task) component of the EPE were collected. This component consists of eight different reading passages (two narrative, four expository, and two argumentative texts) whose average word length was 303.75 and readability metric was 12.08 based on Coleman Liau Index (Coleman & Liau, 1975), which is a readability test calculated using the average number of letters and sentences per 100 words in a given text and the output corresponds to a US grade level. The independent reading task is made up of 40 items measuring diverse subskills of reading comprehension, such as identifying main ideas, locating details, making inferences, and understanding relations within a sentence and across sentences and paragraphs. Fifty-five minutes were allotted for this section. The reliability of the reading comprehension test was acceptable (α = .80)
Independent writing task
In the composition (independent writing task) component of the EPE, students wrote a formal, prompt-based academic essay for 35 minutes. They were allowed to choose one of three possible essay topics. In order to enhance test security, a total of 16 different prompts were used (see Appendix A for the list of prompts). Mean scores obtained from each prompt were compared using a one-way ANOVA to see if students who chose a specific topic performed better than others. The ANOVA results indicated that the prompt effect was minimal, as can be seen from the small F value, F (15, 67) = .83 and no significance p-value (p = .65), along with non-significant results in all post-hoc tests for the differences among essay scores given to each different prompt. The essay was holistically evaluated on a scale of 1 to 5 for linguistic control (grammar, vocabulary, spelling, and punctuation) and content development (clarity, thoughtful development and elaboration of ideas) by two raters who were intensively trained on calibration sets. The resulting inter-rater reliability was high (α = .92). Any discrepancies on their ratings were within one point, and the average of the two ratings was reported as a final score.
Reading-to-write task
For the purposes of this study, a 40-minute RTW task was included during one administration period of the EPE. Since the purpose of the task was to more directly measure a typical university integrated task of responding to multiple sources and making an argument, two texts on the same topic and of similar length and readability (Coleman Liau Index 9.8 and 10.8, respectively) were chosen. Each text expressed a different viewpoint of the influence of violent video games on teens’ behaviors. The two texts were 651 and 627 words in length. The level of difficulty was determined by the experience of the ESL instructors, who had used these texts in classes at the entry level of the matriculated ESL support courses. The task prompt (see Appendix B) was designed taking various concerns into account, the most salient of which was necessitating learners to show evidence of some level of comprehension of the two texts and yet allow them to position themselves in relation to the two perspectives without requiring them to fully agree or disagree with either perspective. Each RTW task was rated by six different raters. As can be seen in Table 1, inter-rater reliabilities on the analytic scores of the reading-to-write task were quite acceptable, ranging from .85 to .93.
Reliabilities.
Rubric development
A trait-based analytic rubric, developed by a team of six instructors, was first drafted on the basis of previous studies on RTW tasks (Gebril & Plakans, 2009; Weigle, 2004). The six raters tested the initial rubric on 10 RTW student samples representing five holistic levels (a more detailed discussion of the holistic RTW scale development process can be found in Ewert & Shin, 2011), and then engaged in discussion over several sessions leading to a final revised rubric. The revision focused particularly on how to operationalize the role of the texts and the learners’ use of the texts in the final written product. The final rubric provided written descriptors for five levels of five traits: (1) Viewpoint recognition – whether the writer notes the two different viewpoints in the text; (2) Organization – whether the writer integrates own view with others in a cohesive and coherent manner; (3) Development – whether the writer can support the main argument with examples and explanations; (4) Language use – whether the writer uses syntactic and lexical richness and accuracy; and (5) Text engagement (source text use) – whether the writer is balanced in using the two texts, neither overusing nor underusing (see Appendix C for the full written description). Our analytic rubric is similar to Weigle’s (2004) four analytic components: content, organization, language complexity, and language accuracy, which she used to assess students’ argument essays based on two argumentative reading passages on a single theme. However, our analytic rubric includes more reading-related features such as viewpoint recognition and source text use, and rhetorical features further divided into organization and development to better capture students’ RTW task performances.
Data collection and analysis
Raters were asked to read through the source texts three or four times to be familiar with examples and details along with key expressions from the source texts. To minimize any potential halo effect that analytic rubric scoring may introduce (Hamp-Lyons, 1991; Weigle, 2002), each rater was asked to assign scores for each dimension one by one per each writing sample. After the final draft of the analytic rubric was developed, each rater went through multiple rater calibration sessions to make sure they were all familiarized into the common rubric. In the first round, raters read five randomly selected essays, and compared their results with other raters. In the second round, raters read an additional five essays, and we found that the agreement ratio of their ratings was above 80%.
Three sources of data were collected to answer the research questions: the independent reading and writing scores from the EPE, and the analytic trait-based scores of the RTW samples from the same participants. These data were first analyzed for several correlation analyses using SPSS 20 (2011). Correlations among independent reading and writing scores, and composite RTW scores were calculated. In addition, unlike previous studies which only compare aggregated analytic scores with other language skills, we examined the relationship between independent reading and writing skills, and each analytic dimension of the RTW task. Partial correlation analysis was also used to examine the relationship between two variables after removing the effects of other variables.
The computer program mGENOVA (Brennan, 2001b) was also employed for the multivariate G-theory analysis to estimate the dependability of the analytic subscores and to calculate universe-score correlations to examine the “true” relationships among different analytic components of the RTW task scores when scores are assumed to reflect a person’s average score across all the possible raters (Shalveson & Webb, 1991). It also reports the phi (Φ) coefficients of the analytic scores as an index of dependability, which refers to the degree of consistency of the absolute level of scores awarded by multiple raters. The phi coefficients seems more appropriate than the generalizability coefficient in this context in which exemption or placement decisions are based on students’ absolute level of performance rather than their relative standing of them (Shalveson & Webb, 1991). G-theory can be expanded to multivariate G-theory when a test consists of multiple domains or subscores as a fixed facet with other random facets such as tasks or raters (Brennan, 2001a). In this study, since all six raters (r) were involved in assessing RTW task performance of all persons (p) for each analytic subscore, a multivariate single-facet design with raters as a random facet for each analytic dimension as a fixed facet (p• × r•) was employed. Raters are considered “random” because raters could be drawn randomly and exchangeable with any other raters. Thus, the universe of generalization in this study regards the rater facet which “provides information about how closely the ratings conducted by one individual reflect the average rating generated by all of the raters who could have possibly conducted the rating instead” (Briesch, Swamination, Welsh, & Chalouleas, 2014, p. 20). Analytic score categories are treated as “fixed” facets given that they are not randomly chosen and replaceable with other criterion (Shavelson & Webb, 1991). The one-facet crossed design (p• × r•) was run since all raters scored all students’ performances on the RTW task on all the analytic rating scales. Subsequently, a D-study was conducted to calculate the corresponding dependability estimates (Φ) for absolute decisions (Shavelson & Webb, 1991), for each analytic domain score of the RTW task under different number of raters.
Results
Descriptive statistics
Table 2 reports the descriptive statistics for the independent reading and writing scores, and total and component analytic scores of the RTW task. As can be seen, given that students have more time for writing in the independent writing than in the RTW task, where they had to split the time for reading and writing, they wrote more in the independent writing task than in the RTW task. In the RTW task, students scored lowest on viewpoint recognition but scored relatively higher in the text engagement and language use domains.
Descriptive statistics.
M = mean; SD = standard deviation.
In response to the research questions identified earlier, the following results are presented after each research question.
RQ1: What are the interrelationships among analytic measures of RTW task scores and reading comprehension test scores and independent writing measures?
Table 3 presents the correlations among independent reading and writing scores, and the total analytic RTW task scores. The composite analytic scores are moderately correlated with independent reading scores at .68, and independent writing scores at .65, indicating that performance on the RTW task appears to tap into both reading and writing ability to a similar degree. It should also be noted that there was a moderate correlation between the reading and the writing tests (r = .64), suggesting that good readers tend to be good writers and thus, reading and writing could be overlapping skills, as research, particularly in L1 reading/writing, has shown (Grabe, 2003).
Pearson product–moment correlation coefficients (n = 83).
All correlations are significant at .00.
Table 4 shows zero-order correlations among the five analytic subscores, and the independent reading and writing scores. It reveals that each analytic score was correlated significantly with both reading and writing scores in similar ways. It is notable that the highest correlation was observed for the language use subscore for both reading and writing measures. Table 5 displays partial correlations that were carried out to examine the independent relationships between each analytic subscore, and reading comprehension and writing scores while holding other analytic component scores constant. Note that only viewpoint recognition and language use were significantly correlated with reading comprehension scores, and language use alone was related to writing scores, after the impact of other subscores was partialled out.
RQ2: To what extent do test takers vary on each analytic domain of a reading-to-write task?
Zero-order correlations between analytic components of reading-to-write task scores and independent reading/writing scores.
All correlations are significant at .00.
Partial correlations between analytic components of reading-to-write task scores and independent reading/writing scores.
Statistically significant correlations; each value in parentheses represents p-values.
Table 6 shows estimated G-study variance and covariance components for each analytic dimension, which provides information about the degree of relative contributions of measurement objects (person) and facets (rater and person-by-rater) to the variance of the five analytic RTW scores.
Estimated variance and covariance components for the five analytic scoring domains (p• × r• design).
Note: Viewpoint recognition (R); Text engagement (T); Organization (O); Development (D); Language use (L) Bolded numbers represent variances, upper off-diagonal numbers are correlations, and lower off-diagonal numbers are covariances.
Among the three variance components calculated for each analytic scoring domain, the person variance (p) is the largest across each analytic scoring domain, which explained from 82% to 92% of the total variance of each analytic subscore. This result suggests that the majority of the RTW scores are explained by the difference in test takers’ proficiency on the integrated writing task across all analytic rating dimensions. Of the five analytic domains, note that the text engagement domain received the smallest person variance (81.8%), whereas the largest person variance was observed for the development domain (91.9%) followed by the language use domain (89.2%), indicating that more variability among test takers was found for development and language use subscores than for text engagement subscore. Table 6 also shows that the covariance components and correlations for the person (p) across the five analytic dimensions. Universe score correlations among the analytic subscores were relatively high ranging from .71 to .92. Among them, three domains which are more directly related to independent writing skills, organization, development, and language use, were correlated highly with each other and all above .80. Two RTW-specific domains, viewpoint recognition and text engagement, were correlated highly at .90, indicating that there were potentially two major dimensions, namely, writing-related and reading-related ones. Meanwhile, relatively low correlations were found between viewpoint recognition and organization, development, and language use (r = .71, .76, and .81) and between text engagement and organization, development, and language use (r = .78, .80, and .83) also hinting at the writing-related versus reading-related bidimensionality observed from the data.
RQ3: To what extent do raters vary in terms of severity across different analytic scoring domains of reading-to-write task?
Table 6 also shows that the second largest variance component was the person-by-rater interaction plus undifferentiated errors, which accounted for about 7% to 16% of the total variance of each analytic domain score. This result suggests that the relative standing of test takers in text engagement was more likely to change across raters or due to some unidentified errors than other domains. The smallest percentage of variation was found for the main effect of raters ranging from 1% to 3% of the total score variance, indicating that raters in this study were neither too harsh nor lenient in their ratings on each dimension. However, it should be noted that the largest rater variance was observed for the viewpoint recognition subscore (3%) followed by text engagement (1.8%), which means that raters tended to be either relatively harsher or more lenient in reading-related domains than writing-related ones (organization, development, and language use) on the RTW task. It was also found that the covariance components were small both in the rater and person-by-rater plus error facets. This suggests that the ratings did not systematically vary in overall severity across different dimensions of RTW task, and raters did not show the identified patterns of rating severity across all students.
RQ4: To what extent does increasing the number of raters affect reading-to-write task score dependability of each analytic scoring dimension?
Table 7 displays the phi coefficients (Φ) in relation to increasing the number of raters estimated for p• × r• design from the D-study. 2 As can be seen in Table 7, score dependability across all analytic subscores and composite scores substantially increases as the number of raters increases. More specifically, the phi coefficients increased most sharply when the number of raters increased from one to two, which was in line with the D-study results from the previous studies on integrated reading-to-write tasks (Gebril, 2009, 2010). This pattern of the impact of the number of raters on score reliability is more clearly captured in Figure 1.
Estimated dependability coefficients (Φ) of analytic scoring domains for increasing the number of raters.
Note: Viewpoint recognition (R); Text engagement (T); Organization (O); Development (D); Language use (L).

Dependability coefficients (Φ) change of analytic scoring domains in relation to the number of raters.
However, it should be noted that at least four or more raters are required to obtain .8 and above dependability coefficients for text engagement, viewpoint recognition, and organization subscores, which is attributable to relatively large error variances of these analytic domains as can be seen in Table 6.
Discussion
The results of the study contribute to our understanding of the RTW construct and how to assess both the reading and writing qualities on the RTW task. The relatively high correlation coefficients among independent reading and writing scores, and composite and separate analytic RTW test scores, suggest that a RTW task could be used as a measure of both reading and writing abilities. This appears to conflict with previous findings in which integrated writing test scores are weakly correlated with independent reading scores (Asencion, 2008; Watanabe, 2001). The stronger correlations between the RTW task and independent reading scores in this study, however, may be due to the fact that the analytic rubrics used had been developed from actual learner data, and that our analytic rubrics for the RTW task considered test takers’ reading-related ability to integrate their own ideas on the topic with those expressed in the source texts. In particular, we included two distinctive criteria in our analytic rating rubrics: viewpoint recognition and text engagement. The former was intended to capture whether students can recognize contrasting views conveyed across two texts, and take a position on a given topic, which can display the depth of their understanding of the topic addressed in both texts. The latter is related to the appropriate use of source texts in terms of overuse or underuse indicating a lack of reading comprehension (Gebril & Plakans, 2009).
In addition, the significant relationships of the reading/content-related dimension, viewpoint recognition, of the analytic rating scales and independent reading skills was captured in the partial correlations. This result mirrors Sawaki, Quinlan, & Lee’s findings (2013) that the comprehension aspect of integrated writing was strongly related to reading and listening comprehension scores. This also corroborates earlier studies on RTW tasks that suggest reading skills are essential for source-based writing because they can facilitate the integrated writing process and outcomes (Feak & Dobson, 1996; Plakans, 2009b; Plakans & Gebril, 2012). These results suggest that the degree of test takers’ reading comprehension abilities can be reflected on the scoring of RTW products as long as the rubrics embody them in the criteria. On the other hand, with regard to independent writing, only the language use component was significantly related to writing scores. This result is in line with previous studies (McNamara, 1990; Perkins, 1980; Sakyi, 2001; Sweedler-Brown, 1993) in which holistic rating raters tend to pay more attention to linguistic features such as grammar and vocabulary than rhetorical features in judging the overall quality of independent essay writing. It is notable that language use was also highly correlated with independent reading test scores. This may be due to the fact that the language use domain was more or less defined as an overall indicator of L2 proficiency relating to grammatical and lexical accuracy.
Although our analytic rating rubric was constructed to reflect distinctive aspects of RTW tasks, it is still not certain that these dimensions can reliably be viewed as being meaningful criteria by human raters. To address this issue, we first examined the variability among test takers on the RTW task across the five analytic scoring domains. This is an important issue in integrated tasks in general as Gebril (2009) observed more variability on the integrated tasks than on the independent tasks. Gebril & Plakans (2014) reasoned that this large variability of integrated tasks is attributable to raters’ more impressionistic judgment on integrated tasks because raters need to attend to many different aspects of RTW task qualities including source information location, citation mechanics, and quality of source use at the same. Our results revealed that most of the variability of test scores across all analytic rating domains was similarly accounted for by test takers’ difference in their RTW-related proficiency, whereas the rater effect was minimal and the interaction effect, person-by-rater, including some undifferentiated measurement errors was relatively small in general across all subscores. With only two raters, relatively high score reliability (.80 above) for composite scores and many subscores was achieved.
However, it should be noted that more variability due to person in a G-study was found on writing-related subscores including development and language use domains than on text engagement. Alternatively, the main effects for raters and interaction between test takers and raters were more noticeable on reading-related subscores (viewpoint recognition and text engagement) than on writing-related ones. Consequently, four or more raters, as opposed to two or more for development and language use domains, were required to reach .80 above of score reliability on reading-related subscores and the organization domain. This result suggests that well-trained raters could reliably judge the overall qualities of a RWT task with carefully designed, data-driven analytic scoring rubrics. On the other hand, our study also reveals the complex nature of RTW tasks makes raters face more challenges when they assign scores to reading-related analytic categories. This result corroborates the recent findings of Gebril and Plakan’s (2014) qualitative study on rater reactions to integrated writing tasks, which shows that raters gave more weight to certain domains over others across different proficiency levels. Their raters attended more to linguistic features and simple citation mechanics at lower levels, but organization and the quality and accuracy of source integration issues start to be recognized from the middle level of the rating scale. This may explain the reason for the relatively lower variability by person but higher rater and error variance components observed for organization and source use-related subscores in our study.
On the whole this study hints that a RTW task may tap into both reading and writing abilities given that both composite and separate analytic scores are highly correlated with independent reading and writing scores. Each analytic rating domain could capture the difference in variability of test takers’ proficiency utilized in a RTW task, and raters assign scores across each analytic rating domain fairly consistently. Thus, students’ actual levels of performance on the RTW task were well represented by scores. These promising results support the use of integrated RTW tasks and data-driven analytic rating rubrics. However, higher score reliability of RTW scores would be achievable if raters were more carefully trained for judging one’s ability to combine and synthesize the selected information from source texts. In addition, this current analytic rubric for RTW tasks still needs to be refined to better capture the integrated nature of reading and writing ability. Continuing to work on such rubrics in local contexts in which the tasks, texts, and curricular goals are well-known may suggest additional criteria for assessing the RTW construct more broadly.
Limitations and future directions
Subsequent research on RTW tasks will need to resolve limitations of the current study. It will be necessary to use more challenging reading texts in which the authors’ viewpoints are less obvious or the topic under discussion in the texts is more complex. Students’ responses to more complex texts may reveal a greater role for reading comprehension in the written responses. A topic on video games itself may be too familiar to many young undergraduate students, which might minimize the use of the sources in their writings. It is also notable that our rubrics were developed for this specific RTW task with argumentative texts. Thus, these rubrics may not be applicable to other types of tasks and texts, particularly narrative or expository texts. For example, non-argumentative texts rarely contain specific views on a given topic, and thus, students do not position themselves on the topic. Likewise, whether students noticed the authors’ positions in their writing will not entirely reflect the degree of their understanding of the reading. In addition, although our analytic rubric for this RTW task included new components such as viewpoint recognition and text engagement, it was written in a relatively traditional way for more general application to other argumentative text-based RTW tasks. In the future, it would be useful to see whether the categories, rhetorical features and textual knowledge, mentioned above, would yield similar results. On the other hand, it would also be fruitful to develop more specific text/task-based analytic rubrics for other RTW tasks. This would allow for more detailed descriptors, which tend to better distinguish between different aspects of reading and writing (Knoch, 2009).
Beyond adjustments to the current task, future research will also need to include investigations of a wider variety of RTW tasks, and RTW tasks for different types of assessment, such as learning outcomes rather than placement, or for a curriculum that is more reading than writing focused. Thus, another future study using a multivariate G-study including a task as another facet of measurement is highly recommended because previous studies using G-theory on integrated tasks demonstrated the task and the person-by-task facets contribute more to score variability (Lee & Kantor, 2005; Gebril, 2009). Finally, it should be noted that the sample size of 83 is relatively small for G-studies, which may account for negative variance components observed in this study (Briesch, Swamination, Welsh, & Chaloueas, 2014). The small sample size may thus limit the stability and generalizability of the results of this study, and the future research should replicate the current study with larger and more representative samples.
The process of this study also revealed a number of other fruitful avenues for research that could shed light on the nature of RTW tasks and, more generally, assessment by human raters. The development of rating scales and raters needs to be better understood. The conversations of the raters during the development of analytic rating scale for RTW tasks might lead to a better understanding of how raters’ conceive of the reading component of these tasks as well as suggest ways in which the reading comprehension construct might better be represented in rating scales. Conversely, analyzing conversations of raters developing rating scales for an independent writing task in comparison with the RTW task might suggest ways in which the tasks are conceived of differently.
Footnotes
Appendix A: Prompts used for independent writing
Appendix B: Integrated readings-to-write task prompt
NAME _________________________________Student Number _______________
Test Room Number __________________ Date ________________
Read the two accompanying texts. Each one provides a different perspective on violent video games. Write a response to these readings in which you present your own perspective in relation to each of the perspectives found in these texts.
(Both texts are adaptations. The original texts can be found in J. Woodward (Ed.), Popular culture: Opposing viewpoints, Detroit, MI: Greenhaven Press, 2005).
Appendix C: Analytic rating rubrics for reading-to-write task
How well a student recognizes and understands positions/views in source texts.
5: The essay demonstrates full understanding of the main ideas of the two source texts and expresses own position in explicit relation to them.
NB: At this level, the essay includes two key opposing viewpoints followed by supporting arguments for each differing viewpoint.
4: The essay demonstrates good understanding of the main ideas of the two source texts and expresses own position.
NB: At this level, the essay includes two key opposing viewpoints but is followed by supporting arguments for either one of the viewpoints.
3: The essay demonstrates some understanding of the two source texts, or it does not take a clear position.
NB: There are two different types for three-point recognition essay: (1) the essay includes two key opposing viewpoints only but without supporting arguments for any of the texts. However, a writer expresses his or her position; (2) the essay includes both key viewpoints and supporting arguments, but a writer’s own position is not expressed.
2: The essay demonstrates partial understanding of at least one text.
NB: From level 2 and below, whether you express your own positions does not matter. The amount of text understanding will determine the scores between level 2 and level 1.
1: The essay demonstrates little or no understanding of both texts.
How well the writer uses the texts – unbalanced use of the texts either by overusing (too much) or underusing (too little) (degree of copying and reference to texts).
“Text engagement” is different from “Recognition” in that it taps into how students used the texts, whereas “Recognition” refers to whether or not they understood the main ideas of the two texts.
5: The essay is completely constructed in students’ own words (mostly paraphrased), and the two source texts are adequately referred to.
4: The essay is mainly constructed in test takers’ own words (some paraphrases but mostly direct quotations), and the two source texts are referred to.
3: The essay is partially constructed in test takers’ own words (some copying here and there), and/or there is little reference to both source texts.
NB: Level 3 is for the essays which are mostly written in students’ own words, but rarely referring to source texts.
2: The essay heavily relies on source texts (marked by multiple verbatim source uses).
NB: From level 2 and below, the level of reference to the texts does NOT affect the scores, but the amount of copying from source texts will determine the scores between levels 2 and 1
1: The essay has extensive copying from source texts (marked by extensive verbatim source use usually without appropriate quotation marks).
How well a student’s own position/view has been organized (b/w paragraphs) and cohesively (b/w sentences) presented.
5: The essay is clearly and logically organized, using appropriate and effective cohesive devices.
4: The essay is generally well organized including reasonable introduction and conclusion with some suitable use of cohesive devices.
3: Organization can be followed with short introduction and conclusion but with some limited repeated cohesive devices.
2: Organization is hard to follow, with little and ineffective use of cohesive devices.
1: There is no evident organizational plan, and the misuse or omission of cohesive devices is frequent.
How well and to what degree the main argument is supported by examples and explanations.
5: The essay is thoroughly developed using appropriate explanations, exemplifications, and/or details.
4: The essay is generally well developed, but it may contain occasional redundancy, or digression in ideas.
3: The essay shows some development with minimal examples and explanations.
2: The essay has little development of ideas with limited explanations or details.
1: The essay shows no development of ideas with no supported explanations or examples.
How appropriately the writer uses language in terms of syntactic accuracy and lexical richness.
5: The essay contains few local grammatical and lexical errors. It uses a variety of simple and complex sentence structures, and the range of vocabulary is rich enough for academic writing.
4: The essay has few grammatical/lexical errors, and/or vocabulary may lack academic sophistication.
3: The essay has several grammatical/lexical errors, and/or demonstrates little syntactic or lexical variety.
2: The essay contains multiple errors that may interfere with meaning, vocabulary is limited, and/or sentences are constructed in a simple manner.
1: The essay contains numerous errors that interfere with meaning, vocabulary is quite basic, and most sentences are not properly structured.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Notes
NAME _________________________________Student Number _______________
Test Room Number __________________ Date ________________
Read the two accompanying texts. Each one provides a different perspective on violent video games. Write a response to these readings in which you present your own perspective in relation to each of the perspectives found in these texts.
(Both texts are adaptations. The original texts can be found in J. Woodward (Ed.),
