Abstract
Assessing the content of learners’ compositions is a common practice in second language (L2) writing assessment. However, the construct definition of content in L2 writing assessment potentially underrepresents the target competence in content and language integrated learning (CLIL), which aims to foster not only L2 proficiency but also critical thinking skills and subject knowledge. This study aims to conceptualize the construct of content in CLIL by exploring subject specialists’ perspectives on essays’ content quality in a CLIL context. Eleven researchers of English as a lingua franca (ELF) rated the content quality of research-based argumentative essays on ELF submitted in a CLIL course and produced think-aloud protocols. This study explored some essay features that have not been considered relevant in language assessment but are essential in the CLIL context, including the accuracy of the content, presence and quality of research, and presence of elements required in academic essays. Furthermore, the findings of this study confirmed that the components of content often addressed in language assessment (e.g., elaboration and logicality) are pertinent to writing assessment in CLIL. The manner in which subject specialists construe the content quality of essays on their specialized discipline can deepen the current understanding of content in CLIL.
Introduction
Content, which comprises the writer’s expressed ideas or meanings in written forms (Bae et al., 2016), is an important point of focus in second language (L2) writing. In the assessment of L2 writing, the ideas or thoughts expressed in learners’ compositions are almost always explicitly evaluated along with language use. For example, the influential analytic rating scales designed by Jacobs et al. (1981) and Spandel and Stiggins (1997) entail criteria labeled as Content and Ideas, respectively. Accordingly, assessing learners’ compositions focusing on content quality, including task achievement and idea elaboration, is a common practice in L2 writing assessment (e.g., Carr, 2011; Hyland, 2019).
However, the current conceptualization of content in the L2 writing assessment literature may be inadequate for content and language integrated learning (CLIL), defined as “a dual-focused educational approach in which an additional language is used for the learning and teaching of both content and language” (Coyle et al., 2010, p. 1). CLIL aims to cultivate not only L2 proficiency, but also knowledge of certain academic subjects (e.g., world history and environmental issues) as well as cognition and critical thinking skills (Mehisto & Ting, 2017). Subject knowledge and cognitive skills, in addition to L2 proficiency, form part of essential constructs in assessment in CLIL contexts (Ball et al., 2015; Sato, 2023). To assess these constructs, writing tasks generally require learners to apply specialized subject knowledge (Sato et al., 2021). In this context, the construct definition of content applied in L2 writing assessment potentially underrepresents the target competence in CLIL, because L2 assessment has traditionally regarded learners’ background knowledge as a source of construct-irrelevant variance (Llosa, 2017).
Therefore, conceptualizing content quality by exploring its components is warranted for assessing learners’ compositions to prevent construct underrepresentation in CLIL contexts. To this end, this study investigates the content quality of academic essays from readers’ perspectives. Research on how readers evaluate L2 learners’ compositions has contributed to the understanding of the construct of L2 writing ability (Barkaoui, 2020). This study applies this approach to conceptualizing a construct, exploring the essay features influencing readers’ judgments of content quality. Since CLIL learners are generally required to write essays on a particular subject and expected to demonstrate their subject knowledge, this study investigates subject specialists’ perspectives. The way they construe the content quality of essays in their specialized discipline can deepen the existing understanding of content in CLIL.
The current study was undertaken to develop a new rubric for assessing the content quality of student essays in CLIL. In particular, subject specialist perspectives offer valuable implications for assessing essays in language-driven CLIL programs, which emphasize the achievement of linguistic objectives (Ball et al., 2015). In such programs, language teachers, who are not necessarily experts of a particular academic subject, implement content-oriented L2 syllabi and conduct lessons (Macaro, 2018). However, they often prioritize the assessment of the linguistic aspects of compositions, without delving deeply into the analysis of their content or presented ideas (Sato et al., 2021). Empirically developed rubrics are likely to capture the quality of essays valued in the discipline more accurately than those created by language teachers intuitively or derived from the conceptualization of content in L2 writing assessment.
Literature review
The following two subsections will provide an analysis and comparison of the conceptualizations of content in L2 writing assessment and the assessment of content in CLIL and English-medium programs.
Conceptualization of content in L2 writing assessment
The theoretical underpinning of L2 proficiency addresses the importance of content. As meanings in discourse are influenced by various types of knowledge related to L2 use, the communication of meaning and meaning itself are considered to play a prominent role in models of communicative language ability (e.g., Bachman & Palmer, 2010). In particular, Purpura (2017) regarded meaning and meaning conveyance as “the cornerstone of L2 proficiency” (p. 48), since the ability to express meaning is an essential quality of successful communication. For instance, he considered a situation where an L2 is accurately used to summarize a story but the information provided in the summary is not factually accurate. This suggests that the writer may not possess all the necessary competences to communicate successfully. It follows that assessing the summary based solely on linguistic knowledge fails to represent their proficiency, in line with Purpura’s view above. In fact, content has been a criterion commonly included in analytic rating scales for assessing L2 writing proficiency. Table 1 presents representative components of content addressed by rating scales implemented in L2 writing assessment, specifically in a criterion labeled as Content or Ideas. These rating scales are used in both classroom-based, and large-scale assessment settings. As shown, content appraised in L2 writing assessment includes a range of components, such as the degree to which the composition fulfills task requirements, provides relevant information, and incorporates well-developed thoughts. Additionally, certain aspects related to the quality of ideas are integral to the construct, such as the complexity and originality of thinking. In argumentative or integrated writing tasks, the supportiveness of opinions and the use of source materials are also considered as key components of content.
Components of content in analytic rating scales.
Several empirical studies have investigated essay features that reportedly influence raters’ evaluations of overall writing proficiency through think-aloud protocols (TAPs) (Ericsson & Simon, 1993), providing evidence that raters pay attention to the content of essays when evaluating those written by L2 students (e.g., Cumming et al., 2001; Li & He, 2015; Polio & Lim, 2020). In Cumming et al.’s (2001) seminal study, 10 experienced English as a second language (ESL) and English as a foreign language (EFL) instructors rated 60 Test of English as a Foreign Language (TOEFL) essays and verbalized their rating processes. As they were not provided with preexisting rating scales, the essay features discussed in TAPs may have reflected their interpretations of L2 writing proficiency (Zhang & Elder, 2011). The results suggest that the content of essays was a feature to which raters paid attention, including logic, task completion, topic development, relevance, interestingness, originality, creativity, redundancies, and ideas (see also Li & He, 2015; Polio & Lim, 2020). More recent studies have explored other content-related features that raters consider when evaluating learners’ writing proficiency, including the persuasiveness of arguments (Kuiken & Vedder, 2014) and the effectiveness of source use (Gebril & Plakans, 2014). However, these studies could not reveal the perceived importance of content compared with other features, since the frequency of comments on essay features does not precisely reflect the strength of their influence (Sato & McNamara, 2019).
Empirical studies have addressed the relationships among content, linguistic quality, rhetorical organization, and coherence (i.e., logical connections among ideas). Some studies of raters’ evaluations of L2 proficiency have coded content features distinctively from linguistic features and rhetorical organization (e.g., Cumming et al., 2001; Li & He, 2015; Polio & Lim, 2020), suggesting that raters can clearly distinguish content from how the writer’s ideas are presented. Moreover, Bae et al. (2016) examined the contribution of coherence, originality, grammar, text length, and lexical diversity to the content of L2 students’ written stories, specifically in terms of detailedness, elaboration, sophistication, creativity, interestingness, and logic. The results of structural equation modeling that they employed revealed that the scores on the content of the stories were influenced by the five textual elements. Accordingly, content quality can be explained by linguistic features, such as lexicogrammatical accuracy and range, as well as coherence. Additionally, Kuiken and Vedder (2017) found medium to strong Pearson’s correlations (range: .544–.938) among four dimensions of essay content: adequacy and relevancy of content, task achievement, comprehensibility, and coherence and cohesion. While content-related features addressed by L2 writing assessment (see Table 1) appear to be related to coherence, they are conceptually different (Kuiken & Vedder, 2017). Moreover, rubrics for L2 writing proficiency evaluate content features separately from coherence (e.g., Hyland, 2019).
In summary, the field of L2 writing assessment has commonly regarded essay content as a part of L2 proficiency (Purpura, 2017), as supported by the findings of empirical studies. Although these studies have shed light on the construct definition of content in L2 writing assessment and contributed to rating scale validation, they addressed ESL/EFL teachers’ or accredited raters’ perspectives on L2 writing ability. Hence, the components of content incorporated into scoring rubrics and explored by previous studies are considered as indicators of L2 writing proficiency itself. The conceptualization of content as solely representing L2 writing proficiency may not be directly applicable to CLIL, as CLIL aims to foster not only L2 proficiency but also subject knowledge and critical thinking skills.
Assessment of content in CLIL and English-medium programs
Macaro (2018) proposed a continuum of L2 English classrooms worldwide, ranging from those with language-dominant objectives and those with content-dominant objectives. CLIL, positioned in the middle of the continuum, is categorized into two types based on the level of emphasis on language and content. Language-driven CLIL predominantly focuses on achieving linguistic objectives within L2 programs, whereas content-driven CLIL aims to teach school subjects as content lessons with a primary focus on accomplishing content objectives (Ball et al., 2015). In contrast to L2 pedagogies, which primarily aim to develop L2 proficiency, subject knowledge mastery is an explicit focus in CLIL, even in language-driven CLIL contexts (Ball et al., 2015). Furthermore, content-driven CLIL, taught by subject teachers, prioritizes academic achievement in subject matters while incorporating L2 learning. CLIL’s counterpart in higher education—Integrating Content and Language in Higher Education—also addresses both L2 and content learning through close collaboration between language teachers and subject experts (Schmidt-Unterberger, 2018). Other content-driven programs, including English-medium instruction and English-medium degree programs, focus almost entirely on subject knowledge mastery, thereby excluding L2 learning from program objectives (Macaro, 2018).
Some CLIL researchers have proposed analytic rating scales that assess the content of writing, specifically designed for specific types of writing tasks (e.g., Ball et al., 2015; Cloud et al., 2000; Dale et al., 2011). For example, Dale et al.’s (2011) study required students to plan an expedition across the Sahara by conducting research about specific areas for travel while considering possible hazards and ways to avoid them. As a final product, they were asked to write a brochure describing the expedition. The scoring rubrics included Content and Research, which focused on the accuracy of the information, thoroughness of the description, and inclusion of relevant research findings. This indicates that in CLIL, the assessment of content is not limited to evaluating L2 proficiency, such as the level of detail in writing, but also encompasses students’ subject knowledge and research skills (Ball et al., 2015; Cloud et al., 2000).
One important limitation of the rubrics proposed in the CLIL literature, which evaluates various aspects of content, is that the definition of content has little empirical basis. Instead, scholars’ intuitions have been the only source of construct definition. Thus, it remains unknown whether specific components of content covered in scoring rubrics pertain to the assessment tasks and accurately represent the construct. Moreover, no study has investigated the construct definition of content in argumentative writing assigned in CLIL classrooms, although argumentation is an important type of task assigned in CLIL and writing for various disciplines (Hirvela, 2017; Sato et al., 2021).
In content-driven programs where subject specialists teach their subject matter, students’ compositions are predominantly evaluated on their content. Rubrics employed in these courses commonly entail multiple specific content-related criteria, without dividing linguistic features into minute elements as frequently seen in L2 writing assessment and language-driven CLIL (e.g., Bean & Melzer, 2021; Bukhari et al., 2021; Gao et al., 2019; Garza et al., 2021; Walvoord & Anderson, 2010). For example, the following nine criteria exist in a rubric used at two U.S. universities to assess first-year students’ compositions, including engagement with external sources, originality of the contention, clarity of the contention, effectiveness of supporting evidence, presentation of the writer’s own idea, organization, source use, language style, use of standard written English, and formatting (Walvoord & Anderson, 2010). Similarly, in yet another American university, a general education academic writing rubric evaluates the quality of the opinion, exigence of the stated issue, quality of supporting arguments, appeal to the intended audience, source quality, source use, and citation (Bean & Melzer, 2021). Thus, the assessment of students’ compositions in higher education has embraced various content-related criteria, including the evaluation of the quality of written ideas, the persuasiveness of supporting evidence, and the effective use of external sources.
The findings of one empirical study align with the components of content as reflected in the writing rubrics used in content-driven courses. O’Hagan (2014) investigated the rating criteria adopted by subject specialists who assessed essays written by attendees of a management course at a major Australian university. The university had an economics and commerce faculty that included a department offering courses in the field of management. Within the faculty, 42.8% of the undergraduate students were international students representing diverse language backgrounds. Ten instructors specializing in management, business, and commerce rated the overall quality of 10 undergraduate students’ argumentative research essays on management. The instructors were told to mark the essays using their regular marking guide typically employed in their courses and to articulate their assessment procedures verbally. Their TAPs included linguistic and content criteria addressed in language assessment. Additionally, the assessors commented on content-related features that other L2 studies had not explored, including the accuracy of the content in relation to the research literature, level of the analysis, inclusion of key definitions, and quality of the research. Because O’Hagan’s (2014) study was conducted in a context where native English speakers and highly proficient students learned a subject, the findings may not be applicable to CLIL contexts. Accordingly, the study’s conceptualization of content does not necessarily represent content quality relevant to pedagogical settings, which aim to develop both L2 proficiency and subject knowledge in L2 learners.
The current study
Given the potential underrepresentation of the current conceptualization of content and lack of empirical studies investigating content in CLIL, this study aims to explore subject specialists’ perspectives on the content quality of essays written by L2 learners in a CLIL context. Specifically, it aims to explore specific content-related features applicable to language-driven CLIL contexts in higher education and presents empirical data indicating whether components of content addressed in the CLIL literature are truly valued in these contexts. This study adopted the approach of conceptualizing a construct based on rater perspectives (e.g., Cumming et al., 2001; Sato & McNamara, 2019) through the analysis of concurrent TAPs (Ericsson & Simon, 1993). Accordingly, the following overarching research question was formulated: What aspects of L2 learners’ argumentative essays reportedly influence subject specialists’ judgments of their content quality?
This study was conducted in the context of a Japanese higher education institution, where language-driven CLIL is implemented in a compulsory English program for first-year undergraduate students majoring in various disciplines (aged 18 and 19 years) in the second semester. Before taking the CLIL course, the students were required to pass an English for Academic Purposes course in the first semester, where they learned academic skills including essay writing. The CLIL course, taught mostly by language teachers, aimed to foster L2 proficiency through the medium of an academic subject or topic with which teachers were familiar. Additionally, essay writing assignments constituted 20% of the students’ overall grades in the course.
Method
Participants (raters)
Eleven researchers with expertise in English as a lingua franca (ELF) (Seidlhofer, 2011) were included as participants in the current study. To recruit them, the researcher contacted 22 scholars in Japan who possessed extensive knowledge of ELF via e-mail. A database on researchers in Japan, researchmap, was used to search for those who specialize in ELF. The researcher selected those who had presented ELF-related studies at conferences or published articles in peer-reviewed journals on ELF. Researchers in Japan were chosen, as data collection was planned during the long vacation periods of most Japanese universities so that busy individuals could participate. Consequently, 12 assistant and associate professors with expertise in ELF from three public and five private Japanese universities, located both in rural and urban areas, participated in this study as raters and received an honorarium for their participation. However, one participant was excluded from the study after the data collection, as their TAP data entailed only unelaborated comments without any specific features reportedly influencing their judgments of content quality, even after training. Table 2 presents the background information of the remaining 11 participants. All participants had English teaching experience at Japanese universities and were, thus, familiar with L2 pedagogies in higher education. Therefore, their perspectives on essay content quality may be influenced by not only their expertise in the discipline but also their experience as L2 teaching professionals at the university level, including the instruction and assessment of L2 academic writing.
Raters’ background information.
Research instruments
The researcher collected argumentative essays in order to obtain the participants’ judgments of content quality. The essays were written as assignments by Japanese undergraduate students who attended a CLIL course, titled Global Englishes, and already marked by its instructor. In the course, students, whose English proficiency level was B1 as per the Common European Framework of Reference for Languages (CEFR) (Council of Europe, 2001), learned about the current issues relating to English language use worldwide and considered the type of English they should learn in the future, utilizing authentic resources (e.g., Jenkins, 2015). A writing assignment with the following prompt was assigned in the course: You have been learning world Englishes.
1
What do you think about learning non-native speakers’ Englishes? Do you think learners should learn only native speakers’ English, or should we learn more world Englishes? In your essay, you must state your opinion and give two reasons to support your opinion.
In this task, students needed to argue their stance using supporting evidence by applying their knowledge gained from the ELF paradigm, which focuses on the “fluid and flexible kinds of English use that transcend geographical boundaries” (Jenkins, 2015, p. 42). The students were given approximately 1 month to complete their essays. The task difficulty was assumed to be moderate given that the students had already learned how to write academic essays by searching for a text related to a topic and following academic conventions.
The researcher obtained 65 essays and their scores from the CLIL instructor. The scores were based on 11 criteria developed by the instructor to evaluate students’ essays for their course grades. These criteria include the quality of their introduction, thesis statement, topic sentence, supporting sentences, and conclusion, as well as the accuracy of linguistic and formatting aspects. From the pool of essays, 21 essays were selected based on students’ main arguments, reasons for supporting such arguments, and scores given by the instructor. The focus of the selection process was to present the raters with essays featuring two distinct main arguments (whether to learn only native speakers’ English or to learn more Global Englishes), encompassing various details, sources, and reasoning styles with a wide range of writing quality for evaluation. As only five essays supported native English speakers’ English, they were selected for this study. Regarding those endorsing Global Englishes, supporting evidence for this argument and their scores were examined. Sixteen essays were selected in such a manner that they contained a variety of evidence and received a range of scores. Supplemental Appendix 1 presents the main arguments, two supporting reasons, and scores awarded by the instructor for the selected essays. It is evident that even essays with the same main argument displayed distinct supporting details, and the scores showed a wide range of variability, ranging from 31 to 49. The mean word count of the 21 essays was 395.6 (SD = 66.01), and many of them cited two to three external resources (see Supplemental Appendix 2 for two example essays).
Data collection
In concurrent TAPs in the current study, the raters were asked to read the essays, evaluate their content quality, and verbalize what they were considering as they were reading through the essays. To guide raters about what they were required to evaluate, the researcher gave the following definition of content adopted from Bae et al. (2016): “Content refers to the ideas or meanings in an essay” (p. 302). This study employed a six-level semantic differential scale from 1 (poor) to 6 (excellent) with unspecified midpoints to elicit the raters’ impressionistic quality judgments. Preestablished rating scales with descriptors were not provided, as this study sought to explore the raters’ perspectives on content quality without the influence of existing resources. Therefore, it was stressed to the raters that the study’s major concerns were the criteria they used rather than the final ratings. Regarding TAPs, the raters were told to verbalize any aspects of essays while assessing content quality, convey their thoughts continuously, and provide details about their thoughts. They were allowed to choose to read the essays either aloud or silently and verbalize their thoughts in either English or Japanese.
The raters were asked to produce all TAPs individually without the researcher’s presence. First, they received the written instructions, which delineated the procedure for rating and producing TAPs, and a sample essay for training through e-mail. The sample essay, chosen from the pool of the essays, was 417 words in length, and its score awarded by the instructor was close to the mean. The raters were asked to read the instructions and rate the quality of the sample essay, also producing and audio-recording a TAP. Thereafter, the rating and recorded protocol were sent to the researcher and examined to check whether the raters had adhered to the instructions. While advice was provided to those who failed to elaborate on their thoughts, the raters generally had no issue about evaluating content quality and generating TAPs. After confirming that the raters had adhered to the written instructions, the researcher sent the raters 21 essays and requested them to submit their ratings and TAPs within 1 month. The order of the essays was randomized for each rater.
After the essay ratings and TAPs had been returned, the researcher conducted and audio recorded approximately 30-minute individual interviews to (a) examine the effect of verbalization on the ratings (Barkaoui, 2011) and (b) elicit supplementary information on the criteria for content quality (Green, 1998). The interview questions are presented in Supplemental Appendix 3.
Data analysis
The 11 raters’ TAPs were scrutinized to explore the essay features that reportedly influenced their judgments of content quality. First, the researcher utilized a data transcription service to transcribe the recorded TAPs using standard orthography. Subsequently, the researcher conducted a meticulous accuracy check on the transcriptions. Afterward, the researcher differentiated between interpretation strategies and judgment strategies based on Cumming et al.’s (2001) framework of rating behavior. While the interpretation strategies included reading the essays aloud, interpreting unclear phrases, and summarizing ideas, the judgment strategies are pertinent to the evaluation of content quality to assign scores. Thus, the former reflected the raters’ understanding of the essay content, while the latter reflected their perspectives on content quality and content aspects to which they had attended. As this study’s focus is on the raters’ evaluative criteria, parts of TAP data irrelevant to evaluation (i.e., those representing the interpretation strategy) were excluded from the analysis. Cumming et al. (2001) further distinguished self-monitoring, ideational, and language foci in each strategy. The self-monitoring focus in judgment strategies involves establishing raters’ own criteria, as well as considering their personal responses and biases. As TAPs with this focus do not directly reflect the essay features that raters attended to while judging content quality, the data indicating raters’ self-monitoring were not analyzed.
The TAPs demonstrating the raters’ judgment strategies without self-monitoring were segmented and analyzed using thematic analysis (Braun & Clarke, 2022). The researcher segmented the verbalized data into units, each of which represented a single essay feature to which raters attended (Green, 1998). For example, the following consecutive comments were divided into three segments, wherein slashes indicate the boundaries of the segments: Yeah, so just some misspelling, instead of world Englishers it’s world Englishes. / And there’s actually no reference to their information about 1.5 billion people who speak English around the world. So I’d like to see a reference being used here. / And the paragraph itself is very short. (Rater 1, Essay 16, Segments 5–7)
Each of these segments represents a single and distinct aspect of the essay: mechanics, presence of citations and research data, and amount of information. Segmentation was iteratively modified depending on the coding of the data.
The TAPs were interpreted and coded adopting a process of thematic analysis known as Process A, which involves identifying themes or patterned meanings from the dataset (Braun & Clarke, 2022). More specifically, the researcher first identified themes and then inductively coded them through multiple readings of the TAP transcripts. Themes were then refined by restructuring, redefining, and renaming (Braun & Clarke, 2022). To confirm intercoder reliability, the TAPs of two randomly selected essays (505 segments) were recoded independently by a PhD student in applied linguistics. The kappa coefficient, indicating the degree of coding agreement, was .74, which is regarded as adequate (Drisko & Maschi, 2016). The disagreements between the two coders did not exhibit coherent patterns. It is possible that these disagreements were influenced by the large number of categories involved in the analysis. The researcher and second coder resolved these disagreements to finalize the segmentation and categorization, resulting in 5289 segments and categorized into 11 themes. Finally, subcategories of each theme were identified to reveal its detailed components (Drisko & Maschi, 2016).
The researcher obtained and verified the transcripts of the online interview data using the same procedures as the TAP data. The analysis of the interview data mainly focused on the essay features the raters believed affected the evaluation of content quality. The researcher categorized the mentioned features and analyzed the perceived degree of influence of lexicogrammar on the raters’ judgments. Furthermore, the study investigated the impact of producing TAPs on the rating process and outcome.
Results and discussion
Table 3 presents the 11 categories of essay features identified from the coding of raters’ TAPs and frequency of segments within each category (see also Supplemental Appendix 4 with examples by coded category). The findings indicate that the raters reportedly attended to a wide range of aspects when judging essay content quality. However, not all the themes are necessarily regarded as components of content regardless of their frequency. The following section presents and discusses the findings by focusing on (a) linguistic and presentation features, (b) content-related features addressed in L2 writing, and (c) content-related features often regarded as construct-irrelevant in L2 writing.
Main categories of criteria mentioned by the raters.
Linguistic and presentation features
Altogether, 21.2% of raters’ evaluative comments were on linguistic and presentation features, including lexicogrammatical accuracy, mechanics, and reference formatting, while the remaining 78.8% of the comments directly addressed the content quality of the essays. The raters mentioned various linguistic errors while reading the essays aloud and producing TAPs. Nevertheless, 10 raters stated that their influence was minimal if errors did not hamper the comprehensibility of the message. This perspective is reflected in the following verbatim quote from a TAP: Okay, so there’s a lot of small grammatical mistakes and word choice mistakes, but it doesn’t really affect the meaning of the paper. So, I am not going to give negative points for any kind of grammatical errors here. (Rater 1, Essay 7, Segment 34)
A plausible reason for the small influence of linguistic errors in this study is ELF’s essential standpoint, which emphasizes functional effectiveness rather than conformity to native English speakers’ linguistic norms (Seidlhofer, 2011). As the participants were researchers with expertise in ELF, its value might have influenced their judgments, as acknowledged by Raters 1 and 4 in the interviews. Furthermore, it is possible that linguistic and presentation features were regarded separately from content quality, although reading the essays aloud might have drawn attention to these features (Barkaoui, 2011), resulting in the large frequency of comments. In the post-session interview, four raters mentioned that they consciously disregarded formal aspects of essays while rating content quality. Rater 2’s comment particularly suggests that essay content is distinctive from how it is presented: Sometimes I found myself a little distracted by citations outside the sentence or something, but there was only a distraction. And so, I wasn’t analyzing these errors. And I was trying to focus on what the student meant. . . . I tried to reward the content, not according to the form it took, according to conventions, but according to the quality of the content, according to the ideas that were being presented. (Rater 2, Interview)
Supplemental Appendix 5 provides supplementary examples of comments offered by the raters during the interview sessions. This finding suggests that the influence of language and mechanical errors on content quality may be limited. However, it is worth noting that among the raters, only Rater 7 penalized linguistic errors that could have been eliminated through careful editing. This observation implies that the absence of thorough proofreading by the writers could potentially impact the evaluation of the essays’ content quality.
Content-related features addressed in L2 writing
The raters addressed logicality, elaboration, task achievement, complexity, and clarity while evaluating the essays’ content quality. These have been regarded as components of content and incorporated into scoring rubrics used in language assessment (see Table 1). This section focuses on the criteria unique to the writing task utilized in this study.
A prominent feature that received attention from raters during the evaluation of essay content quality was logicality, which encompassed elements such as logic and flow (consistency of argument and existence of contradiction and logical leap), supportiveness of evidence or logicalness (Paul & Elder, 2021), and alignment among different essay parts. In particular, in their TAPs, they mentioned whether external resources incorporated into the essays supported students’ claims, the titles indicated the content of the essay, body paragraphs were connected to the thesis statements, and the concluding paragraphs excluded the information not stated previously. During the post-session interviews, 10 raters acknowledged the importance of logicality in their judgments of content quality. For instance, Rater 1 mentioned the saliency of the link between the writers’ opinions and supporting arguments: So I really strongly looked at the way that they linked their opinion and does that reference support their opinion. Sometimes, they find different kinds of references that are opposite. So, I am trying to make sure that that reference that they found actually supports their opinion. That was the biggest area that I was looking for, for my ratings. (Rater 1, Interview)
The TAP and interview data suggest that reasoning displayed in argumentative essays, which involves the use of logic to support claims and build convincing arguments, can be considered as a major aspect of content quality.
The TAPs also included comments on content elaboration, which have been incorporated into many existing scoring rubrics (e.g., Carr, 2011; Lee et al., 2008; UCLES, 2020) and found to be a criterion used by raters to evaluate writing proficiency (e.g., Cumming et al., 2001; Polio & Lim, 2020). A unique feature related to elaboration in the current study is the expansion of cited information and explanation about the definition of terminology. For example, Rater 10 mentioned a lack of elaboration on the content of research cited to support a writer’s opinion: The essay states that academic presentations conducted by non-native English speakers are less credible than those given by native English speakers. But it doesn’t say who rated the credibility and what criteria were used. (Rater 10, Essay 18, Segment 17)
In L2 writing assessment, elaborated content often means that opinions are supported by reasons and examples (Hyland, 2019). This study, however, suggests that simply providing reasons may not suffice even though external resources and citations are deployed.
The complexity of content, including the depth of ideas and inclusion of multiple perspectives (Bae et al., 2016; Carr, 2011; Paul & Elder, 2021), was also considered as a component of content quality. The raters judged content complexity by appraising whether the students had considered all the essential points related to the prompt. Arguments omitting important points gave them the impression that the students had completed their essays without careful consideration of their position. The raters utilized their ELF knowledge to judge the degree of content complexity. For example, Rater 5 mentioned that Essay 11 failed to address accommodation, an essential ELF concept (Seidlhofer, 2011): The second paragraph is kind of problematic because it really overlooks the idea of accommodation, which is really kind of sad because the Sweeny reference that the student refers to inside of the same paragraph is about accommodating. (Rater 5, Essay 11, Segment 44)
Failing to mention essential points in writing may also reveal a lack of understanding of learned concepts (Ball et al., 2015). Consequently, careful consideration of the theme by applying a variety of background knowledge gained from class may be required to demonstrate the complexity of ideas and subject knowledge.
Content-related features often regarded as construct-irrelevant factors in L2 writing
This study explored features that the existing language assessment literature has rarely addressed: content accuracy, citation and research, and coverage. Although the frequency of comments on these categories was not large, they were perceived as relevant to the judgments of academic essays on a specialized subject.
The accuracy of written ideas, information, assumptions, and quotes was an essay feature reportedly influencing raters’ judgments of content quality. This category was mainly concerned with whether the written facts or assumptions about linguistics and Global Englishes were accurate, and the raters evaluated it using their specialized knowledge of the discipline. For example, Rater 2 responded to the argument that learning Asian people’s English varieties is indispensable to communicate with them: Well, as I’ve remarked in other comments on other essays, it’s possible to communicate with Asian people without learning their English if we accommodate and adapt. That’s what I’d like to say to the student. (Rater 2, Essay 9, Segment 23)
This comment implies that the idea given by the writer is incorrect, as accommodation strategies enable people to communicate without learning the interlocutors’ English varieties (Jenkins, 2015). Furthermore, in addition to the accuracy of the facts and assumptions presented in the essays, the raters mentioned whether writers cited external resources accurately in their TAPs. For instance, Rater 8 addressed a citation in Essay 13: I think Bambose’s article is probably related to World Englishes because it is in the journal called World Englishes. So, I suspect if Bambose really claimed, Only native English is a suitable model for all the English language learners. I don’t think that’s what he said. (Rater 8, Essay 13, Segment 7)
Raters 5, 7, and 8 addressed the accuracy of this citation, claiming that its content is incorrect.
During the post-session interviews, four raters mentioned the importance of content accuracy. In particular, Raters 5 and 7 mentioned that they lowered the content quality ratings if the writers’ opinions appeared to be misled by inaccurate assumptions, facts, or citations. For instance, Rater 5 claimed that the inaccurate interpretation of external resources affected the evaluation of content quality, as a conclusion drawn from inaccurate evidence could be misleading: There were a couple of references that I thought the student read in the wrong way. So, there was a citation to a 2010 Lev-Ari and Keysar paper about non-native speech being less credible. And the entire point of the paper is non-native speech is less credible. That’s what they claimed in their paper. The student read that as evidence for non-native speakers are less intelligible, which is not what they were saying. So, whenever I saw evidence of a misreading of a text, that tended to lower scores as well. (Rater 5, Interview)
However, Rater 7 stated that small misunderstandings do not influence the overall evaluation of content quality if the argument is coherent. This suggests that inaccuracy does not necessarily lead to a negative evaluation of content even though subject experts recognize students’ misconceptions.
Content accuracy has not been incorporated into conventional language assessment, because students’ background knowledge is generally regarded as a construct-irrelevant factor (Llosa, 2017) and language teachers, rather than content experts, normally evaluate compositions. However, it is considered as relevant when subject specialists examine the content of essays on the subject in tasks involving stating opinions by synthesizing information from various sources in CLIL contexts (Dale et al., 2011). This perspective is supported by O’Hagan’s (2014) research conducted in an English-medium program, in which experts of management referred to content accuracy, conflicts with the research literature, and overgeneralization in essays on management while evaluating essay quality. However, this aspect may be less important than logicality in argumentative essay tasks in CLIL settings.
Another essay feature unique to this study was related to the citations and research used to support the writers’ arguments. The raters examined whether references and research data were utilized to make the information credible and persuasive. For instance, Rater 4 mentioned that Essay 19’s writing should have included research data demonstrating that non-native English-speaking teachers (NNESTs) will be in great demand: I’d like to have seen a statistic where perhaps the number of NNESTs in teaching advertisements were in higher demand somewhere. A little bit of research done by the writer themselves there, I think, would have uncovered that reality. (Rater 4, Essay 19, Segment 88)
The raters also paid attention to the source of citations when evaluating the content quality of the essays. They negatively perceived citations from Wikipedia or gray literature (Wette, 2021), but positively regarded those from reputable peer-reviewed journals. They also considered the year of publication, noting that resources published more than 20 years prior, unless they are seminal studies, might be considered outdated for supporting the writers’ arguments, as revealed in their TAPs. Furthermore, the quality of cited information and research was evaluated. For instance, Essay 15 contained the statement, “Difference in pronunciation is the most frequent problem and difficult to solve (Jenkins, 2002),” which was used to support the view that learning Global Englishes allows people to grasp pronunciation differences and overcome potential misunderstandings. Rater 5 commented on this citation as follows: And perfect reference for that point as well. Jenkins 2002 was a perfect reference for that point. Good on you. (Rater 5, Essay 15, Segment 27)
As indicated by this TAP excerpt, besides the inclusion of references and research data, the selection of appropriate sources to support opinions can contribute to a favorable assessment of content quality.
In the post-session interviews, seven raters identified citation and research as an influential factor in their judgments of content quality. Rater 1 commented that utilizing outside references that support opinions was an important factor for a high rating of content quality: I don’t think I gave a six. I only gave a five, I think the highest. The reason I gave the five was because those students took the outside references, they read some articles, and they quoted or they paraphrased those articles, and they really utilized the references well. They were able to add it into their details and link it to their opinion. So, I really strongly looked at the way that they linked their opinion. (Rater 1, Interview)
In argumentative essays on a specialized subject, incorporating external sources as supporting evidence thus appears to positively influence raters’ views of the credibility of opinions and of overall content quality.
The inclusion of citations as well as quality of sources and researched information were found relevant and salient to writing tasks in which students conduct research to collect evidence (see Dale et al., 2011; O’Hagan, 2014). Using relevant sources to support and expand ideas is also addressed in essay writing assignments in content-focused programs in higher education (Bean & Melzer, 2021; Walvoord & Anderson, 2010). In academic writing, wherein conducting research and using sources is often expected, writers are advised to extract and incorporate relevant, significant, authoritative, and up-to-date sources (Nosich, 2022; Petelin, 2022; Wette, 2021). Subject specialists, who are also skilled academic writers, may underscore the significance of this practice, considering it as a crucial aspect of assessing the content quality of essays.
Finally, the raters focused on whether the essays covered elements essential to academic essays, such as a thesis statement, an organizational statement, cited references in a reference list, a topic sentence, and concluding remarks. In particular, the lack of a thesis statement, or the claim the whole essay is centered on (Nosich, 2022), was evaluated negatively, as shown in Rater 5’s comment in his TAP: The first paragraph is pretty problematic. No thesis statement. There was an essay prompt, and the first paragraph does not venture an answer to the question in the essay prompts, so already there’s going to be a less than perfect score for this essay. Because it lacks what I consider a pretty clear thesis statement. (Rater 5, Essay 3, Segment 12)
Furthermore, the raters negatively evaluated essays that did not acknowledge in-text citations in the reference list nor cite works included in the list, since this could concern academic honesty, as mentioned by Rater 4. During the interviews, five raters emphasized the importance of covering essential components of academic essays in their judgments of content quality. These writing elements are indispensable to various types of academic writing assignments, not solely in L2 pedagogical settings (e.g., Candlin et al., 2016), and their quality, such as the clarity and specificity of the thesis statement, is appraised in higher education (Bean & Melzer, 2021). The coverage of necessary elements for academic writing can be seen as a component of content quality, as the quality of arguments and credibility of sources, which were found to contribute to content quality, cannot be evaluated if the essay does not include them.
Conclusion
This study addressed the definition of content in writing assessment in language-driven CLIL contexts by exploring 11 subject specialists’ perspectives on the content quality of L2 learners’ argumentative essays on a specific subject. The analysis of the specialists’ TAPs resulted in the identification of 11 essay features that reportedly influenced their judgments of content quality, thereby expanding the construct definition of content within the context of L2 writing assessment. Notably, certain content-related features, such as elaboration, content accuracy, and source quality, were found to be relevant to the assessment of essays across various English-medium programs (e.g., O’Hagan, 2014). The current study suggests that these features are equally imperative constituents of content quality within language-driven CLIL programs.
The conceptualization of content based on this study could be particularly applicable to language-driven CLIL, in which topics related to language or communication are taught to students at B1 level according to the CEFR. More specifically, this study’s findings highlighted the essay features that have not been acknowledged as relevant components of content in language assessment but could be essential in some CLIL contexts. For example, subject specialists’ evaluative judgments are likely influenced by the accuracy of the content, presence and quality of research, and presence of elements required in academic essays. Their assessment can lead to accurate interpretations of competence covered in CLIL, including subject knowledge and cognitive skills (Mehisto & Ting, 2017), derived from research-based argumentative essays on a particular subject. The findings can also support empirical evidence for CLIL scholars’ intuitively developed scoring rubrics, which assess content accuracy and the presence of researched information (e.g., Dale et al., 2011). Furthermore, the current study confirmed that the components of content often addressed in language assessment (e.g., elaboration, logicality, and task achievement) are relevant to writing assessment in CLIL. This study also provided detailed definitions of these constructs and presented a comprehensive list of content-related criteria components, as outlined in Supplemental Appendix 4.
The findings revealed the role of linguistic features in the raters’ judgments of content quality. In essence, lexicogrammatical errors did not appear to affect raters’ evaluations of essay content unless they hampered message conveyance. In this sense, at least at a conscious level, content quality and linguistic features may be regarded as independent and separable to some degree. Accordingly, whereas lexicogrammatical accuracy and diversity serve to express content (Bae et al., 2016), these features alone may not guarantee a positive evaluation from subject specialists.
The explored components of content serve as a list of content-related elements relevant to research-based argumentative essays in the language-driven CLIL context. CLIL practitioners are recommended not to restrict their focus to conventional components of content concerning L2 writing proficiency (e.g., task achievement or idea development), but rather to assess other criteria relevant to subject knowledge and cognitive skills. They can select essay features depending on what they wish to emphasize through research-based argumentative essay writing assignments. If CLIL teachers are concerned with assessing students’ argumentation skills or the skills to construct convincing arguments to the intended audience (Hirvela, 2017), for example, then the rating scale should include and emphasize logicality and complexity (i.e., breadth of ideas). Furthermore, if CLIL practitioners intend to assess the ability to conduct research independently to extend subject knowledge (Coyle et al., 2010), they need to focus on citation and research as well as logicality, including the quality of research and supportiveness of researched evidence. The accuracy of researched information may then be better examined. However, other criteria (e.g., clarity, coverage, and discourse structure) should not be neglected, as they could be relevant to a range of writing assessment foci. This study’s findings suggest that language teachers who are not experts in the subject need to collaborate or consult with subject specialists to evaluate some aspects of content, such as accuracy and complexity (Schmidt-Unterberger, 2018). However, the outcome of this assessment practice represents the knowledge and skills learned in language-driven CLIL settings more accurately than the assessment of limited aspects of content related to L2 proficiency.
Although this study explored the features that reportedly influenced rater judgments of content quality, the TAPs were unable to explain unspecified general features, demonstrate the strength of each feature, or represent all the features affecting the ratings precisely. As 12.2% of the TAPs were unelaborated, detailed factors in many general comments were not explicated. This category could have been delved into through post-session interviews or stimulated recalls. Moreover, this study’s findings did not precisely indicate the strength of influence of each explored feature. Finally, the TAPs potentially failed to reflect all the features that influenced the raters’ judgments, as TAP data cannot generally demonstrate informants’ unconscious and automated thought processes and can only reveal what raters choose to share (Ericsson & Simon, 1993). The findings thus represent only those features consciously heeded by the raters while rating content quality, potentially excluding other influential essay features related to content. Additionally, four raters admitted that thinking aloud might have affected the way they rated the essays, including severity of judgment, rating process, and heeded features.
Further studies are necessary to investigate subject specialists’ perspectives on the content quality of other types of compositions, such as the summarization of the learned content, description of scientific phenomena, and research reports on a given academic theme (Sato et al., 2021). More importantly, perspectives of experts in other disciplines must be investigated, since subject specialists in less language-related fields (e.g., science and history) are likely to construe content quality differently from the raters in this study. The further conceptualization of content would lead to more precise assessments of essential elements fostered in CLIL.
Supplemental Material
sj-docx-1-ltj-10.1177_02655322231190058 – Supplemental material for Assessing the content quality of essays in content and language integrated learning: Exploring the construct from subject specialists’ perspectives
Supplemental material, sj-docx-1-ltj-10.1177_02655322231190058 for Assessing the content quality of essays in content and language integrated learning: Exploring the construct from subject specialists’ perspectives by Takanori Sato in Language Testing
Footnotes
Acknowledgements
The author expresses sincere gratitude to the editor and the anonymous reviewers for their invaluable comments and feedback on the manuscript.
Author contributions
Declaration of conflicting interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by the EIKEN foundation of Japan.
Supplemental material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
