Abstract
This study explores the use of artificial intelligence (AI) in language learning, focusing on its ability to enhance English as a Foreign Language (EFL) writing. Specifically, it examines the effect of integrating ChatGPT feedback with teacher feedback on the syntactic complexity of Saudi EFL learners’ writing. A quasi-experimental design was employed, involving two intact groups of undergraduate students (n = 35) enrolled in an academic writing course. The 9-week intervention provided the experimental group with integrated ChatGPT and teacher feedback, while the control group received only teacher feedback. Pre- and post-test essays were analyzed using the Tool for the Automatic Analysis of Syntactic Sophistication and Complexity (TAASSC), covering 177 indices across global, clausal, and phrasal levels. Results showed that the combined feedback condition did not produce a reliable advantage over teacher feedback alone at the global or clausal levels of syntactic complexity. At the phrasal level, a limited set of noun phrase–related indices revealed post-test differences between groups, suggesting localized, feature-specific development rather than broad syntactic restructuring within a short instructional period. However, these differences did not remain statistically significant after false discovery rate (FDR) correction and are therefore interpreted as exploratory rather than confirmatory. The findings are discussed in relation to previous research on AI-mediated feedback and FL writing development, and pedagogical as well as research implications for the integration of AI tools in EFL writing instruction are outlined.
Keywords
Introduction
Technology-enhanced language learning has exerted a significant influence on the development of FL learners’ proficiency and performance (Fredrick & Craven, 2025). Within this paradigm, the integration of artificial intelligence (AI) tools has markedly reshaped FL writing pedagogy and assessment practices. Among these tools, ChatGPT has attracted growing interest due to its ability to provide immediate, personalized feedback, thereby improving grammar, vocabulary, sentence clarity, and overall writing effectiveness (Deng & Lin, 2022; Han & Li, 2024; Kim & Chon, 2025; Oh & Hsieh, 2025; Shen et al., 2023; Song & Song, 2023; Zhai, 2022). However, concerns have been raised regarding its inconsistency, potential to mislead students, and risks of overreliance (Lingard, 2023). Consequently, educators increasingly advocate for the use of ChatGPT as a supplementary resource rather than a sole replacement in FL writing pedagogy (Escalante et al., 2023; Kim & Chon, 2025; Wang et al., 2024).
The development of FL writing has been assessed using several measures, such as lexical richness, cohesion, and sentence variety, among which syntactic complexity (SC) stands out as a strong indicator of writing ability. SC is defined as the degree of diversity, elaboration, and sophistication in the grammatical structures employed in language production (Ortega, 2015). Although there is a growing need to examine SC as a multidimensional structure, encompassing both fine-grained and large-grained indices, research in this area remains limited. Generally, most FL studies analyze only one or two complexity indices, often relying on a narrow selection of commonly used measures, such as the average unit and subordination ratios (Bulté & Housen, 2012; Norris & Ortega, 2009). This reductionist approach is also apparent in research focusing on specific SC measures, which often tend to emphasize one or two indices of complexity at the clause-linking or sentence level while overlooking complexity at other syntactic levels, such as phrasal or clausal levels (Bulté & Housen, 2012).
To address this gap, this study investigates the impact of integrating ChatGPT feedback with instructor feedback on the SC of EFL learners’ writing at three levels: global, clausal, and phrasal, using both large- and fine-grained indices. The study aims to clarify the role of combined feedback in syntactic development in FL writing by employing a mutli-dimensional approach and comprehensive computational tools (Kyle, 2016; Lu, 2010). Essentially, it seeks to enhance the development of more effective pedagogical approaches for integrating ChatGPT feedback into EFL writing classes. Therefore, this study explores how combining ChatGPT feedback with teacher feedback affects EFL learners’ writing SC at the global, phrasal, and clausal levels. Accordingly, the study addressed the following research questions:
Literature Review
Syntactic Complexity and Writing Quality
SC constitutes a fundamental aspect of language production, reflecting the diversity and sophistication of grammatical structures employed to express meaning and achieve communicative objectives (Ortega, 2015; Zheng & Barrot, 2024). Situated within the broader domain of linguistic complexity, SC has been extensively acknowledged as a significant predictor of both writing development and quality (e.g., Hao et al., 2024; Lu, 2010; Ortega, 2003; Zhang & Lu, 2022), as well as language proficiency (e.g., Y. Li et al., 2022; Lu & Ai, 2015). A variety of indices have been used to quantify SC across various levels (Biber et al., 2011; Kyle & Crossley, 2018; Lu, 2010; Wolfe-Quintero et al., 1998; Zhang & Lu, 2022).
Previous research predominantly employed the mean length indices of clauses (MLC), sentences, and T-units (MLTU) to evaluate SC (Ortega, 2003). Expanding on this foundation, Lu (2010) advanced the field by integrating 11 additional large-grained indices of SC derived from Wolfe-Quintero et al.’s (1998) and Ortega’s (2003) comprehensive synthesis of FL writing research. These 14 measures were subsequently categorized into five dimensions based on the specific syntactic characteristics they represent: (a) length of production indices (e.g., MLTU), (b) subordination (e.g., clauses per T-unit), (c) coordination (e.g., coordinate phrases per clause), (d) sentence complexity (clauses per sentence), and (5) phrasal elaboration (e.g., verb phrases per T-unit). These indices can be systematically analyzed through computational tools designed to automatically annotate learners’ texts for syntactic features, with Lu’s (2010) L2 Syntactic Complexity Analyzer (L2SCA) being among the most widely employed program in this domain.
Numerous studies have investigated the extent to which large-grained SC indices function as indicators of FL proficiency and writing quality. For example, utilizing a corpus of 1,198 argumentative essays, H.-J. Yoon (2017) analyzed seven large-grained indices through the L2SCA and identified significant proficiency-related differences in MLT, MLC, MLS, noun-phrase complexity (CN/C), and phrasal coordination (CP/C). Similarly, W. Yang et al. (2015) reported that length-based measures, particularly MLS and MLT, exhibited significant correlations with holistic writing scores. Among the various large-grained SC measures, MLTU has consistently been the most frequently applied metric and is recognized as one of the strongest predictors of writing quality. For instance, Ortega (2003) identified MLTU as the sole index common to all six longitudinal FL writing studies reviewed, and Johnson’s (2017) meta-analysis revealed MLTU as one of the two most frequently reported metrics in task-based FL writing research.
Despite the demonstrated significant correlations between large-grained indices and FL writing quality, their validity as comprehensive representations of syntactic constructs has been increasingly questioned. Scholars contend that measures such as MLTU, while informative regarding unit length, do not specify the structural elements (e.g., clauses, phrases, or modifiers) that contribute to this length. Consequently, these indices offer limited insights into the developmental trajectories of learner syntax, for example, whether writers are transitioning from reliance on clausal subordination to greater use of phrasal elaboration (Biber et al., 2011; Kyle & Crossley, 2018). In response, researchers have advocated the adoption of fine-grained indices that capture discrete grammatical configurations, particularly those reflecting clausal subordination (e.g., adverbial, complement, and relative clauses) and nominal modification (e.g., possessive constructions, compound nouns, and adjectival modifiers). These indices can be systematically measured using the Automatic Analysis of Syntactic Sophistication and Complexity (TAASSC; Kyle, 2016).
Empirical research indicates that fine-grained indices are stronger predictors of FL writing quality compared to large-grained measures. For example, Zhang and Lu (2022) investigated the comparative predictive efficacy of these indices concerning writing quality. Their findings demonstrated that the implemented fine-grained measures surpassed large-grained indices across both genres. Specifically, fine-grained indices accounted for 31.9% of the variance in quality ratings for application letters, whereas global measures explained only 20.2%. Similarly, for argumentative essays, fine-grained indices explained 30.6%, contrary to 15.7% for the large-grained indices. Among the most salient predictors (r > .200) were indices associated with prepositional complexity (e.g., prepositions per clause and prepositions per object of the preposition), noun phrase elaboration (e.g., dependents per nominal and dependents per nominal subject), and modifications through adverbials and adjectives (e.g., adverbial modifiers per clause and adjectival modifiers per nominal). Previous studies have also highlighted the predictive significance of fine-grained phrasal indices. For example, Qian (2022) analyzed a corpus comprising 120 essays authored by Chinese college FL learners and found that fine-grained phrasal complexity indices, rather than clausal or large-grained measures, were the most reliable predictors of overall writing performance.
Recently, SC has been widely conceptualized as a mutli-dimensional construct (Norris & Ortega, 2009), and methodological approaches to its assessment in FL writing research have progressively evolved to incorporate multiple analytical levels, including global, clausal, and phrasal dimensions (Jiang et al., 2019). Large- and fine-grained indices fulfil distinct, yet complementary roles. Large-grained measures are valued for their practicality and their capacity to abstract from learner- and context-specific variability, thereby offering a degree of generalizability that facilitates their potential for broad application across diverse writing contexts (Zhang & Lu, 2022). Conversely, fine-grained indices enable a more nuanced examination of the syntactic structures that manifest at various developmental stages. By pinpointing specific structural patterns, these measures enhance analytical transparency and provide a more comprehensive understanding of how syntactic variation contributes to differences in FL writing quality. Therefore, this study combined large- and fine-grained measures to leverage the advantages of each approach, thereby providing a holistic and detailed analysis of SC in FL writing.
Syntactic Complexity and Feedback
Within the corrective feedback (CF) domain, only a limited number of studies have investigated whether, and in what ways, feedback influences the development of SC in learners’ written production. However, these studies’ findings have been inconclusive. Some studies have demonstrated that students who receive CF exhibit greater SC development. For example, Van Beuningen et al. (2012) reported overall development in SC, while Fazilatfar et al. (2014) revealed that learners exhibited significantly higher MLS scores and an increased dependent clause ratio (DC/C). Similarly, W. Li et al. (2020) observed positive effects on measures of subordination and coordination. Conversely, other studies have documented the adverse effects of CF on SC. For instance, Hartshorn and Evans (2015) conducted a longitudinal 30-week investigation and found that although accuracy improved, SC declined. Likewise, Eckstein and Bell (2021) reported a significant reduction in SC among students receiving CF compared to their peers in the control group. These findings suggest that CF may redirect students’ attention toward accuracy, potentially discouraging the use of more complex syntactic structures. A further line of research has indicated that CF may exert minimal or no impact on SC. For example, Thi et al. (2023) concluded that mere exposure to CF is insufficient to enhance students’ writing complexity.
Recent empirical investigations into the impact of automated and AI-generated feedback on SC in FL writing have yielded a mixed picture. For example, Thi et al. (2023) compared teacher feedback, Grammarly, and a combination of both over the course of a semester, finding no significant development in SC; moreover, there was some evidence that learners simplified their writing while prioritizing accuracy. In contrast, Hou (2024) reported that automated essay scoring (iWrite) facilitated development in global measures such as MLTU, although certain finer-grained indices remained unaffected, and verb phrase use even declined. Similarly, Fan (2023) found no significant benefits of adding automated feedback from Grammarly to teacher feedback for SC, whereas Bagheri Nevisi and Arab (2023) noted that learners receiving computer-generated feedback through Ginger outperformed their peers on SC measures, suggesting that specific automated tools may promote greater variation in sentence structure. Notably, Deygers et al. (2025) indicated that, although the use of ChatGPT had a significant positive effect on two SC indices, particularly MLTU and MLC, these effects were limited in scope and unsustained, as development diminished once students ceased using ChatGPT.
The existing literature indicates that CF, whether provided by teachers or AI-based tools, has yielded mixed and sometimes inconclusive findings regarding the development of SC in FL writing. These inconsistencies suggest that feedback effectiveness may depend less on the mere presence of feedback and more on how learners’ attention is directed to linguistic form during meaning-focused writing and revision (Thi et al., 2023). To account for this variation, the present study adopts a focus on form framework (Long, 1991), which posits that language development is facilitated when learners’ attention is selectively and temporarily drawn to linguistic features as they emerge in communicative activities. Within this perspective, feedback serves as a pedagogical mechanism that promotes noticing of form–meaning mismatches and supports restructuring during revision.
Building on the noticing hypothesis, this study argues that different sources of feedback may guide learners’ attention to different, yet complementary, aspects of syntactic form. Teacher feedback tends to be selective and pedagogically focused, often targeting higher-level or discourse-relevant structures and providing explicit explanations aligned with instructional goals (Han & Li, 2024). Such feedback is particularly suited to directing learners’ attention to global and clausal-level SC, including sentence structure, subordination, and cohesion. In contrast, AI-generated feedback is characterized by its immediacy, consistency, and high degree of personalization. AI feedback can repeatedly and systematically highlight localized grammatical and structural issues, allowing learners to notice patterns of form–function mappings across their texts (Guo & Wang, 2024). This type of feedback is especially effective in directing attention to phrasal-level features that may otherwise remain unattended during meaning-oriented writing.
Despite these theoretically complementary affordances, prior research has largely examined AI feedback as a standalone intervention, often comparing it with traditional teacher feedback rather than investigating their combined effects (e.g., Deygers et al., 2025; Hou, 2024). Moreover, previous studies have operationalized SC using diverse and sometimes limited indices, with some focusing exclusively on large-grained measures (e.g., Thi et al., 2023) and others selectively examining specific indices (e.g., Bagheri Nevisi & Arab, 2023). Consequently, it remains unclear how AI feedback, when integrated with teacher feedback, influences SC across multiple levels of linguistic analysis.
To address this gap, the present study investigates the impact of ChatGPT as a complementary feedback tool on SC across global, clausal, and phrasal levels, employing both large- and fine-grained indices. By grounding the integration of teacher and AI feedback in a focus-on-form framework, this study provides a theoretically principled and empirically comprehensive account of how combined feedback may support multidimensional syntactic development in FL writing, thereby contributing to the growing literature on AI-assisted language learning.
Methodology
Research Design
This study utilized a quasi-experimental design with two groups: an intervention group (experimental) and a comparison group (control). This design was chosen to examine the impact of the combined feedback on the SC abilities of EFL learners’ writing performance within a real-world classroom environment where random assignment was impractical. The study was conducted in three phases: (a) a pre-test administered to both groups to identify participants’ writing proficiency, (b) a 9-week intervention period, and (c) a post-test to evaluate potential learning development.
Context and Participants
The study consisted of 35 female Saudi EFL students, who were assigned to two intact classes designated as the experimental group (n = 19) and control group (n = 16). A sensitivity analysis for an independent-samples t-test (two-tailed, α = .05) indicated that the present sample size provided approximately 80% power to detect large between-group effects (d ≈ 0.98), whereas statistical power was substantially lower for small-to-moderate effects. Accordingly, non-significant findings should be interpreted with caution, as they may reflect limited power to detect modest instructional effects.
Participants were first-year college students who had completed two semesters, including language skills courses and Writing I. Prior to university enrollment, they had received at least 8 years of English instruction from public schools, spanning elementary, intermediate, and secondary levels. According to departmental placement assessments, their English proficiency ranged from elementary to low-intermediate levels, corresponding to A2-B1 on the CEFR framework. Participants’ ages ranged from 18 to 21 years. Informed consent was obtained from all participants for the use of their performance and classwork data for research purposes. To comply with the ethical standards concerning anonymity and confidentiality, the names of the university and participants have been omitted in this study.
The study was conducted within the English Writing II course, a compulsory module for third-semester English majors at a university in Saudi Arabia. The course’s primary aim is to develop students’ academic writing skills across various genres, emphasizing rhetorical, lexical, and grammatical components. The course spans nine weeks, comprising three hours of teaching per week, divided into a 2-hr session and a 1-hr session. Throughout the course, students completed three major essay assignments, with multiple drafts.
The course instructor also served as the first author and provided teacher feedback to both groups. While this dual role is not uncommon in classroom-based research, it may introduce potential researcher bias or expectancy effects. To mitigate these concerns, several safeguards were implemented. First, outcome measurement relied on automated, tool-based indices (TAASSC) and standardized pre–post writing tasks administered under identical conditions for both groups, reducing the influence of subjective judgment. Second, both groups followed the same curriculum, materials, and instructional schedule, with feedback procedures guided by consistent commenting priorities aligned with course outcomes across sections. The only systematic difference between the groups was the inclusion of ChatGPT feedback in the experimental condition, while both groups received teacher feedback to ensure instructional equity and consistency.
Data Collection
Data were collected in three main stages: a pre-test, a 9-week intervention, and a post-test. At the outset of the course, prior to the intervention, participants from both the experimental and control groups completed a pre-test writing task. This task required composing a 300-word narrative essay within a controlled classroom setting, with a time allocation of 60 min, deemed sufficient for the task allocation. The objective was to evaluate the participants’ baseline writing skills and to establish an initial point of comparison between the two groups. Following the pre-test, the first researcher and instructor facilitated a two-hour orientation workshop aimed at training students in the experimental group on the use of ChatGPT for feedback purposes and exploring its potential application. While some students had prior experience utilizing ChatGPT for feedback, others were introduced to it for the first time. This orientation ensured that all participants in the experimental group were adequately prepared to employ ChatGPT during the intervention stage.
The intervention period spanned nine weeks, during which both groups completed weekly essay writing assignments integrated into their coursework. The instructor implemented a process writing methodology to teach the course. Each essay genre was taught over a 3-week period, resulting in three instructional cycles and a total of nine hours of instruction. Each cycle concentrated on a specific essay genre and adhered to a consistent instructional process. Microsoft Teams served as the platform for assignment submission, feedback delivery, and lesson dissemination.
Each cycle commenced with prewriting activities in the first week, involving brainstorming and outlining thoughts. This was followed by a 50-min in-class writing session during which students created their first drafts. In the second week, students in the experimental group received a carefully designed prompt to solicit feedback from ChatGPT, ensuring that the feedback was constructive without rewriting the text. In addition to ChatGPT feedback, the instructor also provided feedback to the experimental group. Conversely, the control group received feedback exclusively from the instructor. During the third week, students revised their drafts based on the feedback and subsequently submitted their final versions.
Building on prior research on AI-mediated feedback in FL writing (e.g., Koltovskaia et al., 2024; Yeung, 2025), the present study conceptualized ChatGPT and teacher feedback as serving complementary but differentiated functions within the writing process. Previous research suggests that AI-based feedback is particularly effective when used to provide systematic, immediate feedback on linguistic form, whereas teacher feedback typically focuses on higher-level, pedagogically salient concerns such as sentence structure, coherence, and discourse organization. In line with this distinction, ChatGPT feedback in the present study was deliberately constrained through standardized prompting to address localized grammatical and structural features, while teacher feedback targeted global and clausal-level aspects aligned with instructional goals. Feedback was sequenced such that ChatGPT feedback was received prior to teacher feedback, and students were explicitly instructed to prioritize teacher feedback in cases of discrepancy. This design ensured a clear division of labor between the two feedback sources and minimized potential conflict in students’ revision decisions.
Following the intervention, all participants undertook a post-test, which involved writing a 300-word narrative essay under conditions identical to those of the pre-test (60 min, controlled environment). The purpose was to assess development in students’ writing skills upon completion of the intervention and to facilitate a comparative analysis of the experimental and control groups’ performance. To ensure the validity of development measures, the post-test employed a different essay topic from the pre-test; both topics were carefully constructed to be of equivalent difficulty, thereby ensuring that performance changes reflected genuine development rather than task familiarity. The pre- and post-tests were face-validated by three subject matter experts, whose feedback was incorporated prior to the test administration.
Data Analysis
Statistical Analysis
In this study, pre- and post-test written tasks were analyzed using the TAASSC (Kyle, 2016). To ensure the reliability of the analysis, all spelling errors within the text were corrected prior to processing. The TAASSC provides a comprehensive set of indices comprising 31 fine-grained clausal complexity measures, 132 fine-grained phrasal complexity measures, and a re-implementation of the 14 large-grained SC indices originally developed for the L2SCA (Lu, 2010). Consequently, a total of 177 distinct SC values were generated for each writing sample.
Because TAASSC yields a large number of SC indices, the analyses involved multiple statistical tests. To reduce inflated Type I error, p-values were adjusted using the Benjamini–Hochberg false discovery rate (FDR) procedure within each family (global, clausal, and phrasal). The tables report unadjusted p-values; FDR-adjusted q-values are provided in in Supplementary Tables S1–S6, and findings are interpreted as robust only when they remain significant after correction.
In addition to statistical significance, effect sizes (Cohen’s d) were computed for between-group comparisons to quantify the magnitude of differences. Where appropriate, 95% confidence intervals were reported to convey estimation uncertainty, allowing interpretation beyond p-values alone.
The computational methodology underlying TAASSC is elaborated in detail by Kyle (2016). In line with Lu (2010), the calculation of the 14 large-grained indices began with the generation of a constituency representation for each sentence using the Stanford Parser (Zhang & Lu, 2022). Structural units, including T-units, dependent clauses, and complex nominals, were identified and quantified through Tregex queries, with these counts serving as the basis of the indices.
In contrast, the fine-grained clausal and phrasal measures were derived from dependency parsing: each sentence was processed using the Stanford Neural Network Dependency Parser, after which pertinent linguistic units and dependency relations were extracted using a Python XML parser.
Index Selection
All 177 indices generated by TAASSC were retained at the computation stage. For inferential testing, indices were screened for distributional plausibility, |skewness| and |kurtosis|≤ 2, to support parametric comparisons. Between-group differences were then tested within each analytic family aligned with the research questions, global, clausal, and phrasal. To address multiplicity, p-values were corrected using the Benjamini–Hochberg false discovery rate procedure within each family, and results were interpreted as robust only when they remained significant after correction.
Results
Across the three analytic levels, global, clausal, and phrasal, the results showed a largely consistent pattern: the experimental group, who received combined feedback, did not demonstrate reliable advantages over the control group, who received teacher feedback only, on global or clausal indices. At the phrasal level, only a limited subset of noun-phrase–related indices showed post-test differences. Full index-level outputs are provided in Supplementary Tables S1–S6.
At the Global Level
At the global level, the experimental and control groups were comparable at pre-test across the large-grained indices. Post-test comparisons similarly indicated no consistent between-group differences on global measures, suggesting that combined feedback did not produce measurable changes in overall global SC indicators over the intervention period.
Table 1 summarizes the post-test between-group results at the global and clausal levels; full index-level outputs are provided in Supplementary Tables S1–S6.
Post-test Between-group Comparisons on Global (Large-Grained) and Selected Clausal (Fine-Grained) SC Indices.
At the Clausal Level
At the clausal level, pre-test results indicated comparability between groups across the analyzed indices. Post-test comparisons did not show a stable between-group advantage attributable to combined feedback, indicating that clause-level restructuring (e.g., subordination-related complexity) was not reliably affected within the study timeframe.
Post-test comparisons on clausal indices did not show a stable between-group advantage attributable to combined feedback. In other words, clause-level restructuring (e.g., subordination- and clause-function–related complexity) was not reliably affected within the study timeframe. Selected clausal indicators are summarized in Table 1, and the complete clausal outputs are reported in Supplementary Table S4.
At the Phrasal Level
At the phrasal level, the two groups again started from comparable baselines at pre-test. In the post-test, most indices did not differ between groups; however, a limited number of noun-phrase–focused measures showed differences, suggesting localized changes in nominal elaboration rather than broad phrasal restructuring. Key phrasal indices and noun-phrase structural measures are reported in Table 2; complete phrasal outputs are provided in Supplementary Table S6.
Post-test Between-Group Comparisons on Key Phrasal Indices, Including Noun-Phrase Structural Measures With Nominal (Uncorrected) Between-Group Differences.
Note. **: Result is significant at the 0.01 level.
Overall, the post-test results demonstrated no statistically significant differences between the two groups after controlling for multiple comparisons within the phrasal family. A small set of noun phrase–focused indices reached nominal significance at the unadjusted level (p < .05); however, these effects did not remain significant after FDR correction. Accordingly, the phrasal findings are interpreted as exploratory and suggestive rather than confirmatory.
Discussion
This study’s primary objective was to investigate whether integrating ChatGPT feedback with teacher feedback would result in enhanced SC among EFL students, compared to teacher feedback alone. The findings indicated that students receiving combined feedback did not significantly outperform those receiving solely teacher feedback at the global, clausal, or phrasal levels of SC. Our findings on the ineffectiveness of feedback on SC are consistent with those reported by Thi et al. (2023), who demonstrated that writing complexity remains unaffected by feedback, irrespective of whether it is delivered by teachers, automated systems, or a combination thereof. This outcome also supports the conclusions of Fan (2023), who found that integration of automated written feedback with teacher feedback did not enhance students’ SC. Similarly, Xu and Zhang (2021) reported that, unlike accuracy and fluency, learners’ SC exhibited no significant development following automated CF. However, the results of this study diverge from those of Deygers et al. (2025) and Bagheri Nevisi and Arab (2023), who observed that learners receiving automated CF attained higher levels of SC compared to their peers. Additionally, the findings are inconsistent with those of Hou (2024), who documented substantial development in global measures of SC as a consequence of automated feedback. These discrepancies may be attributed to differences in the role and affordances of the AI tools employed and their integration within instructional design. In studies such as Deygers et al. (2025) and Bagheri Nevisi and Arab (2023), automated feedback functioned as a primary or relatively unconstrained source of revision support, allowing extensive reformulation and syntactic expansion. Similarly, Hou (2024) employed an automated essay scoring system that provided holistic feedback, potentially encouraging global increases in SC. In contrast, the present study positioned ChatGPT as a complementary and constrained feedback tool, with teacher feedback explicitly prioritized. Moreover, the standardized prompting used in this study limited ChatGPT rewriting and emphasized localized, form-focused revision, which may help explain the absence of significant SC development observed in the present study.
Previous research has shown that CF may redirect learners’ attention toward accuracy, potentially discouraging the use of more complex syntactic structures, as students adopt simpler constructions to minimize the risk of error (Eckstein & Bell, 2021; Hartshorn & Evans, 2015; Truscott, 2007). Eckstein and Bell (2021), for example, argued that FL writers may deliberately employ linguistically simplified structures when accuracy is emphasized, while Hartshorn and Evans (2015) similarly noted that careful monitoring for errors can inhibit SC as learners favor safer, more controlled forms. This tendency may be particularly pronounced in high-stakes instructional contexts such as the Saudi EFL setting examined in the present study (Al-Seghayer, 2022). In this context, students are evaluated on each written assignment, and performance directly contributes to their final course grades. Such assessment practices, combined with a low tolerance for grammatical errors, may encourage learners to prioritize accuracy and error avoidance over syntactic elaboration. As a result, students may prefer to produce simpler but more accurate sentences rather than attempt more complex structures that could jeopardize their scores
Most importantly, the findings of the present study align with a growing body of research indicating that, although CF does not necessarily promote the development of SC, it also does not lead to structural simplification in learners’ writing (Thi et al., 2023; Xu & Zhang, 2021). In the present study, no significant development in SC were observed; however, there was also no evidence that feedback resulted in less complex syntactic production. This distinction is particularly important in light of earlier studies reporting declines in SC alongside development in accuracy following CF (e.g., Eckstein & Bell, 2021; Hartshorn & Evans, 2015). Instead, the findings are more consistent with research suggesting that written CF may exert a largely neutral effect on SC, neither enhancing nor suppressing it (Thi & Nikolov, 2023).
The absence of significant effects on SC in the present study may be attributed to a combination of intervention-related, learner-related, and task-related factors. First, the 9-week intervention may have been insufficient to elicit measurable development in syntactic structures, particularly at the clausal and global levels. Previous studies reporting significant development in SC have typically involved longer instructional periods or more intensive exposure to automated feedback (e.g., Bagheri Nevisi & Arab, 2023). Second, learner proficiency likely played an important role. Although participants were English majors, most fell within the A2–B1 proficiency range, which may have constrained their ability to reliably produce and control syntactically complex sentences under timed writing conditions. Research on SC development suggests that more advanced learners are better positioned to deploy a wider range of syntactic resources, whereas lower-proficiency learners tend to rely on simpler structures that place fewer demands on linguistic control. For example, Kyle and Crossley (2018) showed that higher-proficiency FL writers produced more syntactically elaborated language at both large- and fine-grained levels. In contrast, lower-proficiency learners may lack the resources to manage such elaboration under time pressure and therefore favor simpler sentence constructions, which may help explain the absence of significant SC development observed in the present study. An additional explanation relates to task genre. Previous research (e.g., H. J. Yoon & Polio, 2017) indicates that narrative writing typically relies on chronological sequencing and event-based progression, which may limit opportunities for syntactic elaboration. Accordingly, the use of narrative tasks in the present study may have constrained learners’ production of syntactically complex structures, regardless of feedback type.
Despite these results, the study contributed important findings to the practice of FL writing pedagogy and research. The findings suggest that integrating ChatGPT with teacher feedback is instructionally safe with respect to SC, as it neither enhanced nor diminished learners’ syntactic sophistication. Although the combined feedback did not lead to measurable development in SC, it also did not discourage the use of complex structures, indicating that ChatGPT may be employed as a teacher-in-the-loop support tool for localized revision and accuracy-focused feedback without negative structural consequences. However, the absence of complexity development also suggests that feedback alone may be insufficient to promote syntactic development. To foster such development, teachers may need to pair AI-assisted feedback with explicit metacognitive explanation, particularly for lower-proficiency learners (Almutlaq & Alsaleh, 2025). Metacognitive explanation enhances learners’ understanding and uptake of both teacher and AI feedback, potentially improving overall writing quality. For FL writing researchers, the findings indicate that the effectiveness of feedback may be shaped by a range of interacting factors related to both learners and tasks. Variables such as learner proficiency, task genre, and instructional duration appear to mediate how feedback is processed and applied, suggesting that feedback effects cannot be fully understood in isolation. Consequently, future research should adopt more context-sensitive and design-aware approaches, systematically examining how learner characteristics and task features condition the impact of feedback. Such work may help clarify the conditions under which feedback, whether human, AI-based, or combined, supports different dimensions of writing development.
Conclusion
This study examined whether integrating ChatGPT feedback with teacher feedback influences EFL learners’ SC at global, clausal, and phrasal levels. The results did not indicate a reliable advantage of combined feedback over teacher feedback alone within a 9-week instructional window, suggesting that such integration may not be sufficient to promote short-term changes in SC.
Several limitations of the present study arise from methodological considerations. First, the classroom-based sample size was relatively small, which limited statistical power to detect modest effects and increased the risk of both Type I and Type II errors, particularly given the large number of indices examined. Second, the sample consisted exclusively of female students drawn from a single higher education institution in Saudi Arabia, thereby constraining the generalizability of the findings across genders, institutional contexts, and cultural settings. Third, the relatively short duration of the instructional intervention restricts conclusions regarding the sustainability of the observed changes in SC.
Further research is recommended to validate and extend the present findings. Longitudinal studies with larger and more diverse samples across multiple institutions and demographic groups are needed to enhance generalizability and to examine the long-term effects of AI-assisted feedback on self-concept. Future research should also employ extended instructional periods, incorporate appropriate control or comparison groups, and apply statistical corrections for multiple comparisons to strengthen the robustness of findings.
Supplemental Material
sj-docx-1-sgo-10.1177_21582440261453413 – Supplemental material for The Combined Impact of ChatGPT and Teacher Feedback on the Syntactic Complexity of EFL Learners’ Writing
Supplemental material, sj-docx-1-sgo-10.1177_21582440261453413 for The Combined Impact of ChatGPT and Teacher Feedback on the Syntactic Complexity of EFL Learners’ Writing by Eman Alkhalifah and Sana Almutlaq in SAGE Open
Footnotes
Acknowledgements
The authors would like to thank Imam Mohammad Ibn Saud Islamic University (IMSIU) for supporting and funding this project.
Ethical Considerations
This study received ethical approval from the College of Languages and Translation at Imam Mohammad Ibn Saud Islamic University (IMSIU). Informed consent was obtained from all participants through signed consent forms after they were provided with a clear and comprehensive explanation of the study’s objectives, procedures, and content. The research design was carefully developed to minimize any potential risk or harm to participants by ensuring anonymity, voluntary participation, and the unequivocal right to withdraw from the study at any time without penalty or adverse consequences. The researchers maintained a strict commitment to the confidentiality and privacy of all participants’ data and personal information, restricting its use exclusively to scientific research purposes, preventing disclosure to any parties beyond the scope of the study, and ensuring secure storage in accordance with ethical and data-protection standards. Furthermore, the anticipated benefits of the study to the academic community and to society at large were carefully evaluated and determined to outweigh any minimal potential risks, as the findings are expected to contribute to knowledge advancement, inform policy development, and support the improvement of practices related to the focus of the study.
Consent to Participate
All participants provided written informed consent before participation.
Author Contributions
Eman Alkhalifah: Conceptualization, Resources, Methodology, Writing, Reviewing, and Editing.
Sana Almutlaq: Conceptualization, Resources, Methodology, Data Analysis, Writing, Reviewing, and Editing.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported and funded by the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University (IMSIU; grant number IMSIU-DDRSP2602).
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The data set of this study shall be available upon request.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
