From 2024 to 2026: The Evolving Advantages and Limitations of Large Language Models in Mathematics Test Preparation

Abstract

The study presents a longitudinal and comparative analysis of Large Language Model (LLM) in mathematics exam creation processes based on the Revised Bloom’s Taxonomy (RBT). The study focuses on three widely used free applications: ChatGPT, Gemini, and Claude. In this study, conducted at the secondary school level, the features expected in mathematics written exam questions (scientific relevance, target relevance, and understandability) were examined in terms of their conformity to the RBT steps. The study was conducted with data obtained from LLMs in 2024 and 2026. The results revealed the changes in LLMs over the 2-year period and the differences in the taxonomy levels and question-writing skills. Within the scope of the questions analysed, it was determined that the advantages and limitations of the LLMs changed over time, and that questions targeting especially the upper levels of the taxonomy showed improvement after the second prompt in both years. The results highlight the potential of LLM as instructional tools, which still require human control, and underscore the need for continuous evaluation and validation as these technologies evolve.

Plain Language Summary

The study presents a longitudinal and comparative analysis of Large Language Model (LLM) in mathematics exam creation processes based on the Revised Bloom’s Taxonomy (RBT). The study focuses on three widely used free applications: ChatGPT, Gemini, and Claude. In this study, conducted at the secondary school level, the features expected in mathematics written exam questions (scientific relevance, target relevance, and understandability) were examined in terms of their conformity to the RBT steps. The study was conducted with data obtained from LLMs in 2024 and 2026. The results revealed the changes in LLMs over the two-year period and the differences in the taxonomy levels and question-writing skills. Within the scope of the questions analyzed, it was determined that the advantages and limitations of the LLMs changed over time, and that questions targeting especially the upper levels of the taxonomy showed improvement after the second prompt in both years. The results highlight the potential of LLM as instructional tools, which still require human control, and underscore the need for continuous evaluation and validation as these technologies evolve.

Keywords

large language model mathematics exams revised bloom’s taxonomy ChatGPT Gemini Claude

Introduction

In addition to the many technological applications in education, LLMs have been widely used (Haspekian et al., 2024). The ability of AI systems, also known as chatbots or large language models (LLMs), to adapt to changing situations makes them an important resource for education both in classrooms and online (Long & Magerko, 2020). Accordingly, AI has reshaped learning and teaching processes by penetrating areas such as instruction, assessment and evaluation, and administrative affairs in education (Chen et al., 2020; Opara et al., 2023). These developments indicate that the impact of AI in education will continue to increase.

The use of AI in education has both advantages and disadvantages, and opinions on the subject vary (Adeshola & Adepoju, 2024). AI can guide learning and significantly enhance learning environments (Rane, 2024). Its capacity for personalised learning and constant availability promotes inclusive education practices (Opara et al., 2023). However, AI also has limitations, such as generating erroneous information and misleading users. Given the growing use of AI in education and the generally positive public perception of this technology, it is important to transform its potential risks into pedagogical advantages (Adeshola & Adepoju, 2024). Consequently, studies that critically examine LLMs, adapt them to educational systems, and highlight their limitations will enhance our understanding of the role of AI in education (Chen et al., 2020).

One application of artificial intelligence (AI) in education is assessment and evaluation. Research indicates that AI is used to evaluate student responses and generate assessment questions (Lo, 2023). In the context of mathematics education, the mathematical accuracy of questions used in assessment and evaluation processes is a key factor in evaluating AI-assisted question development applications (Svičević et al., 2025; Wei, 2024). In mathematics education, studies indicating that AI responds adequately to basic-level questions but produces incorrect expressions and misinformation in advanced topics support this view (Urhan et al., 2024; Wardat et al., 2023). The potential of AI to answer questions raises concerns regarding its level of question development competence. In studies evaluating the question development potential of AI across different fields and in mathematics education, it is stated that AI can generate questions; however, it has limitations, such as containing incorrect information and operating at low cognitive levels (Rycroft-Smith et al., 2024; Segal & Klemer, 2025; Urhan et al., 2024). It has also been noted that AI performs differently across various areas of mathematics, demonstrating greater proficiency in number theory than in geometry (Svičević et al., 2025; Wei, 2024). Nevertheless, it has been emphasised that AI demonstrates superhuman abilities in cognitively demanding tasks (Spreitzer et al., 2024; Svičević et al., 2025). Therefore, despite its limitations, it has been suggested that LLMs will reduce the time required for question preparation and that updated versions will yield better results in question development (Ngo, 2024). Therefore, examining the change and potential of LLMs in fulfilling given tasks during the process of developing questions for mathematics exams in 2024 and 2026 using the same prompts can provide valuable insights into the evolving aspects of AI and the support it can offer in assessment and measurement. It is important to reveal the change in LLMs over time by determining the extent to which they are involved in exam design, considering factors such as cognitive level, mathematical accuracy, and question types, and by evaluating their success in this process. This study, therefore, aims to investigate the potential, evolution, and comparison of LLMs for creating mathematics exams in two different learning areas (fractions and angles in triangles) with an interval of approximately 2 years. The exams were structured according to the Revised Bloom’s Taxonomy (RBT), which allows for determining the cognitive level of each question.

Theoretical Framework

The assessment of students’ mathematical competencies is important in mathematics education (Niss, 2003). Attention should be paid to the design of assessment tools used to assess students’ knowledge and skills (Neubrand, 2018). Although many methods can be used in assessment, one of the most preferred methods is written exams (Omar et al., 2012). Different features come to the forefront in studies examining the characteristics that exam questions should possess. The literature (Bilgeç, 2016; Haladyna, 2004; Stein et al., 2009; Sweller, 1998) highlights factors such as question type, alignment with intended objectives, cognitive load, and scientific accuracy as prominent features. Not only is it important what information questions assess, but also the cognitive level at which that information is elicited. Cognitive load theory, proposed by Sweller (1988), is based on the principle that instructional content presented to students should be cognitively balanced. If the cognitive demands of the presented content exceed the learner’s capacity, learning, retention, and transfer are negatively affected (Plass et al., 2003). The expressions used in instructional content should be clear and understandable and should avoid unnecessary complexity. In other words, the instructional content and assessment tools presented to students should use clear and understandable language and maintain a balanced cognitive load. Stein et al. (2009) explain the cognitive nature of mathematical questions and emphasise that the level of cognitive demand in mathematical tasks directly affects students’ thinking processes. They state that mathematically superficial or poorly structured tasks are insufficient to activate higher-order cognitive processes, whereas tasks that make mathematical relationships explicit and require reasoning present higher cognitive demands. Therefore, mathematics questions should be mathematically accurate and scientifically consistent and should include both higher- and lower-level cognitive questions to achieve balance (Stein et al., 2009; Swart, 2010). In this context, Bloom’s Taxonomy (Bloom et al., 1956), which presents cognitive levels in a hierarchical structure, provides a theoretical framework for determining the cognitive levels of mathematics examination questions and ensuring balance among them.

Bloom et al. (1956) developed the Taxonomy of Cognitive Domains, which categorises learning objectives according to cognitive processes. The taxonomy, in which cognitive processes are ordered from simple to complex in a single dimension, was criticised for some of its features due to changing needs in the process and was updated by Anderson and Krathwohl (2001). Thus, the Revised Bloom’s Taxonomy (RBT) emerged. This taxonomy addresses cognitive processes within a hierarchical structure, ranging from simple to complex. However, Anderson and Krathwohl (2001) revised the taxonomy in line with emerging theoretical and pedagogical needs. Thus, the Revised Bloom’s Taxonomy (RBT) was developed. The RBT considers cognitive processes at six levels—remembering, understanding, applying, analysing, evaluating, and creating—while classifying knowledge as factual, conceptual, procedural, and metacognitive (Krathwohl, 2002). Both the original and revised taxonomies acknowledge that lower cognitive levels require less complex processes than higher levels (Bloom et al., 1956; Krathwohl, 2002).

In curriculum development and instructional content design, Bloom’s Taxonomy is the most widely used and accepted framework for determining the cognitive levels of exam questions (Chang & Chung, 2009; Seaman, 2011). Bloom’s Taxonomy is used extensively in the analysis of exam questions, particularly in mathematics education (Akinboboye & Ayanwale, 2021). Since each step in the taxonomy requires a different level of mental activity, it is recommended that exams administered to students reflect each step of Bloom’s Taxonomy and include diverse question types (Afacan & Nuhoğlu, 2008; Linn & Gronlund, 2000).

Another criterion considered in exam design is question type. Open-ended and single-answer question types (e.g., multiple-choice, matching, and true–false questions) used in mathematics education have distinct strengths and weaknesses (Bilgeç, 2016). Open-ended questions, which can provide broader opportunities for student-centred assessment than multiple-choice questions, require greater effort and meticulousness in scoring and in ensuring validity and reliability (Carroll, 1999; O’Neil & Brown, 1998). On the other hand, although multiple-choice questions are easier to score, their preparation process requires meticulous work and can be challenging (Arıcan, 1996). While open-ended questions more effectively measure higher-level cognitive behaviours, multiple-choice questions offer the advantage of being administered to large populations and standardising assessment results (Friborg & Rosenvinge, 2013; O’Neil & Brown, 1998). In addition, matching and fill-in-the-blank questions are traditional assessment tools commonly used in mathematics education (Bilgeç, 2016). Matching questions are particularly effective in identifying connections between concepts, definitions, and mathematical relationships. However, they mostly measure cognitive skills at the recall and comprehension levels (Haladyna et al., 2002). Fill-in-the-blank questions require students to correctly recall and apply specific concepts, symbols, or operations. However, due to the limited context they provide, such questions may be inadequate for measuring higher-order cognitive processes (Haladyna, 2004). While these traditional question types offer advantages such as practicality and rapid feedback in the assessment process, they should be used carefully and sparingly when evaluating higher-order thinking skills, such as mathematical reasoning and problem solving (Haladyna et al., 2002). For this reason, question types such as open-ended or multiple-choice questions should be determined by considering their advantages and disadvantages in line with the purpose being measured (Bilgeç, 2016). Furthermore, to serve their purpose and accurately measure behaviour in mathematics education, questions must be scientifically sound and aligned with the target learning outcomes being assessed (Kilpatrick et al., 2001). Finally, it is important to write questions in clear and concise language so that students can correctly understand them and demonstrate the cognitive skills being measured. Questions containing unclear, linguistically complex, or ambiguous expressions have been shown to negatively affect students’ cognitive performance and weaken the validity of assessment results (Haladyna, 2004). In line with these explanations, mathematics exam questions must be scientifically sound, free from mathematical errors, relevant to the learning outcomes being assessed, and written in clear and understandable language. They should also reflect a balanced distribution of cognitive skills and be structured using appropriate question types.

Mathematics exams are assessment tools designed to measure specific learning outcomes and should be systematically designed in terms of cognitive level and question structure. The literature indicates that teachers most frequently use multiple-choice questions in classroom exams, while open-ended and short-answer questions are used less frequently (Doğan, 2019; Zeren et al., 2023). Furthermore, it has been suggested that exam questions prepared by teachers tend to focus on lower-level cognitive behaviours (Himmah et al., 2019). Similarly, national mathematics exam questions and the mathematics curriculum have been reported to lack a balanced distribution of cognitive levels, with a high concentration at the application and analysis levels (Aydoğdu & Gültekin, 2025). It is also emphasised that mathematics questions in textbooks predominantly assess lower-level cognitive skills (Bal & Yılmaz, 2022; Üredi & Ulum, 2020). In light of these limitations, it is becoming increasingly important to examine the opportunities offered by technology and LLMs for question development and exam preparation in mathematics education. In mathematics education, which naturally and intensively involves technology (Wing, 2006), it is necessary to address the potential contributions of digital tools and AI-based applications to assessment and evaluation processes, as well as the technical and methodological challenges that may arise in these processes (Çelik Görgüt, 2023; Nortvedt & Buchholtz, 2018; Trgalová & Tabach, 2024).

Today’s world is going through a period dominated by uncertainty, volatility, complexity, and ambiguity, where the speed and depth of development in AI technologies challenge the imagination. Educational processes are rapidly evolving from the stage of “AI-Guided Learning,” where students are passive recipients, to “AI-Interactive Learning,” where collaboration with systems occurs, and finally to the most advanced stage, the paradigm of “AI-Driven Personalized Learning,” where the student becomes the leader of their own educational journey (Kyambade et al., 2025; Ouyang & Jiao, 2021). Particularly in mathematics education, while problem posing is recognised as a fundamental process that enhances conceptual understanding and creativity, AI-supported scaffolding strategies have begun to play a critical role in developing this complex skill (Arslan & Güler Selek, 2025; Kim et al., 2026). In this longitudinal process covering the years 2024 to 2026, the immense increase in the mathematical reasoning and problem-generation capacities of LLMs (including the transition process from GPT-3.5 to advanced systems like GPT-4 and o1) has created a “digital ripple” that is changing the nature of education (Fan, 2025; Strielkowski et al., 2025; Walkington, 2025). However, the continued need for expert oversight regarding the accuracy and originality of problems generated by AI (Korkmaz Güler & Yıldız, 2025; Urhan et al., 2024) necessitates that teachers possess a critical perspective, using these tools as a complementary element (Kim et al., 2026; Walkington, 2025). In this context, monitoring the development of mathematical skills within a rapidly changing technological ecosystem is of decisive importance in the transition to intelligent educational systems that place the student at the centre of the process and prioritise human-machine collaboration. Accordingly, the study aims to answer the following research questions:

What are the performances of ChatGPT, Gemini, and Claude in generating mathematics exam questions on fractions and angles in triangles, regarding their strengths and limitations in RBT levels and question criteria (scientific relevance, target appropriateness, and understandability)?

To what extent do the performances of these models differ between their 2024 and 2026 versions across the specified criteria?

In this research, the free versions of the LLMs discussed in the literature were used in 2024 and 2026: ChatGPT-3.5, Gemini, and Claude 3.5 Sonnet in 2024, and ChatGPT, Gemini 3.0, and Claude 4.6 Sonnet in 2026. For the sake of clarity in the text, the LLMs will be referred to in the study using their general names along with the year. ChatGPT is the most widely used LLM, particularly in educational research, and has been studied extensively in areas such as measurement and evaluation, problem-solving, and content creation (Lo, 2023). The Gemini and Claude models have also become established in educational research, with comparative studies examining their performance in reasoning and content creation (Oh et al., 2024; Segal & Klemer, 2025). The growing use of these models in educational contexts has made it necessary to compare how different LLMs perform on similar tasks. In studies comparing LLMs, Lee et al. (2024) compared Gemini Pro with ChatGPT and concluded that Gemini was not very successful in text comprehension; its student evaluation performance was limited and could be improved through simpler expressions and fragmented visuals. Rycroft-Smith and Macey (2024) compared the ability of three different LLMs—Copilot, Claude, and ChatGPT 3.5—to generate questions on the area of a rectangle. While ChatGPT offered more multi-part questions and scoring opportunities, this also increased the probability of the model producing errors. Oh et al. (2024) found that ChatGPT performed best in solving mathematical problems, followed by Claude and Gemini. However, Korkmaz Güler and Yıldız (2025) found that although ChatGPT and Gemini had different strengths in identifying mathematical misconceptions in student answers, both LLMs produced incorrect solutions. Numerous studies emphasise that LLMs can produce incorrect solutions and should therefore be used with caution in mathematics education (Çavuş Erdem, 2025, 2026; Segal & Klemer, 2025; Urhan et al., 2024; Wardat et al., 2023). The performance of LLMs has been found to vary depending on the mathematics subject area, with geometry proving to be more challenging (Svičević et al., 2025; Wei, 2024). Taken together, these studies demonstrate the potential of LLMs for solving and generating questions. However, there are currently no studies longitudinally addressing the performance of LLMs in creating mathematics exams across different subject areas. In this context, longitudinally comparing the ability of ChatGPT, Gemini, and Claude to create mathematics exams on fractions and angles in triangles is expected to contribute to the existing literature.

Method

In order to collect more comprehensive data, the study aimed to develop a mathematics exam covering subjects from two learning areas in 2024 and 2026. The learning areas of numbers and geometry account for the largest proportion of topics in the mathematics curriculum. “Fractions” and “Angles in Triangles” were randomly selected from each learning area. The aim of the study is to answer the following question: What are the capabilities and limitations of the LLMs ChatGPT, Gemini, and Claude in creating mathematics exams on fractions and angles in triangles, aligned with the RBT levels and adhering to the required characteristics of exam questions (scientific relevance, target relevance, and understandability), and how has the change in these tools over time (2024 and 2026) occurred? The data were collected in May–June 2024 and in February 2026, and the LLM steps were followed exactly the same in both years (Figure 1).

Figure 1.

Steps followed in the study.

Data Collection

The selected LLMs were asked to create exams at the 6th-grade level (angles in triangles) and the 5th-grade level (fractions) based on the RBT in both years. The prompts provided were the same for all LLMs. When an error was detected in the analysed questions, the appropriate second prompt shown in Table 1 was applied according to the type of error. The number of prompts was limited to two to minimise interference with the LLMs. The first and second prompts given to the LLMs during exam creation are presented in Table 1.

Table 1.

Prompts Used in the Study.

1. Prompt	2. Prompt (Used only for incorrect questions)
First prompt for the mathematics exam on fractions Prepare a mathematics exam on fractions at the 5th-grade level with questions appropriate to each level of the RBT. The exam you have created should include open-ended, matching, fill-in-the-blank, and multiple-choice questions. First prompt for the mathematics exam on angles in triangles Prepare a mathematics exam on angles in triangles at the 6th-grade level with questions appropriate to each level of the RBT. The exam you have created should include open-ended, matching, fill-in-the-blank, and multiple-choice questions.	Not suitable for the level: Write a new question suitable for the level
	The question is not clear; make it understandable.
	Not suitable for the specified learning outcome: Write a new question suitable for the learning outcome.
	Create the question by adding a visual.
	The question is scientifically unsuitable or has no valid solution; write a new question.

As shown in Table 1, the first prompts were the same, whereas the second prompts varied depending on the questions generated. For example, if a problem was detected in the comprehensibility of a question, the second prompt used was “The question is not clear; make it understandable.”

Data Analysis

Content analysis was used to evaluate the questions generated by the LLMs. The same analysis steps were applied for the 2024 and 2026 data. The questions were evaluated by considering the criteria of scientificity, target relevance, and understandability in accordance with the RBT levels. The steps followed in the study are presented in Figure 1.

The questions at each level in the created exams were analysed individually. For example, ChatGPT generated four questions on the angles in triangles from the remember level in the exam. During the analyses, the total number of questions at the remember level was considered to be four, and each question was analysed individually. In the analyses conducted according to the RBT, the features that questions should measure at each level were tabulated (Table 2). Content analysis was conducted by the researchers. The researchers were experts in the field of mathematics education. Coders were fully blinded to LLM identity. All outputs from the three LLMs were randomised and pooled into an anonymous collection, with each output assigned a neutral identification number. The coding was performed using only these anonymised texts and pre-defined objective criteria. The researchers involved in the coding were unaware of which output belonged to which LLM until the evaluation phase was complete. Both evaluators independently coded all questions based on the criteria of RBT alignment, scientific accuracy, clarity, and alignment with learning outcomes. When coding according to the RBT, if a question possessed characteristics of more than one cognitive level, the highest level to which the question primarily directed the response was selected. This rule was explained to the evaluators prior to coding and reinforced with examples. Following independent coding, the agreement between the analyses was calculated using the formula developed by Miles and Huberman (1994) [Reliability = Agreement / (Agreement + Disagreement)], and the agreement rate between the coders was 92%. For the disagreements comprising the remaining 8%, support was obtained from an expert in mathematics education with experience in RBT and assessment and evaluation. A detailed discussion process was conducted with the participation of both researchers and the expert, resulting in the resolution of coding disagreements and the achievement of a final consensus. All analyses were reported based on these final agreed-upon codes. Sample analyses conducted within the scope of the criteria are presented in Table 3. In the study, the criteria determined by Krathwohl (2002) were considered while checking compliance with the taxonomy level. These criteria constitute the structure of the cognitive process dimension of the RBT and are presented in Table 2.

Table 2.

Exam Question Evaluation Criteria.

Criteria		Description of the criterion
Scientific Relevance		The problem must be consistent with mathematical rules and relationships, the concepts and symbols used must be correctly defined, and the solution must be mathematically correct.
Target Relevance		The question should be consistent with the relevant subject area and the learning outcomes included in the curriculum.
Understandability		The question should be written in clear, concise, and simple language that students can easily understand, and should not contain any ambiguities in its expressions and presentations.
Suitability for RBT levels	Remember	The problem requires recalling and recognising fundamental concepts, definitions, symbols, or formulas related to the subject.
	Understand	The problem requires the interpretation, explanation, or comparison of information presented verbally, in written form, or visually.
	Apply	The problem requires the application of learned mathematical knowledge, rules, methods, or procedures to a new situation.
	Analyse	The problem requires breaking down the mathematical structure related to the subject into its components and analysing the relationships between those components.
	Evaluate	The problem requires making a judgement by evaluating the mathematical solution, process, or results according to specific criteria.
	Create	The problem requires bringing together mathematical knowledge to create an original and consistent solution, model, or problem.

Table 3.

Examples of the Question Analysis Process.

Criteria	Sample question	Analysis criteria	Analysis	Analysis evaluation	Second prompt
Compliance with the RBT level	Question generated for the create level In a purchase, the amount represented by the fraction 3/5 equals 15 TL. What is the total cost of this purchase?	At the create level, questions should involve planning or producing new products or solutions.	The question is at the apply level, as it requires procedural knowledge of fractions.	Incorrect question	Not suitable for the level: Write a new question suitable for the level.
Target relevance (learning outcomes)	A pizza is divided into eight equal slices. If three-eighths of the pizza are eaten, what fraction of the pizza remains?	Compliance with the subject of fractions	It was observed that the question written for the topic of fractions instead assessed subtraction with natural numbers.	Incorrect question	Not suitable for the specified learning outcome: Write a new question suitable for the learning outcome.
Understandability	The measures of two angles of a triangle are 30° and 60°. Determine the measure of the third angle and identify the type of triangle. Then, draw and name any triangle with these properties.	Clarity of task requirements and compliance with punctuation and spelling rules	It was observed that the question was clear, and spelling and punctuation were appropriate.	Correct question	No prompt required
Scientific relevance	Which of the following triangles is an isosceles triangle? a) Triangle with all sides equal b) Triangle with two equal sides c) Triangle with all angles different d) Triangle with two equal angles	Objective, valid, and correct questions	It was observed that the question had two correct answers. Such questions are not scientifically appropriate, as a triangle with two equal sides also has two equal angles.	Incorrect question	The question is scientifically unsuitable or has no valid solution; write a new question.

Each question generated by the LLMs was coded as suitable or not suitable according to whether it met the criteria listed above.

As shown in Table 2, the features associated with each level were specified, and the questions were analysed in terms of whether they reflected these features. For example, if a question at the understanding level contains features such as interpretation, exemplification, classification, summarising, inference, comparison, or explanation, it was classified as suitable for that level. In addition to the RBT, criteria related to scientific relevance, target relevance, and understandability were established, and each question was analysed individually by the researchers. Sample analyses illustrating the question analysis process are presented in Table 3.

As shown in Table 3, the questions were analysed in detail according to the specified criteria, and the second prompt was provided only for incorrect questions. In order to check the internal validity of the study, the consistency and language of the data collection tool were checked during the data collection process. To test external validity, the data obtained from the LLMs were coded according to the RBT levels; criteria for scientific relevance, target relevance, and understandability were defined; the data analysis stages were described in detail; the findings were reported; and the results were compared with the literature. The researchers were responsible for data collection, data analysis, detailed reporting of the findings, and objective comparison of the results with the literature.

Results

To investigate the capabilities and limitations of ChatGPT, Gemini, and Claude, the questions generated by these LLMs in 2024 and 2026 were analysed using content analysis based on four main criteria: alignment with Revised Bloom’s Taxonomy (RBT) levels, scientificity, target relevance, and understandability. The distribution of questions across cognitive levels was examined to determine how well each tool addressed lower and upper-order thinking skills. Additionally, a longitudinal comparison between the 2024 and 2026 data was conducted to identify changes in the tools’ performance over time. This section presents comparative analyses of math exam questions generated by ChatGPT, Gemini, and Claude. The first section presents data on angles in triangles; the second, on fractions; and the final section presents a comparative analysis of the LLMs’ identified limitations and advantages, supported by examples. In the study design, each LLM was given up to 2 prompts to perform a task. Tables 4 and 5 present the success rates of each LLM after the first and second prompts for the years 2024 and 2026, respectively, and compare them longitudinally. In the subsequent tables, the LLMs’ question-generation performance for the first and second research questions is analysed separately for 2024 (Tables 6 and 7) and 2026 (Tables 8 and 9), using each criterion. Table 10 presents the changes in the strengths and limitations of the LLMs used in the study across the years, within the scope of the research questions. Additionally, the findings are supported by screenshots taken from the LLMs. Thus, the results presented below reveal both the comparative strengths and weaknesses of each LLM and their evolution across the 2-year period. The performance of LMMs in generating questions related to angles in triangles following the first and second prompts is presented in Table 4.

Table 4.

The Number and Percentage of Questions Generated for the Angles in Triangles Subject in the First and Second prompts.

Steps of taxonomy	ChatGPT						Gemini						Claude
	Total number of questions generated		Percentage of appropriate questions generated with the first prompt (%)		Percentage of appropriate questions generated with the second prompt (%)		Total number of questions generated		Percentage of appropriate questions generated with the first prompt (%)		Percentage of appropriate questions generated with the second prompt (%)		Total number of questions generated		Percentage of appropriate questions generated with the first prompt (%)		Percentage of appropriate questions generated with the second prompt (%)		Percentage of appropriate questions generated after the second prompt across all LLMs (%)
	2024	2026	2024	2026	2024	2026	2024	2026	2024	2026	2024	2026	2024	2026	2024	2026	2024	2026	2024	2026
Remember	4	4	100	100	100	100	4	3	100	100	100	100	2	3	100	100	100	100	100	100
Understand	4	4	100	75	100	100	4	3	100	100	100	100	2	4	100	67	100	100	100	100
Apply	4	4	50	75	100	100	4	3	100	100	100	100	2	3	100	100	100	100	100	100
Analyse	2	4	50	50	100	100	4	3	100	25	100	75	2	3	0	67	50	67	83.3	80.7
Evaluate	2	4	0	25	50	100	2	2	50	75	100	100	2	2	0	100	50	100	66.6	100
Create	2	4	0	50	100	75	4	2	50	33	75	67	2	2	0	100	50	100	75	80.7
Average percentage	18	24	50	62.5	91.6	95.8	22	16	83.3	72.2	95.8	90.3	12	17	50	89	75	94.5	87.4	93.5

Table 5.

The Number and Percentage of Questions Generated for the Fractions Subject in the First and Second Prompts.

Steps of taxonomy	ChatGPT						Gemini						Claude
	Total number of questions generated		Percentage of appropriate questions generated with the first prompt (%)		Percentage of appropriate questions generated with the second prompt (%)		Total number of questions generated		Percentage of appropriate questions generated with the first prompt (%)		Percentage of appropriate questions generated with the second prompt (%)		Total number of questions generated		Percentage of appropriate questions generated with the first prompt (%)		Percentage of appropriate questions generated with the second prompt (%)		Percentage of appropriate questions generated after the second prompt across all LLMs (%)
	2024	2026	2024	2026	2024	2026	2024	2026	2024	2026	2024	2026	2024	2026	2024	2026	2024	2026	2024	2026
Remember	4	4	75	100	100	100	4	4	100	100	100	100	4	5	75	100	100	100	100	100
Understand	4	4	75	75	100	100	4	2	100	50	100	100	4	4	75	100	100	100	100	100
Apply	4	4	75	100	100	100	4	3	100	100	100	100	4	4	100	100	100	100	100	100
Analyse	2	4	50	75	50	100	4	3	50	67	75	100	2	3	50	100	100	100	75	100
Evaluate	2	4	0	75	50	100	2	2	0	67	50	100	2	2	0	100	50	100	50	100
Create	2	4	0	25	100	100	2	2	0	50	50	100	2	2	0	50	100	100	83.3	100
Average percentage	18	24	45.8	75	83.3	100	20	16	58.3	72.3	79.1	100	18	20	50	91.7	91.6	100	84.7	100

Table 6.

Comparative Analysis of the Questions Generated for the Angles in Triangles Subject According to Taxonomy Levels.

Taxonomy levels	ChatGPT						Gemini						Claude
Taxonomy levels	Number of questions	Question type	Percentage of scientific relevance	Percentage of compliance with taxonomy levels	Percentage of understandability	Percentage of target relevance	Number of questions	Question type	Percentage of scientific relevance	Percentage of compliance with taxonomy levels	Percentage of understandability	Percentage of target relevance	Number of questions	Question type	Percentage of scientific relevance	Percentage of compliance with taxonomy levels	Percentage of understandability	Percentage of target relevance
Remember	4	4oe	100	100	100	100	4	4oe	100	100	100	100	2	2fb	100	100	100	100
Understand	4	4oe	100	100	75	100	4	3oe 1mc	100	100	75	100	2	2mc	100	100	100	100
Apply	4	2oe 2mc	100	100	100	100	4	4oe	100	100	100	100	2	2oe	100	100	100	100
Analyse	2	2oe	100	100	100	100	4	4oe	100	100	100	100	2	2m	100	50	50	100
Evaluate	2	2oe	100	50	100	100	2	2oe	100	100	100	100	2	2oe	100	50	50	100
Create	2	2oe	100	100	100	100	4	4oe	100	75	100	100	2	2oe	100	50	100	100
Average percentage	—	—	100	91.6	95.8	100			100	95.8	95.8	100			100	75	83.3	100

Note. oe = open-ended question; mc = multiple-choice question; m:matching question; fb = fill in the blanks questions.

Table 7.

Comparative Analysis of the Questions Generated for the Fractions Subject According to Taxonomy Levels.

Taxonomy levels	ChatGPT						Gemini						Claude
Taxonomy levels	Number of questions	Question type	Percentage of scientific relevance	Percentage of compliance with taxonomy levels	Percentage of understandability	Percentage of target relevance	Number of questions	Question type	Percentage of scientific relevance	Percentage of compliance with taxonomy levels	Percentage of understandability	Percentage of target relevance	Number of questions	Question type	Percentage of scientific relevance	Percentage of compliance with taxonomy levels	Percentage of understandability	Percentage of target relevance
Remember	4	3oe1mc	100	100	100	100	4	3mc1oe	100	100	100	100	4	2oe2mc	100	100	75	100
Understand	4	4oe	100	100	100	100	4	3fb1m	100	100	100	100	4	2oe2fb	100	100	75	100
Apply	4	3oe1mc	100	100	100	100	4	4oe	100	100	100	100	4	4oe	100	100	100	100
Analyse	2	1oe1mc	100	50	100	100	4	4oe	100	75	75	100	2	2oe	100	100	50	100
Evaluate	2	2oe	100	50	100	100	2	2oe	100	50	100	100	2	1oe1fb	100	50	100	100
Create	2	2oe	100	100	100	100	2	2oe	100	50	100	100	2	2oe	100	100	50	100
Average percentage	—	—	100	83.3	100	100	—	—	100	79.1	95.8	100	—	—	100	91.6	75	100

Note. oe = open-ended question; mc = multiple-choice question; m:matching question; fb = fill in the blanks questions.

Table 8.

Comparative Analysis of the Questions Generated for the Angles in Triangles Subject the According to Taxonomy Levels.

Taxonomy levels	ChatGPT						Gemini 3						Claude
Taxonomy levels	Number of questions	Question type	Percentage of scientific relevance	Percentage of compliance with taxonomy levels	Percentage of understandability	Percentage of target relevance	Number of questions	Question type	Percentage of scientific relevance	Percentage of compliance with taxonomy levels	Percentage of understandability	Percentage of target relevance	Number of questions	Question type	Percentage of scientific relevance	Percentage of compliance with taxonomy levels	Percentage of understandability	Percentage of target relevance
Remember	4	2fb1mc1m	100	100	100	100	3	2fb1tf	100	100	100	100	3	1tf1m1fb	100	100	100	100
Understand	4	1mc2oe1tf	100	100	50	100	3	2tf1oe	100	100	100	100	4	4oe	100	100	100	100
Apply	4	4oe	100	100	75	100	3	3mc	100	100	100	100	3	3oe	100	100	100	100
Analyse	4	4oe	100	100	100	100	3	2oe2m	100	75	100	100	3	1oe2mc	100	67	100	100
Evaluate	4	4oe	100	100	100	100	2	4oe	100	100	100	100	2	1oe1mc	100	100	100	100
Create	4	4oe	100	75	75	100	2	3oe	100	67	100	100	2	2oe	100	100	100	100
Average percentage	—	—	100	95.8	83.3	100	—	—	100	90.3	100	100	—	—	100	94.5	100	100

Note. oe = open-ended question; mc = multiple-choice question; m:matching question; fb = fill in the blanks questions; tf:true-false questions.

Table 9.

Comparative Analysis of the Questions Generated for the Fractions Subject the According to Taxonomy Levels.

Taxonomy levels	ChatGPT						Gemini 3						Claude 4.6 Sonnet
Taxonomy levels	Number of questions	Question type	Percentage of scientific relevance	Percentage of compliance with taxonomy levels	Percentage of understandability	Percentage of target relevance	Number of questions	Question type	Percentage of scientific relevance	Percentage of compliance with taxonomy levels	Percentage of understandability	Percentage of target relevance	Number of questions	Question type	Percentage of scientific relevance	Percentage of compliance with taxonomy levels	Percentage of understandability	Percentage of target relevance
Remember	4	2fb2m	100	100	100	100	4	4fb	100	100	75	100	5	1fb1tf1m2mc	100	100	100	100
Understand	4	2mc2oe	100	100	100	100	2	2m	100	100	50	100	4	1mc2oe1fb	100	100	100	100
Apply	4	4oe	100	100	100	100	3	3mc	100	100	100	100	4	2mc2oe	100	100	100	100
Analyse	4	4oe	100	100	100	100	3	2oe1mc	100	100	100	100	3	2oe1mc	100	100	100	100
Evaluate	4	4oe	100	100	100	100	2	3oe	100	100	67	100	2	2oe	100	100	100	100
Create	4	4oe	100	100	50	100	2	2oe	100	100	100	50	2	2oe	100	100	100	100
Average percentage	—	—	100	100	91.7	100	—	—	100	100	82	91.7	—	—	100	100	100	100

Note. oe = open-ended question; mc = multiple-choice question; m:matching question; fb = fill in the blanks questions; tf:true-false questions.

Table 10.

The Limitations and Advantages of the LLMs Used in the Study.

LLMs	Limitations	Advantages
ChatGPT
2024 version	-Preparing questions using the order and names of the original Bloom’s Taxonomy levels instead of the RBT-Causing misconceptions by producing incorrect triangle drawings-Providing incorrect information about the order of fractions-High similarity of the questions prepared in the same level while preparing questionsA tendency to generate predominantly open-ended questionsIncluding multiple-choice questions with no correct answer and reproducing open-ended questions after the second prompt	-High understandability of the questions generated
2026 version	-High similarity of the questions prepared in the same level while preparing questions-Tendency to generate questions that do not contain visuals.-Tendency to generate short-text questions at upper-order levels.	-Generating questions that are appropriate for the taxonomy, scientifically accurate, and aligned with learning objectives-Including different question types
Gemini
2024 version	-Inability to produce triangle drawings and reliance on verbal descriptions-A tendency to form open-ended questions	-Providing triangle images via external links (e.g., Quora), with the drawings being correct-Providing YouTube or GeoGebra links when content is deemed insufficient; the links provided are appropriate.-High understandability of the questions generated-Indicating the estimated time required to complete the taxonomy-based exam after the first prompt
2026 version	-High similarity of the questions prepared in the same level while preparing questions-Inability to generate questions containing visuals despite the 2nd prompt being administered.-Tendency to generate short-text questions at upper-order levels.	-Generating questions that are appropriate for the taxonomy, scientifically accurate, and aligned with learning objectives-Google user content bağlantısı kullanarak çoktan seçmeli extra soru oluşturma.-Capacity to generate diverse question types (open-ended, matching, multiple-choice, fill-in-the-blank)
Claude
2024 version	-Failure to produce appropriate triangle drawings, being able to draw right and acute triangles but not obtuse triangles.-Lower level of understandability of the questions compared to other LLMs-Providing incorrect information about fractions	-Identifying unclear aspects of its own responses after receiving a prompt and elaborating on those parts by posing and answering follow-up questions.-When generating a taxonomy-based question, it explains the characteristics of the question and justifies its suitability for the relevant level without an additional prompt.-High capacity to generate diverse question types (open-ended, matching, multiple-choice, fill-in-the-blank)
2026 version	—	-High capacity to generate diverse question types (open-ended, matching, multiple-choice, fill-in-the-blank)-Generating questions that are appropriate for the taxonomy, scientifically accurate, and aligned with learning objectives-Providing rich content in generating questions that include visuals-Creating an answer key for the exam-Creating a complete exam paper including exam duration and instructions

An examination of Table 4 reveals that in the 2024 version, after the first prompt, the percentage of questions appropriate for lower-order cognitive levels was higher compared to upper-order levels. The increase in success at upper-order levels was generally achieved through corrective prompts such as “write a new question appropriate for the level.” During this period, Gemini had the highest rate of generating questions appropriate for the taxonomy, while Claude had the lowest. In the 2026 version, it is observed that LLMs were more successful at lower-order levels, similar to 2024; however, after corrective prompts (“make it appropriate for the level,”“write it in a more comprehensible way,”“rearrange by adding a visual”), a significant improvement was achieved, especially in the quality of upper-order questions. This development was most prominently observed in the Claude model, with Claude showing the highest success after the first prompt, while ChatGPT showed the highest success after the corrective prompt. The post-correction performance of the LLMs in 2026 is generally similar to each other. Compared to the 2024 versions, the 2026 versions showed a 6.1% increase in overall performance.

The distribution of questions generated by the LLMs after the second prompt according to the criteria will be presented separately for the 2024 and 2026 versions. Table 6 presents the distribution for the 2024 version.

Table 6 shows that there are differences between the 2024 versions of the LLMs. Gemini was found to be the LLMs that met the criteria to the greatest extent. LLMs generally created questions according to taxonomy levels. While Claude was the LLM that generated the greatest variety of questions, all LLMs generally produced open-ended questions. While ChatGPT and Gemini were generally successful in generating questions aligned with taxonomy levels after the second prompt, Claude’s percentage decreased at the higher levels. Similarly, limitations were observed in Claude with respect to the understandability criterion. It was observed that all LLMs were 100% successful in terms of scientific relevance and target relevance. The distribution of questions according to the 2026 versions is presented in Table 8.

Looking at Table 8, it is seen that with the second prompt, the LLMs largely met the criteria determined for the mathematics exam. Overall, Claude was the LLM that best met the criteria. In terms of question diversity, Gemini and ChatGPT showed progress compared to their previous versions; however, for upper-level questions, the LLMs generally generated open-ended questions. Similar to 2024, Claude used different question types. All three LLMs generated questions that were mathematically accurate and aligned with the learning objectives. The LLMs showed progress in creating questions according to taxonomy levels. Regarding the comprehensibility criterion, while ChatGPT’s performance did not show improvement, Gemini and Claude reached 100%.

The percentages of the questions generated for the fractions topic after the first and second prompts are presented in Table 5.

An examination of Table 5 reveals that in the 2024 version, all LLMs demonstrated more successful performance at lower-order levels. In the versions from that period, while questions could not be generated for the evaluation and creation levels with the first prompt, appropriate questions were generated with the second prompt. While Gemini had the highest percentage of appropriate questions with the first prompt, Claude achieved the highest percentage with the second prompt. In the 2026 version, a significant performance increase was observed in all models at both lower-order and upper-order levels. It is seen that after the second prompt, the rates of generating questions appropriate for upper-order cognitive levels reached 100%. This development was more pronounced in the Claude model, indicating that the model achieved a remarkable improvement in its capacity to generate questions appropriate for upper-order levels. The progress achieved in the 2026 versions of the LLMs had a higher percentage in the fractions subject compared to the angles in triangles subject, revealing a significant difference of 15.3%. The distributions of questions according to each criterion for the 2024 versions are presented in Table 7.

Table 7 shows that the highest percentage of questions appropriate for the taxonomy belongs to Claude, while the lowest belongs to Gemini. While open-ended questions were the most frequently used overall, Gemini and Claude generated the widest variety of question types. In Gemini, despite the second prompt, the percentage of questions aligned with taxonomy levels decreased at the higher levels. Some limitations were observed in Claude with respect to the understandability criterion. This issue was infrequent in the other LLMs. No problems were detected in any LLM after the second prompt in terms of scientific relevance and target relevance. The question distribution of the 2026 versions of the LLMs is given in Table 9.

Table 9 shows that all three LLMs achieved complete success in creating questions appropriate for the taxonomy. Claude was the LLM that achieved complete success in every criterion and made significant progress in the comprehensibility criterion. Gemini and ChatGPT have some limitations in producing comprehensible questions. In terms of question diversity, while all three LLMs offered rich content, Claude used more question types. Open-ended questions are prominent at upper-order levels, and the tools developed questions that were scientifically sound and aligned with the learning objectives.

ChatGPT, Gemini, and Claude’s 2024 and 2026 versions had both advantages and limitations. The advantages and limitations identified within the scope of the study are presented in Table 10.

As shown in Table 10, each LLM exhibits distinct advantages and limitations. It is observed that some limitations were resolved in the 2026 versions of the LLMs. The 2026 version of Claude stands out compared to other LLMs with its advantages, such as creating an answer key and preparing exam instructions. Sample cases from the 2024 versions of each LLM are presented. Figure 2 presents an example of a limitation observed in ChatGPT.

Figure 2.

Example screenshot illustrating the limitations of ChatGPT.

As shown in Figure 3, after the first prompt, ChatGPT organised questions according to the original Bloom’s Taxonomy rather than the RBT. Accordingly, the final taxonomy levels are treated as analysis, synthesis, and evaluation, and questions are generated following this sequence.

Figure 3.

Screenshot illustrating the advantages of the Gemini AI Tool.

An example case of the Gemini LLM is presented in Figure 3.

As shown in Figure 3, Gemini could not draw an obtuse triangle; however, it explained step by step how to draw an obtuse triangle and provided links for additional information.

Finally, a screenshot illustrating incorrect triangle drawings produced by Claude is presented in Figure 4.

Figure 4.

Screenshot illustrating the limitations of the Claude.

As shown in the figure, Claude identified the image on the left as a right triangle while generating the question. The generated shape clearly has four sides. Although the shape has four sides, it is described as having three angles (30–60–90), which is inconsistent with the properties of a triangle. The model failed to generate a triangle with three sides corresponding to a 30–60–90 right triangle. An attempted example of an obtuse triangle is also shown on the right. The generated shape is not an obtuse triangle but instead consists of an acute triangle combined with a five-sided polygon. Errors related to triangle drawings constitute one of Claude’s limitations.

In the 2026 versions, all LLMs demonstrated overall success in exam creation. However, the ability to generate questions containing visuals was identified as a significant difference among the LLMs. The initial questions generated by ChatGPT and Gemini did not include visual elements. With the second prompt, ChatGPT added some visual elements that were mathematically accurate, although some were not entirely clear. Despite the second prompt, Gemini did not include visual elements. Claude was able to generate error-free questions containing visuals with the first prompt. The visuals generated by ChatGPT and Claude are presented below (Figure 5).

Figure 5.

Screenshot of the visuals generated by Chatgpt and Claude.

Discussion

This study comparatively revealed, using data from 2024 and 2026, the current potential and limitations of existing LLMs in the areas of “fractions” and “angles in triangles” in terms of their suitability for RBT levels and their ability to meet key question characteristics (scientific accuracy, understandability, and alignment with specified learning outcomes). In 2024, all three LLMs demonstrated a high potential to generate appropriate questions at the lower taxonomy levels. However, these percentages decreased at the higher taxonomy levels. A key finding of the study is that LLMs were more successful in preparing questions that measure lower-level cognitive skills (e.g., recall and understanding) but lagged behind in generating questions that assess higher-level cognitive skills (e.g., evaluation and creation). This finding indicates that LLMs still lag behind human expertise in tasks requiring mathematical reasoning, original scenario construction, and multi-step evaluation. It is well established that lower cognitive levels require less complex cognitive processes than higher levels (Bloom et al., 1956; Krathwohl, 2002). Bloom’s lower levels theoretically target pattern-recognition cognitive processes, whereas higher levels target analytical reasoning (Monrad et al., 2021). The findings of the study also showed that this theoretical distinction was directly reflected in LLM performance in 2024. The 2024 data of LLMs were found to be successful in imitating standard question levels and sentence structures frequently encountered in textbooks or online sources. However, they were generally unsuccessful in preparing questions that measure higher-level skills, such as questioning the rationale behind a concept, critiquing a solution, or constructing a new problem scenario. Frieder et al. (2024) observed that ChatGPT was insufficient for writing complex and creative mathematical problems. Within the scope of this study, it was determined that this limitation also applied to Gemini and Claude with the 2024 data. Since each level of the taxonomy requires a different degree of mental activity, it is recommended that exams administered to students reflect all levels of Bloom’s Taxonomy and include diverse question types (Afacan & Nuhoğlu, 2008; Linn & Gronlund, 2000). An examination of the 2026 data revealed significant progress in LLM performances. Similar to the 2024 data, the success in writing questions targeting the lower levels of the taxonomy with the first prompt was high, but decreased when moving to upper levels. However, with the second prompt, it was observed that all questions prepared by all LLMs for the fractions subject were entirely at the desired taxonomy level, and great success was achieved for the angles in triangles subject. The most important factor underlying this leap is thought to be the evolution of the methods and datasets used in training the LLMs. In the period from 2024 to 2026, the models were likely trained not only with more mathematical problems but also with “chain-of-thought” data explaining problem-solving strategies, proof techniques, and multi-step reasoning processes. Consequently, the models have now become capable of modelling not just the answer to a question, but also the logical structure and cognitive steps that constitute that question. In other words, LLMs appear to have moved beyond knowing “what” to ask, learning “how” to design a question that will stimulate thinking. This success observed in 2026 can be considered strong evidence that LLMs are transforming from mere information storage and repetition tools into tools capable of creating pedagogical content and designing cognitive processes. This situation is promising for the future of educational technologies and AI-supported learning platforms. As Kyambade et al. (2025) stated, this rapid development can be seen as a process of AI approaching the “AI-Driven Personalized Learning” paradigm. However, it was observed that despite this success achieved with second prompts, the process of using LLMs as learning environments that generate appropriate questions for each level while preparing mathematics exams on “fractions” and “angles in triangles” still requires teacher supervision and pedagogical filtering. This result has been frequently emphasised in the mathematics education literature (Segal & Klemer, 2025; Urhan et al., 2024). As also frequently emphasised in the literature, LLMs necessitate that teachers have a critical perspective, using these tools as a complementary element (Kim et al., 2026; Walkington, 2025). In other words, LLMs should be seen as a complementary tool rather than a replacement for traditional instruction (Turmuzi et al., 2026). The study results support the literature in this sense.

As individual approaches are important in the question-based categorisation of the RBT (Monrad et al., 2021), the approaches adopted by LLMss are also important. It is an expected result that LLMss, which contain information in raw form, are effective at preparing knowledge-oriented questions without critically examining the information when generating lower-level questions in the first prompt. This situation is also observed in human performance. Studies have shown that teachers’ and educators’ ability to write questions targeting the upper levels of the RBT is lower than their ability to write questions targeting lower levels (Boylu, 2019; Himmah et al., 2019; Özyalçın & Kana, 2020). The results show that although the success of question preparation in the first prompt increased in 2026, LLMs do not surpass human performance in tasks such as preparing questions at the upper levels of the RBT when the first prompt is used. However, controlling AI-generated questions through appropriate prompts was found to have the potential to save time and resources in the creation of mathematics exams. In addition, this approach can provide insight into potential question types by making the time- and experience-intensive question-writing process more efficient. However, because chat-based LLMs operate based on user-provided commands and generate different responses depending on the language and specificity of those commands (Rospigliosi, 2023), the study’s results may vary according to the prompts used. When using a more general instruction—particularly in the first prompt (Prompt 1)—LLMss were more successful in preparing lower-level questions in both data sets. In contrast, when asked in the second prompt to rewrite questions by indicating their shortcomings (alignment with RBT levels, scientific relevance, understandability, and target relevance), the models were more successful in preparing higher-level questions. The most effective second prompts were those related to scientific relevance and target relevance. Limiting the process to a maximum of two prompts was intended to minimise researcher intervention and to identify the natural tendencies and basic capabilities of LLMs. By the end of the study, it was shown that increasing the number of interventions or requesting more specialised prompts led to more successful outcomes in preparing mathematics exams at the desired level. At this stage, AI, whose performance improves with human intervention, can be considered to play a supportive role in the exam creation process by facilitating teachers’ work and saving time. Looking at the 2024 and 2026 data, this limitation may be addressed through more specific prompts, such as: “Prepare a question suitable for the analysis level in which the student learns that dividing by a fraction is equivalent to multiplying by its reciprocal.” Ünal et al. (2025), found that customised prompts performed better in preparing mathematics questions. This finding is consistent with the results of the present study. This highlights the decisive role of prompt engineering in LLM performance. Therefore, it should be noted that more numerous and higher-quality prompts may yield different results and that the present findings are valid within the context of a specific prompt set and a limited number of questions.

The study focused on the middle school–level topics of “fractions” and “angles in triangles.” The nature of these topics may have influenced LLM performance. For example, questions on “angles in triangles” typically require visual representations. It has been reported that LLM performance varies depending on the mathematics subject area, with geometry topics posing greater challenges (Svičević et al., 2025; Wei, 2024). The study findings showed that there were very serious limitations in visual preparation in 2024, but with the 2026 data, this limitation decreased significantly for Claude. In the 2026 version, ChatGPT was able to add correct visual elements, although some were not fully comprehensible. Claude’s success in preparing visual geometry questions was particularly noteworthy. This situation recalls the reality that LLMs are creating a “digital ripple” that is changing the nature of education (Fan, 2025; Strielkowski et al., 2025; Walkington, 2025). However, all three LLMs designed questions without using visual elements for the fractions subject. Additionally, the success in exam preparation for the angles in triangles subject was lower than for the fractions subject in both years. The visual input and output capabilities of current LLMs may have limited the quality and originality of the generated questions. The fact that LLMs generally designed questions without visual elements in their 2026 versions as well supports this idea. In contrast, the topic of “fractions” is more abstract and relies on symbolic operations. This suggests that model performance is not independent of the characteristics of the topics being assessed and the currency of the LLM used. Therefore, the findings of this study cannot be generalised to mathematics topics with different content characteristics (e.g., algebra or probability), and LLM performance in creating exams for such topics may vary.

Within the scope of the study, it was determined that the performance of the three LLMs differed from one another over the years. Within the scope of the 2024 data, Gemini showed the highest performance in the angles in triangles subject, while Claude showed the highest performance in the fractions subject in taxonomy-based question writing. Lee et al. (2024) compared Gemini Pro with ChatGPT and concluded that Gemini was not very successful at understanding texts; its student assessment performance was limited and could be enhanced through the use of simple expressions and fragmented visuals. These results are consistent with the findings of the present study. Looking at the 2026 data, it is clear that Claude performed best in exam preparation for both fractions and angles in triangles. This situation indicates that, as AI advances rapidly, LLMs are developing at different rates. While all three LLMs were successful in terms of relevance to objectives and scientific relevance, the 2026 data yielded the same result. However, while Claude’s performance was lower than the other LLMs on the comprehensibility criterion in the 2024 data, it was observed that by 2026, it had closed this gap and generated completely comprehensible questions. It is well established that questions containing unclear, linguistically complex, or ambiguous expressions negatively affect students’ cognitive performance, thereby weakening the validity of measurement results (Haladyna, 2004). These findings align with previous results indicating that the performance of LLMs varies depending on the task and that AI technologies are rapidly changing and developing (Hochmair et al., 2024). Studies comparing LLMs have identified varying strengths and limitations across models. For example, Rycroft-Smith et al. (2024) compared the ability of three different LLMs—Copilot, Claude, and ChatGPT 3.5—to write questions about the area of a rectangle. They found that ChatGPT offered more opportunities for multi-part questions and scoring; however, this also increased the risk of errors. In this sense, the performance differences between LLMs highlighted in previous research were also identified in this study.

Considering the types of questions designed by LLMs, it was determined that in both the 2024 and 2026 data, Claude offered richer content in terms of question variety compared to the LLMs; however, overall, all three LLMs tended to prefer open-ended questions at upper-order levels. It can be considered that tasks such as ensuring logical distractor selection in multiple-choice questions, managing multidimensional matching processes, or avoiding ambiguous expressions in fill-in-the-blank questions while focusing on critical domain-specific information may pose difficulties for LLMs. For example, it is often argued that applying Bloom’s cognitive domains when creating multiple-choice questions results in items that measure higher-order thinking rather than the recall of factual information (Cecilio-Fernandes et al., 2018; Crowe et al., 2008; Monrad et al., 2021). At the same time, the process of preparing multiple-choice questions requires meticulous effort and is highly complex (Arıcan, 1996). In this sense, LLMs’ preference for open-ended questions may be related to the relative ease of the preparation process. Goodman et al. (2024) stated that ChatGPT is more successful in writing open-ended questions; this finding shows that this is still valid for the questions designed at upper-order levels in this study. ChatGPT extensively used open-ended questions in both of its versions. Although other LLMs included different question types in 2026, they were more inclined to prepare open-ended questions at upper-order levels.

This study determined that LLMs have distinct advantages and limitations relative to one another. For example, considering the 2024 data, Gemini’s ability to enhance comprehensibility by providing appropriate links when content is insufficient, Claude’s relative strength in preparing diverse question types (open-ended, multiple-choice, fill-in-the-blank, and matching), and ChatGPT’s higher level of comprehensibility compared to the other LLMs represent their respective strengths. Examining the 2026 data, it was determined that the limitations of ChatGPT and Gemini were similar, but Claude did not have its existing limitations from 2024 during this intervening period. ChatGPT and Gemini have limitations such as a tendency to produce short-text questions at upper levels and limitations in creating questions containing visuals. It was determined that Claude had no limitations in these respects. Similarly, another feature distinguishing Claude from the other LLMs was observed to be its success in creating questions containing visuals. These advantages and limitations are likely outcomes of differences in the underlying infrastructures used by the LLMs. The change in the data collected over two different years is thought to be due to the evolution of the LLMs’ datasets. In the period from 2024 to 2026, the models were likely trained not only with more mathematical problems but also with “chain-of-thought” data explaining problem-solving strategies, proof techniques, and multi-step reasoning processes. This is thought to have caused their limitations to decrease and their advantages to increase. This change, particularly in Claude, indicates that LLMs are progressing and developing at different rates. Considering that the capabilities of LLMs are improving rapidly (Koubaa, 2023), the results presented in this study should be regarded as reflecting the current state of LLM capabilities. This highlights the need for ongoing evaluation and validation as AI technologies continue to evolve.

Conclusion

The results are critically important for teachers and curriculum developers. Examining the 2024 and 2026 data of LLMs, it was observed that they showed very rapid development and that the limitations they had in 2024 were largely resolved. Significant progress was observed, particularly in Claude and ChatGPT, in terms of creating mathematics questions containing visual elements. This situation provides a hint that LLMs will quickly overcome their other limitations in the coming periods. However, the results have not changed the conclusion that teachers should view LLMs as time-saving assistive tools. Teachers should view LLMs as time-saving aids. LLMs can rapidly generate question pools for assessing lower-level cognitive skills; however, questions intended to measure higher-level cognitive skills must be reviewed, edited, and enriched by teachers to ensure pedagogical appropriateness, progressive difficulty, and the promotion of higher-order thinking. Similarly, curriculum designers can treat questions generated by LLMs as a starting point or a source of ideas rather than relying on them directly. In other words, teachers and curriculum developers can view these technologies as powerful assistants rather than as autonomous decision-makers.

LLMs should be further developed to more effectively measure higher-level mathematical thinking processes with a single prompt. It was observed that the percentage of questions prepared at the desired level increased with the use of second prompts. This finding shows that improved results can be achieved through targeted interventions. This performance can be further improved through the application of prompt engineering techniques.

Limitations and Future Studies

The study results are limited to data obtained from ChatGPT, Gemini, and Claude—free chat-based LLMs—during May–June 2024, and February 2026, using a specific prompt set and a limited number of questions on the middle school–level topics of “fractions” and “angles in triangles.” It should be noted that the study was conducted to capture the current state and development of LLM performance. In this context, future studies may examine the criteria used by AI language models when generating questions aligned with the RBT in newer model versions. In addition, future research may focus on identifying appropriate prompts for developing mathematical questions.

Footnotes

ORCID iDs

Zühal Gün Şahin

Zeynep Çavuş Erdem

Ethical Considerations

This study did not require institutional ethics committee approval, as it did not involve human or animal participants. However, commercial AI platforms (ChatGPT, Gemini, and Claude) were used in the study. Throughout the research process, all actions were conducted in accordance with the terms of service and data privacy policies of these platforms. Furthermore, all generated content was used solely for academic analysis purposes.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The data supporting the conclusions of this paper will be made available by the authors upon reasonable request.

References

Aydoğdu

Gültekin

(2025). An ınvestigation of lgs mathematics exam questions and 8th-grade mathematics curriculum outcomes based on the revised bloom’s taxonomy. Western Anatolia Journal of Educational Sciences, 16(1), 619–649. https://doi.org/10.51460/baebd.1555329

Adeshola

Adepoju

A. P.

(2024). The opportunities and challenges of ChatGPT in education. Interactive Learning Environments, 32(10), 6159–6172. https://doi.org/10.1080/10494820.2023.2253858

Afacan

Ö.

Nuhoğlu

(2008). A comparative analysis between tımss-r (1999) questions and lgs (1999) questions in life science Ahi Evran University Journal of Kırşehir Education Faculty (KEFAD), 9(1), 31–43.

Akinboboye

J. T.

Ayanwale

M. A.

(2021). Bloom taxonomy usage and psychometric analysis of classroom teacher made test. African Multidisciplinary Journal of Development, 10(1), 10–21. https://doi.org/10.4038/kjms.v3i2.25

Anderson

L. W.

Krathwohl

D. R.

(Eds.). (2001). Taxonomy for learning, teaching and assessing: A revision of bloom’s taxonomy of educational objectives. Allyn & Bacon.

Arıcan

(1996). Examining and evaluation in mathematics education [Master's thesis, Marmara Unıversıty].

Arslan

Güler Selek

H. K.

(2025). Artificial intelligence in mathematics education: ChatGPT’s capabilities and limitations in problem posing. International Journal of Education in Mathematics, Science and Technology (IJEMST), 13(5), 1189–1205. https://doi.org/10.46328/ijemst.5712

Bal

A. P.

Yılmaz

(2022). An examination of the questions in secondary school mathematics textbooks according to the revised bloom taxonomy. Bartın University Journal of Educational Research, 6(2), 94–107. https://izlik.org/JA83FU63DX

Bilgeç

İ.

(2016). Inspectıon of mathematıc exams wıth open ended questıons from the perspectıve of measurement and evaluatıon. [Master's thesis, Marmara Unıversıty].

10.

Bloom

B. S.

Engelhart

M. D.

Furst

E. J.

Hill

W. H.

Krathwohl

D. R.

(1956). Handbook I: cognitive domain. New York: David McKay, 483–498.

11.

Boylu

(2019). Measurement assessment practices and standardization ın teaching turkish to foreigners [Unpublished Doctoral Thesis, Çanakkale Onsekiz Mart University].

12.

Cecilio-Fernandes

Kerdijk

Bremers

A. J.

Aalders

Tio

R. A.

(2018). Comparison of the level of cognitive processing between case-based items and non-case-based items on the ınteruniversity progress test of medicine in the netherlands. Journal Of Educational Evaluation For Health Professions, 15, Article 28. https://doi.org/10.3352/jeehp.2018.15.28

13.

Chang

W. C.

Chung

M. S.

(2009, December). Automatic applying Bloom’s taxonomy to classify and analysis the cognition level of English question items. In 2009 Joint Conferences on Pervasive Computing (JCPC) (pp. 727–734). IEEE.

14.

Carroll

W. M.

(1999). Using short questions to develop and assess reasoning. In Stiff

L. V.

Curcio

F. R.

(Eds.), Developing and mathematical reasoning in grades K-12 (pp. 247–255). NCTM.

15.

Chen

Lin

(2020). AI in education: A review. IEEE Access, 8, 75264–75278. https://doi.org/10.1109/ACCESS.2020.2988510/

16.

Crowe

Dirks

Wenderoth

M. P.

(2008). Biology in bloom: İmplementing Bloom’s taxonomy to enhance student learning in biology. CBE—Life Sciences Education, 7(4), 368–381. https://doi.org/10.1187/cbe.08-05-0024

17.

Çavuş Erdem

(2025). Matematik eğitiminde yapay zeka: Çalışmalar ne söylüyor? In İlgün

Ş.

Altıntaş

Erdoğan

(Eds.), Matematik ve Fen - Teknoloji Eğitimi Alanında Yenilikçi Tartışmalar III (ss. 127–146). Vizetek Yayınları, Ankara. https://doi.org/10.54637/vizetek.9786253821791

18.

Çavuş Erdem

(2026). Artificial intelligence in education from the perspective of students: The example of mathematics classes. Journal of Learning and Teaching in Digital Age, (1), 134–143. https://doi.org/10.53850/joltida.1766039

19.

Çelik Görgüt

(2023). AI and mathematic education. In Kuloğlu

(Ed.), Education-science-IV (pp. 43–60). Efe Akademi Yayınları.

20.

Doğan

(2019). Traditional measurement and evaluation techniques I: Measurement tools requiring response selection. In Doğan

(Ed.), Measurement and evaluation in education (pp. 113–138). Pegem Akademi.

21.

Fan

(2025). The reconfiguration of human education in an uncertain world. ECNU Review of Education, 8(3), 609–637. https://doi.org/10.1177/20965311241266856

22.

Friborg

Rosenvinge

J. H.

(2013). A comparison of open-ended and closed questions in the prediction of mental health. Quality & Quantity, 47, 1397–1411. https://doi.org/10.1007/s11135-011-9597-8

23.

Frieder

Pinchetti

Griffiths

R. R.

Salvatori

Lukasiewicz

Petersen

Berner

(2024). Mathematical capabilities of chatgpt. Advances in Neural İnformation Processing Systems, 36. https://doi.org/10.48550/arXiv.2301.13867

24.

Goodman

Handa

Wilson

R. E.

Bradbury

L. U.

(2024). Promises and pitfalls: Using an AI chatbot as a tool in 5E lesson planning. Innovations in Science Teacher Education, 9(1), 1–13.

25.

Haladyna

T. M.

(2004). Developing and validating multiple-choice test items (3rd ed.). Lawrence Erlbaum.

26.

Haladyna

T. M.

Downing

S. M.

Rodriguez

M. C.

(2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309–334. https://doi.org/10.1207/S15324818AME1503_5

27.

Haspekian

Kieran

Drijvers

Bråting

Tabach

(2024). Algebra education and digital resources: A long-distance relationship? In Pepin

Gueudet

Choppin

(Eds.), Handbook of digital resources in mathematics education (pp. 439–470). Springer.

28.

Himmah

W. I.

Nayazik

Setyawan

(2019, March). Revised Bloom’s taxonomy to analyze the final mathematics examination problems in Junior High School. Journal of Physics: Conference Series, 1188(1), Article 012028.

29.

Hochmair

H. H.

Juhász

Kemp

(2024). Correctness comparison of ChatGPT-4, Gemini, Claude-3, and Copilot for Spatial Tasks. Transactions in GIS, 28, 2219–2231. https://doi.org/10.1111/tgis.13233

30.

Krathwohl

D. R.

(2002). A revision of bloom’s taxonomy: An overview. Theory Into Practice, 41(4), 212–264.

31.

Kilpatrick

Swafford

Findell

(2001). Adding it up. Helping children learn mathematics. National Academy Press.

32.

Kim

Y. R.

Park

M. S.

Joung

(2026). Exploring the integration of artificial intelligence in math education: Preservice teachers’ experiences and reflections on problem-posing activities with ChatGPT. School Science and Mathematics, 126(1), 9–23. https://doi.org/10.1111/ssm.18336

33.

Korkmaz Güler

Yıldız

(2025). Effectiveness and guidance of artificial intelligence chatbots in diagnosing exponential expression errors. The Journal of Educational Research, 118(6), 547–567. https://doi.org/10.1080/00220671.2025.2510387

34.

Koubaa

Boulila

Ghouti

Alzahem

Latif

(2023). Exploring ChatGPT capabilities and limitations: A survey. IEEE Access, 11, 118698–118721. https://doi.org/10.1109/ACCESS.2023.3326474

35.

Kyambade

Namatovu

Ssentumbwe

A. M.

(2025). Exploring the evolution of artificial intelligence in education: From AI-guided learning to learner-personalized paradigms. Cogent Education, 12(1), Article 2505297. https://doi.org/10.1080/2331186X.2025.2505297

36.

Lee

S. G.

Park

Lee

J. Y.

Lim

D. S.

Lee

J. H.

(2024). Use of ChatGPT in college mathematics education. Mathematical Education, 63(2), 123–138. https://doi.org/10.7468/mathedu.2024.63.2.123

37.

Linn

R. L.

Gronlund

(2000). Measurement and assessment in teaching (8th ed.). Columbus, OH: Charles E. Merrill.

38.

C. K.

(2023). What is the impact of ChatGPT on education? A rapid review of the literature. Education Sciences, 13(4), Article 410. https://doi.org/10.3390/educsci13040410

39.

Long

Magerko

(2020, April). What is AI literacy? Competencies and design considerations. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (pp. 1–16). Association for Computing Machinery. https://doi.org/10.1145/3313831.3376727

40.

Miles

M. B.

Huberman

A. M.

(1994). Qualitative data analysis: An expanded sourcebook. Sage Publications.

41.

Monrad

S. U.

Bibler Zaidi

N. L.

Grob

K. L.

Kurtz

J. B.

Tai

A. W.

Hortsch

Gruppen

L. D.

Santen

S. A.

(2021). What faculty write versus what students see? Perspectives on multiple-choice questions using Bloom’s taxonomy. Medical Teacher, 43(5), 575–582. https://doi.org/10.1080/0142159X.2021.1879376

42.

Neubrand

(2018). Conceptualizations of professional knowledge for teachers of mathematics. ZDM, 50(4), 601–612. https://doi.org/10.1007/s11858-017-0906-0

43.

Ngo

Gupta

Perrine

Reddy

Ershadi

Remick

(2024). ChatGPT 3.5 fails to write appropriate multiple choice practice exam questions. Academic Pathology, 11(1), Article 100099. https://doi.org/10.1016/j.acpath.2023.100099Niss2003

44.

Niss

(2003, January). Mathematical competencies and the learning of mathematics: The Danish KOM project. In 3rd Mediterranean Conference on Mathematical Education (pp. 115–124). Hellenic Mathematical Society.

45.

Nortvedt

G. A.

Buchholtz

(2018). Assessment in mathematics education: Responding to issues regarding methodology, policy, and equity. ZDM, 50(4), 555–570. https://doi.org/10.1007/s11858-018-0963-z

46.

Yoon

Chung

Cho

Shim

Kwon

O. N.

(2024). Analysis of generative AI’s mathematical problemsolving performance: Focusing on ChatGPT 4, Claude 3 Opus, and Gemini Advanced. Mathematical Education, 63(3), 549–571. https://doi.org/10.7468/mathedu.2024.63.3.549

47.

Omar

Haris

S. S.

Hassan

Arshad

Rahmat

Zainal

N. F. A.

Zulkifli

(2012). Automated analysis of exam questions according to Bloom’s taxonomy. Procedia-Social and Behavioral Sciences, 59, 297–303. https://doi.org/10.1016/j.sbspro.2012.09.278

48.

O’Neil

H. F.

Jr. Brown

R. S.

(1998). Differential effects of question formats in math assessment on metacognition and affect. Applied Measurement in Education, 11(4), 331–351. https://doi.org/10.1207/s15324818ame1104_3

49.

Opara

Mfon-Ette Theresa

Aduke

T. C.

(2023). ChatGPT for teaching, learning and research: Prospects and challenges. Global Academic Journal of Humanities and Social Sciences, 5(2), 33–40. https://ssrn.com/abstract=4375470

50.

Ouyang

Jiao

(2021). Artificial intelligence in education: The three paradigms. Computers and Education: Artificial Intelligence, 2, Article 100020. https://doi.org/10.1016/j.caeai.2021.100020

51.

Özyalçın

K. E.

Kana

(2020). An evaluatıon on the skılls of wrıtıng sub-text questıons of teachers of turkısh as a foreıgn language. Çukurova University Journal of Turkology Research, 5(2), 488–506. https://doi.org/10.32321/cutad.797711

52.

Plass

J. L.

Chun

D. M.

Mayer

R. E.

Leutner

(2003). Cognitive load in reading a foreign language text with multimedia aids and the influence of verbal and spatial abilities. Computers in Human Behavior, 19(2), 221–243. https://doi.org/10.1016/S0747-5632(02)00015-8

53.

Rane

(2024). Enhancing the quality of teaching and learning through Gemini, ChatGPT, and similar generative AI: Challenges, future prospects, and ethical considerations in education. TESOL and Technology Studies, 5(1), 1–6. https://doi.org/10.48185/tts.v5i1.1000

54.

Rospigliosi

P. ‘asher’.

(2023). AI in teaching and learning: What questions should we ask of ChatGPT? Interactive Learning Environments, 31(1), 1–3. https://doi.org/10.1080/10494820.2023.2180191

55.

Rycroft-Smith

Macey

(2024). Using AI for question generation in mathematics education: What are the advantages and disadvantages?. In Fujita

(Ed.), Proceedings of the British Society for Research into Learning Mathematics (BSRLM), 44(1). https://bsrlm.org.uk/publications/proceedings-of-day-conference/ip44-1/.

56.

Seaman

(2011). Bloom’s taxonomy. Curriculum & Teaching Dialogue, 13(1), 29–43.

57.

Segal

Klemer

(2025). Dialogic interactions between mathematics teachers and GenYZ: Multi-environment task design and its contribution to TPACK. International Journal of Mathematical Education in Science and Technology, 1–25https://doi.org/10.1080/0020739X.2025.2551363

58.

Spreitzer

Straser

Zehetmeier

Maaß

(2024). Mathematical modelling abilities of artificial intelligence tools: The case of ChatGPT. Education Sciences, 14(7), Article 698. https://doi.org/10.3390/educsci14070698

59.

Stein

M. K.

Smith

M. S.

Henningsen

M. A.

Silver

E. A.

(2009). Implementing standards-based mathematics instruction: A casebook for professional development. Teachers College Press.

60.

Strielkowski

Grebennikova

Lisovskiy

Rakhimova

Vasileva

(2025). AI-driven adaptive learning for sustainable educational transformation. Sustainable Development, 33(2), 1921–1947. https://doi.org/10.1002/sd.3221

61.

Swart

A. J.

(2010). Evaluation of final examination papers in engineering: A case study using Bloom’s Taxonomy. IEEE Transactions on Education, 53(2), 257–264. http://doi.org/10.1109/TE.2009.2014221

62.

Sweller

(1988). Cognitive load during problem solving: Effects on learning. Cognitive Science, V.12, 257–285.

63.

Svičević

Milenković

Vučićević

Stanković

(2025). Evaluating the success of YZ Tools in supporting student performance in mathematical Kangaroo competition. Computer Applications in Engineering Education, 33(4), Article e70063. https://doi.org/10.1002/cae.70063

64.

Trgalová

Tabach

(2024). Introduction to how digital resources transform content. In Pepin

Gueudet

Choppin

(Eds.), Handbook of digital resources in mathematics education (pp. 313–323). Springer.

65.

Turmuzi

Azmi

Kertiyani

N. M. I.

(2026). ChatGPT in school mathematics education: A systematic review of opportunities, challenges, and pedagogical implications. Teaching and Teacher Education, 170, Article 105286. https://doi.org/10.1016/j.tate.2025.105286

66.

Urhan

Gençaslan

Dost

Ş.

(2024). An argumentation experience regarding concepts of calculus with ChatGPT. Interactive Learning Environments, 32(10), 7186–7211. https://doi.org/10.1080/10494820.2024.2308093

67.

Üredi

Ulum

(2020). Evaluation of unit evaluation questions in primary school mathematics course books according to the revised bloom’s taxonomy. Mersin University Journal of the Faculty of Education, 16(2), 432–447. https://doi.org/10.17860/mersinefd.693392

68.

Walkington

(2025). The implications of generative artificial intelligence for mathematics education. School Science and Mathematics, 1–10. https://doi.org/10.1111/ssm.18356

69.

Wardat

Tashtoush

M. A.

AlAli

Jarrah

A. M.

(2023). ChatGPT: A revolutionary tool for teaching and learning mathematics. Eurasia Journal of Mathematics, Science and Technology Education, 19(7), Article em2286. https://doi.org/10.29333/ejmste/13272

70.

Wei

(2024). Evaluating ChatGPT-4 and ChatGPT-4o: Performance insights from NAEP mathematics problem solving. Frontiers in Education, 9, Article 1452570. https://doi.org/10.3389/feduc.2024.1452570

71.

Wing

J. M.

(2006). Computational thinking. Communications of the ACM, 49(3), 33–35.

72.

Zeren

Köksal

Kozan

Şahinoğlu

Kalkancı

Ataman

(2023). Investigation of the techniques used by teachers in measurement-evaluation processes in secondary schools. Socrates Journal of Interdisciplinary Social Studies, 9(34), 153–163. https://doi.org/10.5281/zenodo.10022353