A question of quality: Conceptualisations of quality in the context of educational test questions

Abstract

In educational contexts, questioning performs a number of functions. These include facilitating learning in the classroom and the recognition of achievement through examinations and other assessments. Good quality questions are important to ensuring that these functions are achieved. This research focused on educational exams and used views from question writers to explore conceptualisations of question quality and features thought to affect question quality. Seven examination question writers from four subjects were shown some example exam questions. Participants were asked to comment on the quality of the question, reflect on performance data, rate question quality and comment more generally on how they define quality in question writing. Three conceptualisations of question quality emerged, two relating to aspects of validity and one relating to pedagogical concerns. Participants varied in which definition dominated their views. Discussions also identified question features thought to affect quality. These are similar to features previously identified as affecting difficulty and fairness by studies analysing student performance.

Keywords

Question quality tests examinations question writing General Certificate of Secondary Education

Introduction

Questions and their role in the questioning process are an important element of learning discourse (Ernst-Slavit and Pratt, 2017; Heritage and Heritage, 2013; Mehan, 1979; Sinclair and Coulthard, 1975). The power of questions resides in their ability to elicit a response, i.e. they implicate a respondent into returning information to a teacher (or an assessor) from which a decision can be made about the next steps for learning. This is because the questioning process, which may be verbal or written, takes place in an educational context where ‘adjacency pair organisation’ (Schegloff, 2007; Schegloff and Sacks, 1973) dictates that a question raised by a teacher or an assessor necessitates a learner response. This responsibility is an element of the accountability that is inherent to social interaction (Buttny, 1993; Stivers and Rossano, 2010).

Beyond the use of verbal questioning in the classroom, another context where question use is ubiquitous is in classroom and school year-end tests. Questions in formal, high stakes assessments are important to achieving fair assessment of students’ achievements, with results influencing future study and career options. The nature of such assessments also has an influence on classroom practice. High quality questioning in these contexts is important to ensuring that the intended purposes of these questions are fulfilled. But what exactly is ‘quality’ in the context of classroom test and exam questioning? It is important to have a good grasp of what makes good quality questions and, indeed, what is meant by ‘quality’ in the context of question design, such that those involved in questioning students can do so appropriately. This research explored these themes in the context of high stakes examinations, though issues may be similar for written classroom tests.

Test questions are particular kinds of texts, perhaps even constituting specific genres. It has been shown that different kinds of vocabulary are associated with evaluations of high quality writing in different genres (Olinghouse and Wilson, 2013). Thus, it would be reasonable to suggest that different types of vocabulary and other features of text might be associated with quality in the ‘genre’ of educational questions. Interestingly, Olinghouse and Wilson (2013) found that ‘content vocabulary’ (i.e. vocabulary specific to a particular topic) was the most important vocabulary-related predictor of text quality in informative texts (perhaps the closest genre to exam and test questions included in their study). Past work on how features of exam questions affect students has shown how difficult vocabulary, technical terminology and subject-specific meanings can affect the difficulty of questions (Crisp et al., 2012; Pollitt et al., 1985; Ramseier, 1999; Shaurd and Rothery, 1984). This suggests that difficult vocabulary should only be used if it is within the syllabus content for the assessment as otherwise it might unfairly compromise student performance. This may be less essential in informal classroom questioning or activities where the teacher can clarify meanings if needed. However, an exam question writer’s (or setter’s¹) choice of vocabulary and specifically ‘technical terms’ may well affect question quality.

Validity is a key principle in assessment, yet a concept that is much debated in terms of exact definition and scope. A prominent definition comes from a seminal book chapter by Messick (1989: 13): ‘Validity is an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores and other modes of assessment.’

According to this definition, validity relates to whether the meanings we assign to assessment results are appropriate and justifiable. Thus, all features and elements of the assessment itself, the procedures used (e.g. for marking), their implementation and guidance on what results mean can influence validity. Validity is most often discussed in the context of formal, high stakes assessments where results have considerable implications but it is also important that evaluations of students based on verbal questioning and classroom tests are appropriately interpreted so as to support further learning. In relation to the design of test questions, the main concern in order for them to contribute to validity is that the questions trigger student performance showing their relevant knowledge, understanding and skills and that marks/ratings/grades can fairly be assigned to reward these. This is one element of the full conceptualisation of validity and could be labelled ‘validity at the question level’. Notions of validity along these lines have been used in past research studies exploring how well exam questions work (e.g. Ahmed and Pollitt, 2007; Crisp, 2011; Crisp et al., 2008) and are similar to older definitions of validity as ‘whether a test really measures what it purports to measure’ (Kelley, 1927). Such definitions are narrower than Messick’s definition but consistent with it as it would be unlikely that any intended use of assessment results would be appropriate if the questions or tasks did not prompt performances relevant to the knowledge, understanding and skills of interest. In discussing the concept of validity in relation to assessment design, Ahmed and Pollitt (2011) propose that ‘the test constructors’ task is to ensure that the questions and mark schemes that they write deliver scores for students that show as accurately as possible how much and how well they have learned. To the extent that the assessment fails to do this, the potential for interpreting the results validly will be threatened’ (Ahmed and Pollitt, 2011: 260). Or, put another way, ‘valid score inferences are not likely to be feasible in the absence of a well-designed assessment’ (Bejar, 2011: 327).

Validity at the question level would seem to be a reasonable way to conceptualise ‘quality’ in questions and question writing. ‘Fairness’ is a related useful term; one would expect good quality questions to allow candidates a fair opportunity to attempt relevant tasks. One might expect the wording and content of tasks to influence their fairness to candidates, and thus validity (at question level). It is of interest to see whether these themes match the conceptualisations of question quality held by those involved in question writing.

A number of studies, some taking validity as a focus, have investigated how students answer exam questions and how the features of questions affect their difficulty and fairness. Some such research has used statistical analysis to look at questions that did not function well and explored why this might be by analysing the kinds of responses given (Crisp, 2011; Fisher-Hoch et al., 1997; Pollitt et al., 1985). Other studies have used different versions of the same task with one or more features changed between the versions (e.g. slightly different wording, a different diagram) and trialled with similar groups of students to investigate any effect of the difference(s) between versions of the same question (Ahmed and Pollitt, 2007; Crisp and Sweiry, 2006; Crisp et al., 2008; Fan et al., 1994). A generally psychological perspective has been used (Pollitt and Ahmed, 1999), focusing on the interaction between the student and the question, such as whether they understand the task as intended and how they attempt to come to a response. Thus, much of this work has used student score data or written responses as the main data, with the researchers inferring what the analysis reveals about the quality of the questions.

Research of this kind has identified a range of question features as affecting validity (at the question level). For example, setting questions in context is a useful mechanism for allowing the assessment of application skills but if a context is novel, complicated or ‘at odds’ with the focus of the task then it can hinder some students for reasons other than subject knowledge and skills (Ahmed and Pollitt, 2007). It has been found that the context used can influence students’ selection of the appropriate operation to complete the task (Fan et al., 1994; Nickson and Green, 1996) and that students from lower socio-economic backgrounds can be disadvantaged by realistic contexts (Cooper and Dunne, 1998, 2000). Language use has also been a theme in such research with concerns that linguistic knowledge and skills may affect performance even where there is little intention to test language (e.g. Mayer et al., 1984) and that ambiguity in question wording affects the difficulty of questions (Fisher-Hoch et al., 1997). The nature and effects of visual resources used in exam papers have been explored (Crisp and Sweiry, 2006). In addition, features such as the ordering and spacing of questions, multiple steps in a task and the need to recall an appropriate strategy have been found to affect difficulty (Fisher-Hoch and Hughes, 1996). Such research has provided many insights and points for guidance on question writing regarding the features of questions that affect difficulty and affect the contribution that a question makes to validity. They may well affect ‘quality’ too depending on conceptualisations of quality. Whilst this research has been conducted in the context of exam questions, findings may also be relevant to classroom tests.

Most research studies relating to the features of questions and how questions function have focused on the student and/or the exam paper. Little work on this theme has made use of the expertise of the question writers who create the questions. Exam setters are usually current or retired teachers and thus bring their experience to date of classroom questioning to the role. Through examination work they gain substantial experience of writing exam questions and seeing how students respond to them (as usually they are also involved in marking) and this informs their understanding of how features of questions affect how well the questions measure the knowledge, understanding and skills of interest. Whilst their experiences are likely to inform their own future practices and perhaps those of their colleagues, their expertise has been little used in research as a resource from which to unpick question quality. These people are the professionals in this context, and to a large extent the exam questions that they create appear to be successful at measuring relevant candidate achievement. It would seem to be worth making use of their knowledge and experience to explore concepts of question quality and features thought to make good questions. It appears that only one study (Spalding, 2011) has, to date, used question writers’ views to gather insights on good practice in question writing. By viewing example exam papers, Spalding identified features to discuss with participants. She interviewed two question writers individually and five further question writers took part in a short focus group. Some example General Certificate of Secondary Education (GCSE) exam papers were used as reference during discussions. The findings included views that Arial font is easier to read than Times New Roman; the number of lines of response space for a question should be decided on an item-by-item basis; stimuli in questions need to be interesting, relevant to the question and different to other stimuli on the paper; and written information should be provided where it is needed in a staggered way rather than all at once.

The current study made use of question writers’ views and thus has some similarities with Spalding (2011). However, it varies in that themes were allowed to arise in a bottom-up, rather than top-down, manner (i.e. in the current study the researchers did not identify in advance specific features of questions to ask participants about, but allowed themes to arise naturally by showing the participants some example exam questions and encouraging them to comment on the quality of the questions). In addition, the current study used a selection of specific example questions to prompt evaluation and discussion of question quality rather than asking for general opinions on certain kinds of features of questions.

The current research set out to explore the following research questions:

How do exam question writers conceptualise question quality?

What features of exam questions do question writers feel affect question quality?

Since most relevant research to date has explored the features affecting question difficulty and validity (at the question level) using analysis of student performance, it is of interest to see whether question writers’ views align with such findings, and how conceptualisations of quality relate to validity.

Method

The research participants were seven question setters who had been involved in writing exam papers for several years for either GCSE or International General Certificate of Secondary Education (IGCSE) assessments.² The participants’ areas of expertise were geography (N = 2), maths (N = 3, split across two different syllabuses), biology (N = 1) and physics (N = 1).

For the syllabuses that the setters were usually involved with, eight example questions (or nine in the case of physics) from recent past exam papers were selected in advance for use in this activity. Questions were selected from more than one paper from the syllabus, including from the paper for which the setter usually wrote. Some of the selected questions involved several subparts. Analysis of item level data³ was used when selecting items. Where possible a few items with low correlations (<0.2) between scores on the item and scores on the rest of the test were selected. Low correlations suggest that performance on the item is less well related to performance on the rest of the test which could indicate a problematic item. Thus, it was anticipated that including a few such questions in the selection for a subject could lead to revealing discussions with the participants. It is acknowledged that this was not a representative sample of questions, due to the small number of questions for each syllabus and the selection method. The questions were labelled Question ID 1–8 (or 9) for each syllabus. In reporting the results, the maths syllabuses are differentiated as ‘maths syllabus 1’ and ‘maths syllabus 2’ whilst retaining the QID 1–8 labelling for each.

Individual data collection sessions were held with each participant, with sessions being audio-recorded and researcher notes made. The activity began with asking the participant how they would define quality in question writing. After this, they were presented with the selected questions one at a time (along with the relevant extract from the mark scheme and any additional resources). For each question, there were a number of stages to the method:

First, participants were invited to comment on the quality of the question, to categorise the question as ‘poor quality’, ‘medium quality’ or ‘high quality’ and to justify their view. (The participants were given the option to identify a question as ‘poor/medium’, or ‘medium/high’ if they appeared to struggle with a decision.)

Second, participants were asked whether they had been involved in the drafting, editing or marking of the paper and, if so, any recollections they had about the question from those experiences.

Next, the participants were presented with item level data analysis for the question (including facility values,⁴ correlations between marks on the specific item(s) and overall marks on the rest of the items, omit rates, item characteristic curves showing the facility values for each quartile of candidates, graphs showing the percentage of candidates scoring each mark). They were asked for any reactions to this information, such as whether the data on student performance were surprising or as they would have expected.

The relevant sections of the senior examiner’s report on how the students performed on the test were then presented and again the participants were encouraged to reflect on this information.

The setters were then asked whether, in hindsight, they would change the question and encouraged to explain possible revisions.

After each participant had considered all of the questions, the researcher laid out the questions in front of the participant and asked them to rate the questions on a scale of 1 (low quality) to 5 (high quality), giving reasons for their ratings.

The task lasted approximately one and a half to two hours depending on the participant. The geography questions tended to take longer to discuss than maths and science questions and consequently only five of the questions were considered by one geography setter and six by the other. All eight or nine questions were considered for the other syllabuses.

The responses of the participants were analysed using thematic analysis (Braun and Clarke, 2006).

Results

Conceptualisations of ‘quality’

Three main themes emerged when the question writers were asked how they would define ‘quality’ in question writing. The key theme expressed by most setters was around testing the intended knowledge, understanding and skills, clarity around what is required, lack of ambiguity (students understand the question and what they are being asked to do), allowing students to perform as well as they can and fair assessment of what has been taught. Various additional details were mentioned that relate to this theme:

a degree of simplicity

clarity in expression and instruction

simple phrasing with language not causing a barrier to understanding the task

appropriate use of technical terms (explained if not in syllabus)

appropriate and consistent use of command words (e.g. describe, explain, state)

no ‘twists’

students behaving as expected on the question (no unexpected problems or misinterpretations)

the question does what you wanted it to do

nothing unnecessary/nothing to distract from the objective of the question

resources are clear and accessible, are used in the question (not just decorative) and do not include features that could distract from the intended focus

if set in context, this should be familiar

This theme relates to validity in terms of ensuring that the intended knowledge, understanding and skills are being tested.

A secondary theme mentioned by several setters was that of differentiation. It was noted that good questions should be accessible to all, but differentiate between students who are better and weaker in the subject. Within a whole question there should be parts targeting different grades, with difficulty usually increasing through a whole question such that students of different ability gradually ‘drop off’ as the question goes on. One setter commented that over a whole paper there should be a nice balance of routine and challenging questions. It was sometimes noted during discussions of the item level data that questions where the majority of candidates score either 0 or the maximum mark are not ideal because not all of the available marks are helping to spread the candidates out. Of the questions selected for the past paper tasks, this issue arose mostly in maths. One of the maths setters commented that it can be hard to avoid this problem with some kinds of questions because students tend to either know how to do the task and complete it fully, or not know how to start.

For two setters, a further important theme was that good questions require students to go beyond simple recall and understanding. The physics participant described this as ‘real physics’ and this concept was dominant in his conceptualisation of quality in questions. When asked, this participant did recognise the need for knowledge and understanding questions but seemed reluctant to consider that such questions could be ‘good’. His conceptualisation of quality in exam questions appeared to mostly relate to going beyond recall and understanding. This may be because he reportedly sees physics as a conceptual subject with abstract concepts introduced relatively early on. This concept of ‘real physics’ was evident in his comments throughout the discussion of example past paper questions. Another setter (maths) said that they like questions that make students ‘think about it’ – this alludes again to questions that go beyond straightforward knowledge and understanding or applying a standard calculation, and require skills such as problem solving. However, for this setter this was a secondary part of their notion of question quality.

Specific features thought to affect question quality

As described earlier, for each past paper exam question the setters were prompted to consider its quality, reflect on item level data analysis and the examiner report, asked if they would change the question in hindsight, and asked to rate each question from 1 (low quality) to 5 (high quality).

In an initial stage of the thematic analysis for these data, setters’ comments on each question were summarised and tabulated for each question. An example is shown in Appendix 1. From these comments across all questions involved, features thought to affect the quality of questions could then be drawn out.

The features of good quality questions identified by the participants were as follows:

There should be a logical flow through the question with a logical structure/sequence through parts of the task. One participant expressed this in terms of ‘leading a child down a particular route that allows them to answer the question as best they can’.

Any resources, visuals and/or graphs should be clear. Resources should appropriately match the question and support the direction in which the students’ thinking needs to go. If the graph type used is unusual it should be appropriately explained and exemplified. Information in a resource should not conflict with information in the text.

Setting a question in context can usefully make a question less abstract and more ‘friendly’. The context should be realistic.

Questions should be clear and unambiguous, which may sometimes mean using more words. Clarity is particularly important for complex concepts. Difficult vocabulary that might cause confusion or lead to different interpretations of the question should be avoided. Complex sentence structures should be avoided (particularly if students are likely to not have English as their first language). It should be clear what the students are intended to do and what kinds of answers are expected (through use of appropriate command word and instruction).

The information provided should be appropriate and clear. Information that might limit the scope of students’ answers should be avoided. Enough information should be provided so that students do not go ‘off at a tangent’. Unnecessary introductory text that does not affect the question should be avoided. Information should be provided as it is needed with each question part rather than all at the start. Information should be up to date.

The mark scheme should be clear and provide exemplification. It should also be ‘friendly’ for the examiners.

The number of marks allocated to a question and the marking instructions should be appropriate. The mark scheme should allow follow-through marking⁵ where appropriate or the need for follow-through marking should be avoided by changing the question.

Simple response strategies should be provided where appropriate. For example, if the possible responses are provided by the question, students should be asked to circle or tick their answer rather than rewrite it.

There should be a variety of types of task in the parts of a question/question parts should test a range of skills.

The layout and spacing should be appropriate, with enough space for working and line breaks to help with reading and understanding.

Questions should be accessible, so that all can attempt them (but also challenging). Scaffolding should be used where needed to make the question accessible.

Bold text should be used to emphasise a key word.

The numbers in calculations should be ‘friendly’.

Questions should not include unnecessary demands that are not the main focus of what the question is intended to test.

Some of the most common features are exemplified below with reference to specific questions.

Resources

The clarity of visual resources (e.g. diagrams, graphs, etc.) was noted as important to the quality of the questions in several cases. Any such resources need to be clear, understandable, provide information that does not conflict with that in the text and not give away the answer to the question. The salience of visuals and their potential effect in exam questions has previously been noted and explored (e.g. Crisp and Sweiry, 2006).

An example of a question in the current study for which one of the setters felt the resource affected question quality came from maths syllabus 2 (QID2). This question showed drawings of four containers with their volumes underneath in different measurements and students were asked to put them in order by size (see Figure 1; see also Appendix 1 for tabulated analysis notes on this question). One of the setters commented that the diagrams might draw students’ attention and mislead them as the shapes do not appear to reflect the volumes (an issue potentially exacerbated by the instruction to order by ‘size’ not ‘volume’). However, it was also noted that drawing the shapes in proportion to the volume could give away the answer. Therefore, this setter felt that the question would have been better without the diagram. The other maths syllabus 2 setter involved in the research did not comment on the resource as a potential problem.

Figure 1.

Maths syllabus 2 QID2.

The item level data for the question showed that success on this task was fairly low (facility value = 0.23) with over 50% of candidates scoring no marks and less than 5% scoring both marks. Also the correlation between scores on the item and scores on the rest of the test was somewhat low (R_rest = 0.21). This could suggest that the resource did mislead some candidates, that some guessing occurred or simply that some of the volume measures used were not well understood by candidates. For example, the Examiner Report suggests that students were unfamiliar with how cubic centimetres relate to litres, centilitres and millilitres as container ‘C’ was most often in the wrong position in student responses.

Context

Appropriate use of context was noted as influencing question quality. Participants thought that context makes questions less abstract and more ‘friendly’ to candidates, though it was also commented that ideally contexts should be realistic.

Physics QID2 (see Figure 2) provides an example where the question setter involved in the research felt that additional context could have made a question (part b) less abstract. The question deals with an abstract concept relating to cooling and candidate performance was very weak (facility value = 0.02). The setter suggested that adding more context would make the question less abstract which would be helpful for weaker candidates. For example, the question could give a scenario where two students are each trying to cool some hot liquid and one adds 100 g ice whilst the other adds 100 g water (both at 0℃).

Figure 2.

Physics QID2 (part b).

One of the maths past paper questions was noted by the setter to have a slightly unrealistic context. Maths syllabus 1 QID1 (see Figure 3) was a question about probabilities which used as its context the probability of two people completing a triathlon. The setter commented that most competitors in a triathlon will complete the race and thus the context seems ‘odd’. He noted that he would change the context perhaps to a focus on probability of completing the race within a certain time (though he acknowledged that this would add extra detail to the question). The correlation between item marks and marks on the rest of the paper was low for part (ai) (R_rest = 0.12) but good for parts (aii) and (aiii). The setter suggested that the low correlation for (ai) could be because this item involved providing a textual explanation rather than just a mathematical answer and that if students are less well taught in how to explain an answer that could explain the quality of responses being less related to performance in the rest of the paper. It is difficult to know whether the slightly unrealistic context caused problems for candidates, or to know how the proposed change to the context would have affected the functioning of the question without further research.

Figure 3.

Maths syllabus 1 QID1.

The role and effects of context in exam questions have previously been noted and explored (e.g. Ahmed and Pollitt, 2007). Context is thought to be useful to test certain types of skills (e.g. application) and to make abstract ideas more concrete and hence increase accessibility of questions. However, it has also been found that if a context is too complicated or unfamiliar then this can confuse candidates and prevent them from having a fair opportunity to attempt the question.

Clarity of task

A key theme arising in setter comments about past paper questions was clarity and avoiding ambiguity. It was noted that it should be clear and unambiguous what questions are asking, what kind of response is expected and that the language should support this. It was also felt that excessive reading demands should be avoided (unless conciseness compromises understanding); difficult vocabulary should be avoided unless a term is a learning requirement; and care should be taken over sentence structure, particularly for international exams where English is unlikely to be the first language of many candidates.

A relevant example comes from Geography QID2 (see Figure 4). Part (c) asks candidates to suggest two things which could be done to rebuild a community after a disaster. One of the Geography research participants reported that he was not sure what the question ‘was getting at’ until he looked at the mark scheme. The kinds of responses rewarded in the mark scheme were around immediate responses to the disaster such as emergency accommodation and supplies, and rebuilding services such as hospitals. He had expected appropriate responses to be mostly around rebuilding buildings (e.g. homes) and commented that the phrase ‘rebuild a community’ can be interpreted in more than one way and that he would have got the answer wrong.

Figure 4.

Geography QID2 Part (c).

Another example is Biology QID6, part (a), which asked candidates to list two features that distinguish bacteria from other groups of organisms (see Figure 5). The biology participant commented that in hindsight wording this question as ‘List two structural features ….’ would have helped clarify to candidates the kind of responses that were expected.

Figure 5.

Biology QID6 Part (a).

Previous advice and research on question writing has noted the importance of clear wording and avoiding ambiguity (e.g. Fisher-Hoch et al., 1997; Pollitt et al., 1985) since misinterpretations will lead to candidates attempting the wrong task.

Appropriate information

In several cases, setters mentioned how the information provided in the question can affect its quality. As described earlier, it was felt that enough information should be provided such that students cannot go ‘off at a tangent’, that the information provided should not limit the scope of students’ answers, that unnecessary introductory text should be avoided, that information should be provided with the question part for which it is needed (rather than necessarily at the start of the question) and that information should be clear and up to date.

For example, in relation to one of the geography questions, one setter commented that some students went ‘off at a tangent’ because they assumed, incorrectly, that Indonesia was a More Economically Developed Country (MEDC). This led to further incorrect assumptions and loss of marks. The setter noted that providing information on which of the countries in the resource were Less Economically Developed Countries and MEDCs would have avoided this issue.

Another example is provided by Physics QID3 (Figure 6). This question was set in the context of a car in motion and dealt with concepts of driving force, speed, time and distance. The physics research participant commented that the speed of the vehicle is provided in part (a) but is not needed until part (b). Part (a) asks students for the value of the driving force. Given that the information states that the speed is constant, the value of the driving force is the same as the resistive force stated in the diagram. The actual speed is not needed. The participant suggested that leaving the speed out of the introduction to part (a) would have led to less students assuming they should use this value in some way. (It could be possible that the original setter deliberately included the speed as a distractor to test how well students understood that the driving force and resistive force are equal if speed is constant. Arguably, if students are secure in their knowledge then they would not be distracted into using the speed when attempting to answer this question part.)

Figure 6.

Physics QID3 Part (a).

Whilst the appropriate inclusion of information for any question may vary and sometimes be debatable depending on exactly what the setter wishes to assess, how information is presented appears to be an important factor identified by several of the setters.

Logical structure/flow

When considering the past paper questions, several participants commented on the need for a logical flow of themes through the parts. One of the setters commented on how it is helpful if a question is ‘leading a child down a particular route that allows them to answer the question as best they can’. The sequence of the question parts in Geography QID5 (Figure 7) was commented on by one of the setters. He noted that parts (b) and (c) did not seem to flow well and that swapping their order might have been beneficial. He argued that this restructuring would encourage students to think about types of coastal erosion which might assist their thinking for why some California counties have a high value of buildings at risk.

Figure 7.

Geography QID5.

Ratings of question quality

For each question the setters were asked for their initial view of its quality (poor, medium or high) and then after seeing additional information, and after seeing all the questions, they were asked to give a rating of quality from 1 (low) to 5 (high). Half points on the scale were allowed if setters seemed to need them. These ratings serve a purpose of giving a rough idea of how good the setters considered each question to be, but asking for categories and ratings was also intended to encourage the setters to comment on features affecting question quality as they justified their decisions. It is interesting to compare ratings where there was more than one setter for a syllabus involved in the research. As can be seen in Table 1, for some questions the ratings are the same or very similar between participants but for others there is more variation. With any rating scale of this kind there are likely to be variations in how judges use the scale, for example one judge might apply a more compressed scale and avoid using the extremes, or a judge might avoid more negative ratings and be more generous. Nevertheless, with the variations in ratings shown below it would seem that setters had different views of how good some of the questions were. It is possible that in some cases one setter identified an issue that the other had not and that if the two participants had been asked to discuss potential issues and then rate the questions individually, the difference between ratings might have been smaller. For example, Maths syllabus 2 QID2 was the question where students were asked to put several containers in order by size and one of the participants had been concerned that the diagram could be misleading. If this possibility had been raised in discussion between them, the participants might have reached a more similar view and given more similar individual ratings.

Table 1.

Ratings of quality for geography and international maths.

	Geography		International maths
	Setter 1	Setter 2	Setter 1	Setter 2
QID1	4	2.5	4	2
QID2	1	2	3.5	5
QID3	5	5	3.5	4
QID4			4	4
QID5	2.5	4	5	2
QID6	2.5		5	3
QID7			5	4.5
QID8	5	4	4	4.5

Despite the possibility for slightly different ways of applying the scale, and for views becoming more similar after discussion between setters, it seems that there are variations in what different setters consider to be a good quality question. This is problematic in the context of wider discussions of the need to ensure high quality questions in examinations, and for research that intends to investigate the nature of quality in questions in order to inform practice. It should be noted, however, that Table 1 represents only a small number of questions each evaluated by only two setters and more data of this kind would provide more robust evidence relating to this issue.

Discussion

Good quality questions are important to ensuring that the intended knowledge, understanding and skills are assessed by an exam and to successfully facilitate classroom learning. By utilising the experience of question writers, this research has explored their conceptualisations of exam question quality and their views on how features of questions affect question quality.

In terms of features affecting quality, a range of findings arose such as clarity of resources, appropriate use of context, clarity of information and task, logical flow, appropriate mark allocation, and appropriate layout and spacing. Some of these features echo findings from past research looking at difficulty and fairness of questions. For example, Crisp and Sweiry (2006) also found that diagrams can have the potential to mislead students and Ahmed and Pollitt (2007) have previously provided evidence of the need for contexts in questions to be appropriately designed. Scaffolding of tasks (Hughes et al., 1998), layout (Crisp, 2008) and the nature of mark schemes (Ahmed and Pollitt, 2011) have also been identified and explored in earlier studies. Other features identified were often factors that might generally be assumed to be good practice. For example, ensuring the syllabus requirements are met without repetition of content, and including a variety of types of task would be likely advice, depending on the requirements of the syllabus to which the exam relates. In general, these findings are unsurprising, but it is reassuring that the features that question writers reported to affect quality align with existing research findings, thus providing triangulation of previous work. These findings, including examples of questions with commentary, could be used in training materials for exam question writers and teachers. It seems likely that some of the current (and past) findings regarding exam questions would also apply to written classroom tests where it is important that student performance gives a realistic indication of ability such that teaching can be appropriately targeted to assist students’ learning. Whilst there will be some significantly different issues relating to verbal questioning in the classroom, certain aspects of the current findings are likely to apply, such as the need for clarity of meaning when questioning. The latter aligns with Tofade et al. (2013), who argue that the clarity of wording used in verbal questions influences their effectiveness.

When the exam question writers involved in this research were asked how they would define ‘quality’ in question writing, three conceptualisations of question quality emerged. Good quality questions:

Test the intended knowledge, understanding and skills. Questions are clear around what is required, and lack ambiguity. This means that students understand the question and what they are being asked to do, allowing them to perform as well as they can.

Differentiate between students who are better and weaker in the subject, whilst being accessible to all.

Go beyond simple recall and understanding to assess skills in the subject and ‘make students think’.

The first of these themes was dominant for most of the participants. It is about ensuring that the intended knowledge, understanding and skills are being tested, and thus relates to validity at the level of the question. As discussed earlier, validity relates to the appropriateness of the ways in which assessment results are interpreted and used (Messick, 1989). As such, in order for questions to contribute to validity they need to elicit and measure the knowledge, understanding and skills of interest. This includes a need for the task to be clear to students so that they attempt the intended task even if they are not able to complete it fully.

The second conceptualisation of quality, differentiation between candidates, also relates to validity. For GCSEs and IGCSEs, students receive an overall grade for their performance (across several assessments) where the grades are taken to indicate students’ level of achievement and are used accordingly. For this to be the case the questions need to elicit performances that vary with ability in the relevant subject such that scores and grades differentiate appropriately between candidates. Thus, this conceptualisation of quality relates to validity as scoring that does not capture important qualities of student performance would constitute a threat to validity (Crooks et al., 1996) and the grades cannot reasonably be used in the intended ways if the questions do not differentiate appropriately between candidates. The mention of accessibility in participants’ responses refers again to ensuring that students understand what they are being asked to do, as per the first conceptualisation of quality and thus relates to validity at the question level.

The final conceptualisation of quality expressed by the question writers, that of going beyond recall and understanding, is quite different in nature. It was only described by two of the seven participants, but for one of them this was his main definition of quality. This is interesting given that most GCSE/IGCSE syllabuses set out that a certain proportion of marks on the assessment must assess knowledge and understanding. This would mean these questions are low quality by definition even though such questions must be included. An exam paper that does not contain the intended balance of knowledge, understanding and skills cannot appropriately contribute to assessment outcomes that have the intended meaning. Therefore, this conceptualisation potentially threatens validity (unless there was no intention to assess knowledge and understanding).

There is an interesting contrast here between primary concerns of ‘what’ it is important to assess (concept 3) and ‘how’ it should be assessed in order to achieve validity (concepts 1 and 2). Concept 3 focuses on pedagogical aims for questions to go beyond the basics, and it is unsurprising that question writers have views on pedagogical issues given that most question writers are current or retired teachers. A question setter’s assessment expertise will have developed through their teacher training and then teaching practice where the main focus is likely to have been on how assessment can support student education. This could be the origin of concept 3. There is some evidence of this in the literature on verbal questioning in the classroom, where questions that facilitate learning rather than just recall are considered desirable (Sullivan and Lilburn, 2002; Tofade et al., 2013). Through becoming involved in exam marking and question writing these individuals also become assessment professionals, responsible for ensuring that assessments measure what they are intended to such that results can be interpreted and used as intended. Therefore, their conceptualisations of quality in concepts 1 and 2 are also unsurprising. For most, the technical aims around validity appear to dominate, and the potentially conflicting pedagogy-focused aim is secondary if consciously considered at all. These question writers have assumed the syllabus to be appropriate, and their concerns fall within the bounds of what they have been asked to do. For one question writer, his views on the importance of students being able to apply knowledge and understanding and to show skills relevant to the subject, an aim potentially developed during his teaching career, appeared to dominate his views. This echoes theory suggesting that teachers should ask questions requiring higher order cognitive processes to encourage students to apply their learning and think critically (Sullivan and Lilburn, 2002; Wilen, 1991).

Another interesting finding is that participants who saw the same questions often varied considerably when rating question quality. Whilst the way that individuals use rating scales will always vary to an extent and the relevant data are too small scale to be conclusive, this would seem to be a negative finding in terms of the consistency of understandings of quality in questions. What it does emphasise is the importance of involving several experts in the creation and review of an exam paper so that varying views can be utilised in order to ensure questions are as high quality as possible according to a consensus view. It is standard practice for GCSE and IGCSE papers to involve editing and review by several experts.

It should be noted that this was a small-scale study involving seven participants and a maximum of nine exam questions per syllabus. Consequently, some caution should be applied in interpreting the findings as either the setters or the questions might not constitute representative samples. However, the study has strengths in that it involved in-depth consideration of 39 questions (often made up of several question parts) and thus provides a sample of views on a variety of questions. Hence, the current research usefully adds to the body of literature in this area and triangulates past findings from analyses of student performance. Features thought to contribute to exam question quality arose from discussions in a ‘bottom-up’ manner and notions of ‘quality’ in exam questions were explored. The finding that experts may have different conceptualisations of what ‘quality’ means in the context of exam questions and that participants sometimes disagreed on the quality of example questions continues to make the notion of ‘high quality questions’ in exams difficult to define. Further work could usefully explore the conceptualisations of question quality across a wider sample of exam question writers, and whether views on specific questions converge through inter-examiner discussion. The latter would make the existence of different independent views less challenging to the creation of questions as long as a number of experts are involved in the design and editing of a paper. It is encouraging that the kinds of features that different question writers mentioned as affecting quality were not discrepant with each other, and that validity was a key theme. Some of these findings are likely to also apply to other educational questioning such as classroom tests.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Notes

Appendix 1: An example of summarised comments for a question

References

Ahmed

Pollitt

(2007) Improving the quality of contextualized questions: An experimental investigation of focus. Assessment in Education: Principles, Policy and Practice 14(2): 201–232.

Ahmed

Pollitt

(2011) Improving marking quality through a taxonomy of mark schemes. Assessment in Education: Principles, Policy and Practice 18(3): 259–278.

Bejar II (2011) A validity-based approach to quality control and assurance of automated scoring. Assessment in Education: Principles, Policy and Practice 18(3): 319–341.

Braun

Clarke

(2006) Using thematic analysis in psychology. Qualitative Research in Psychology 3(2): 77–101.

Buttny

(1993) Social Accountability in Communication, London: SAGE.

Cooper

Dunne

(1998) Anyone for tennis?: Social class differences in children’s difficulties with ‘realistic’ mathematics testing. Sociological Review 46(1): 115–148.

Cooper

Dunne

(2000) Assessing Children’s Mathematical Knowledge: Social Class, Sex and Problem Solving, Philadelphia, PA: Open University Press.

Crisp

(2008) Improving students’ capacity to show their knowledge, understanding and skills in exams by using combined question and answer papers. Research Papers in Education 23(1): 69–84.

Crisp

(2011) Exploring features that affect the difficulty and functioning of science exam questions for all candidates and specifically for those with reading difficulties. Irish Educational Studies 30(3): 323–343.

10.

Crisp

Johnson

Novakovic

(2012) The effects of features of examination questions on the performance of students with dyslexia. British Educational Research Journal 38(5): 813–839.

11.

Crisp

Sweiry

(2006) Can a picture ruin a thousand words? The effects of visual resources in exam questions. Educational Research 48(2): 139–154.

12.

Crisp

Sweiry

Ahmed

, et al.(2008) Tales of the expected: The influence of students’ expectations on question validity and implications for writing exam questions. Educational Research 50(1): 95–115.

13.

Crooks

Kane

Cohen

(1996) Threats to the valid use of assessments. Assessment in Education: Principles, Policy and Practice 3(3): 265–286.

14.

Ernst-Slavit

Pratt

(2017) Teacher questions: Learning the discourse of science in a linguistically diverse elementary classroom. Linguistics and Education 40: 1–10.

15.

Fan

Mueller

Marini

(1994) Solving difference problems: Wording primes co-ordination. Cognition and Instruction 12(4): 355–369.

16.

Fisher-Hoch H and Hughes S (1996) What makes mathematics exam questions difficult? In: British Educational Research Association annual conference, Lancaster, UK, 12–15 September 1996.

17.

Fisher-Hoch H, Hughes S and Bramley T (1997) What makes GCSE exam questions difficult? Outcomes of manipulating difficulty of GCSE questions. In: British Educational Research Association annual conference, York, UK, 11–14 September 1997.

18.

Heritage

(2013) Teacher questioning: The epicenter of instruction and assessment. Applied Measurement in Education 26(3): 176–190.

19.

Hughes S, Pollitt A and Ahmed A (1998) The development of a tool for gauging the demands of GCSE and a level exam questions. In: British Educational Research Association annual conference, Belfast, UK, 26–30 August 1998.

20.

Kelley

(1927) Interpretation of Educational Measurements, New York: New World Book Co.

21.

Mayer

Larkin

Kadane

(1984) A cognitive analysis of mathematical problem solving ability. In: Sternberg

(ed.) Advances in the Psychology of Human Intelligence, Hillsdale, NJ: Erlbaum Associates, pp. 231–273.

22.

Mehan

(1979) Learning Lessons: Social Organization in the Classroom, Cambridge, MA: Harvard University Press.

23.

Messick

(1989) Validity. In: Linn

(ed.) Educational Measurement, 3rd ed. New York: American Council on Education/Macmillan, pp. 13–103.

24.

Nickson M and Green S (1996) A study of the effects of context in the assessment of the mathematical learning of 10/11 year olds. In: British Educational Research Association annual conference, Lancaster, UK, 12–15 September 1996.

25.

Olinghouse

Wilson

(2013) The relationship between vocabulary and writing quality in three genres. Reading and Writing 26: 45–65.

26.

Pollitt A and Ahmed A (1999) A new model of the question answering process. In: International Association for Educational assessment conference, Bled, Slovenia, 23–28 May 1999.

27.

Pollitt

Entwhistle

Hutchinson

, et al.(1985) What Makes Exam Questions Difficult?, Edinburgh: Scottish Academic Press.

28.

Ramseier

(1999) Task difficulty and curricular priorities in science: Analysis of typical features of the Swiss performance in TIMMS. Educational Research and Evaluation 5(2): 105–126.

29.

Schegloff

(2007) Sequence Organization in Interaction: A Primer in Conversation Analysis Vol. 1, Cambridge: Cambridge University Press.

30.

Schegloff

Sacks

(1973) Opening up closings. Semiotica 8(4): 289–327.

31.

Shaurd

Rothery

(1984) Children Reading Mathematics, London: John Murray.

32.

Sinclair

Coulthard

(1975) Towards an Analysis of Discourse: The English Used by Teachers and Pupils, Oxford: Oxford University Press.

33.

Spalding V (2011) Structuring and formatting examination papers: Examiners’ views of good practice. Report, Assessment and Qualifications Alliance, UK.

34.

Stivers

Rossano

(2010) Mobilizing response. Research on Language and Social Interaction 43(1): 3–31.

35.

Sullivan

Lilburn

(2002) Good Questions for Maths Teaching: Why Ask Them and What to Ask (K-6), Sausalito, CA: Maths Solutions.

36.

Tofade

Elsner

Haines

(2013) Best practice strategies for effective use of questions as a teaching tool. American Journal of Pharmaceutical Education 77(7): Article 155.

37.

Wilen

(1991) Questioning Skills, for Teachers, Washington, DC: National Education Association.