Effects of Task Difficulty and Presentation Order in Subjective Usability Measurement

Abstract

The System Usability Scale (SUS) is a tool widely used in industry for measuring the usability of products and systems. Users are often asked to complete two or more tasks of varying complexity before evaluating a product using the SUS. However, task order effects may influence the overall usability rating of a product, but previous literature has not examined this issue. To test the effect of task order on the SUS, participants were asked to complete two tasks involving locating specific information on a college website. Participants completed an easy task and a hard task, and presentation of the tasks was randomized. Results showed that when participants completed the easy task first, they rated the overall usability of the website lower than when the hard task was presented first. This suggests that practitioners should be cautious when designing usability studies with tasks of varying difficulties.

Keywords

Test and evaluation methodology Usability/acceptance measurement and research Task order Usability

Introduction

The usability of systems and products is a concept that receives large focus in human factors circles. Usability is the general quality of appropriateness to a purpose of a product and must be viewed in the context of which a product is used, including intended users, the tasks the users will perform, and characteristics of the environment in which it is used. The International Standards Organization (ISO) defines usability as “the extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfactions in a specified context of use” (ISO, 2018).

Being able to accurately measure the usability of a product, service, or system is important, since usability measures are used to help allocate resources to mitigate poor usability, or to help designers choose among alternate products. Performance measures are often used to judge a system’s usability, and success rates, errors, and time on task are three commonly used performance measures that map nicely to the ISO metrics of effectiveness and efficiency.

Subjective usability metrics are also important, since they can provide additional insight into how users feel about the products use, which also maps well to the ISO metric of satisfaction. The System Usability Scale (SUS) is a questionnaire that is widely used in industry and academic research for measuring the subjective usability of a product or system, including software, websites, and hardware. It is a 10-item questionnaire that asks users to rate statements such as “I felt very confident using the system” on a 5-point Likert scale, and produces scores ranging from 0 to 100 (Brooke, 1996). Higher scores indicate better usability. The SUS was originally developed to be a low-cost, efficient tool to compare subjective assessments of usability, but over time has been shown to be a highly reliable and valid measurement instrument (Bangor, Kortum & Miller, 2008; Gao et al., 2018; Sauro, 2011). The SUS provides valid results even when slight modifications are made, such as dropping questions (Lewis and Sauro, 2018) or re-wording item to all have positive phrasing (Kortum et al., 2021).

Usability tests are frequently designed in a way where participants test a product across a number of different tasks in order to gain a more accurate picture of its usability. Often, a variety of task difficulties are used, with easier tasks presented first to allow participants to gain familiarity with the product before progressing to more difficult tasks. However, psychological literature has shown that task order effects can impact testing results. Order effects may be in the form of either primacy bias or a recency bias (Bansback, et al., 2014; Hassenzahl & Sandweg, 2004). Primacy bias refers to the effect where people more strongly consider information listed first than information listed last. Conversely, recency bias is the effect where people give more weight to the information received most recently.

Previous research on the SUS has not yet investigated which of these biases may be strongest in the context of usability testing, but other work has shown that task difficulty is reflected in the resulting SUS scores.

For example, a study by Sagar and Saha (2020) directly evaluated the effect of task difficulty on usability scores measured using the SUS. Their findings showed that higher difficulty tasks resulted in lower SUS scores. The researchers concluded that there is a strong relationship between SUS scores and system effectiveness as measured by task success rate. The strength of this relationship is different for different types of tasks, which suggests that having an easy or difficult task first may influence the rating for the second task.

Further evidence for a relationship between task difficulty and SUS scores comes from Drew et al. (2018), who asked participants to think aloud while completing a usability test and the SUS on two different websites. Results showed a moderate positive relationship between SUS scores and perceived success. This highlights the importance of examining the differences between SUS scores for easy and hard tasks, as SUS scores may be influenced by perceived success.

What is not known, however, is whether the order in which participants complete the task set can have an impact on subjective usability assessments, particularly if the tasks vary in difficulty.

This study aims to investigate the potential impact of task order effects on SUS scores by comparing SUS scores from participants who complete a hard task first to those who complete an easy task first. The findings of this study will contribute to our understanding of how task order effects may impact the results of usability tests and to inform best practices for administering the SUS in future usability studies.

Methods

For this study 419 total survey responses were collected from Rice University undergraduate students. Participants were between the ages of 18 and 25 and received partial course credit as compensation for their participation. The participants were 38.5% male, 59.4% female and 2.1% identified as another gender. They reported spending an average of 5 to 7 hours using a computer in the past month. 77% of the participants indicated they were completely confident or fairly confident in their ability to use the internet to find information. The study was approved by the Rice University Internal Review Board (IRB).

Data was collected remotely through Qualtrics. Participants were able to complete the testing session at the time of their own choosing, and did not have a time limit to complete testing procedures. Participants were asked to complete two tasks that involved locating specific information on one of the websites. One task was easy and the other was hard, but the presentation of the tasks was randomized.

The difficulty of tasks was determined by the minimum number of clicks it took to access that information. For easy tasks, the user would only need to click on two or three links to find the answer, while the hard task required them to click on six to eight links. Participants were not allowed to use the search bar to find answers - they had to rely solely on mouse clicks for navigation.

Procedure

Participants began the experiment by reading and agreeing to the IRB-approved consent page. They were then randomly assigned to see one of two college websites, University of Illinois Urbana-Champaign (UIUC) or Tufts University, shown above in Figures 1 and 2, respectively. Participants were then randomly assigned to complete one easy task and one hard task. The task order was counterbalanced.

Figure 1.

UIUC Website Homepage.

Figure 2.

Tufts University Website Homepage.

Four tasks were developed: two easy and two hard. This was done to ensure that our results were not limited to issues with specific navigation problems on the websites. Participants completed 1 of 2 possible easy tasks. Easy Task 1 was to “find the cost of tuition”. Easy Task 2 was “find the first and last name of the player who wears the #3 jersey on the men’s’ baseball team”.

Participants also completed 1 of 2 possible hard tasks. Hard Task 1 was to “find a location for single use mask recycling on campus”. Hard Task 2 was to “find a location of a private room that is for use by breast-feeding mothers”. For each task, participants were instructed not to use the search bar or Google to find the answer to the task.

Rigorous pilot testing was completed previously to ensure that our selection of tasks was of appropriately different difficulty levels, and to ensure that tasks were of similar difficulty between the two websites.

Immediately upon completion of the first task, participants were prompted to complete the SUS for the first time to evaluate the task they had just seen. On the same website, participants were assigned a second task of a different difficulty to complete, and then immediately prompted to complete the SUS for a second time. Participants were then asked to complete a short demographic survey before being asked to complete the SUS for a third time to evaluate their overall website experience.

Results

Data from a total of 69 participants were excluded. 37 responses were excluded due to technological limitations that made it unclear that course credit had been awarded leading to participants completing the survey multiple times. 31 responses were excluded for incomplete responses and 1 response was removed due to an unrealistic completion time.

In order to understand how the order and combination of the tasks affects the perceived usability of the website, a repeated-measures, multi-factor ANOVA was conducted, along with an examination of Cohen’s f. Results showed that scores on all three SUS administrations were significantly different from each other, F(2, 668) = 17.34, p < .001, f = .23. The SUS scores had a two-way interaction effect with the order in which participants saw the tasks, F(2, 668) = 19.74, p < .001, f = .24, and with the combination of easy and hard tasks they received, F(6, 668) =2.81, p = .01, f = .16. The SUS scores had a three-way interaction with the order of tasks and combination, F(6, 668)=3.01, p = .007, f=.16; with the order of tasks and the school website seen, F(2, 668)=7.32, p < .001, f=.15; and with the combination of tasks and school, F(6, 668)=3.04, p = .006, f=.17. Lastly, the SUS scores had a four-way interaction with the order of tasks, combination of tasks, and school, F(6, 668)=5.80, p <.001, f=.23. A post hoc analysis using the Ryan-Einot-Gabriel-Welsch method revealed no significant differences among the different combinations of tasks and their effect on the SUS.

Table 1 contains the means and standard deviations for participants’ SUS ratings for each order of task difficulty. Since our ANOVA revealed significant differences among the two task order groups, we further tested this effect on the website SUS scores. Results showed that when participants completed the easy task first, they rated overall usability of the website lower than when the hard task was presented first, t(347.95) = 2.95, p = .003, d =.31. Additionally, the hard task when presented first was rated to be more usable than when the hard task presented second, t(343.54) = 3.71, p = .002, d = .39.

Table 1.

SUS Administration Descriptive Statistics.

	Task 1 Score	Task 2 Score	Website Score
Easy-Task First	Easy	Hard
Mean	69.15	39.11	55.20
Standard Deviation	19.07	23.76	20.33
Hard-Task First	Hard	Easy
Mean	47.84	70.04	61.42
Standard Deviation	20.24	21.89	19.18

Note. The SUS received a full range of scores from 0 to 100 in all administrations.

A multi-factor ANOVA of demographic information revealed that participants’ self-reported confidence in their ability to find information on the internet had an effect on their ratings for the overall usability of the website, F(3) = 7.29, p < .001, f = .44. There was an effect of the interaction between the hours participants reported spending online and whether they had seen the website before, F(6) = 3.75, p = .002, f = .45, and an interaction between internet use, age, and current school year, F(5) = 2.80, p = .020, f = .35.

Discussion

These findings suggest that task order has an impact on the perceived usability of websites. While further research is necessary, practitioners should be cautious when designing usability studies with tasks of varying difficulties.

Our results showed that when an easy task was presented first, followed by the hard task second, participants rated the overall usability of the website as lower than when the hard task was presented first. In both groups, participants appeared to be mentally averaging their experiences with the individual tasks when they scored their overall experience with the website. When the hard task was presented first, participants produced slightly inflated SUS scores for the overall website, suggesting that these findings are not due to a halo effect from the first task. Participants who saw the hard task second might have been more frustrated by the comparative difficulty with the easy task, and hence produced lower SUS scores on both the task itself and the overall website. Although the school website that participants saw showed a significant interaction with the order of tasks and combination of tasks to impact the SUS scores, there was no main effect of school.

These results suggest that recency bias has a stronger effect than primacy bias on participants' resulting SUS ratings. Hassenzahl and Sandweg (2004) found that summary assessments of perceived usability do not reflect a whole experiential episode, but rather its most recent incidents, hence predicting a recency bias.

Additionally, participants’ self-reported confidence regarding internet use impacted their ratings on the SUS, along with the number of hours they spent online and age, suggesting that experience with the system or product guides its perceived usability. This finding is in line with what one might expect based on previous literature regarding the relationship between experience and perceived usability (Kortum and Johnson, 2013).

There are several limitations in our study that should be considered when interpreting our results. The present study only utilized one product, college websites, limiting the generalizability of our findings. The task order effect demonstrated by our results should be replicated with other products, including non-websites.

Our study also only utilized two task difficulties. This may not be reflective of all testing scenarios, which may require the use of more than two difficulties or tasks that are more similar in complexity. We included multiple versions of each task difficulty in an attempt to mitigate this limitation. Future research should aim to replicate this study design with other usability metrics, such as the UMUX and SUPR-Q, to explore the effects of task order on perceived usability with subjective measures other than the SUS. Researchers should also consider adding tasks with more varied difficulty levels instead of just easy or hard.

While this study suggests that there is an impact of task difficulty and presentation order on the resulting subjective usability measures, it unfortunately does not provide evidence about how to best deal with the phenomenon in the real world, where the goal is to make an accurate evaluation of usability. Practitioners should employ good experimental designs that help minimize these kinds of effects, such as randomization of task orders to distribute the variability across conditions. Practitioners might also consider creating between-subjects study designs where tasks of similar difficulty are grouped together in order to minimize the effects demonstrated in this study.

Clearly, further research is needed to explore how to best mitigate the findings described here in a fashion that does not significantly complicate experimental designs or increase data collection time or cost in the field.

Practical Take-Aways

Our findings suggest that presenting easy tasks first and hard tasks second causes lower overall SUS scores of a product. Presenting users with an easy task first may cause increased frustration on later, more difficult tasks and induce a recency bias in subjective usability assessment responses.

Practitioners should be aware of potential task order effects when designing usability studies. It is important for practitioners to know that using multiple task difficulties has the potential to unintentionally bias participant responses.

Interpretation of others’ results should be viewed cautiously, if there are large differences in the difficulty of tasks used to assess the product, and those tasks were presented in a non-randomized fashion.

Conclusions

Task order effects influence the overall subjective usability ratings of a product. When participants complete an easy task first, they rate the overall usability of the product lower than when the hard task is presented first. This suggests that practitioners should be cautious when designing usability studies with tasks of varying difficulties, and interpreting results form these kinds of common studies.

References

Bangor

Kortum

P. T.

Miller

J. T.

(2008). An empirical evaluation of the system usability scale. International Journal of Human–Computer Interaction, 24(6), 574-594.

Bansback

L. C.

Lynd

Bryan

(2014). Exploiting order effects to improve the quality of decisions. Patient Education and Counseling, 96(2), 197-203.

Brooke

(1996). SUS-A quick and dirty usability scale. Usability evaluation in industry, 189(194), 4-7.

Drew

M. R.

Falcone

Baccus

W. L.

(2018). What does the system usability scale (SUS) measure? validation using think aloud verbalization and behavioral metrics. In Design, User Experience, and Usability: Theory and Practice: 7th International Conference, DUXU 2018, Held as Part of HCI International 2018, Las Vegas, NV, USA, July 15-20, 2018, Proceedings, Part I 7 (pp. 356-366). Springer International Publishing.

Gao

Kortum

Oswald

(2018). Psychometric evaluation of the use (usefulness, satisfaction, and ease of use) questionnaire for reliability and validity. In Proceedings of the human factors and ergonomics society annual meeting (Vol. 62, No. 1, pp. 1414-1418). Sage CA: Los Angeles, CA: SAGE Publications.

Hassenzahl

Sandweg

(2004). From mental effort to perceived usability: transforming experiences into summary assessments. In CHI'04 extended abstracts on Human factors in computing systems (pp. 1283-1286).

ISO (2018). Ergonomics of Human-System Interaction – Part 11: Guidance on Usability (ISO 9241-11(E)). Geneva, Switzerland.

Kortum

Acemyan

C. Z.

Oswald

F. L.

(2021). Is it time to go positive? Assessing the positively worded system usability scale (SUS). Human factors, 63(6), 987-998.

Kortum

Johnson

(2013). The relationship between levels of user experience with a product and perceived system usability. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting (Vol. 57, No. 1, pp. 197-201). Sage CA: Los Angeles, CA: SAGE Publications.

10.

Lewis

J. R.

Sauro

(2017). Can I Leave This One Out? The Effect of Dropping an Item From the SUS. Journal of Usability Studies, 13(1).

11.

Sagar

Saha

(2020). Exploring the effect of tasks difficulty on usability scores of academic websites computed using SUS. In International Conference on Innovative Computing and Communications: Proceedings of ICICC 2019, Volume 1 (pp. 11-19). Springer Singapore.

12.

Sauro

(2011). A practical guide to the system usability scale: Background, benchmarks & best practices. Measuring Usability LLC.