Abstract
The concept of reliability is widely recognized across various academic disciplines. However, the conventional understanding of reliability varies by discipline, does not adequately address the intricate and ever-changing environments, and fails to account for the dynamic interactions between humans and artificial intelligence (AI) systems. To address these gaps, a new framework is proposed for considering reliability that accounts for performance on both supraordinate and subordinate objectives. By framing reliability in such a manner, the evaluation of systems can become more precise and research involving human-machine interactions can gain greater clarity. This is crucial for product designers and evaluators working to develop systems that meet end-use goals and comply with regulations. Researchers and practitioners alike need to rethink reliability in the context of AI systems, and this article proposes a new framework for understanding reliability.
Introduction
Reliability is a crucial concept that holds significance across various fields and situations. It refers to the quality or state of being dependable and is used to assess individual or system performance from design to testing. Within Human Factors research, reliability has been linked to team performance (Chavaillaz et al., 2016; Huegli et al., 2020), trust in systems (Hussein et al., 2020a; Lyons et al., 2020), and workload (Katrein, 2015; Metzger & Parasuraman, 2005), among other things. However, the term reliability is used across different disciplines such as engineering (Petritoli et al., 2018; Uday & Marais, 2015), computer science (Morgan et al., 1977; Roshandel, 2004), and philosophy.
Within these academic fields there exist two broad categories for the definition of reliability. Some are more outcome based, which are typically found in more technical fields such as Engineering (Modarres et al., 2009), Computer Science (Roshandel, 2004), and Human Factors (Hussein et al., 2020b; Onnasch, 2015), but are also found in fields such as Philosophy (Ryan, 2020). These outcome-based definitions of reliability focus on the ability of a system to successfully perform the task it was designed for, under a specified time interval and in specific conditions. These outcome-based definitions focus on the deviation of performance from the intended result.
These outcome-based definitions contrast with those definitions which are inherently more focused on the consistency and repeatability (John, 2012; Shaffer & Kipp, 2014). For example, the field of Psychometrics, defines reliable measurement instruments as those which “yield consistent results, both over time (temporal) and across observers” (Köhler et al., 2015). While this consistency-based definition is typically found within the consideration of measurements and scales, isolating the key element of the definition from the context in which it is applied still yields a considerable difference in the way that reliability is considered, namely repeatable results versus the more ideal results found within fields of Engineering.
The interest in reliability obviously extends beyond the “halls of academia”. Reliability assessments are often required by the government to ensure the consistent, credible, and effective performance of their products. Governmental agencies such as the National Aeronautics and Space Administration (NASA) and the Department of Defense (DoD) place an emphasis on reliability in both their operational policies as well as contract requirements. Governmental agencies typically define reliability in a more outcome-based way as a measure of the probability that a system will perform without failure under specified conditions (Army Regulation 70-1 Research, Development, and Acquisition Army Acquisition Policy, 2018; DoDD 5000.01, “The Defense Acquisition System”, September 9, 2020, 2003; National Aeronautics and Space Administration, 2017).
While there exists a plethora of definitions of reliability, most within the field of artificial intelligence (AI) and assistive automation (AA) rely on a definition of reliability which serves as a probability that a system will perform a given function, without failure, under specific conditions for a specified period of time (e.g. de Visser et al., 2018; Foroughi et al., 2021; Hussein et al., 2020a; Onnasch, 2015). However, as AI and AA systems become more complex, the applicability of this definition of reliability is less clear.
Existing definitions of reliability focus on the accomplishment of supraordinate objectives. For example, the “proportion of correctly indicated critical events (information automation), correctly given diagnoses, suggested decisions, or correctly executed actions (decision automation)” (Onnasch, 2015). While such a contextualization of reliability is a concrete way in which to quantify system performance, it fails to account for the dynamic nature of interactions between humans and more advanced systems.
The need to account for this dynamism is self-evident in both research and product evaluations. While there has been considerable research on the effects of system reliability (as measured by outcome) on factors such as trust (Kaltenbach & Dolgov, 2017; Masalonis & Parasuraman, 2003), recent work has shown that outcome-based measures of system performance on supraordinate objectives may be insufficient to explain differences in trust (Tenhundfeld, Davis, et al., 2022). In order to accurately assess the impacts of reliability on user trust, workload, and situation awareness when using more complex systems, there needs to be a greater consideration of subordinate goals in the assessments of reliability. The consideration of these subordinate goals is also essential as product designers and evaluators work to develop systems which meet end-use goals, and to ensure compliance with regulations and requirements.
We therefore present a new framework for consideration of reliability in research- and practitioner-based environments.
New Framework
For any outcome-based assessment of reliability, there exists a supraordinate objective (O) against which success and failure is determined. This supraordinate objective may be explicitly stated or implied. For example: the self-driving vehicle will maintain an accident rate at or below one incident per million miles driven. There are instances of assessment wherein more than one supraordinate objective exists. For instance, in addition to the accident rate above, another supraordinate objective may be to ensure that in instances of a collision, the airbags deploy 100% of the time. We can establish every supraordinate objective as part of a set:
We can also establish a set of subordinate objectives (o) for every supraordinate objective such that:
Reliability assessments of supraordinate objectives fail to account for performance differences in subordinate objectives. Additionally, failure in subordinate objectives may not preclude the success of the supraordinate objectives. To further illustrate this point, we turn to a weapon detection task in baggage screening. In this case, our supraordinate objective will be for the system to alert us that there is a weapon present when there is one. However, subordinate to this objective may be X objectives:
1) Scan the bag with X-rays
2) Collect the reflectance of the X-rays
3) Process the signal to identify the presence of weapons
4) Present the outcome of the processing (e.g. weapon/no weapon)
In this example, there are a multitude of ways in which the subordinate objectives might not be met, which are not reflected in the outcome of the supraordinate objective. For instance, the signal processing may result in the identification of a hairdryer as a firearm, cueing a ‘weapon found’ determination, while missing a knife. However, in this example, the supraordinate objective was realized in that a correct determination was made, albeit for the wrong reason. An outcome-based assessment of reliability would not be concerned with ‘how’ the system made its determination, but only the accuracy of the determination.
This is an admittedly simplistic example, however as we introduce complexity, we are confronted with the reality that as calls for explainable AI (XAI) increase, we are likely to become more aware of systems satisfying the supraordinate objectives while simultaneously failing at achieving the subordinate objectives. To illustrate this one should consider a self-driving car that misidentifies a trashcan as another vehicle. In both instances, the identification will result in a supraordinate goal (i.e. do not crash into the object) being accomplished, but an audit of the technology by either developers or users in a transparent system will show that the subordinate objective of correctly identifying its surroundings was not accomplished successfully. The question as to what this should mean from a reliability perspective is both a technical problem, and a human problem.
From the technical side, one might imagine that the consideration of reliability must account for performance on every supra- and subordinate objective in order to create a wholistic evaluation of reliability. On the human side, this raises questions about the comparative importance of supraordinate and subordinate objectives. For constructs like trust, users may be primarily concerned with the supraordinate objective outcomes (Ross et al., 2008; Schwarz et al., 2019). However, research shows that as users gain expertise their focus on system performance, interactions with the systems, and trust in the systems changes (Keller & Rice, 2009; Navarro et al., 2021; Niu et al., 2018). This may also be dependent upon user mental models of the systems which could have impacts on their understanding of the objective hierarchies (Schraagen et al., 2020; Tenhundfeld, Barr, et al., 2022). However, as it relates to workload, subordinate objective failures may result in greater attentional resources being allocated towards the supervisory control over the system to ensure that the failures at the subordinate level do not manifest as failures at the supraordinate level (Bowden et al., 2021; Brown & Galster, 2004).
We propose that a new definition of reliability is needed in order to account for this hierarchical relationship. As such, we define reliability as the probability that a system will perform without failure and in a manner consistent with supraordinate and subordinate objectives. This new definition allows for the clarification that these complex systems should be assessed with a particular emphasis on the fact that they represent a system-of-systems (Uday & Marais, 2015).
Our goal here is not to add to the list of already long qualified reliability definitions such as mission reliability (Dui et al., 2021), software reliability (Littlewood & Strigini, 2000), and network reliability (Morgan et al., 1977), but rather to prompt a reconceptualization of purely what the definition of reliability needs to be at its core in order to adapt to the dynamic nature of human-machine interactions.
In order to assess the reliability of a system, simple dichotomous classifications of success or failure may be insufficient.
Severity of Errors
One issue that the assessments of reliability run into is the fact that not all failures or errors are the same. As mentioned above, there exist errors that preclude the system from achieving its supraordinate objectives, but there are those which are comparatively minor, affecting only a subordinate objective, but that may not impact the system’s overall performance on supraordinate objectives. This is additionally complicated by the fact that more complex systems may be able to recognize errors at the subordinate level, correcting them before they impact system performance at the supraordinate level.
Turning to fault tree analyses (FTA), designers can anticipate unintended outcomes, and establish ways in which to mitigate these issues (Ruijters & Stoelinga, 2015). This can be done in a multitude of ways, but one compelling recent example stems from the data verification approaches from block-chain technologies (Martin, 2020). As AI and AA systems have more access to data, these data verification strategies may be employed to catch errors before they cascade. The importance of such designs has been long understood and contextualized with the “Swiss cheese” model for accident prevention (Wiegmann et al., 2022). A more nuanced assessment of system reliability at both the supra- and subordinate levels will allow for accounting of overall system reliability in these self-correcting systems.
Traditional measures of reliability also fail to account for the timing and type of errors, which research shows can have deleterious effects on the overall performance of the human-machine team (Bahner et al., 2008; Guznov et al., 2016; Rossi et al., 2017; Salem et al., 2015). System errors which occur early in the interactions between human and machine tend to impact factors like trust greater than errors which occur later on (Rossi et al., 2017), which can in turn impact workload (Bailey & Scerbo, 2007; de Visser & Parasuraman, 2011), situation awareness (Parasuraman et al., 2008), performance (Sebok & Wickens, 2017), and reliance strategies (Du et al., 2019; Dzindolet et al., 2003).
Similarly, not all errors have the same impact on the human-machine interaction. We can dichotomize types of errors as being either errors of omission or commission (Johnson, 2004). Errors of omission are those in which the system does not do something it was supposed to do. For example, in the baggage screening example above an error of omission would constitute the scanning machine not providing a determination about whether a weapon was detected in a bag. Conversely, errors of commission are those in which a system does something it is not supposed to do. In the case of self-driving vehicles this may manifest in the vehicle straying onto the shoulder. As with the timing, errors of omission and commission do not impact the human-machine team equally (Johnson, 2004).
However, no framework for reliability, to date, can account for the severity, timing, and type of errors discussed here. This may result in two different systems yielding the same overall reliability score, when using these traditional reliability approaches, but representing two very different reliability profiles with regards to human-machine interactions.
Given what has been discussed here, we propose research and further consideration be given to the ways in which reliability assessments can be adapted to meet the needs of systems which are growing in complexity. This requires a reconceptualization of reliability as a construct. This work should focus on how one might build a ‘reliability profile’ for systems which expressly depicts system performance on supra- and subordinate objectives, while also classifying the type, timing, and severity of errors. Doing this will allow for system designers/developers and researchers to better understand the impacts of system performance on human users.
Conclusion
Reliability is an important concept across many academic disciplines. However, the current conceptualization of reliability fails to consider the highly complex and dynamic environments in which humans are and will be interacting with AI and AA systems. As such, there needs to be a reconsideration of reliability as a construct which accounts for performance on supra- and subordinate objectives. Doing this will allow for more precision in the evaluation of systems, while also allowing more clarity in the research involving human-machine interactions.
