Abstract
The implementation of automation in many domains has led to well-documented accidents and incidents, resulting from reduced situation awareness that occurs when operators are out-of-loop (OOTL), automation confusion, and automation interaction difficulties. Wickens coined the term lumberjack effect to summarize the finding that while automation works well most of the time in typical or normal situations, the performance problems that occur in novel or unexpected situations also increase the likelihood of catastrophic errors. Skraaning and Jamieson have criticized the lumberjack effect due to a study in which they failed to find it. I show that this claim is unsupported due to a number of methodological limitations in their study and conceptual errors. They also provide a model of automation failure that fails to clearly delineate the many barriers to accidents that are available, instead emphasizing the ways in which automation can fail technically and different types of human error. An alternate automation failure model is presented that provides a broader socio-technical perspective emphasizing the design features, processes, capabilities, organizational policies, and training that support people in improving system safety when automation fails.
Keywords
Automation-Induced Failures
Severe accidents and safety compromising incidents are a well-known side effect of automation (Wiener, 1988). These accidents arise from a number of automation problems, largely induced by the design of these systems, that affect people’s ability to oversee and interact with it in order to maintain operational safety. This includes (a) automation confusion where operators have trouble in understanding what the automation is doing, (b) reduced situation awareness (SA) that results from being out-of-the-loop (OOTL) leading to slow detection and response to events that the automation cannot handle, and (c) automation interaction difficulties where people have problems directing the automation, particularly in time-critical situations.
Gawron (2019) documents 26 automation-related aircraft accidents and incidents. In reviewing this list, 27% can be categorized as involving automation confusion, 58% as OOTL/SA problems, and 36% involved automation interaction problems (with some accidents having multiple problems). Contributing to these challenges were problems with (a) 61% of cases having highly complex automation logic not performing as expected or as appropriate for the situation, (b) 38% of cases having mode errors in which the wrong mode was accidentally selected or the system was accidentally disengaged, and (c) 8% of cases involving erroneous inputs causing the automation to perform incorrectly for the situation.
Based on this history, Wickens (Onnasch et al., 2014; Sebok & Wickens, 2017) coined the term “lumberjack effect” to refer to the general finding that while automation may work well most of the time in typical or normal situations, the performance problems that occur in novel or unexpected situations also increase the likelihood of catastrophic errors. This generalization of the problem summarizes these real-world findings in which automation-related accidents arise in edge case situations that the automation may not be designed to handle, where it acts inappropriately due to inaccurate inputs, and where it adds extra complexity to the operators’ job. The challenge lies not just in the robustness and appropriateness of the systems’ programming, but also in many deficiencies in the design of the human-automation interaction and resultant deficiencies in SA. While not specific to just the OOTL problem, the lumberjack effect includes OOTL SA deficits within its purview.
Automation Failures
Skraaning and Jamieson (2023) believe that there is a need to better define the concept of automation failure in order to ensure that researchers more clearly understand the nature of the event, particularly in the context of automation-related human performance challenges. I agree with them that it is worthwhile to fully define what is meant by automation failure; however, their proposed taxonomy has several drawbacks. (1) It fails to distinguish causes from effects, falling short of providing either a taxonomy or a model of automation failure. It includes elements and systemic causes of automation failure characterizing a number of difficult types of shortcomings that can occur within the automation (e.g., logic deficiencies, programming errors, and malfunctioning hardware) as well as inputs to the automation (such as sensor failures) and combinatorial effects from interactions between automated systems. It also includes a partial list of human-automation interaction problems, and a partial list of human/organizational slips or misconceptions. These categories are somewhat overlapping however. For example, inadvertent activation/deactivation of automation is listed as a human/organizational slip (which may be a precipitating input to an accident), but it is also the outcome of poor human-automation interface design. An operator’s incorrect mental model may be due to inadequate training (listed as human/organizational slips) or due to overly complex and hidden automation (listed as a human-automation interaction breakdowns). Yet the fact that the automation logic is highly complex in the first place does not seem to be listed as a failure of the automation design itself. (2) If fails to provide a socio-technical view of error that can assist automation developers in avoiding negative outcomes, instead blaming the operator at the pointy end of the stick for errors (Woods et al., 2017). For example, the human and organizational slips/misconceptions all cite the operator as performing improperly in some way without identifying these issues as a consequence of automation design features or organizational actions. (3) No evidence is provided that many of the factors in the taxonomy (i.e., different etiologies for automation failure) matter in terms of differential effects on human performance with automation. That is, while I agree that the list of potential automation problems may be a good one, and any or all of them may be at fault in a given accident, there is no evidence at present that the nature of human response to the automation fault will be different based on the 13 categories of failure listed. In the review of automation accidents in the aviation field provided by Gawron (2019), for example, only a subset of these automation problems is observed, and there is no clear difference in the subsequent outcomes.
Prevention of Automation-Induced Performance Failures
As a remedy to these shortcomings, Figure 1 shows an alternate depiction of automation failure, clearly distinguishing inputs from outputs as well as relevant intervention factors. The model depicts a number of factors that mediate between particular automation-related events and negative outcomes of concern. This approach more clearly illustrates that accidents are not a direct result of just an automation failure, but that a series of barriers to error are available (Reason, 2000). In this light, automation accidents are not the sole result of operator error, but rather induced by a series of predicating events and automation features as well as failures to provide tools, capabilities, policies, and training that support people in improving system safety under these conditions. This depiction more clearly emphasizes the specific actions that system developers and organizations can take to maintain operational safety by reducing the occurrence of human performance problems in the face of failure events. Anatomy of automation failure: Barriers for the prevention of human performance degradation.
OOTL, SA, and Level of Automation—Tempests in Teapots
Skraaning and Jamieson (Jamieson & Skraaning, 2020a, 2020b; Skraaning & Jamieson, 2023) criticize the lumberjack effect on the basis of their inability to find it in a study they conducted in a nuclear power simulation with an automated aid for performing a checklist. They believe the reason for this is that their study was conducted in a high-fidelity simulator with experienced power plant operators (Jamieson & Skraaning, 2018).
Wickens et al. (2020) disputed their criticism of the lumberjack effect on the basis that their findings were limited due to: (1) The automation failure involved a malfunction in an underlying relief valve, not the automation itself, (2) Reliance on an unvalidated measure of SA called the Important Parameter Assessment Questionnaire (IPAQ), (3) Discounting participants’ negative ratings of the high level of automation (LOA) conditions that were consistent with lumberjack predictions and highly significant, (4) Lack of statistical power associated with a small sample size, and (5) Lack of comparison between routine and abnormal conditions.
While I agree with Skraaning and Jamieson (2023) that the automation failure condition in their study was relevant in terms of being representative of many automation failures, disputing point 1, I largely agree with Wickens et al.’s (2020) other four points. There are a number of issues, both methodological and conceptual, that significantly limit the generalizability of the Jamieson and Skraaning (2020a) study.
Methodological Limitations
Small Sample Size
The study involved only eight teams of operators each performing in four automation conditions, providing only eight measures of performance per condition. This is an extremely small number to detect differences between conditions. While Skraaning and Jamieson argue that somehow the power of the test is not important, this fails to address a key limitation of their study: perhaps eight measures of an event in each condition is simply insufficient for showing statistical significance between conditions? Another study cited by Skraaning and Jamieson as contradicting the lumberjack effect is Calhoun et al. (2009). Yet that study only had six participants, an issue that its authors credited for the performance measure failing to reach statistical significance even though the trends followed the expected direction of higher LOA yielding slower performance when the automation is unreliable (p = .16). Similarly, Cummings and Mitchell (2007) had only three participants in each LOA condition, yet still found that high automation (management by exception) was worse for performance (p = .07), particularly when there were a high number of replanning events. Skraaning and Jamison do not provide any trend data.
Limits of Testing
The Jamieson and Skraaning study involved eight teams (of three people each) performing in four scenarios. All scenarios involved a fault in which one valve remained open, varying only in terms of which valve and why it failed to open. After the first fault, it is highly unlikely the operators were surprised or not expecting a fault in the remaining scenarios. The authors do not report on any testing of order effects which could easily have contributed to the lack of statistically significant difference between conditions. With four LOA conditions and only eight teams, attempts at counter-balancing for order effects will have been only partial at best.
Poor SA Measure
Skraaning and Jamieson relied on IPAQ as a measure of SA. Yet this is an unvalidated measure that has not been used in any published literature. Their only citations for it are two unpublished internal documents from an organization that no longer exists and are not publicly available. This measure, administered at the end of the scenario, asked for operators to rate the importance of eight process parameters. As a key limitation, by assessing this information only at the end of the scenario, it is likely to only capture operators’ understanding at that point, well after participants may have figured out what was going on in the scenario (Endsley, 2021). Thus, it does not necessarily capture people’s SA at the time of the anomalous event or any delays in ascertaining what was really happening that could lead to negative outcomes in many domains. The measure therefore may suffer from significant hindsight bias. Further, a high degree of overlap is shown in the 95% confidence intervals on this measure between the four LOA conditions casting some doubt on the strength of their findings of SA differences between conditions.
Other Relevant Factors
There are several other factors that also could have contributed the lack of an LOA performance effect in the Jamieson and Skraaning study. The negative effects of high LOAs have been shown to be worse for continuous control tasks and those involving advanced queueing of tasks (Endsley & Kaber, 1999; Kaber et al., 2000), and when other tasks are present that compete for people’s attention (Kaber & Endsley, 2004). The Jamieson and Skraaning study, however, involved a checklist that operators were working through with different levels of automation. This type of task, particularly with no competing tasks, is less likely to fall prey to OOTL problems.
Secondly, it has been shown that display transparency can reduce or eliminate OOTL performance decrements (Bagheri & Jamieson, 2004; Bass et al., 2013; Bean et al., 2011; Dzindolet et al., 2002; Mercado et al., 2016; Selkowitz et al., 2017; Seppelt & Lee, 2007; Stowers et al., 2017) as well as SA problems (Boyce et al., 2015; Chen et al., 2014; Schmitt et al., 2018; Selkowitz, et al., 2017). It is entirely possible that Jamieson and Skraaning did a good job in providing automation transparency with their automated checklist display design. While it cannot be stated categorically that either of these factors contributed to the lack of significant differences in failure detection performance between LOAs in their study, neither can they be ruled out with enough certainty to call the lumberjack effect into question.
Conceptual Errors
More than OOTL
Much discussion about the lumberjack effect is built around studies of OOTL. However, it is worth pointing out that automation-related performance problems result from not just OOTL, but also automation confusion and automation interaction errors. Wicken’s description of the lumberjack effect is not limited to OOTL; rather, catastrophic events can happen due to automation problems of varying kinds.
Probabilistic versus Deterministic Models
Much of Skraaning and Jamieson’s argument lies in the presumption that the “burden of proof for a prediction of human performance effects rests on those making the prediction,” and that even a few studies that have findings that contradict that model are enough to cast doubt on it. This viewpoint would only hold true if the lumberjack effect were deterministic rather than probabilistic. In fact, it is plainly obvious that automation in many cases works well much of the time and that people are often able to detect and correct for its shortcomings, or it would never be implemented at all. Rather, the lumberjack effect merely states that the probability of these OOTL events increases with automation. The more detailed research literature provides information on the many factors that affect that probability (e.g., LOA, competing tasks, time on task, task implementation, display transparency, and training) (Endsley, 2017b).
As for the burden of proof, not only does the wide body of research literature on this matter show a preponderance of evidence for the effect of LOA on OOTL from laboratory studies (Onnasch, et al., 2014) but also a large body of real-world examples (e.g. Gawron, 2019) shows that these problems are not just the result of the artificialities of the laboratory, but are a significant problem in complex settings with experienced operators.
SA and LOA
Some confusion seems to exist on the effects of higher LOAs on SA. The research base includes studies that show increases in SA at higher LOAs (Endsley & Kaber, 1999; Ma & Kaber, 2005); however, most studies show decreases in SA at higher LOAs (Endsley & Kiris, 1995; Franz et al., 2015; Jipp & Ackerman, 2016; Kaber, et al., 2000; Manzey et al., 2012; Sethumadhavan, 2009). Understanding this apparent discrepancy relies on looking more closely at the data.
High LOAs can have the advantage of freeing up cognitive resources so that people have more time to take-in information, at least initially. But when secondary tasks are introduced, this potential benefit disappears (Kaber & Endsley, 2004; Ma & Kaber, 2005; Weaver & DeLucia, 2022). Engagement decreases and attention to other tasks increases over time, lowering SA on automation-related information (Carsten et al., 2012) and increasing complacency (Wickens et al., 2015). This is more likely with automation that is more reliable which acts to increase trust (Wickens & Dixon, 2007). SA essentially can become more variable under increased automation, increasing with effort, but decreasing during periods of distraction and higher workload (Endsley, 2017a). Importantly, studies have shown a significant correlation between SA and manual take-over time following an automation problem (Clark et al., 2017; Sethumadhavan, 2009), demonstrating the importance of SA at the time that an event occurs for people’s ability to recover from the automation failure.
Thus, if SA actually did increase in the higher LOA conditions in Skraaning and Jamieson’s studies (which is questionable given their measure of SA), it would not be particularly damning with respect to the lumberjack effect. Rather it is somewhat consistent for a study in which participants were not subject to dual tasking.
Task Complexity
Jamieson and Skraaning’s central rationale for not finding an OOTL effect centers on the fact that their study employed experienced operators in a high-fidelity simulation environment, claiming the lumberjack effect does not extend to complex environments. However, there is not much evidence to back up this claim. Two of the other studies cited as additional evidence by Jamieson and Skraaning also had a very small number of inexperienced participants, with their trends generally supporting the lumberjack effect (Calhoun, et al., 2009; Cummings & Mitchell, 2007). The third study employed 20 experienced air traffic controllers using a conflict detection aid (Metzger & Parasuraman, 2005). It also supports the lumberjack effect showing improved performance with the aid in normal situations (p = .01) and a trend toward better manual performance compared to with the automated aid when it was unreliable (p = .14). The authors felt this finding had practical significance in safety critical environments, particularly since order effects were present and the less reliable condition was always presented last. SA was not measured in that study.
While many experimental research studies that have been done in this area employed either low- or medium-fidelity simulations, the large number of accidents and incidents from highly complex settings in the real world with experienced operators discussed at the beginning of this paper shows that the OOTL event occurs outside of laboratory settings as well.
Conclusions
Skraaning and Jamieson’s “failure to find automation failure” is far more the result of significant methodological issues, conceptual errors, and logical flaws than it is of any shortcomings in the well-documented OOTL phenomenon. There is little reason (in their words) to “throw the baby out with the bath water.” Endsley (2017b) and Onnasch et al. (2014) reviews of experimental research on this concept over the past 40 years add significant depth to our understanding of the OOTL phenomenon that has been repeatedly demonstrated in complex, real-world settings, with sometimes catastrophic effects. Importantly, the lumberjack effect and the human-autonomy system oversight (HASO) model (Endsley, 2017b) demonstrate predictive validity, with a new crop of accidents being found for automobile automation of continuous control functions (Siddiqui & Merrill, 2023) and benefits for automation directed at improving SA (Cicchino, 2018).
This paper shows the robust effects of automation failures on human performance due to automation confusion, SA and OOTL problems, and automation interaction problems, with accidents often stemming from multiple contributing factors. Importantly, there are many interventions that are available to break the accident chain, as outlined in Figure 1. Automation developers and organizations should focus on improving the design of the automation and the human-automation interface, detailed testing of the combined human-automation system in difficult and novel scenarios, providing clear communications regarding system capabilities and limitations, setting appropriate policies, and providing needed training.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
