Abstract
Efforts to optimize automated alarm systems have introduced the use of supplementary graphical proactive monitoring displays (GPMDs) that promote salient deviation detection alongside alarms. This study proposes the incorporation of newer, adaptable automation techniques, which afford operators control over an automated alarm’s reliability, in combination with GPMDs to further optimize these systems. A 2 (choice in reliability) x 2 (supplementary display) case-control match design was used to examine effects on task performance, operator mood, trust in automation, and subjective workload using an open-source version of the NASA Multi-Attribute Task Battery (OpenMATB). Results indicate that experimental participants rated the automation significantly higher on a trust in automation subscale than those in the control group. Few negative findings indicate it would be worthwhile to continue investigating the proposed system adaptations to determine if there are benefits to be gained from their application in realworld scenarios where risk of failure is especially dangerous.
Keywords
Background
Allocation of tasks to machine execution, hereafter referred to as automation, is ubiquitous in highly complex, high-risk environments where work is either too dangerous for humans or where humans are less precise or reliable than automation (e.g., nuclear power plants, aviation; Parasuraman, Sheridan & Wickens, 2000). The degree to which automation controls information gathering, decision making, or action selection for a given task is referred to as degree of automation (DOA). DOA can range from absolutely no computer assistance (zero DOA) to not requiring any human control (very high DOA).
Automated alarm systems are one of the most ubiquitous forms of automation located at the low end of the DOA continuum commonly implemented as a binary decision aid for human operators in complex work environments. Alarms systems remain silent (e.g. no sound, or green light) during nominal system performance, and activate (e.g. alarm sound, or red light) when deviations from nominal performance are detected through use of automation (Zirk et al., 2020).
An automated alarm’s reliability is determined in part by its detection threshold, or the criterion value that determines whether an alarm will sound to indicate a system deviation (Parasuraman & Hancock, 1999). Detection thresholds are typically categorized as liberal or conservative. Liberal threshold alarms activate to even weak or small system deviations, decreasing the likelihood of missed system deviations (i.e., misses), and increasing the likelihood of alerting when there is a system deviation (i.e., true alarms) and alerting in the presence of no system deviation (i.e., false alarms; FAs). The opposite is true for conservative threshold alarms which require a greater system deviation to sound an alarm. Detection thresholds can be quantified by their positive predictive value (PPV), or the probability that there is a true critical event or system deviation from nominal performance when an alarm activates. PPV is calculated by dividing the number of true alarms by the total number of alarms emitted (Getty et al., 1995). Alarms with a liberal threshold correspond to a PPV around 50% or below (i.e., less than chance), and conservative threshold alarms correspond to a PPV of about 80% or higher.
Low reliability alarms would be expected to be preferred in complex work environments because their lower likelihood for misses should prevent critical events from going unnoticed. However, the corresponding increased likelihood for false alarms (FAs) with low reliability alarms often results in alarm fatigue (Sendelbach, & Funk, 2013) where operator response time to alarms decreases as the number of FAs increases, so much so that operators may even stop responding to alarms entirely (Wickens & Colcombe, 2007), thus mitigating the potential fail-safe benefits to be gained from using a low reliability alarm. As such, high reliability alarms have often been found to yield the best system performance (Zirk et al., 2020) and fastest operator response times to alarm activation (Bustamante, Bliss & Anderson, 2007) when compared to low reliability alarms.
Graphical Proactive Monitoring Displays (GPMDs)
To optimize automated alarm systems, some studies have investigated the use of GPMDs that accompany the existing alarm systems and are designed to promote salient deviation detection (Burns, 2006). GPMD have been found to be effective at reducing alarm response time and mental workload while increasing secondary task performance, when compared to text-based or no proactive monitoring (Hwang et al., 2008). These reported benefits of GPMD could be attributed to their successful application of Ecological Interface Design (EID) principles, with GPMD clearly showing alarm “trip points,” where important values or thresholds (i.e., normal vs dangerous values) are clearly illustrated for tasks where nominal system operation is important to monitor. These trip points assist operators in decision making when responding to alarms (Hajdukiewicz & Burns, 2004), and have been found to be related to faster problem detection times (Zinser & Frischenschlager, 1994). Psychologically, the incorporation of these displays can also be expected to reduce the alarm fatigue induced by low reliability alarms (Manzey, Gérard, & Wiczorek, 2014).
Adaptable Automation
Traditional applications of automation use a fixed approach, where automation is set to a predetermined DOA that is maintained throughout system operation for each task. The application of automation is known to have an effect on the operator’s trust in automation, or, “the attitude that an agent will help achieve an individual’s goals in a situation characterized by uncertainty and vulnerability” (Lee & See, 2004, p. 51). A newer, flexible approach to optimizing automation applications is adaptable automation which provides the operator control over any DOA changes made throughout system operation (Oppermann, 1994). Some benefits of adaptable automation include increased operator acceptance of automation and more accurately calibrated trust in automation (Parasuraman & Wickens, 2008), likely due to the increased observability of the automation (Niederée et al., 2012) achieved through operators’ feedback control over the system.
Adaptable automation is relatively understudied, however (Calhoun, 2022), and may introduce a new task that increases operator workload (Kirlik, 1993). During high workload, operators could be so busy that they do not make DOA changes at all or, if they do, might make an inappropriate DOA change (Kaber & Riley, 1999). Given the potentially dangerous consequences of an inappropriate DOA change in high-risk work environments, what is needed is a form of adaptability that yields benefits while maintaining safety by avoiding inappropriate DOA changes.
Recent work examining the effects of giving participants choice over automation reliability indicates both increased system performance and trust in the system, and also decreased workload when compared to a fixed application of automation (Wiczorek & Zirk, 2019). Therefore, these results indicate that adaptability in additional automation characteristics is a safer modality that can yield some of the same benefits of DOA adaptability, for which alarm systems are a fitting testbed.
Although findings separately support the beneficial use of adaptable automation and proactive monitoring, no work has yet been conducted to investigate their combined use.
Human-Automation Interaction
Many studies investigating human-automation interaction consider mainly workload, situation awareness, and system state (nominal or failure), failing to consider the effects of other characteristics of automation and operator wellbeing as they relate to overall system performance. Work in related areas could hopefully begin to elucidate the relationship between DOA and system performance.
Automation and operator wellbeing
A person’s affect has been found to be a powerful factor that influences how operators interact with automation (Merritt, 2011), with positive mood and extraversion positively influencing performance (Niederée et al., 2012). Limited research has investigated the opposite relationship: the influence of automation on mood. Some scholars suggest that system designs that take into consideration a person’s affect could very likely positively influence productivity and operator acceptance of the automation (Norman, Ortony, & Russell, 2003), where understanding how automation effects mood then becomes important.
Trust in automation
Trust is an attitude that has a significant influence on how an operator interacts with automation, and the extent to which operators appropriately rely on automation is related to their trust in the system.
Lee and See (2004) propose that trust is developed through a closed-loop process, and development of trust in automation is therefore reliant on the observability of the automation. With low DOA, operators can observe automation even when it is not in use (e.g., when an alarm is silent) and can verify its validity with raw data that is readily available (Wickens, Gempler, & Morphew, 2000), making operators more likely to rely on these earlier stages of automation. With high DOA, however, operators can only observe decision automation once they are relying on it (e.g., automation that only activates and executes a task when a certain criterion is met, like automatically activating a fuel pump when fuel is low). This decreased observability is also often associated with decreased levels of trust (Lee & Moray, 1994) and therefore decreased reliance on automation by the operator. Together, these findings indicate that the lower DOA are associated with increased levels of trust because of their increased observability compared to high DOA.
The positive relationship between low DOA and operator trust implies there could be benefits to investigating new ways for how automation can be implemented to allow for operators to experience the performance benefits of high DOA but with the affective benefits of low DOA.
The Present Study
This proof-of-concept study aims to correct the present automation-centered approach to automation applications to further optimize automation in highly complex work environments. If the bodies of work surrounding operator wellbeing, trust, and mood are taken into consideration alongside reliability and performance, the body of evidence suggests that a correct combination of automation features and application could yield a system with lower-stage automation that performs as well a system with late-stage automation. The following hypotheses were proposed:
Hypothesis 1: The incorporation of operator control over alarm threshold will improve performance and operator psychological outcomes (i.e., positive mood, subjective workload, and trust) as compared to their case-control match.
Hypothesis 2: The incorporation of GPMD alongside alarms will improve performance and operator psychological outcomes (i.e., positive mood, subjective workload, and trust) as compared to those with text-based proactive monitoring.
Hypothesis 3: Those with both a graphical proactive monitoring display and control over alarm threshold will show improved performance and operator psychological outcomes (i.e., positive mood, subjective workload, and trust) as compared to those with text-based proactive monitoring and no control over alarm threshold.
Methods
Participants
Undergraduate participants were awarded course credits for participation. To date, the sample consists of 26 undergraduate students with an average age of 18.53 (SD=1.02), who were 53.84% male, 76.92% White, and 80.76% Non-Hispanic or Latino.
All participants received 3 course credits for participating in the study. To encourage performance and simulate a high-stakes environment, it was advertised that two participants were eligible to win a $50 Amazon gift card if they achieved one of the two highest scores on the flight simulator. Participants’ scores from their final trial served as the basis for these rewards.
Flight Simulator Tasks
An open-source version of the NASA Multi-Attribute Task Battery (NASA MATB), OpenMATB, was used for the experimental task environment (Cegarra et al., 2020). OpenMATB included the following tasks:
Altitude Monitoring Task
The original system monitoring task was adapted in OpenMATB to fit the needs of the experiment while maintaining the basics of a monitoring task. The task was reworked to become an Altitude Monitoring task, where participants were tasked with monitoring and correcting deviations from safe flying altitude, with deviations of 30,000 ± 5,000 feet being considered unsafe. The following pseudo-random formula was used to simulate the aircraft’s altitude:
Across each 15-minute trial, a total of 12 system deviations required correction by participants.
All participants worked alongside 2 light-based alarms. Alarm F1 and F2 alerted to altitude deviations 5,000 feet above and below a safe flying altitude, respectively. Alarms would change from green to red to alert to any system deviation. To correct a system deviation, participants pressed the corresponding button on their keyboard (button F1 or F2). the altitude monitoring task provided participants with automation assistance in completing the task.
Tracking Task
The Tracking Task simulated dynamic control of the tilt of the aircraft using a joystick controller. Participants were instructed to maintain a cursor as close to the center of two axes as possible.
Communications Task
The Communications Task simulated communication with air traffic control. Participants were assigned a call sign at the start of each trial (e.g., ABC123). Throughout the trial, “air traffic control” would make requests over computer speakers. Participants would hear both their own call signs and other call signs intermittently throughout the trial, and were instructed to only respond to requests made to their own call signs.
Resource Management Task
The Resource Management Task simulated fuel management in an aircraft. Participants moved fuel between tanks by toggling on and off fuel pumps using keys 1 through 8 on their keyboard. Fuel pumps can temporarily fail preventing fuel transfer. Participants were instructed to keep fuel levels in the main tanks (Tanks A & B) between 2500 and 3000 gallons at all times.
Experimental Design
A 2 (Display) x 2 (Choice in Threshold) case-control design was used to investigate effects of adaptable automation and GPMD on operator performance, subjective workload, trust in automation, and mood.
Display Condition
Participants had constant access to a proactive monitoring display to assist them in identifying unsafe deviations in flying altitude. Those in the control group were presented with a numerical display to simulate the form of information a pilot typically has access to (e.g., altimeter) that plotted the aircraft’s deviation from a safe flying altitude in thousands of feet, such that any value outside of 5 to -5 would warrant intervention to correct the plane’s altitude.
Those in the experimental group were presented with a GPMD, where the y-axis represented deviation from a safe flying altitude, and the x-axis representing time. At the start of the trial, the display would begin to plot the true altitude of the plane, updating every second. The graphical display had two threshold lines that denoted 5,000 feet above and below a safe flying altitude. The plotted altitude line would remain green whenever it was within the two dangerous thresholds and turn red when it crossed over a dangerous threshold.
Choice in Alarm Threshold
A case-control match design was used within each proactive monitoring condition (GPMD vs text-based). Participants in the experimental condition had the option at the start of each trial to choose which reliability their alarm would operate at: 60% or 80%. For each participant that completed all 4 trials in the experimental condition, a new participant was assigned as their control match who would perform their four trials at the same thresholds their experimental counterpart chose, without the knowledge of which threshold the alarm was operating at or whether it changed between trials.
All trials, regardless of the alarm threshold, had a total of 12 actual system deviations that participants were required to respond to. Thresholds varied by the number of false alarms and misses. All 60% threshold trials had a total of 18 alerts, with 11 true alarms, 7 false alarms, and 1 miss. All 80% threshold trials had a total of 9 alerts, with 7 true alarms, 2 false alarms, and 5 misses.
Training
Participants were given as much time as necessary to read through a training manual that contained instructions on how to perform each task. Participants were allowed to ask clarification questions before beginning a 10minute training session on the computer to gain familiarity with the system.
Measures
NASA Task Load Index (NASA-TLX)
A shortened, 6item version of the NASA-TLX was used to measure subjective workload at the end of every 15-min trial.
Trust in Automation
A 13-item measure developed by Lee and See (2004) assessed trust in automation at the end of each trial. The measure consists of 3 subscales: performance, process, and purpose.
Profile of Mood States
A shortened, 29-item version of the Profile of Mood States created by Petrowski and colleagues (2021) was used to assess participant mood on five subscales: vigor, anger, tension, confusion, and fatigue.
Altitude Monitoring Task
Participants’ final score was scored as how many out of the 12 actual system deviations per trial were corrected within 5 sec of the alarm sounding.
Tracking Task
Cursor position along X and Y axes was sampled at 50 Hz. Root mean square error (RMSE) was calculated every second, with both the X and Y coordinates entered into a distance formula. RMSEs for each sec were averaged across each minute, and minute scores were averaged across a 15-minute trial for average RMSE, with lower scores (less error) indicating better performance. Communications Task. Twenty air traffic control requests were made each trial; 10 with the participant’s own call sign requiring a response. Points were awarded for correctly responding to own call sign and not other call signs.
Resource Management Task
Fuel levels for Tanks A & B were recorded every two seconds. RMSE was calculated every minute; values between 2500 and 3000 were assigned a score of 0, with Tanks A & B levels entered into one formula. RMSEs scores were averaged to yield a final trial average; with lower scores indicating better performance.
Results
Within-Subjects Analyses
Paired t-tests compared the effect of participant control over alarm threshold on performance, subjective workload, mood, and trust in automation for each group of case-control pairs, for all four trials. MATB program malfunctions in Trials 3 and 4 resulted in loss of data from four participants; data from Trials 1 and 2 was used instead. Two participants who lacked a case-control match were also excluded from these analyses. Therefore, a total of 16 participants in 8 case-control matched pairs were used for these within-subjects analyses.
Performance Data
No significant differences on the altitude monitoring, resource management, tracking, or communications task were observed between case-control matches in either experimental group, across all four trials.
Survey Data
No significant differences in mood, subjective workload, or trust in automation were observed between case-control matches in either experimental group.
Between-Subjects Analyses
A series of independent sample t-tests were performed to compare the main effect of type of proactive monitoring (graphical vs text) on performance, subjective workload, mood, and trust in automation. The inclusion criteria is stated above. A total of 8 case-control matched pairs and 2 participants without a case-control match were included.
Independent sample t-tests were also used to compare the true experimental group (control over alarm threshold and graphical display) to the true control group (no control over alarm threshold, text display). 5 true experimental participants (i.e., graphical display and choice in threshold) and 1 true control participant (i.e., text-based display and no choice in threshold) were included in these analyses.
Performance Data
Performance on the altitude monitoring, resource management, tracking, or communications task did not vary significantly by type of proactive monitoring nor between the experimental and control group across all four trials.
Survey Data
Participants in the graphing proactive monitoring condition reported significantly greater subjective workload than those in the text proactive monitoring condition in Trial 1 (graph mean=4.73, graph SD=0.82, text mean=3.79, text SD=0.41, t(13.858)=-3.157, p=0.007, Cohen’s d=0.67) and Trial 2(graph mean=4.96, graph SD=1.00, text mean=3.64, text SD=1.21, t(16)=-2.525, p=0.01, Cohen’s d=1.10). Subjective workload did not vary significantly between the experimental and control groups.
A significant main effect of proactive monitoring on tension was observed in Trial 2, with participants with the graphical display reported being significantly more tense (mean=2.08, SD=1.00) than those with the text display (mean=0.85, SD=0.31, t(16)=-2.718, p=.015, Cohen’s d=0.95). No main effect of proactive monitoring on anger, vigor, confusion, or fatigue was observed across all four trials nor were there any significant differences between experimental and control participants.
Trust in Automation
No significant main effects of type of proactive monitoring on performance, process, or purpose were observed across all four trials. Participants in the experimental group scored the automation significantly higher in purpose (mean=3.53, SD=.76) than those in the control group (mean=1.00, SD=0.00, t(4)=3.014, p=.039, Cohen’s d=.76) in Trial 4 only. No other significant differences in performance, process, or purpose ratings were observed between the experimental and control group across all four trials.
Discussion
Partial support was provided for Hypothesis 3, such that participants with control over alarm reliability and access to a graphical proactive monitoring display rated the automation significantly higher than their control counterparts on the purpose subscale of the Trust in Automation (TIA) measure. This subscale assesses participants’ beliefs about why the automation was created, and specifically if the automation is being used to the designers’ intent. Findings indicate that experimental participants felt they had a better understanding of why the automation was created and that they were using it correctly and in accordance with its intended design. A lack of full support for hypothesis 3 could be attributed to the limited time participants were able to spend with the simulator. According to the literature, operator trust in automation has been found to be based first on purpose because there is initially only a short history of performance to base further judgements on (Hoc, 2000). But as experience with automation increases, trust can then be built on experiences of dependability and predictability.
Interestingly, the opposite of the hypothesized outcomes was found for mood and subjective workload in relation to use of the GPMD. Participants with access to GPMD reported feeling significantly more tension than those with text-based proactive monitoring. This relationship between tension and subjective workload was only observed in earlier trials (Trials 1 & 2) suggesting that this could be a transient effect dependent on the learning curve when operating a new system. Further, participants in this study completed only a 10-minute training before beginning their first experimental trial. The lack of these same significant effects in later trials (Trials 3 & 4) can be explained by participants becoming more comfortable with the graphical proactive monitoring display over time. Compared to other studies using the MATB simulator (e.g. Singh & Heard, 2022), where participants used the simulator for two sessions lasting about one hour each, the present study gave participants much less time to become accustomed to the graphical proactive monitoring display before asking them to give their best performance.
Although results did not indicate significant improvements in performance, mood, workload, or trust in automation with the graphical proactive monitoring display, it is worth noting that few significant negative outcomes were found. Because of the early nature of work surrounding adaptable automation, lack of significant negative findings here might indicate that negative effects of automation were mitigated by use of the graphical proactive monitoring display. It would therefore appear to be worthwhile and interesting to continue to investigate the proposed system adaptations further in order to determine if there are benefits to be gained from their application in real-world scenarios where risk of failure in complex systems is especially dangerous.
Footnotes
Acknowledgements
This paper was supported by Grant Number T01-OH008610 from CDC–NIOSH. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of NIOSH. For further information, please contact the corresponding author, Theresa Parker, at
