Abstract
This study aims to understand the development of users’ mental models (MMs) over time. We use behavioral data obtained from process tracing to identify key components of MMs and their relative importance. Further, we investigate the stability and predictability of these components as users learn through system interaction. Human-in-the-loop experimentation was deployed in a dynamic geospatial environment and six information attributes were provided to inform participants’ decisions. Partial Least Squares Regression was used to relate behavioral data and decision-making outcomes. We found that top-most performers initially adapt and progressively stabilize toward a suitable model as performance improves. In contrast, low performers lack adaptability and perform poorly. Overall, most participants are consistent with their choices as task familiarity increases. Identifying MMs and the underlying stability and predictability trends within performance groups has implications for improving user experience and curating decision support tools for human-AI teams.
Keywords
Introduction
Humans are often placed in situations where they must make decisions in complex domains. They are overloaded with information and faced with a large decision space with uncertainty in the environment. One tool decision-makers use for dealing with uncertainty is to construct models that capture their fundamental understanding of the system. Morecroft (1983) was an early investigator on how researchers could model Mental Models (MMs) in humans by constructing hypothetical inputs and outputs to a system. The MM construct is used by psychologists and engineers to explain cognitive functioning and human system performance (Converse, Cannon-Bowers, & Salas, 1993). It enables people to infer important information, understand phenomena, and make decisions (Johnson-Laird & Nicholas, 1983). Decision makers’ MMs capture their fundamental understanding of the system by defining their perceptions of the system, process, and related elements (Carley & Palmquist, 1992). Evidence suggests that MMs influence decision-making (DM) by relating strategies and choices to the environment (Barr, Stimpert, & Huff, 1992; Gary & Wood, 2011; Osborne, Stubbart, & Ramaprasad, 2001). It is understood that, in complex environments, decision-makers tend to adopt heuristics that are consistent with their simplified MM of the environment (Gigerenzer & Gaissmaier, 2011; Simon, 1991). MMs are differentiated from other cognitive constructs like schema and decision strategies because they are dynamic representations of systems (Jones, Ross, Lynam, & Perez, 2014) and thus cannot be fully represented by analyzing one-off decisions/outcomes. They guide behaviors and DM (Goldberg, Gustafson, & van der Linden, 2020), and allow decisionmakers to adopt strategies for interaction with systems (Rouse & Morris, 1986).
Most documented methods elicit MMs using subjective/introspective methods (Cooke & Rowe, 1994). These methods do not render objectivity and make it difficult to measure the different aspects of MMs. Further, MMs are difficult to assess unobtrusively and the elicitation method can actually alter the MM (Jones et al., 2014). Alternative techniques that use objective methods for MM elicitation are less common and subsequently less validated. Understanding a user’s MM can help identify disparities in information between individuals and misperceptions of the system (Broek, 2018). This information can be implemented in AI-decision support systems to improve DM and reduce cognitive load on the user.
In a previous study, Walsh and Feigh (2022b) proposed an objective method to assess the general model of DM using real-time, observable behavioral data. The classification of decision models based on behavioral data is relevant in most cases because the data can be easily collected in most processes and the ground truth labels are generally unknown. In this study, we build upon Walsh and Feigh (2022b) by temporally extending the set of decision tasks from 10 updates to 30. At each update, the information sources are updated and participants must choose an optimal location based on the information available at that update. This process is carried out 30 times to simulate the progression of a storm. This is done to capture the dynamic nature of the users’ MM, rather than a static model that does not account for the learning/model developmental processes. It also allows us to examine the stability of MMs as users develop a deeper understanding of the system through feedback. We believe that the stability aspects of MMs render opportunities for AI intervention with conclusive decision aids. Thus, we seek to answer the following research questions:
Can we observe the dynamic development of humans’MM of the task using process tracing in a complex geospatial decision environment?
Do MM components stabilize with task progression?If yes, does this trend render predictability to human behavior as task familiarity increases?
Methodology
In this study, we utilize an empirical evaluation of human DM behavior in a temporally extended version of the geospatial disaster relief environment used by Walsh and Feigh (2022b).
Experimental Task
Participants in our experiment were tasked as disaster relief planners for an artificial environment that simulates the progression of an oncoming storm in a city. The participants must decide where to place resources (e.g. essentials like food, water, medicine, etc.) throughout the progression of the storm based on information from six key attributes. Out of the six, three are static and three are dynamic in nature. Static attributes remain the same throughout while dynamic ones change with each update. The former includes ‘SocioEconomic Status’, ‘Population Density’, and ‘No-go zones’, while the latter includes ‘Storm’, ‘Flooding’, and ‘Power Outages’. The goal is to choose a location that optimizes the utility of these attributes. As the task progresses, the storm is updated 30 times and participants are responsible for placing the resource at every update.
The user interface consists of two key areas, as shown in Walsh and Feigh (2022b). Participants may choose to review one information attribute at a time. Each attribute, when clicked on, displays heat maps with the best and worst areas indicated by green and black/brown/dark red colors respectively.
The optimal location, for the resource, at each update is based on equally weighing information from all six attributes. Once the resource location is determined, participants may click submit and proceed to the next update. All dynamic resources change at each update and this process is repeated 30 times. The participants are scored using an equal weighting strategy with cue values equal to the sum of utility values followed by applying a min-max normalization. The utility measure gives an indication of how good the choice of a grid space pixel is. The utility values were scored in 15 color swatches at each pixel where black/dark red and green areas have associated utilities of 0 and 1 respectively (Walsh & Feigh, 2022b). At the start, participants are informed about how their performance would be measured. After the submission of their decisions for each update, feedback in the form of a percentage score is provided. The score (%UtChoice) for each update ranges from 0 to 100 indicating the lowest and highest possible scores respectively.
Measures
The performance of each individual is determined by %UtChoice which identifies the utility of their decision outcome at each update. Each location on the decision surface corresponds to a utility value that signifies how appropriate it is to place the resource, at that location, based on information from all six attributes. Participants are scored on each task update. The average performance of each participant is the arithmetic mean of %UtChoice they scored across all updates.
The key components in the MM of each task update are identified by examining the behavioral data across a stipulated window/time frame. Unlike Walsh and Feigh (2021), which utilizes all data points for characterizing the key MM components of each participant, this work explores a sliding window technique to capture the dynamic variations in users’ MMs as task understanding changes. These windows had sizes in the factor of 30 to incorporate data points from all the updates evenly. Thus, the candidate window sizes were 3, 5, 6, etc. Two types of windows were constructed: overlapping and consecutive (Figure 1). In the case of overlapping windows of size 5, window 1 uses data points 1-2-3-4-5, window 2 uses points 2-3-4-5-6, and so on. For consecutive windows, window 1 is classified using data points 1-2-3-4-5, window 2 using 6-7-8-9-10, and so on. Key components are identified for each window.

Types of windows for strategy classification: consecutive and overlapping. Consecutive windows do not use any overlapping data points.
The convergence of MMs is defined by measuring the similarity of key components of each window with those of the final window. The similarity between key components of any two windows is calculated using Levenshtein Distances (LD) (Levenshtein, 1965). Lower LD indicates higher similarity between windows of comparison. We use convergence as a metric to quantify the stability of MMs. Further, the similarity of key components evolving over time is measured using Marginal LDs. This metric signifies the consistency of MMs. It is calculated between key components of consecutive windows.
Participants
Data were collected from 32 participants (47% male, 44% female, and 9% undesignated) and ages ranged from 19-60 with a median age of 34. All participants resided in the United States and were fluent in English. There were no reports of color blindness. The experiment was hosted on an experiment-building platform called Gorilla and recruitment was done using an online crowd-sourcing platform, Prolific. The first presentation of this can be seen in Walsh and Feigh (2021). With an average completion time of 30-40 minutes (study and post-study questionnaire), participants were compensated at $10 USD/hour. The study was approved by the Georgia Institute of Technology IRB Committee.
Identification of Mental Model Components
A previously established approach, Partial Least Squares Regression (PLS-R), was used for strategy identification in Walsh and Feigh (2022a, 2022b). PLS-R is well suited for this task domain because it can be used when there are correlated independent variables. The technique constructs nearindependent variables, called “components”, as linear combinations of the original independent variables. PLS-R combines information about the relative importance of each attribute to a participant’s decision choice, and which features from their behavioral data (frequency of mouse clicks and dwell time on each attribute) correlated most strongly with the used resources. The Variable Importance in Projection (VIP) metric was used to estimate the predictive power of each independent variable in explaining the user behavioral data. VIP scores higher than 1 indicate the data information sources, i.e., attributes, that were used (1 being a wellestablished cutoff threshold for PLS-R independent variable reduction as supported by Akarachantachote, Chadcham, and Saithanu (2014)). For more information on this approach, refer to Walsh and Feigh (2022b).
A mapping between the significant information attributes and decision strategies is not a mere tally of the number of attributes selected. We instead determine if the attribute selection was predictive of the outcome (overall utility and utility of the individual attributes) by examining attribute-wise performances. For example, if a participant repeatedly interacted with the “Power Outages” but their performance on the attribute was chaotic, then PLS-R would not consider “Power Outages” as a significant component in the participant’s MM. Thus, PLS-R informs the relative importance of each attribute to a participant’s decision choice and identifies which of their behavioral features correlate strongly with the used resources.
Results
In this section, we explore the development of individuals’ MMs and determine the stability and predictability trends of MMs amongst participant groupings.
To answer the first research question, we focused on identifying key attributes over a localized window of time to capture variations among participants’ choices. A window size of 3 was not sufficient for PLS-R to make accurate classifications (led to an overfit model with R2 values ≈ 1). Sizes 5 and 6 led to negligible differences in key attributes but a window size of 5 allowed for an additional classification point for comparison (6 vs 5). Higher window sizes did not yield sufficient classification points for analyses. Thus, a window size of 5 was used for all further analyses presented.
The distribution of key components in MMs of participants, across all updates, is seen in Figure 2. At a glance, we can see that most participants do not incorporate all information sources equitably (6 significant attributes) nor do they make completely irrational/erratic decisions (Unknown/0 significant attributes). Most participants developed MMs that partially consist of the available information sources. The characterizations of key MM components in participants span the entire range of available information attributes, with the highest frequency of models comprising 3 key components/attributes (56%). The next largest group used 2 key attributes (20%). Roughly 69% of models within the 3component category belonged to ‘PDE’ which was then followed by ‘DEN’ (10%). Models with key attributes ‘PE’, ‘PD’, and ‘DE’ comprised roughly 75% of the 2-component group. From these choices, it is clear that participants relied on the more visually complex attributes i.e., Power, Population, and SES. No further characterization was done on the basis of variations in weighting between the significant attributes to avoid the ballooning of the number of models. None of the participants incorporated an equal weighting scheme for all six attributes. We speculate that the complexity of the overall task and time constraints for completion impeded the usage of all information sources equally. We also identified 7 instances where participants acted arbitrarily, as we were unable to conclusively identify any significant attributes. These instances stem from an unpredictable relationship between the participant’s behavior and decision outcome at that instance.

Key component identification within the entire population.
To answer the second research question, we classified each participant into one of the three categories based on their average performance. The mean performance (% UtChoice) per participant across all updates was 75.56 with a standard deviation of 6.749. Participant groups were, thus, divided into high (M = 85.2709, SD = 6.8021), mid (M = 76.2041, SD = 8.2253), and low (M = 65.233, SD = 10.1479) performance categories (see Figure 3). Individuals whose performance exceeded 1S D above the overall mean were designated as high-performers, while those with scores 1S D below the mean were grouped as low performers. Individuals with scores ranging between high and low performers were midperformers. 62.6% participants were mid-performers, followed by 18.7% and 15.6% participants in the low and highperformance groups respectively. A one-way ANOVA differentiated the three performance groups (F(2,87) = 41.77, p < 0.01). Tukey’s HSD posthoc test revealed that the average performances of the low, mid, and high-performance groups significantly differ from each other with a statistical significance of p < 0.001. Thus, the performance-wise characterization of our population is statistically relevant.

Performance groups based on average utility scores.
To gauge the stability of MMs with task progression, we investigated if the strategies identified at each window ‘converged’ (see Measures) toward the strategy identified at the final window. Figure 4 shows the trends in LD for each performance group. Among all groups, we observe a downward trend in LDs computed between the final window and window at each time step. Differences between the performance groups were non-trivial. A one-way ANOVA indicated F(2,72) = 27.77, p < 0.01. Tukey’s HSD posthoc tests revealed statistically significant (p < 0.001) results between the high-low and mid-low pairs, while the differences between the high-mid pair were not significant (p > 0.05). These results point towards differences in adaptability among the three groups, which is strongest among high performers followed by mid and low performers in that order. There also exists a positive correlation between convergence toward the final strategy along with a corresponding change in performance as seen in Table 1. The correlation is strongest among high performers and is observed in moderation among the midperformers. It is absent in the low-performance group. While the former groups tend to vary key attributes of their MM until they find a stable, well-performing strategy, the low performers utilize the same attributes that do not necessarily improve task performance.

Group-wise average LDs between final model and model at each update. (***p < 0.01).
Spearman-rank correlation.
p < 0.01; ** p < 0.05; * p < 0.1.
Next, we examine the predictability and persistence of these trends over a larger number of tasks. Figure 5 shows the distribution of marginal Levenshtein Distances among participants over time. Marginal LDs are LDs calculated between consecutive windows without any overlap in the data points used to classify them. This is done to avoid similarities arising due to artifacts of overlapping key component classifications. The proportion of participants having 0attribute changes monotonically increases with the progression of tasks and almost doubles by the end of the experimental trials. The overall proportion of participants with 0 or 1-attribute changes also increases in comparison to 2, 3, or 4-attribute changes. This tendency for local convergence was highest among low-performers, followed by mid and high-performers in that order. We attribute this pattern to high adaptability within the high-performers. No significant relationship was observed between the marginal changes in strategies along with corresponding changes in performances.

Distribution of the number of attribute changes in key components of each model.
Results from Table 1, in conjunction with Figures 4 and 5, indicate that the highest performers tend to adapt their strategies until they find a suitable one that leads to improved performance. On the other hand, the lowest performers adhere to a moderately performing strategy very early and exhibit tendencies toward risk aversion over reward-seeking.
Discussion
To understand and facilitate team performance by creating reliable decision support systems, it is critical to infer humans’ MMs of the task (Klimoski & Mohammed, 1994) and determine how they evolve over time. Through this experiment, we used data from participants’ information-seeking behavior to observe dynamic changes in key attributes of their MMs. We also ascertained variations in performance outcomes based on individual differences in the evolution of MMs over time.
Our results indicated that decision rules cluster into a relatively small number of distinct strategies, similar to the findings of Gary and Wood (2011). The most effectively used attributes were “Power Outages”, “Population Density”, and “SES”. As found in our previous work (Walsh & Feigh, 2022b), most participants continued using these attributes over the visually simpler “Storm”, “No-Go”, and “Flooding”. The complexity of the information presented and time constraints led most users to implement only a subset of the available information. Our results support previous works (Gary & Wood, 2011; Tomlin, 2021; Walsh & Feigh, 2022b) which find that complexity in MMs is linked to improved performance as complexity may hinder accurate task understanding and subsequent performance. Our highest performers were able to use only 2, 3, or 4 information attributes at a time to inform their decisions. They were able to develop efficient MMs of the task and environment through limited interaction. This finding compels us to think about the importance of decision aids and how they can be efficiently designed to condense information for fruitful assimilation. Detailed investigation of the impact of information complexity on physical and cognitive workload is necessary to conclusively formulate design guidelines for decision aids.
The disaster-planning task was designed to simulate a high-stakes decision environment with time playing a crucial role in enforcing task criticality. No think-aloud protocol was used to verify the key MM components as it would risk slowing down participants and disturbing the elicitation process. Further, dwell time on attributes, our behavioral measure used to determine key components, would be manipulated through subjective elicitation. This is supported by Harper and Dorton (2019) who state that indirect elicitation methods, such as task observation, do not allow practitioners to verify individuals’ internal representations of the task.
Characteristics of a DM environment induce the selection of cognitive systems whose properties lie on a continuum. According to Hammond (1993), to sufficiently capture these characteristics decision-makers must provide “testable theories of the environment and make use of well-worked-out precepts of representative design”. Generalization of contextsensitive MMs (Converse et al., 1993; Rouse & Morris, 1986; Scheutz, DeLoach, & Adams, 2017) to other types of tasks, without accounting for task conditions, would yield implicit, oversimplified results that may be inaccurate/retrospective due to “ad-hoc representations of the task environment that cannot be falsified”.
Our second research question investigates the stability of decision strategies and the persistence of developmental trends over time. The highest performers converged toward a single model and their decisions grew more stable with the progression of tasks. They varied their strategies until they found a suitable model that yielded improved performance. Conversely, low performers were risk-averse and did not vary their strategies much. Factors such as low motivation were investigated to explain this observation and ruled out based on observed behavioral patterns. It is likely that lower performers were satisficing at a lower criterion value. High performers were satisfied with a higher score, leading to more variability in performance over a longer period of time, but ultimately leading to a more accurate MM with no increase in complexity. With the progression of tasks, an increased proportion of participants adopted strategies that were similar to each other, indicating an overall rise in predictability. As the average performance among participants improved, individuals showed high adaptability in the initial trials and transitioned towards being more predictable as task familiarity increased. These findings can help curate decision aids and support mechanisms in improving performance based on overall task understanding and performance of users.
Summary
Our goal was to objectively infer the dynamic decision strategies and observe the development of humans’ MMs of a task using process tracing to track behavioral data. We presented the progression of an oncoming storm where participants were tasked as disaster-relief planners and were asked to allocate resources based on six information attributes. We identified key attributes of MMs over stipulated time intervals using their behavior data and decision outcomes. Most individuals deployed a subset of information attributes while informing their decisions and had much less tendency to act arbitrarily or weigh in all information equally. Significant differences also emerged when participants were characterized by their performance. The highest performers adapted their decision strategies until they developed a MM of the task that led to desired outcomes. In contrast, the lowest performers preferred stability over performance improvement. Overall, participants broadly tended to be more consistent with the key attributes used in DM as their task familiarity improved, indicating that they indeed developed MMs of the task. Future work should investigate the impact of MM development on workload metrics and explore its utilization for formulating effective decision aids.
Footnotes
Acknowledgements
This work was supported by the Office of Naval Research Command Decision Making Program under Contract N00014-13-1-0083. The results do not reflect the official position of this agency.
