Abstract
The potential of Artificial Intelligence in assisting human teamwork has yet to be fully realized, despite its success in other domains. To ensure AI’s effectiveness and credibility as a team advisor, it must be able to effectively infer team dynamics and issue appropriate interventions. This study focuses on AI-mediated human teamwork in an simulated search and rescue (SAR) task, where a team of humans is monitored and guided by an artifical social intelligence (ASI). Six different ASIs are compared against a human baseline investigating the characteristics and effectiveness of their interventions. When adjusted for initial player competence ASIs performed on par with the human advisor although the human advisor was rated as more trustworthy and useful. Additionally, sentiment analysis of the interventions reveals that participants were more likely to accept interventions with negative emotions and resulted in improved team performance.
Introduction
Artificial agents have been applied in various domains and participated in social interactions with humans (Ayoub, Du, Yang, & Zhou, 2022; Huang, Brusilovsky, Guerra, Koedinger, & Schunn, 2022; Rahdari et al., 2022). In most of the application scenarios, artificial agents and humans form human-autonomy teams in which synthetic team members play a well-defined role in carrying out taskwork (McNeese, Demir, Chiou, & Cooke, 2021). However, in addition to taskwork, AI can also be applied to facilitate teamwork especially in enhancing human’s limited cognitive ability and bounded rationality when collaborating in dynamic environments (Gupta & Woolley, 2021). To issue valid interventions to the team, artificial advisors need to maintain a cognitive model of humans to recognize and attribute their mental states, beliefs, desires, and intentions (Williams, Fiore, & Jentsch, 2022). Instead of modeling humans with black-box models, like neural networks (Li et al., 2022) or model-based reinforcement learning (Gao et al., 2019),Theory of Mind models can be used to reveal more internal information about human decision-making processes (Baker, Jara-Ettinger, Saxe, & Tenenbaum, 2017). We will use the term artificial social intelligence (ASI) to refer to agents that can process social signals from humans. By having a human model, ASI could provide adaptive recommendations to the team to help them better coordinate, collaborate, and communicate with each other.
The complexity of modeling humans scales up when it involves teamwork. Even for cooperative tasks where the team shares the same goal, every team members have different observations, beliefs, and intentions (Fletcher & Sottilare, 2018). The social interaction between team members brings room for miscommunication, false memory, and second-order belief, where artificial agents could intervene and provide help. Intelligent Tutoring Systems have been a prime mover in the development of social intelligence in domains like education (VanLehn, 2011) and the military (Fletcher, 2009). While most research focuses on one-to-one tutoring scenarios like issuing adaptive instructions to individual learners, growing attention is being paid to improving human teamwork with artificial advisors. Fletcher and Sottilare (2018) proposed the Generalized Intelligent Framework for Tutoring to provide adaptive instructions for teams based on shared mental models. Gupta and Woolley (2021) modeled human teamwork as a transactive systems framework and pointed out how AI could help with the emergence of collective memory, attention, and reasoning. However, most of those work have not evaluated their proposed methods in empirical studies with real human teams.
There are a few essential considerations when designing effective team interventions for ASIs, for example, the level of explainability and form of presentations. Previous research in AI-meditated teamwork have shown that the compliance of human teams with the recommendations given by an ASI depends on human trust level (Williams et al., 2022). ASIs could become more trustworthy by providing explanations for their interventions because there is a higher chance for human users to trust and rely on advisors if they could easily interpret the given interventions (Lewis, Li, & Sycara, 2021). The wording of interventions can also be tailored to promote the desired result. Covey (2014) found that health messages may have different persuasiveness when presented with different frames, e.g. in terms of the benefits of adopting a recommendation (gain-frame) or the costs of not adopting a recommendation (loss-frame). ASIs could utilize such a framing effect (Tversky & Kahneman, 1985) when issuing recommendations in order to promote compliance of human teams. In addition, sentiment and emotion during conversations might also influence users’ perceived utility and trust in artificial agents (Stivers & Sidnell, 2012). However, most previous research focuses on identifying sentiment from user inputs during interaction with chatbot (Feine, Morana, & Gnewuch, 2019) or regulating emotion in a group chat between humans (Peng, Kim, & Ma, 2019) with little attention paid to how humans might perceive and react to agent inputs with different emotions.
For artificial advisors to become effective, transparent, and trustworthy when interacting with human teams, they have to be able to process social signals from humans to estimate their mental states and then issue appropriate interventions with adequate explanations for humans to understand (Williams et al., 2022). Inspired by human’s ability in conducting Theory of Mind inference, we propose an Artificial Social Intelligence (ASI) that can issue explainable instructions based on the observation of humans. We focus on a search and rescue (SAR) team consisting of three humans, each with unique abilities and responsibilities. The proposed agent monitors team behaviors, estimates mental states, and issues interventions in natural language. We compared the effect of 6 different ASIs with a human advisor as the baseline, in terms of team performance, compliance, and perceived utility and trust. Sentiment analysis is conducted to show how intervention presentations play a role in AI-mediated teamwork.
Method
Task Scenario
ASIs were tested by Arizona State University (ASU) researchers with human participants in a simulated search and rescue team task (Corral, Tatapudi, Buchanan, Huang, & Cooke, 2021). The scenario is built in the Minecraft environment representing a structurally damaged office building after an unspecified incident. There are 15 critical victims severely hurt and 20 regular victims inside the building. Participants were asked to search the building and rescue victims. The original building layout is given to the rescue team but there might be structural damages including collapses and wall openings. Participants need to remove rubble to clear paths in their search, stabilize victims, and transport victims to designated locations. Victims stabilized and delivered to the correct zones count toward the team points: regular victims are 10 points each and critical victims are 50 points each. The team performance is measured by the sum of point rewards for all stabilized victims that are delivered to the correct zones by all team members. The maximum score each team could get is 950 when they rescued all 35 victims in time.
The three participants on each team were assigned to one of three roles: Medic, Transporter, and Engineer. The three roles have a unique set of capabilities and knowledge.
Engineer: Can remove rubble with a hammer, and transport victims at a slow speed.
Medic: Can rescue victims, diagnose victim injury types, and transport victims at a medium speed.
Transporter: Can detect victims nearby, and transport victims at a fast speed.
Advisor Conditions
Artificial Social Intelligence
Six different artificial advisors are implemented to issue interventions based on observed team behaviors. The implementation details of each ASI are out of the scope of this paper. We will introduce the general design principles of ASI and then take ASI-1 as an example to briefly illustrate the intervention mechanism.
Artificial Social Intelligence (ASI) is integrated with the testbed in real-time during the experiment, allowing it to receive in-game event messages such as stabilization behaviors, player locations, and objects within the field of view (FOV). Each ASI features its own decision-making module, controlling when interventions should be issued to teams or individuals. Given access to information from all three team members, the ASI has the potential to enhance teamwork in multiple aspects such as encouraging communication, suggesting team strategy, and correcting false beliefs. The intervention rules and learning-based algorithms of 6 ASIs were developed separately by multiple universities and institutes based on pilot experiment data. To prevent access to ground truth information and promote teamwork improvement, the ASIs are trained and tested on different mission maps while receiving the same information as participants. The interventions are communicated through text messages displayed on the recipient player’s screen and reinforced through audio cues.
ASI-1 employs a Team Theory of Mind (TTOM) model to estimate the beliefs of each player regarding the contents of each room in the environment. The estimation of belief states is continuously updated based on events that provide evidence of room contents, such as when objects come into view, and a decay parameter is included to reflect the forgetting of information over time. 17 distinct interventions have been identified based on prior task knowledge and pilot data analysis, and these interventions are triggered when specific belief states are detected. For instance, if the TTOM model detects a discrepancy in belief states between two players regarding the injury types of victims, ASI-1 may issue an intervention to the medic player, encouraging her to share information with other players. A filter module decides whether to send interventions, taking into account the probabilistic distribution of belief states and the history of previous interventions. Interventions are presented using pre-defined language templates, with keywords that can be customized to reflect the recipient and explanation. For example, an intervention might read: ’[Team], you seem to be neglecting high-value [critical] victims. Rescuing more [critical] victims would likely result in a higher score.’.
Human Advisor
The human advisor condition is intended to establish a benchmark against which to compare ASI advisorsâĂŹ effects on teams. The term intervention denotes any action taken by an advisor: issuing advice, asking a question, drawing attention, etc. The human advisor communicates with participants verbally. The human advisor was told to observe the team and advise them concerning teamwork (not taskwork) to help improve the team performance. The advisor was selected based on their expertise concerning the SAR tasks, teamwork knowledge, and communication skill, and not on their Minecraft skill. The human advisor was trained concerning the experimental tasks and provided with guidelines and examples of advice. The human advisor was instructed to use their best judgment about when to speak, what to say, and how to express advice to the team.
Sentiment Analysis
Lexicon-based sentiment analysis is used to examine the intervention contents issued by advisors. It uses a pre-prepared sentiment lexicon to score an intervention text by aggregating the sentiment scores of all the words in the text. We adopt the National Research Council Canada (NRC) Word-Emotion Association Lexicon (Mohammad & Turney, 2013) for our research because of its coverage (i.e. 14,182 words) and wide usage across academia. The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). The annotations were manually done by crowdsourcing on Mechanical Turk.
The content and frequency of interventions issued by advisors are calculated for each trial in the dataset. We process texts by the WorldNet lemmatizer to group together the inflected forms of a word so they can be analyzed as a single item (Bird, Klein, & Loper, 2009). The emotion values are aggregated at the trial level with intervention frequency as the weights.
Data Collection
Participants were recruited by ASU researchers (Freeman, Huang, Wood, & Cauffman, 2021) from the University and social platforms (e.g., Reddit, Discord, etc.). Participants must be physically located in the US, have reliable internet, have experience playing Minecraft with a standalone mouse and keyboard. Participants were randomly assigned into one of the 7 advisor conditions (i.e. 6 ASIs or human advisor). Teams of three participants were scheduled to participate in a 2-hour experiment that involves searching for victims and rescuing them in a Minecraft task environment. At the start of the experiment, participants received training materials that introduce the rules of the game and provides some hands-on experience with the game before two 17-minute missions. Participants were paid a total of $35 in the form of an Amazon gift card for a total of 2.5-hour experiments.
In total 111 teams (i.e. 333 participants) were recruited for the data collection. Their in-game behaviors and verbal communications were recorded into metadata files, including players’ location, actions, and objects in their field of view. 13 teams failed to complete the experiment due to technical difficulties. The final dataset used for later analysis consists of 196 trials from 98 teams. The remaining 98 teams (294 participants) consisted of 215 males, 77 females, and 2 individuals who declared other gender identities or preferred not to respond. The mean age of participants was 22.04 (SD=5.22, ranging from 18 to 49). The most common ethnicities were white/Caucasian (54.2%), Asian (25.8%), and Hispanic or Latino (13%). All participants had at least a high school level education.
Results
Performance
We first compare the effect of advisor conditions on team performance to have a direct measurement of whether advisors issue helpful interventions. Since all ASIs are designed to aid teamwork (e.g. communication and coordination), covariates that influence taskwork (e.g. game experience) need to be controlled. Here we use the results of the competency test during the training session as a measurement of participants’ Minecraft skill level. Analysis of Covariance (ANCOVA) was conducted with advisor conditions as the independent variable and team performance as the dependent variable. The completion time (in milliseconds) of the competency test is the covariate to control. Results show that the advisor condition has a significant effect on team performance after controlling individual competency level F(6,554) = 3.49, p < .01. We run separate linear regressions for each advisor condition, as shown in Fig 1. Data shows that most participants (93.9%) have their test completion time between 1 min to 3 min. During this interval ASI-1 and Human advisor have the largest intercept among all advisor conditions indicating that those two advisors provide the most positive influence on team performance when individual skill levels are controlled, however, pairwise comparisons among advisors did not reach significance.

Linear regression plots of different advisor conditions. The X-axis refers to the competency test completion time, which is assumed to be negatively correlated with participants’ skill levels. The Y-axis refers to the team performance of individual participants.
The effectiveness of advisors can also be measured by participants’ perceived utility and trust. A post-task survey was collected to gather participants’ subjective reports on those two dimensions (Freeman et al., 2021). Fig 2 shows that the human advisor is perceived to be more useful and trustworthy than ASIs, while all other advisors do not differ significantly. ANOVA and paired t-tests provide similar insights as shown in the figure (utility: F(6,563) = 18.8, p < .001; trust: F(6,563) = 13.9, p < .001).

Barplot of participants’ perceived utilities and trust on advisors.
In addition, we annotated if the participants followed interventions given by the ASI-1 advisor. The criteria of compliance were met when the recipients (either individual players or the whole team) conducted requested actions within 30 seconds upon receiving the intervention. The average compliance rate is 61.7% and it shows that high-performance teams tend to have a higher compliance rate r = 0.50, p < .001.
Sentiment Analysis
The sentiment analysis indicates the weighted average emotional values of ASIs and human advisors, as shown in Fig 3. Firstly, we compare how advisor conditions differ in each emotion dimension to provide a qualitative perspective. For the two sentiments, the human advisor uses the most positive words (e.g. good, luck, job, work) in their interventions compared to other advisor conditions. ASI-1 uses the most negative words (e.g. victim, wrong, remove). In terms of the 8 basic emotions, ASI-5, Human, and ASI-1 use the most trust-related words (e.g. suggest, team, share). ASI-3, ASI-1, and Human use the most anticipation-related words (e.g. time, plan, hope).

A radar plot of the weighted average emotional values of ASIs and human advisors.
Then we calculated the average intervention frequency and length in each advisor condition. Results are shown in Table 1. While the human advisor tends to issue interventions conservatively with a moderate length, ASIs have a wide variety in terms of frequency and level of detail for interventions. For example, ASI-2 issues a lot of concise interventions, and ASI-3 issues selective interventions with rich information.
Intervention frequency and length measured by the average number of interventions issued per trial and the average text length per intervention, respectively.
Next, we run a correlation analysis to reveal the relationship between intervention features (i.e. emotion values, frequency, and length) and performance measurements (i.e. team performance, perceived utility and trust, and compliance rate). Table 2 shows a subset of variable pairs that have significant correlation and relatively large coefficients. Negative sentiment has positive correlations with both team performance and compliance rate. Similarly, positive emotion values like joy and surprise are negatively correlated with intervention compliance. These results show that interventions with harsh language might be more acceptable and lead to better outcomes in the given task scenarios. In addition, average text length and intervention frequency correlate with subjective satisfaction participants reported in the post-task survey, but in different directions. Longer but less frequent interventions get higher ratings in terms of perceived utility and trust, leading to the hypothesis that frequent interventions might be disruptive and short interventions without enough explanations might be less convincing.
Correlation table between intervention features and performance measurements. Correlation coefficients are calculated using Spearman correlation. ***: p<.001, **: p<.01, *: p<.05.
Discussion
In this study, we investigated AI-mediated human teamwork in a Search and Rescue Task. Artificial advisors were implemented to facilitate human teamwork by issuing interventions to participants during the mission. To understand how intervention presentations influence team performance and participant’s perceived utility and trust on the advisor, we conducted sentiment analysis over intervention messages.
In our study, we evaluated the performance of six artificial advisors and compared it to a human advisor, which served as the baseline. Results revealed that when controlling for competency the ASI-1 and human advisor conditions were the most effective in improving team performance although pairwise differences among advisors were not significant. Specifically, ASI-1 was more beneficial for participants with lower skill levels (as indicated by a competency test completion time greater than 2 minutes), while the human advisor was more helpful for participants with higher skill levels (as indicated by a competency test completion time less than 2 minutes). These findings suggest that artificial advisors with dedicated Theory of Mind models can achieve human-level performance in mediating human teamwork in our task scenario.
To gain further insight into the effectiveness of ASI-1, we conducted a qualitative analysis of the interventions issued. Results showed that ASI-1 issued more interventions per trial, but each intervention contained fewer words compared to the human advisor. This difference in the number and length of interventions might explain the varying effects of the advisors across individual competence levels, with the ASI-1 group performing better than the human advisor group at the lower skill level. Our analysis of the intervention contents supports the hypothesis that artificial agents provide concise and context-dependent instructions, which are more helpful for beginners with limited experience, while human advisors offer more high-level strategy recommendations that can be more easily executed by intermediate players. Another key difference between the human and artificial advisors was their interaction channels. The human advisor communicated with participants via audio, while ASIs presented interventions via text messages. The human advisor was also allowed to answer questions during the mission, which enabled them to take advantage of natural language interaction and gain the trust of the participants. Results from the post-task survey indicated that the human advisor was perceived to be more trustworthy and useful compared to the ASIs. In addition, participants indicate a preference over less frequent but longer interventions.
A sentiment analysis was performed to examine the impact of emotion conveyed in intervention text on participants’ compliance and team performance. The results indicated a correlation between negative sentiment in interventions and higher compliance by participants, as well as improved team performance. This relationship can be attributed to two factors. Firstly, some interventions begin by addressing a current issue using negative language, like ’you seem to be neglecting critical victims’. This helps participants understand the rationale behind the intervention and increases their willingness to comply. Secondly, interventions presented in loss-frame tend to use negative language to exaggerate the consequences of non-compliance, such as "failure to spread out will decrease performance". Previous research found that loss-framed messages are more persuasive when the perceived risk of the activity is high (Cho & Boster, 2008). Considering the above findings, it would be beneficial to use loss-frame and negative language when issuing interventions in tasks with high pressure and risk. The current study is limited by the absence of a determined causal relationship between the intervention content and team performance. Both human and artificial advisors issue interventions based on observed team behaviors, while the participants modify their behaviors in response to the interventions. To untangle these factors and validate the findings, a randomized controlled trial would be beneficial. Additionally, the lexicon-based sentiment analysis used in this study does not take context into account. Words such as "victim," "threat," and "evacuate," which are prevalent in the scenario, are inherently negative, which could skew the sentiment analysis results. Incorporating machine learning techniques could improve the accuracy of the sentiment analysis by considering context.
