Abstract
We develop a method to assess the three indicators of transactive memory systems (TMS)—specialization, credibility, and coordination—through computer-aided text analysis. First, human coders assessed group transcripts for phrases representative of these indicators. From those phrases, we identified words that occurred frequently to develop a dictionary of TMS indicators. In total, we analyzed 262 groups composed of 1,091 individuals. Both the human-coded and dictionary-based assessments of TMS indicators are significantly related to a popular survey-based assessment of TMS. Our approach could be used to advance understanding of TMS by analyzing it in contexts where administering surveys is not feasible.
Keywords
A transactive memory system (TMS), a shared understanding of who knows what in a group (Wegner, 1987), is an important predictor of group performance (Ren & Argote, 2011). Though TMS was originally studied only within dyads (Wegner et al., 1985), research on TMS has increased substantially since the concept was applied to the group level (Liang et al., 1995), and its importance has been demonstrated in a number of reviews (DeChurch & Mesmer-Magnus, 2010; Ren & Argote, 2011). Though TMS has been studied in many types of teams, the types have been limited by the available measurement techniques.
Transactive memory systems are typically assessed through surveys of group members (Austin, 2003; Lewis, 2003). The adoption of these standardized techniques for the assessment of TMS has been important in the development of research on the concept. Even so, TMS can be difficult to study in groups where questionnaire distribution and collection are difficult or when teams cannot be interrupted and asked to complete a survey. The main purpose of this research project is to develop an unobtrusive method for assessing the extent to which a TMS has developed, without the need for group members to complete a survey.
The development of an additional, unobtrusive assessment technique for TMS is important for several reasons. Survey measures on organizational teams are intrusive in that members must take time out of their workday to complete the survey. Some individuals may be unwilling, or some may give biased responses due to issues such as researcher demand effects, where participants’ inferences about the purpose of the survey affect their responses (Woodrum, 1984). Because groups contain many individuals and individuals can belong to multiple groups, it is necessary to get a majority of group members to complete the survey and to ensure that group members know precisely which group is being assessed.
In this paper, we provide an overview of the theoretical underpinnings of our unobtrusive measure and how it relates to extant assessments. We then present the development of a human-based coding scheme, the development of a computer-aided text analysis (CATA) derived from the human coding, and the use of these to assess TMS across five studies. Our Appendices also provide the code to replicate our work in developing CATA on other concepts of interest.
Theory Review
Transactive memory systems (TMS), first proposed by Wegner (1987), are collective systems for encoding, storing, and retrieving information. Focusing on dyads, Wegner (1987) proposed that each individual has a transactive memory consisting of perceptions of the knowledge and expertise of the other individual. This cognitive map of who has what expertise interlocks with others’ transactive memories to form a transactive memory system, often described as a shared understanding of who knows what (Lewis & Herndon, 2011).
Liang et al. (1995) extended the concept of transactive memory to the group level. TMS has been examined in many group contexts, such as ad hoc work teams, MBA project teams, and globally distributed teams (Kanawattanachai & Yoo, 2007; Lewis, 2004; Liang et al., 1995). A TMS develops over time as individuals interact with one another and learn who knows what, which can lead to individuals specializing in certain knowledge categories (Liang et al., 1995). Group members can coordinate effectively because they learn who has what knowledge and how to access it (Liang et al., 1995). TMS improves group performance by allowing individuals to rely on others for knowledge or expertise and implicitly interrelate their actions.
Current Techniques to Measure TMS Indicators
Though a variety of approaches have been used to assess the strength of a TMS, the assessment of the three indicators proposed by Liang et al. (1995)—specialization, credibility, and coordination—is the most common measurement approach. Knowledge-based, meta-knowledge-based, and process-related assessments of TMS have all also been used (Austin, 2003; Hollingshead, 1998; Kush, 2019; Mell et al., 2014). The only systematic comparison of these assessments found that knowledge-based assessments of TMS capture overlapping variance in performance compared with measures of the three indicators of TMS, suggesting that these two means of assessing TMS are getting at the same underlying concept (Kush, 2019). One advantage of the indicator approach to assessing TMS—in comparison to knowledge-based approaches—is that the indicators of TMS are consistent across knowledge domains and thus do not require context-specific information about a group’s work. Indicator-based assessments of TMS, therefore, can be used in many contexts with no modifications (Lewis, 2003, 2004). For simplicity, alignment with extant work, further accumulation of TMS knowledge, and theoretical parsimony, we focus on the three indicators of TMS. An overview of their historical assessments is discussed below.
Early work measuring transactive memory systems was primarily qualitative. For example, Wegner interviewed couples (Wegner et al., 1985), and Liang et al. (1995) coded videos of group interaction. Liang et al. (1995) argued that in groups with a well-developed TMS, members would specialize according to their expertise, trust each other’s expertise, and coordinate their distributed expertise effectively. In a group with a strong TMS, members develop specialization because they can devote more resources to the areas in which they are adept, they learn to trust others’ expertise in other areas, and their interactions become more efficient with better coordination. Thus, through the assessment of these TMS indicators—specialization, credibility, and coordination—the strength of a group’s TMS can be inferred. Liang et al. (1995) measured the indicators by instructing coders to watch videos of group interactions and rate each group on the extent to which its interactions reflected the three indicators. This qualitative work laid the foundation for future survey-based measures.
Lewis’s (2003) widely used survey measure of TMS builds on the same three indicators developed by Liang et al. (1995). Lewis’s (2003) 15-item survey dedicates five questions to each indicator: specialization, credibility, and coordination. Other researchers have also used parts of other established survey measures to substitute for the Lewis scale (see Bachrach et al., 2019 for a review). For example, Kanawattanachai and Yoo (2007) captured the same three indicators of TMS as Lewis (2003) but with different questions. Faraj and Sproull (2000) measured TMS by assessing expertise location, trust (Cook & Wall, 1980; McAllister, 1995), and coordination (based on Weick & Roberts, 1993). Though there is some variety in the specific items and terminology, these studies all assess essentially the same indicators of TMS strength: specialization, trust in expertise, and coordination.
Theoretical Underpinnings of a Text-Based Measure
We propose that a group’s interactions captured in the form of recorded text communications (e.g., instant messages, transcribed audio recordings, emails, etc.) are a source of information about the extent to which a group has developed a TMS. That is, the more group communications reflect the three indicators of TMS, the stronger we can infer the group’s TMS. For example, a phrase such as “I agree with that” might indicate that group members see each other as credible. If this type of communication occurs at high rates among members, it indicates a high degree of credibility within the group. Given that credibility is an indicator of TMS, the more often phrases reflecting credibility occur, the more certain we can be that TMS has developed. Thus, our strategy for creating a new measure of TMS based on group communication is to adapt the current understanding of the indicators of group TMS to text communication. Measuring TMS via text communication does not have some of the downsides, such as interruption or demand effects, that survey methods have. As discussed in the prior section, TMS was originally measured by a qualitative assessment based on observation of behavior—our proposed measure of the TMS indicators builds directly on that original work.
Thus, we propose that groups may use language that suggests the presence of one of the three indicators of TMS, which we term Positive Specialization, Positive Credibility, and Positive Coordination. Positive Specialization statements indicate that a member has knowledge of some area, such as “I figured it out.” These messages would lead members to update their meta-knowledge, indicating that this member has a specific area of expertise. Positive Credibility statements likewise include a positive evaluation of another’s skills or decisions. Positive Coordination statements ensure that work keeps moving within a team and individuals are aware of and not interrupting each other’s tasks—for instance, statements such as “If you feel comfortable, go for it” or “I’ll take the next look.”
Previous work coding TMS from video has generally taken a global approach to assess the extent to which the TMS indicators of specialization, credibility, and coordination were present (e.g., Liang et al., 1995). We take a more micro approach and code each communication. In the process of developing our more fine-grained coding scheme, we encountered phrases that seemed to indicate the lack of a TMS indicator rather than the presence of a TMS indicator. Thus, we propose that groups can use language that suggests the absence of one of the three indicators of TMS, which we term Negative Specialization, Negative Credibility, and Negative Coordination. Negative Specialization statements like “I don’t know what that does” are indicators that an individual does not have expertise in some area. Negative Credibility are statements where an individual questions another’s skills or abilities. For example, a phrase such as “I personally disagree with the design choice of two search terms” could indicate that members do not trust each other’s expertise. Negative Coordination statements, such as “someone already did that,” indicate that a group is duplicating tasks and has thus failed to coordinate effectively. More information on the coding categories can be found in Supplemental Appendix A.
Additionally, we propose that some words may occur more often in phrases that are seen as suggestive of the presence of TMS indicators. For example, the following phrases may be coded as Positive Coordination: “Can you do this,” “Can I have control,” and “I think member A can help.” The word “can” occurs in all of these phrases. We may find that the word “can” or the phrase “can you” occurs frequently across phrases that are coded as positive coordination and less frequently in phrases that are coded as having other meanings. In that case, the occurrence of “can” could be used as a proxy for Positive Coordination. Once a list of relevant words or phrases related to each TMS indicator, called a dictionary, is developed, those words or phrases could be counted systematically and automatically in a group, producing an assessment of the extent to which communication indicative (or not indicative) of TMS strength occurs.
Methods
Overview
Our objective was to create an assessment of TMS that could be easily applied to group communication transcripts or logs. This objective narrowed our focus to creating a dictionary of terms related to the indicators of the extent to which a TMS has developed, much like the work from Pennebaker et al., who developed the Linguistic Inquiry and Word Count (LIWC) dictionaries to assess various psychological constructs such as affect and analytical thinking (Boyd et al., 2022; Pennebaker et al., 2015). Though various methods of text analysis exist, our hope was to create a practical assessment that could be easily applied to many contexts and with only limited technical proficiency needed by the user. This, again, led us toward an assessment of TMS indicators, which are not as context-dependent as knowledge-based indicators of TMS.
We used three steps in the development of this unobtrusive measure of TMS indicators. First, based on prior literature, we developed a theoretical model of the relationship of communications to the indicators of a developed TMS: specialization, credibility, and coordination. Second, we developed a coding scheme that was used by trained human coders to assess the occurrence of the types of communications indicative of TMS within transcripts of group interaction. The more communications that groups engage in that were coded as one of the positive indicators of TMS (or the fewer coded as one of the negative indicators of TMS), the more confident we are that the group has developed a TMS. This coding scheme is available in Supplemental Appendix A. Our third step was then to apply an algorithmic procedure to determine which words or phrases were more likely to be used in statements that were related to the indicators of TMS. Once these were identified, then “dictionaries” or lists of words and phrases could be counted within group transcripts to infer the strength of a group’s TMS. This approach of using machine learning trained on data created by human coders to generate a dictionary of terms is a unique contribution of this paper.
We used the standard measure of the indicators of group TMS (Lewis, 2003) to validate both the human-coded and dictionary-based methods. We also determined to what extent these measures predict group performance compared to the Lewis (2003) measure. Through these and additional analyses, we assessed the construct validity of the text-based measure of TMS, using procedures described in Short et al. (2010) for computer-aided text analyses.
Data and Context
Data from four similar but unique experiments were used to develop and compare these assessments of the indicators of TMS (see Table 1 for a summary of these studies). These studies were designed to test hypotheses different than those in the current paper but are appropriate for our purpose because all studies have similar tasks and all group members communicated via an instant messenger to other group members who were in separate rooms. Thus, we have a record of all communication that occurred among study participants. In Study 1, groups of four individuals performed a collaborative computing task using a graphical programming language. Participants had simultaneous access to the work environment and communicated with one another throughout the experiment through instant messages. As all communications between members occurred through the instant messaging client, a transcript of each person-to-person communication was captured throughout the entire work period. The second study again used groups of four individuals and the same graphical programming language, though a different task. The third study used a different programming language and groups of three but was otherwise similar to the first two studies. Lastly, a fourth study used the same programming language as Study 1 and Study 2, also used four group members, but was shorter in duration.
Overview of Data.
Note. See Supplemental Appendix E for more information on these datasets.
For each of the four experiments, a measure of TMS using Lewis’s (2003) scale and a measure of performance, based on the number of errors in completed work, was available. The length of the experiments varied, but all included at least 30 min of direct interaction. See Table 1 for a comparison of the various datasets used in this paper and Table 2 for means and correlations of the main variables. 1 The combined datasets contain 262 groups composed of 1,091 individuals. Analyses were done on two sub-samples of the data. The 120 groups that were coded by humans will be referred to as the “training” data, and the 142 groups that were not coded by humans will be referred to as the “test” data. The 120 groups selected to be reviewed by human coders represent the first 50 groups to participate in Study 1, the first 46 groups in Study 2, and all 24 groups in Study 3. They were not selected based on experimental condition, which was randomly assigned, or performance.
Means and Correlations for Studies 1 to 4.
Note. As multiple studies were combined, this correlation table presents pooled correlations using fixed effects and inverse variance weighting.
p < .1. **p < .01. ***p < .001.
An external dataset was also used to test external validity. We obtained data from a separate research team from an experiment that contained 61 groups of three members. The Lewis (2003) measure of TMS and a measure of performance were captured in this study.
Human Coding and Content Analysis
We used the process described in Weingart et al. (2004) to develop the coding scheme drawing on constructs presented in Liang et al. (1995) and ensuring that these constructs reasonably matched those measured in Lewis’s (2003) survey. Because our coding scheme was developed to assess the indicators of TMS within written language, the definitions were slightly expanded to better accommodate how groups communicate. As these are instant messenger communications, the default unit was the individual instant message; however, sometimes an idea spanned multiple messages, or a message contained more than one idea. Before the human coding, each transcript in Study 1 and Study 2 was unitized based on communications that contained an individual idea (Weingart, 2012). For Study 1, two coders—the first author and a research assistant who later also applied the coding scheme to Study 1—separately unitized a set of 20 groups and achieved acceptable reliability (a Guetzkow’s U of .025, indicating that the coders disagreed on the number of units by only 2.5%). The first author then unitized the rest of the transcripts prior to coding. The overall communication frequency was highly correlated and changed only a small amount between the unitized and the non-unitized transcripts. 2 Thus, Study 3 was not unitized prior to coding. 3 As Study 4 was not human-coded, it was not unitized.
The coding scheme for communications was created based on prior theory that groups with a TMS have three indicators of TMS including specialization, trust in each other’s expertise, and good coordination (Liang et al., 1995). We developed the coding scheme in Study 1 and then applied this coding scheme to Study 2 and Study 3, with slight modifications due to nuanced changes in the tasks. The entire coding scheme was composed of 15 different categories, six of which were directly related to indicators of TMS: Specialization (positive and negative), Credibility (positive and negative), and Coordination (positive and negative). The other nine categories (see Supplemental Appendix A for a fuller description) were generated in part to increase reliability by reducing ambiguity about which communications would be coded within the TMS-relevant categories (Weingart et al., 2004). Some of these other categories were exchanges of information unique to the experimental task, evaluations of overall quality of the task, and discussion of the experimental manipulations. The TMS-related phrases that were captured with this coding scheme were later used to develop the dictionaries described in the next section.
As described more fully in the theory, the six primary coding categories were related to the three indicators of TMS, with positive language and negative language captured separately. Positive coordination communications, for example, were communications in which individuals asked others about how they were working or told each other the next steps to take. Negative coordination communications, conversely, mentioned duplication in work or lack of turn-taking. A group’s transcripts were coded in chronological order so that context from the surrounding conversation could be considered in assessing statements. On average, about 51% of communication in the group was coded in one of the six codes directly related to indicators of TMS (44% positive, 7% negative). Groups that communicated with more language that contained positive indicators of TMS were anticipated to have a stronger TMS than groups that communicated fewer such statements. Similarly, groups that communicated with many statements containing language that is negatively indicative of TMS were assumed to have a weaker TMS than groups who only used a few statements that are negatively indicative of TMS. The entire coding scheme with additional examples is available in Supplemental Appendix A. 4
Using NVivo (a software package designed for qualitative coding), each unit of communication was assigned one of the available codes, and if there was ambiguity, the code that best represented the meaning of the communication based on context was applied. The counts of the occurrence of coded phrases were then aggregated to the group level and divided by the number of statements within each group, creating a proportion of each group’s communication that was related to each coding category. Lastly, we z-score-transformed these measures to make analyses comparable (Fetterman et al., 2015; Martindale, 1990). This process was applied to 120 groups across Study 1, Study 2, and Study 3.
Dictionary Development
Using the human-coded transcripts, we utilized a simple machine learning process to create dictionaries of words and phrases that commonly appeared in each coding category. This novel approach has the benefit of allowing relevant words to emerge inductively from the deductively coded data, combining both a theory-driven and a data-driven approach. We used this technique to create dictionaries of words and phrases that seem related to the behavioral indicators of TMS: specialization, credibility, and coordination. Thus, the dictionaries contain words that are used significantly more in statements that are coded as having one meaning (such as positive coordination) than in the other coded categories. These dictionaries can then be used as a form of Computer-Aided Text Analysis (CATA) to assess a group’s level of TMS.
Procedurally, we first created a list of all the statements that received a code from any of the coding categories. Then, for each coding category, we created a dictionary using the following steps. We generated a binary variable for each coded unit of communication that equaled one if the statement received the code of interest and zero if it did not. This variable allowed us to compare all the statements with the code with all statements that did not receive that code. We then prepared the list of statements as a “bag of words” dataset. 5 We turned the transcripts into a corpus and then normalized the statements by removing punctuation, special characters, numbers, and context-specific words, making all letters lower case, and stemming all words. 6 We then tokenized the remaining terms so that sequences of two or three words (bigrams and trigrams) could be considered in addition to single words (unigrams). Lastly, we weighted words based on their term frequency-inverse document frequency (TF-IDF) score, which adjusts words’ value based on their frequency of occurrence. Creating a frequency cut-off, adjusting TF-IDF, and many of the other modifications that can be made during corpus preparation had little impact on the final outcomes in our setting.
We then applied Naïve Bayes—a simple classification algorithm—to estimate conditional probabilities that indicate which words are significant in discriminating between the various codes that were applied to statements (Mladenic & Grobelnik, 1999). This algorithm, for example, identified that the phrase “an idea” is more likely to be in statements that were coded as specialization than to be in statements not coded as specialization. The Naïve Bayes algorithm assumes conditional independence across words and provides estimates of the likelihood that an observation—in our case, a statement—falls into each category (e.g., a statement coded as specialization or coded as something else). The Naïve Bayes classifier estimate is found by estimating the classification with the highest probability. The words or phrases that most significantly tip the Naïve Bayes estimator in one direction or another for each classification can be identified by the conditional probability of them appearing in a statement given the associated statement’s classification (such as positive specialization or not). These conditional probabilities are calculated as follows, using the phrase “an idea” as an example: P(“an idea” appears in statement | statement has positive specialization code) = [Total number of statements with both (1) “an idea” in them and (2) a positive specialization code] / [Total number of statements that have a positive specialization code].
We generated the conditional probabilities described above for each word and short phrase that appeared in any statement. This allowed us to determine the conditional probability that the phrase appeared in a statement coded as one category—such as “positive specialization”—over the conditional probability that the phrase appeared in a statement that was not coded as that category. These ratios indicate how relatively common a word is contained in statements with a “positive specialization” code, compared to all the other statements. A ratio above 1 indicates that the word or phrase is relatively more common in statements with a “positive specialization” code than in statements with other codes. We included words or phrases with a ratio of 1 or higher in the dictionary for that category. Note that words or phrases could be included in multiple dictionaries, as this method determines whether terms are statistically more likely to be in one coded category versus the average of all other categories. We followed these steps for each category to arrive at a list of words and phrases for each coded category. Our next decision rule on when to include a word or phrase in a TMS dictionary accounted for its rarity across the statements. A word or phrase appearing infrequently but consistently having been coded in one category may appear to be strongly related to that category. We reduced noise by setting thresholds for word or phrase frequency scaled to the prevalence of words and phrases within that coding category. 7
Thus, we created a list of words for each coded category that reflected the dominant words that reliably occurred within those categories. The resulting lists vary in length across the coding categories, from 45 to 102 words or phrases. As is described in the factor structure section of the paper, we found a two-factor structure with the positive and negative indicators aligning on separate factors. To adapt the dictionary to accommodate this, we aggregated the words across all coding categories that are positively linked to TMS, to create a “Positive TMS” dictionary. We did the same with the negatively-linked words to create a “Negative TMS” dictionary. Words were included in the overall positive dictionary only if they appeared in at least one positive coding category and did not appear in any negative TMS coding categories; the converse rule was used for the negative dictionary. The word “ask,” for example, is associated only with the positive coordination category and no other; thus, it is counted in the positive TMS dictionary. The word “out,” however, is associated both with positive specialization and negative specialization; thus, it is not included in either the overall positive or negative TMS dictionaries. The overall positive and negative TMS dictionaries are therefore more conservative than the individual categories. The positive TMS dictionary contains 141 words and phrases, and the negative TMS dictionary contains 117 words and phrases. See Supplemental Appendix C for a list of all items contained in each dictionary. We use the overall dictionaries in all subsequent analyses.
Using LIWC-22 (Boyd et al., 2022), the quanteda package in R 8 (Benoit, 2018), or another standard text analysis package, one could reference these dictionaries to assess the percent of words in a transcript that are related to six categories of language that indicate the extent to which a group has developed a TMS. We used LIWC-22 to count these dictionary values and z-scored these to make the results comparable across dictionaries and with the human-coded variables. An advantage of the dictionaries is that any group transcript can be assessed; thus, the total number of groups that can be included in our analyses is larger than those that were included in the analyses on the human-coded statements alone. We therefore created two independent samples that we used in subsequent analyses. First, we analyzed data on the sample of groups that had statements coded by humans (N = 120). This allows for a direct comparison of the human coding method to the dictionary method in assessing the group’s level of TMS as assessed by Lewis’s survey (2003). Importantly, neither the human-coded nor dictionary-based methods were based on the groups’ scores on Lewis’s (2003) measure; however, we anticipate that all three measures of TMS should be assessing the same underlying indicators of the extent to which a TMS has developed. The dictionary assessment can also be applied to transcripts that were not human-coded. We reserved some groups from two of the prior studies and added a new experiment to create the “test” set (N = 142). All dictionary-based variables were z-score-transformed before being used in analyses, to make them comparable.
Lewis’s Measure of TMS
Lewis’s (2003) measure of TMS is composed of three 5-item scales, each measuring one of the indicators of a developed TMS within a group: specialization, credibility, and coordination. We assessed reliability on all cases across all four studies together. Cronbach’s alpha—a measure of internal scale consistency—was high (α = .84). A confirmatory factor analysis was performed to determine whether the three subscales formed three different factors, as the initial paper found. This model explained the data reasonably well (chi-squared = 607.53, p < .001, RMSEA = .08). Because the Lewis (2003) scale is an individual-level measure of a group variable, we must also assess the extent to which within-group agreement exists to justify aggregating the variable. Average rwg(j) was high, .93, as were the ICC values, ICC(1) = .41, ICC(2) = .73, p < .001, suggesting that members in the group agreed on their ratings of TMS (LeBreton & Senter, 2008). The Lewis measure was also z-score-transformed.
Group Performance
All studies used a similar measure of group performance, errors on the graphical programming task. Errors were defined as the number of modifications that would need to be made to change the program the group produced into a correctly working program. Typical errors were incorrect settings within aspects of the program, missing parts of the program, or incorrectly used parts. In Study 1, a portion of the groups had their errors assessed by two coders who generally agreed (N = 70, κ = .72, p < .001). In Study 2, all groups had their errors assessed by two coders who also had good agreement (κ = .74, p < .001). Reliability for errors in Study 3 and Study 4 was not available, but errors were calculated consistently with Study 1 and Study 2 and were based on relatively objective criteria.
Results
Means and correlations of variables are presented in Table 2. The Lewis TMS measure correlates positively with the positive TMS dictionary and negatively with both the human-coded and dictionary-based negative TMS. Our performance measure, errors, correlates negatively with both the Lewis TMS measure and with the positive TMS dictionary measure, indicating that increases in TMS, as measured by the Lewis scale or the positive TMS dictionaries, are associated with decreases in errors. The errors variable is marginally (p < .1) and positively correlated with the negative TMS dictionary. Thus, the correlations of the text-based measures of TMS with Lewis’s measure of TMS and performance are in the anticipated directions.
The human-coded positive and negative TMS scores are marginally (p < .1) negatively correlated though not correlated for the dictionary measure. Interestingly, we observe cross-method correlations: the positive human-coded and dictionary-based TMS measures are correlated positively, and the positive human-coded dictionary and negative dictionary are negatively correlated. Similarly, the negative human-coded TMS score correlates negatively with the positive TMS dictionary score and positively with the negative TMS dictionary score. Thus, apart from the lack of a correlation between the positive and negative TMS dictionaries, the positive and negative TMS measures are negatively related, as expected.
Below we present analyses following the advice from Short et al. (2010) in establishing a CATA measure. We first establish the reliability of the human-coding scheme through the comparison of multiple independent raters. We then establish the factor structure of both the human- and machine-coded measures of TMS. We then compare both methods of assessing TMS to the previously established survey measure of TMS (Lewis, 2003) and to TMS’s effects on group performance. We establish discriminant validity through the comparison of the dictionary measure of TMS developed in this paper to other CATA measures. Lastly, we use the dictionary-based measure on an independent dataset to help establish external validity.
Rater Consistency–Reliability for Human Coding
Transcripts were imported into the coding program NVivo, and two coders determined to which category each unit of communication belonged. One of the coders was the first author and the other was a research assistant, both of whom were very familiar with the experimental context. Coders were trained by applying the developed coding scheme on a small number of groups from pilot data, refining the definition of the codes and coming to agreement when there was ambiguity. Once sufficient agreement was reached on the pilot data, the two coders independently applied the coding scheme to a randomly chosen subset of the data from Study 1 (n = 21). Agreement on the coding of transcripts is notably difficult to attain, but the two coders achieved “fair to good” agreement above chance (Cohen’s Kappa = .56; percent agreement = 94.6) by several standards (De Wever et al., 2006; Fleiss, 1971; Krippendorff, 1980; Landis & Koch, 1977; Neuendorf, 2002). The first author then coded a subset of the remaining groups from Study 1, for a total of 50 groups. Subsets of the second (N = 46) and third (N = 24) studies were coded by the first author and a third coder who had been trained in the same method. These coders also reached “fair to good” reliability on a training set (Cohen’s Kappa = .56; percent agreement = 92.8). Agreement levels were on the lower end of acceptable, which is likely due in part to the focus of the unit of analysis on the individual instant message where context can lead to different attributions. The benefit of coding at this level, however, was to create much finer grained codes than if larger pieces of the transcripts were assessed.
An average of 51% of the human-coded communications that groups engaged in were related to the 6 TMS indicators assessed (about 44% positive indicators and 7% negative indicators). For the dictionary method, within the training data, 48.9% of the text was related to one of the 6 TMS indicators, with 21.8% of the text appearing only in the positive TMS dictionaries and 12.7% of the text appearing only in the negative TMS dictionaries.
Factor Structure
As we developed the human-coding scheme to match the three indicators of TMS from Lewis (2003), we anticipated that the human-coded and subsequent dictionary-based measures would also have these three dimensions. In neither of these cases, however, did a Confirmatory Factor Analysis (CFA) of a three-factor model (with the positive and negative aspects of each indicator aligning on one factor) fit the data well. This was in part because the three latent variables created, one for each of the three indicators of TMS, were very highly correlated, as is likewise frequently the case with the Lewis (2003) survey method (Ren & Argote, 2011). A two-factor solution with a positive and negative dimension of TMS, however, fits the human-coded data fairly well, N = 120, CFI = 0.948, TLI = 0.844, RMSEA (p = .220) = .083, SRMR = 0.073, and fits the dictionary data to some extent, N = 262, CFI = 0.795, TLI = 0.616, RMSEA (p < .001) = .229, SRMR = 0.094. 9 Thus, the use of the aggregate positive and negative measures of TMS indicators seems more supported than the use of the three indicators aggregated separately. Groups with high positive TMS variable values have many communications indicating specializations, coordination, and credibility among members, suggesting that they have developed a TMS. Conversely, groups with high negative TMS variable values have many communications that indicate that they lack or are unaware of specialization, that they have difficulty in coordinating, and that they disagree with the ideas or expertise in the group.
Convergent Validity
To assess the overall convergent validity of the corpus of statements that were coded, we used OLS regressions including our variables of interest and controls in predicting Lewis’s (2003) measure of TMS; see Table 3, Model 1. As all predictors used in regressions were z-score-transformed, their relative effects can be understood based on the size of their coefficients. The positive TMS variable derived from human coding was not significantly related to Lewis’s (2003) survey measure of TMS (β = .03, p = .821), which was counter to expectations. The negative TMS variable, however, was negatively and significantly related to Lewis’s measure of TMS (β = −.47 p < .001), suggesting that, as expected, negative statements about TMS indicators were associated with lower reported levels of TMS. The η2 effect size for negative TMS in predicting the Lewis measure of TMS was .21, suggesting a large effect.
Regressions Predicting Lewis’s (2003) Assessment of TMS.
Note. Values in parentheses are standard errors.
p < .05. **p < .01. ***p < .001.
We then repeated these analyses using the variables created with the dictionaries. We started with the 120 groups that had been used as a training set to create the dictionaries (see Table 3, Model 2). The positive TMS dictionary was positively and significantly related to the Lewis measure of TMS (β = .25, p = .011), and the negative TMS dictionary was significantly and negatively related to the Lewis measure of TMS (β = −.25, p = .013). The η2 effect sizes for both positive TMS and negative TMS was .05, a traditional “medium” effect size. Figure 1 provides a graphical representation of the correlations between these different measures of TMS derived from language and the Lewis (2003) measure. This figure demonstrates a strong overlap between the two methods (human coding and dictionary), which would be expected because the human coding was used to develop the dictionaries.

Comparison of language-based measures of TMS and Lewis’s survey–training data.
In addition to the 120 groups that were coded by humans, we had access to 142 independent cases from the experiments from which the training data were obtained, as well as cases from a different experiment. This “test” set of groups offered us the opportunity to use our dictionaries on an independent set of cases, which provides insight into how well our dictionaries perform when applied to groups different from those used to create the dictionaries. Figure 2 presents a scatterplot of these dictionary assessments and Lewis’s (2003) measure in the test data. The dictionary variables are both significant in predicting Lewis’s measure of TMS with both the positive dictionary (β = .20, p = .006) and the negative TMS dictionary (β = −.14, p = .013) in the anticipated directions (see Table 3, Model 3). 10 Both positive TMS (η2 = .05) and negative TMS (η2 = .04) are medium-sized effects. These analyses demonstrate some level of convergent validity primarily for the dictionary-based assessments in predicting the Lewis (2003) measure of TMS.

Comparison of language-based measures of TMS and Lewis’s survey–test data.
Predictive Validity
We predicted group performance using both the human-coded and dictionary-based measures as well as the Lewis measure of TMS, including study controls to account for differences between studies. For the training sample (N = 120), neither the human-coded nor the dictionaries were related to errors. Lewis’s (2003) measure was a significant negative predictor of errors, when it was the only predictor (β = −4.65, p < .001), when the human-coded measures of TMS were added to the model (β = −5.13, p < .001), or when the dictionary-based measures of TMS were included to the model (β = −4.75, p < .001). In each of these models, the η2 effect size for Lewis’s measure was between .16 and .18, suggesting a large effect on performance.
Results for the test sample (N = 142), including study controls, are shown in Table 4. This sample contains independent cases from the experiments from which the training data were obtained, as well as cases from a different experiment. Lewis’s (2003) measure of TMS is a significant negative predictor of errors (β = −2.28, p = .022; see Table 4, Model 1) with an η2 effect size of .04. In Table 4, Model 2, when the aggregate dictionaries are included as predictors of errors, the positive dictionary is negatively related to errors (β = −1.74, p = .043) and has an η2 effect size of .03, suggesting a small effect. If Lewis’s measure of TMS is added (see Model 3), the size and significance levels of the positive TMS dictionary decrease (β = −1.36, p = .118) relative to Model 2 and the size and significance of the Lewis measure also decrease somewhat relative to Model 1 (β = −1.91, p = .063), which suggests that the two variables are capturing some overlapping variance in errors.
Regressions Predicting Performance (Errors).
Note. Values in parentheses are standard errors.
p < .1. *p < .05.
We performed F-tests to determine if one set of variables captures additional variance in the model. Model 3—where the TMS dictionaries are added to the Lewis (2003) measure in predicting errors—did not fit the data better than Model 1, F(2, 136) = 1.24, p = .294. When Lewis’s measure is added in Model 3, compared to Model 2, it adds only a marginal amount of explanatory power, F(1, 136) = 3.51, p = .063. 11 This pattern of results suggests that the dictionary measures capture similar variance in the same latent construct that affects group performance as Lewis (2003), though Lewis (2003) may be a somewhat better predictor of performance. As is expanded on in the discussion section, administering detailed surveys is not possible in many scenarios, so text-based analysis methods could be enlightening instead.
Discriminant Validity
LIWC-22, which was used to calculate the dictionary-based measure of TMS, also has its own internal dictionaries intended to measure aspects of communication such as the emotional tone, analytic thinking, clout (defined as leadership or confidence), and authenticity (Boyd et al., 2022). None of these dictionaries are intended to measure TMS, and the concepts they assess are conceptually distinct from TMS. Thus, a test of the discriminant validity of our measure is if other similarly measured but theoretically different text-based assessments do a different job in predicting TMS.
If the four summary variables from LIWC-22’s internal dictionary (emotional tone, analytic thinking, clout, and authentic) are added to Model 2 in Table 3, which contains the dictionary measures of TMS predicting Lewis’s (2003) measure of TMS using the training sample, these four variables are all at least marginally statistically significant. The significant effect of the positive TMS dictionary in predicting Lewis’s (2003) measure of TMS is still present and increases in strength (β = .37, p = .001), though the negative effect of the negative TMS dictionary decreases to non-significance (β = −.11, p = .355). In the test sample, however, in predicting Lewis’s measure of TMS, none of the summary variables from LIWC are significant, and both positive and negative TMS dictionaries remain significant in the anticipated directions. If the summary dictionaries from LIWC-22 are included in models predicting performance, none of these variables predict errors. These results suggest that, though these LIWC internal dictionaries may capture some overlapping variance in the relationship of the negative TMS dictionary measure to the Lewis (2003) measure of TMS, including the LIWC internal dictionaries does not affect the relationship of the positive dictionary measure to Lewis’s measure of TMS or to group performance.
External Validity–Analysis on Independently Collected Data
Independent Lab Study
A comparison of the training and test datasets suggested that the dictionary assessment is a promising measure of the extent to which a TMS has developed, but we sought additional, independent data to investigate whether our dictionary measure of TMS was externally valid. We obtained another dataset (Gupta & Woolley, 2018), in which participants in groups of three worked within a laboratory setting on a decision-making problem and communicated through a chat client. Individuals divided their attention between two projects and their partners on those projects were either the same (single-team membership) or different (multi-team membership). The experimenter measured TMS using the Lewis (2003) measure. We assessed transcripts for the 61 groups in the study using the TMS dictionaries, controlling for the experimental manipulations; see Table 5, Models 1 to 2. In predicting the Lewis measure of TMS, the positive TMS dictionary was positive and significant (β = .35, p = .017), and the negative TMS dictionary was not significant (β = −.03, p = .808). The adjusted R2 on the model increased from .17 with just the manipulation controls (Model 1) to .23 once the aggregate dictionaries were included (Model 2), and an F-test suggests that the dictionary measures explained significant additional variance in the Lewis TMS survey, F(2, 56) = 3.32, p = .043).
Regressions Predicting TMS and Performance for the Independent Data.
Note. Values in parentheses are standard errors.
p < .1. *p < .05. **p < .01.
Lewis’s measure of TMS was a significant predictor of group performance (profit on a simulation) in the experiment (β = 50.11, p = .003); see Table 5, Model 3. When the positive and negative TMS dictionaries are included in Model 4, the positive dictionary is significant (β = 51.35, p = .008). When the Lewis measure is re-added in Model 5, it remains significant though smaller in magnitude and significance (β = 39.39, p = .024) than in Model 3, and the positive TMS dictionary similarly becomes smaller in magnitude and less significant (β = 37.64, p = .051), suggesting that both the Lewis measure and the TMS dictionary explain some overlapping variance in group performance. The positive dictionary appears to be related to both the Lewis (2003) measure of TMS and to group performance on this independent data.
Discussion
In this paper, we have described a theoretically derived, automated, and unobtrusive assessment of TMS that is significantly correlated with the most frequently used survey measure. With the rise of virtual teaming and the accessibility of transcript data, we recognized the need for TMS measures that could accommodate teams utilizing these new means of collaboration and the measurement opportunities afforded by the communication data the teams provided. The use of group conversations could provide valuable insights into TMS, especially when no other measure is available. Administering existing survey measures of transactive memory systems interrupts a team performing its task, which is not feasible in some contexts. A greater range of groups could be studied if assessments of TMS indicators could be conducted unobtrusively. This work is the first step toward developing a text-based measure of TMS strength based on group communication transcripts.
Our primary contribution in this paper is the creation and validation of a dictionary to measure the extent to which TMS has developed through its indicators. This is a form of computer-aided text analysis (CATA) as described in Short et al. (2010). We followed the Short et al. (2010) recommendations to demonstrate reasonable construct validity by determining the content validity, dimensionality, convergent validity, predictive validity, external validity, and reliability of the measure. Other recent papers in the measurement of group constructs with CATA (see Mathieu et al., 2021) have also followed these recommendations. Our inductive development of a list of relevant words using machine-learning analysis rather than using expert coders was also novel. The leveraging of machine-learning we demonstrate here could allow for easier scaling of CATA for measurements of other group constructs, as the traditional means of creating dictionaries to measure constructs can be costly (see Mathieu et al., 2021).
In the results section, we assessed convergent, predictive, and external validity by predicting the Lewis (2003) TMS measure and group performance with these new measures of TMS. For both the human-coded and the dictionary-based measures of TMS, we found stronger support for Positive and Negative aggregation as opposed to aggregating within each of the three indicators. In general, we found that the dictionary-based assessment of TMS was related to an existing measure of TMS, the Lewis scale. Further, the Lewis scale (Model 1, Table 4) and the positive TMS measure from the dictionary (Model 2, Table 4) were both significant predictors of performance in the test sample. When these two predictors were included in the same model predicting group performance (see Model 3, Table 4), the magnitude and significance of both predictors declined, suggesting that they explained some common variance. The negative TMS dictionary was not significantly predictive of performance, but controlling for this variable, increased the strength of the relationship of positive TMS language to performance as well as improved model fit.
One plausible explanation why the negative indicators of TMS dictionary were not as related to performance as the positive indicators of TMS dictionary is the importance of context in understanding negative interactions. The content of communications that are negative may be more dependent on the surrounding statements for their negative meaning to be realized than positive statements. For example, the words “hold on one moment” could be used as a means of coordination, but if it is followed or preceded by “stop working,” then both of those phrases suggest some problem with the group’s coordination. The words themselves, however, may not be used in negative interactions enough for the machine learning algorithm to be able to give a negative coordination value to any of those individual words or phrases. Another reason that the negative indicators were not related to performance could be their low occurrence rate, with 6 times fewer negative TMS statements compared to positive TMS statements.
Theoretical Implications
Our results indicate that having more of certain types of communication is indeed indicative of the extent of the development of a TMS. This paper is novel in demonstrating that the specific words included in conversations within a group can be used to assess the group’s level of TMS, and that this assessment can be semi-automated with computers, removing a barrier to unobtrusive assessment of TMS. The development of this measure also provides additional tools to researchers in assessing TMS, a concept that is growing in popularity (Bachrach et al., 2019; Ren & Argote, 2011).
As we previously described, Lewis’s (2003) survey-based measure of TMS was a somewhat better predictor of team performance than the text-based measures in our data. TMS is a cognitive concept, so it is unsurprising that a survey measure that captures the thinking of individuals is a superior assessment of TMS compared to a purely observational measure. That said, in many situations, as we have discussed throughout the paper, survey measures of TMS may not be possible, and our test-based measure could enable the assessment of TMS.
We hope that the techniques we developed for the assessment of the indicators of the development of TMS will expand the study of TMS. Being able to assess TMS within groups without needing respondents to fill out a survey would allow many more types of groups to be assessed and could decrease the cost of research on TMS. Computer-aided text analysis (CATA) such as the Linguistic Inquiry and Word Count, or LIWC (Pennebaker et al., 2015) has been gaining prevalence in assessing psychological characteristics of individuals based on the words they use. Previous research on computerized measures has demonstrated that even complicated phenomena such as organizational culture (Pandey & Pandey, 2017) can be usefully assessed with these types of techniques. Our dictionary of language that is indicative of the extent of the development of TMS can be easily used in LIWC or the quanteda R package (Benoit, 2018).
Limitations
Our approach to assessing TMS based on text is promising but has limitations and room for improvement. This work focused on groups communicating solely in English; thus, the dictionary would likely be most useful for the examination of TMS in other groups communicating in English. The coding scheme, however, would likely be applicable to other contexts or languages with thoughtful translation. If the coding scheme were redeveloped in a different language and applied to group transcripts, a new dictionary could be developed using the code we provide in Supplemental Appendix D. Additionally, groups in our contexts were allowed to communicate only via text message. Thus, our coding scheme or dictionary may not apply to groups whose conversations are verbal and/or who have access to non-verbal cues. In the groups we studied, their primary means of interaction was text communication, which may have increased its relative importance. It is thus possible that the relationship of our text-based measures of TMS to a group’s true TMS or performance may be weaker in face-to-face teams. This would be a fruitful direction for additional investigation.
Though we have attempted to demonstrate external validity through the analysis of an external dataset, it is possible that the coding scheme or dictionaries may be less related to TMS or group performance in other contexts. If the coding scheme is applied to group transcripts in a particular context, however, a dictionary could be created using the techniques demonstrated here. Thus, though the specific dictionary used in this paper might not be applicable in all group settings, the techniques described in this paper should be useful to other researchers. Further research is needed to determine the generalizability of our results to other contexts.
As described above, the human-coded statements for the negative indicators of TMS were significantly related to the Lewis (2003) measure of TMS in the expected direction, whereas the positive human-coded statements were not. Both the positive and the negative indicators of TMS dictionaries were consistently related to Lewis’s (2003) measure in the expected directions. This discrepancy is curious, as the dictionaries were derived from the statements. Thus, we are left with the conclusion that the counts of phrases anticipated to be positively related to TMS are not as strongly related to Lewis’s (2003) measure as the number of words that frequently occur in those phrases. One explanation for this inconsistency is that positive coordination phrases often contain many words, but the coder is forced to code the entire phrase as one meaning. The dictionaries do not have this limitation, in that sentences can contain words from both the positive and negative dictionaries. Thus, the dictionaries may be able to capture complicated TMS language with fewer limitations than human coding.
The lack of a significant relationship between the negative TMS dictionary and performance in Table 4 raises questions of whether our text-based negative TMS dictionary measure should be included in future studies and how researchers should interpret this variable. Empirically, controlling for negative TMS increased the strength of the influence of positive TMS on performance and improved model fit. In addition, the negative TMS predicted the Lewis measure in both the training and test samples. With one exception (see Table 2), the correlations between the positive and negative TMS variables were negative. Thus, our general recommendation would be to continue to include negative TMS until additional research is conducted on whether positive TMS language is a sufficient predictor on its own. Additionally, understanding the relationship of the positive and negative TMS indicators to each other is an important area for future research. If additional researchers accumulate evidence using these measures, we will have a greater understanding of the relationship between positive and negative TMS as well as the relationships between the dictionary-based TMS measures and performance.
The dictionaries, although theoretically derived, are in-part a “black box” of words and phrases. These terms were derived from the data; therefore, we do not know specifically why they have predictive ability. We anticipate that some of the words or phrases contained in the individual dictionaries have direct theoretical significance to the category. Words that are not obviously related to one of TMS indicators may be parts of phrases or language uses that are. Thus, though we are confident that the words contained in the dictionaries are important, it is uncertain why each word is important. This was also one reason we opted not to remove “stopping words” (e.g., prepositions, articles, and pronouns), as these words could contain relevant meaning as emergent from the human coding even if these words are typically less meaningful in ordinary English speech. 12
All the groups in our sample were newly formed groups. As such, the language they used may be more representative of groups during their initial formation than their ongoing operation. For example, a TMS should lead to better coordination within a group, which may reduce the need for explicit coordination communication. Thus, if a group persists in having high levels of coordination communication, it might have lower levels of TMS than a group that begins with high levels of coordination communication that decrease in amount over time due to their increased TMS. More research should be done on this topic to determine whether the content of communications later in a group’s life relates differently to its TMS than earlier communications.
Methodological limitations could influence various aspects of the reliability of the text-based codes for TMS described in this paper. For example, the inter-rater agreement values for the human coding were lower than desired. This was likely driven in part by the ambiguity in assessing language at the sentence-level, which was a necessary component for the generation of the dictionary. The lower inter-rater agreement could have added additional noise to both the human-coded TMS variables and thus the dictionary variables. Future researchers using human coders may be able to fine-tune the coding scheme and rater training to increase this reliability. In the development of this paper, we avoided using certain machine learning or natural language processing techniques that would limit the external applicability of the measure of TMS developed in this paper. We wanted to limit these types of complications for users of our measure. With the current measure, we have created a dictionary that can easily be applied with off-the-shelf software by researchers or practitioners.
Future Work
This assessment of the indicators of TMS could be used to study and manage groups in new ways. The dictionary-based, unobtrusive measure could be used for diagnostic purposes to catch issues in communication, allowing a manager to intervene before issues become serious. A manager could determine which teams are communicating in a way that indicates high or low TMS. She could then use that information to institute interventions from the literature, such as group training or expertise identification, to improve TMS within those groups. Our results and available dictionaries offer options to managers and may increase the extent to which TMS is considered by practitioners.
If additional researchers validate and build upon our measure, we believe that this type of measure could be extended to assess the dynamics of the development of TMS within a group. TMS are interactive and dynamic, changing as group members acquire knowledge and develop a better understanding of each other’s expertise. Survey measures reflect a static moment in the group’s TMS development and must be administered at multiple points in time if a researcher is interested in changes in a group’s TMS, which might not be feasible or might lead respondents to react negatively to the measures. Using the text-based measures, however, the transcript could be segmented before running the dictionary analysis to provide a sense of whether language related to the positive or negative indicators of TMS is increasing or decreasing over time in the group. This analysis could provide insight into the changes or development of TMS over time within groups or in response to membership or task changes. We believe this is a ripe area for future investigation.
The method presented here could greatly expand the types of groups that are examined from a TMS perspective, especially groups where conversational data are more available than survey data. Many high-pressure, fast decision-making teams, such as cyber-security, military, or critical care teams, are filmed or audio-recorded for after-action reviews or training purposes (Hu et al., 2016; Mackenzie et al., 2004; Weick, 1990; Zhang et al., 2018). The qualities of a TMS such as high-level specialization, trust in expertise, and implicit coordination between group members are especially important in these settings, but the brief operational period and multi-team-membership nature of these groups may make survey measures of TMS impractical. The cost of video and audio recorders has decreased, and their use has also increased dramatically, especially with increased use of body cameras in many types of groups. Videos could be transcribed—using human or computerized methods—to apply either of our TMS measurement approaches: human-coded or the dictionary-based assessment. Our assessments, especially used in tandem with the rich information available from video, geo-location, social signals, or other multi-modal sources, may provide an accessible means for group researchers to look at both groups’ overall collaboration and the nuances that contribute to the development of TMS.
Conclusion
This paper provides a conceptual-measurement approach to TMS as well as two text analysis methods that are useful in the assessment of TMS. This paper demonstrates an important but first step in assessing TMS unobtrusively. The flexibility of a text-based assessment like the dictionary-based measure could make it possible to assess TMS retrospectively or in types of groups that were previously out of reach for researchers. Our dictionaries provide an easy-to-use assessment of TMS that appears to work as well as traditional survey instruments in predicting the outcomes of TMS. The dictionary of terms is available in Supplemental Appendix C and in the supplement as a LIWC dictionary file and with R code in Supplemental Appendix D for easy implementation. We hope that this measurement technique will be adopted by other researchers to expand the types of groups studied and to advance our understanding of TMS. We encourage additional researchers to use and build on our measure of TMS so that it can be refined and improved and its efficacy in assessing TMS in a wider variety of contexts can be ascertained.
Supplemental Material
sj-dic-2-sgr-10.1177_10464964231182130 – Supplemental material for A Text-Based Measure of Transactive Memory System Strength
Supplemental material, sj-dic-2-sgr-10.1177_10464964231182130 for A Text-Based Measure of Transactive Memory System Strength by Jonathan Kush, Brandy Aven and Linda Argote in Small Group Research
Supplemental Material
sj-dicx-3-sgr-10.1177_10464964231182130 – Supplemental material for A Text-Based Measure of Transactive Memory System Strength
Supplemental material, sj-dicx-3-sgr-10.1177_10464964231182130 for A Text-Based Measure of Transactive Memory System Strength by Jonathan Kush, Brandy Aven and Linda Argote in Small Group Research
Supplemental Material
sj-docx-1-sgr-10.1177_10464964231182130 – Supplemental material for A Text-Based Measure of Transactive Memory System Strength
Supplemental material, sj-docx-1-sgr-10.1177_10464964231182130 for A Text-Based Measure of Transactive Memory System Strength by Jonathan Kush, Brandy Aven and Linda Argote in Small Group Research
Footnotes
Acknowledgements
The authors would like to thank Laurie Weingart, Taya Cohen, Kyle Lewis, Dokyun Lee, and Erin Fahrenkopf for their feedback and assistance throughout the development of this paper. The authors would like to thank our research assistants: Angel Gonzalez, Madeline Mesard, Stephanie Kong, and Karen Kim for their dedication.
Authors’ Note
Versions of this paper were presented at the INGroup conference 2017 and Collective Intelligence conference 2019.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Lastly, the authors would like to acknowledge the support of several grants that have funded this specific and the underlying research: National Science Foundation [Grants 1111750 and 1459963], the Army Research Office [Grant W911NF-16-1-0005], and the Center for Organizational Learning, Innovation and Knowledge at the Tepper School of Business at Carnegie Mellon University.
Supplemental Material
Supplemental material for this article is available online.
Notes
Author Biographies
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
