Abstract
We present five studies investigating the predictive validity of thin slices of nonverbal behavior (NVB). Predictive validity of thin slices refers to how well behavior slices excerpted from longer video predict other measured variables. Using six NVBs, we compared predictive validity of slices of different lengths with that obtained when coding is based on full-length (5-min) video, investigating the relative predictive validity of 1-min slices as well as of cumulative slices. Results indicate some loss in predictive validity with 1-min slices, but relatively little loss when Slices 1 and 2 were combined for five of the six NVBs. This research establishes an empirical basis on which researchers can decide how much of their recorded corpus needs to be coded for NVB. The results also provide some guidance on effect sizes in power analyses for researchers coding specific behaviors in a thin-slice design.
Researchers commonly analyze the nonverbal behavior (NVB) of participants in recorded social interactions using observers who rate, count, time or otherwise track the occurrence of different behaviors such as smiling or gazing. These behavioral measures are then correlated with other variables to address the particular goals of the given research project. Although many studies demonstrate that rather brief excerpts—“thin slices”—of behavioral coding can be related to other variables of interest, typically researchers resort to guesswork or convenience (such as how long the coding would take, the patience of coders, and financial resources) in deciding how much of their recorded data to code (Fujiwara & Daibo, 2014; Murphy, 2005). The present article begins to fill this gap by analyzing the relative predictive validity of shorter (thinner) versus longer slices of NVB when examining correlations with other social and individual difference variables. Thus, the present research has the potential to provide support for methodological decisions that are made often but without empirical backup. The question, therefore, is “how much is sacrificed by analyzing less than all of the behavioral data in one’s recorded corpus?” The tools of statistical power analysis enable researchers to answer this question with regard to sample size—that is, one can calculate exactly what is lost by reducing sample size—but as yet there is little evidence to help researchers estimate what is lost by abbreviating the quantity of recorded material analyzed per participant. Therefore, researchers analyzing thin slices of behavior cannot evaluate the consequences of having made this methodological choice.
Across various domains, many studies employ thin slices, and many of these demonstrate thin-slice predictive validity, where behavioral thin slices predict other variables (which, in the present article, we call outcome variables). Thin-slice predictive validity in this context is measured by correlating the thin-slice behavior with another study variable. If that correlation is statistically significant or of interesting magnitude, it lends validity to the use of the thin-slice method. For example, in a classic study, Milmoe, Rosenthal, Blane, Chafetz, and Wolf (1967) audio recorded physician–patient interactions and then had judges code 1.5 min of physicians’ voices for anger and anxiety (among other voice qualities). When the thin-slice voice ratings were correlated with treatment effectiveness, physician anxiety rated from the thin slice significantly predicted treatment adherence. In later work, 30-s slices of teachers’ classroom instruction were sufficient to accurately predict end-of-semester teacher evaluations (Ambady & Rosenthal, 1993). Murphy (2007) coded target behaviors such as speaking time and looking at the interaction partner from 1-min slices and found significant correlations with targets’ measured intelligence. All of the above studies, and many others, indicate that behavioral thin slices can predict other study variables at statistically significant levels.
Furthermore, thin slices are used in many standard measures of emotion recognition; for example, the Diagnostic Analysis of Nonverbal Accuracy–Adult Prosody (DANVA2-AP; Baum & Nowicki, 1998) features 2-s adult-voice excerpts and the Geneva Emotion Recognition Test (GERT; Schlegel, Grandjean, & Scherer, 2014) features short audio and video clips (<20 s per clip) of actors’ emotion expressions. The use of thin slices in such measures suggests that short excerpts of expressive behavior are psychometrically useful in measuring social constructs such as emotion recognition.
Historically, early perspectives regarding person perception processes recognized the link between expressive behavior and underlying emotional states or personality traits, although “thin slices” were not necessarily involved. Allport and Vernon (1933) noted the “diagnostic significance of expression (i.e., whether an idiosyncrasy in manner is a valid indicator of some personal complex, prejudice, or interest)” (p. 6; original italics), suggesting that expressive behavior can validly foretell personality dimensions. Later, investigations into expectancy effects and nonverbal communication acknowledged expressive behavior as providing valid information about social variables such as personality traits or emotions (e.g., Burns & Beier, 1973; Christensen & Rosenthal, 1982). As Ambady and Rosenthal (1992) noted, “A great deal of information is communicated even in fleeting glimpses of expressive behavior” (p. 256). The thin-slice method evolved from such traditions of examining expressive behavior in relation to person perception, eventually leading to systematic investigations about how much information (i.e., slice length) is needed to form accurate impressions (e.g., Ambady & Rosenthal, 1992; Borkenau, Mauer, Riemann, Spinath, & Angleitner, 2004; Carney, Colvin, & Hall, 2007; Hall, Andrzejewski, Murphy, Schmid Mast, & Feinstein, 2008).
The thin-slice methodology is grounded in the direct measurement of behavior, which is a substantial component of understanding and predicting personality, in addition to other psychological domains (Furr, 2009; Leikas, Lönnqvist, & Verkasalo, 2012). Indeed, one could argue that examining behavior is a fundamental tenet of psychology as a whole. Recent authors have pleaded for direct behavioral observation as an ideal method for social and personality psychologists (Agnew, Carlston, Graziano, & Kelly, 2010; Back & Egloff, 2009; Baumeister, Vohs, & Funder, 2007; Furr, 2009). Yet, as noted by others, these data can be difficult to collect and involve tremendous time and resources (Back & Vazire, 2015; Baumeister, 2016). Such constraints may even be hindering the pursuit of high quality research (Carcone et al., 2015). Thus, the thin-slice methodology may be one avenue to ease the burden of directly measuring behavior.
Despite its popular use, relatively little systematic research has investigated the reliability and validity of the thin-slice methodology itself, including how long or which slices should be used. Ostensibly, the validity of a thin slice (i.e., the relationship between a slice of behavior and an outcome variable) may be dependent on the length of the slice itself as a longer slice may increase quantity and reliability of information in the slice. Which begs the question: do briefer slices have as much predictive validity as longer slices? Ambady and Rosenthal (1992), who coined the term “thin slice” in their meta-analysis, found that interpersonal outcomes could be predicted from brief excerpts (⩽5 min) of expressive behavior and that the duration of the slice (ranging from <30 s to 300 s across the studies analyzed) was not a meaningful moderator of predictive validity. Hall, Roter, Blanch, and Frankel (2009) conducted a comparative analysis of thin-slice predictive validity in a study of medical students interacting with an actor-patient. Observers’ ratings of rapport during the first minute were often as predictive of other variables (e.g., assessments of the medical student’s respectfulness) as were ratings of that minute plus two more minutes taken from later in the interaction. Such findings illustrate that predictive validity did not necessarily become stronger with longer slice lengths.
In the current article, we present a series of studies examining the predictive validity of thin slices where all comparisons of slices of different lengths were done within studies. This within-study comparison was a feature of the Hall et al. (2009) study, but not of the Ambady and Rosenthal (1992) meta-analysis, where length of slice was a between-studies variable, leaving open the possibility that other study attributes could obscure an accurate understanding of the impact of slice length on predictive validity.
As researchers who study person perception and NVB, we were interested in establishing more efficient ways to code NVB, as such coding is often complex and laborious (e.g., Harrigan, 2013). But efficiency alone is not a strong argument for choosing a method. Therefore, the present within-studies analysis of the impact of slice length on predictive validity was undertaken. Researchers often have little guidance on which or how much behavior to select (e.g., Caperton, Atkins, & Imel, 2018; Hirschmann, Kastner-Koller, Deimann, Schmelzer, & Pietschnig, 2018; Lausberg & Sloetjes, 2016; Murphy et al., 2015). Previously, we demonstrated that thin slices of specific NVBs (most notably, interpersonal gazing) can reliably represent and predict longer behavioral streams (Murphy, 2005; Murphy et al., 2015). Across five behaviors, results suggested that slices can be used interchangeably to represent a total behavior from an interaction, and it may not be necessary to code the entire amount of a particular behavior to adequately represent the relative amount of behavior in an interaction (though there were notable exceptions for the behavior of speaking time; Murphy et al., 2015). We also illustrated how researchers could employ the Spearman–Brown Prophecy Formula to estimate how many slices added together would be needed to reach a desired reliability level (Brown, 1910; Murphy et al., 2015; Spearman, 1910). Ultimately, we suggested that slices after the first minute and before the last minute of an interaction were likely reliable and valid selections (a conclusion also reached by Carney et al., 2007, in the context of accurate judgments of personality).
In the present article, we extend the previous work as well as remedy the thin-slice validity gap by systematically analyzing the predictive validity of thin slices across five studies with respect to six NVBs and 33 unique outcome variables, some of which were measured in more than one study. Beyond the practical issues of behavioral coding, investigating the predictive validity of thin slices has implications for whether, or how well, behavioral thin slices represent personality or any other outcome construct, a topic that has received relatively little systematic investigation, despite the wide range of domains in which thin slice predictive validity is assumed.
Below, the five studies are described, and after that their predictive validity results are reported. The present article addresses comparative predictive validity in two ways: individual slice validity and cumulative validity. For assessing individual slice validity, within each study the correlation between a slice and a given outcome variable was compared with the correlation between the whole behavior of interaction (which was 5 min) and the same outcome variable. Comparing how 1-min slices correlate with the outcome variable to how 5-min Totals correlate with the same outcome variable would indicate whether a 1-min slice is sufficient to obtain the same predictive validity as the full 5 min of behavioral coding. Cumulative validity was assessed by comparing correlations between cumulative slice lengths and the outcome variable against the correlation between whole behavior of interaction (Total) and the outcome variable. These results would indicate the relative predictive validity of 2-min, 3-min, and 4-min cumulative slices, compared with the 5-min Total. Because the questions of interest concern only within-studies comparisons, for present purposes the differences between studies in populations and tasks were not considered.
Description of Current Studies
We used five preexisting datasets involving videotaped zero-acquaintance social interactions. In each of the five studies, 5 min of video recorded behavior was coded in successive 1-min slices as well as combined in progressively longer cumulative slices and also as a 5-min Total. Across the studies, six NVBs were measured: gaze, gesture, nod, self-touch, smile, and speaking time; however, not every study measured every behavior (see details below). Within each study, depending on its own original goals, a number of other variables had been measured (outcome variables). These outcome variables were correlated with the NVB measurements. We use the term outcome variable simply to denote variables that are not one of the NVBs, without any implication that such variables are causal outcomes or even temporally consequent to the NVB measurement. The outcome variables were selected according to decision rules outlined in the following section, and they represent varied content and methodologies (self-reports following interactions, personality scales, objective measures of emotion recognition accuracy, dyadic partner reports, sociodemographics, and coding or rating by observers).
In each dataset, only one target person per dyad was selected for analysis to maintain independence. Specifics regarding reliability and behavioral coding are described in the Method of each study below. In all studies, only the target was visible during coding, with one-half of the screen covered to conceal the interaction partner. Coding was conducted with no audio (except for speaking time). On occasion, a target could not be coded for a specific behavior (e.g., the target’s hat obscured their eyes for gaze coding); this accounts for slightly varying Ns between behaviors within a given study. Table 1 provides details of which behaviors and variables were measured in each study. Studies 1 and 2 analyzed targets’ behavior in a laboratory dyadic interaction. Studies 3 and 4 involved targets’ behavior in a structured job interview. Study 5 was based on same-gender dyads where targets role-played roommate conflict discussions. To base results on a common set of targets, listwise deletion for the NVB was used throughout (i.e., a target was not included for a given behavior unless all of the slices were coded for that target). The methods of Studies 1-5 are described below, followed by the collective results across studies.
Predictive Validity of Averaged 1-Min Slices of NVB Versus 5-Min Total NVB, for Individual Outcome Variables.
Note. Range refers to the range of correlations across the five slices. All 5-min Total correlations listed in this table were significant at p ⩽ .05 (two-tailed; † signifies .050 < p < .054). (A complete list of measured variables and corresponding 5-min Total correlations by behavior and study appears in Supplementary Table 1.) Correlations are absolute values (see text for explanation). IPT = Interpersonal Perception Task (Costanzo & Archer, 1989); OR = observer rated; PR = partner rated; SR = self-rated; IRI = Interpersonal Reactivity Index (Davis, 1983); DUTCH = Dutch Test for Conflict Handling (De Dreu, Evers, Beersma, Kluwer, & Nauta, 2001); NEO PI-R = NEO Personality Inventory–Revised (Costa & McCrae, 1992); NVB = nonverbal behavior.
Method
Study 1
Target sample and procedures
Unacquainted college student targets were videotaped for 5 min in dyads while engaged in a social interaction; dyads were told they could discuss any topic. Participants were seated at an approximately 45° angle to the camera, facing their interaction partner. For present purposes, only the right-seated target from each dyad was coded to maintain independence of behavior. The coded sample was comprised of 84 targets (58 women, 26 men; Mage = 18.93, SD = 1.12). Forty-five targets (54%) reported their ethnicity as Caucasian, 14 (17%) reported Hispanic, and the remainder reported Asian, African American, or Other ethnicity, or did not report. Following the interaction, participants were separated and completed a series of measures; those that were used in the present article are listed in Table 1. Further detail on the study can be found in Murphy, Schmid Mast, and Hall (2016).
Behavioral coding and intercoder reliability
Each target was coded for five NVBs: gaze, gestures, nods, self-touch, and smiles. All behaviors were coded as frequency counts except for gaze, which was coded as duration. Four female coders each individually coded a single behavior; coding for gaze was split between two female coders. For each behavior, intercoder reliability was established between the assigned coder and one other coder by correlating their coding for 10 targets’ 1-min slices across the 5-min interaction. (Coders were assigned one behavior, and each served as the reliability checker on one other behavior.) Intercoder reliability was acceptable: gaze: r = .92, gesture: r = .84, nod: r = .70, self-touch: r = .81, and smile: r = .73. During coding, coders watched each interaction in 1-min segments, pausing the tape at the end of each segment to record their data before continuing to the next segment.
Study 2
Target sample and procedures
College student targets (N = 112; 47 males, 65 females) were videotaped for 5 min in dyads while engaged in a series of interactive tasks with a partner; only the right-seated target from each dyad was coded to maintain independence of behavior. Participants were assigned various topics to discuss (e.g., getting acquainted, campus life) and were videotaped at an approximately 45° angle to the camera, facing their interaction partner. Following the interaction, participants were separated and completed a series of measures and ratings. Those that were used in the present article are shown in Table 1. Further detail on the original study can be found in Schmid Mast, and Hall (2006, Study 2).
Behavioral coding and intercoder reliability
Each target’s interaction was coded for six NVBs; gaze and speaking time were coded as durations, whereas gesture, nod, self-touch, and smile were coded as frequency counts. Intercoder reliability was acceptable: gaze: r = .97, gesture: r = .83, nod: r = .79, self-touch: r = .67, smile: r = .76, and speaking time: r = .88. Coders watched each interaction in 30-s segments, pausing the tape at the end of each segment to record their coding before continuing to the next segment. In the present article, for consistency across studies, 30-s slices were combined into 1-min slices. See Murphy et al. (2015, Study 1) for further details on coding and reliability procedures.
Study 3
Target sample and procedures
College student targets (N = 62; 45 females, 17 males; Mage = 24 years) in the role of an applicant were audio and video recorded during a job interview. The recruiter (not visible in the video frame) always asked the same questions (e.g., motivation for the job, short self-presentation). The job to which the targets were applying was a job similar to sales in that targets had to go on the street and recruit people for a laboratory study. The average length of the job interview was 11 min (range: 6-19 min).
Before the job interview, targets filled in personality scales and, afterward, filled in self-ratings. In addition, five professional recruiters evaluated the recorded job interview videos in terms of hireability on a scale from 0% to 100% (M = 59.73, SD = 19.15; see Frauendorfer, Schmid Mast, Nguyen, & Gatica-Perez, 2014). Variables used in the present article are shown in Table 1.
Behavioral coding and intercoder reliability
Four nonverbal cues were coded: gaze, nod, smile, and speaking time. Gazing was defined as time looking at the recruiter, coded by one external coder. Smiling was coded on a general impression scale from 1 (not smiling at all) to 5 (very much smiling), also by one external coder. Interjudge reliability was assessed with a second coder who coded a subsample of five targets. Intercoder reliability was strong: gaze: r = .95 and smile: r = .95. Nod and speaking time were automatically extracted and coded based on machine-learning algorithms. Nodding was defined as the duration the target nodded while recruiter was speaking; the automated extraction of nodding was based on a computer algorithm developed and validated for the present video material (see also Frauendorfer et al., 2014; Nguyen, Odobez, & Gatica-Perez, 2012). Speaking time was extracted automatically with a “Microcone,” which is a microphone and software package designed to extract individual speaker speech from group conversations (Dev-Audio, 2012). Speaking time was the sum of all speaking turn lengths of the applicant as extracted from the Microcone.
Study 4
Target sample and procedures
College student targets (N = 98; 33 males, 65 females) were videotaped in a 5-min dyadic interaction (49 dyads total) while engaged in a mock job interview. Participants were randomly assigned to be an applicant for a newspaper reporting job or an interviewer in a role-play job interview. Participants were seated at approximately a 45° angle to the camera, facing their interaction partner. Participants were given short descriptions of the newspaper reporting job and their role of applicant or interviewer. For present purposes, only the randomly assigned applicant from each dyad was coded to maintain independence of behavior. The coded sample was comprised of 49 targets (31 women, 18 men). No other demographic information was collected from this sample, but this student population typically is 18-24 years old and 70% Caucasian.
Following the interaction, participants were separated and completed a series of measures (those that were used in the present article appear in Table 1). Hireability was judged by two independent coders who both viewed the entire 5-min interview, with the interviewer not visible and with the sound present, and rated (9-point scale) how likely they would be to hire the applicant if they were the interviewer. The two coders’ ratings were averaged to create a hireability composite. Further detail on the original study can be found in Ruben, Hall, and Schmid Mast (2015).
Behavioral coding and intercoder reliability
Each target was coded for five NVBs: gaze, gestures, nods, self-touch, and smiles. One primary coder, different from the coders of hireability, individually coded a single behavior. A second coder watched and coded 10% of the full 5-min interactions. Coder reliability was established by correlating the main coder’s data with that of the independent coder on 10% of the clips for the total of each of the NVBs. The reliability for each behavior was strong: gaze: r = .97, gesture: r = .96, nod: r = .92, self-touch: r = .99, and smile: r = .86. After reliability was established, coders watched each interaction in 30-s segments in random order; as in Study 2 and 4, the 30-s slices were combined into 1-min slices.
Study 5
Target sample and procedures
Same-gender dyads (N = 56 dyads, 50% females) were videotaped in a 4- to 10-min role-played roommate conflict discussion. Targets were undergraduate students and were 59% White, 26% Asian, 9% Hispanic, 5% African American, and 1% Other. Targets were recruited separately, and most participants did not know one another before the experiment. Both participants were assigned a role (either A or B), and for the purposes of this study, only targets in role B were analyzed and discussed. Note, to maintain consistency with other studies included in this project, only videos ranging from 4 to 5 min in length were included, thus reducing the number of participants from 56 targets to 45 targets (24 females). For videos meeting that criteria, there were instances in which coding for a particular target could not be completed in a given segment (for various reasons, for example, outside of camera’s view). Data for that particular behavior for the individual participant were dropped from analysis, and this accounts for slightly varying Ns in analyses. Following the interaction, participants were separated and completed personality scales and self-ratings (those that were used in the present article appear in Table 1; Johnson, 2015).
Behavioral coding and intercoder reliability
Each target was coded for six NVBs: gaze, gestures, nods, self-touch, smiles, and speaking time. All behaviors were coded as frequency counts except for gaze and speaking time, which were coded as duration. Two coders individually coded half of the behaviors in their entirety and served as reliability checkers for one another by coding ten 30-s slices for each of the other half of behaviors. For each behavior, Pearson correlation was used to evaluate intercoder reliability between the assigned coder and one other coder for 10 targets’ 30-s slices across the interaction. As in previous studies, the 30-s slices were combined into 1-min slices. Note, there were 8 to 10 slices depending on the length of the audiotaped video (tapes longer than 5 min were only coded up to the 5-min mark). Intercoder reliability was strong: gaze: r = .94, gesture: r = .92, nod: r = .85, self-touch: r = .99, smile: r = .96, and speaking time: r = .98.
Analyses
To summarize, the inputs from each study were, for each of several NVBs: five 1-min slices of the NVB, the 5-min Total NVB, and a series of outcome variables. As stated earlier, each study included a variety of outcome variables. For the present analyses, we collected only those outcome variables for which any of the 5-min Total NVB measurements had a statistically significant correlation (p ⩽ .05, two tail). Altogether there were 51 significant correlations with the 5-min Total NVB across the 5 studies, representing 33 unique outcome variables (see Table 1 for descriptions). These correlations between Total NVB and an outcome variable are the standard against which the correlations for individual 1-min slices, and cumulative combinations of 1-min slices, were compared.
Results
Individual Slice Validity
The first set of analyses compared the predictive validity of 1-min slices to the predictive validity of the full 5-min Total, for the six NVBs measured in the five studies. Table 1 shows every outcome variable that had a significant (p ⩽ .05, two tail) correlation with the 5-min Total and, for comparison, the average and range of the five 1-min slice correlations. These ranges show considerable variation, meaning that within a given study, for a given NVB and outcome variable, the predictive validity varies greatly for different slices. For parsimony and robustness, we discuss the average of these 1-min slices within the given study. The logic of comparing the averaged 1-min correlations against Total is that there should be no difference in magnitude between the 5-min Total correlations and the 1-min correlations if, on average, a 1-min slice is all one needs to obtain the same predictive validity as the full 5 min of behavioral coding.
Because the sample sizes varied considerably from study to study and the purpose was to appraise relative magnitudes of correlations for slices versus Total, tests of statistical significance are of less importance than future researchers’ appraisal, in the context of their own studies, of how much loss in predictive validity they are willing to tolerate if they choose to use thin slices, using the current results as a starting point. Therefore, Table 1 does not give p values for tests of the 1-min slices against zero; however, every correlation for the full 5-min Total was significant at p ⩽ .05, two tail, as mentioned above. (A complete list of measured variables in each study with the corresponding 5-min Total per behavior appears in Supplementary Table 1.)
In the calculations for Table 1, all correlations were treated in terms of their absolute values because the signs were immaterial (i.e., strength was the issue, not direction). For a hypothetical illustration, the correlation of Total smiling with rated friendliness might be significantly positive (let us say, r = .30), whereas the correlation of Total smiling with rated anxiety might be significantly negative (r = −.30). Averaging those in a summative analysis, as we did as a later step, would mean that they would cancel each other out, and it would appear that smiling had no net relation to the two outcomes. However, using the absolute values of the correlations would yield the correct conclusion that Total smiling relates to those outcomes collectively with a strength of r = |.30|. The only (occasional) exceptions to this rule occurred when, within a study, the 1-min slice correlations were a mix of positive and negative. In such a case, treating the 1-min slices in terms of their absolute values would not accurately represent the inconsistency in the 1-min slice correlations within the given study. In these instances, the 1-min slice correlations retained their original signs when averaged.
Table 1 shows that for all instances except one (correlation of self-touch with “displayed flexibility”), the average of the 1-min slice correlations was smaller than the 5-min Total correlation. This means that, nearly always, some amount of predictive validity is lost if only 1 min out of the whole 5 min is coded. Across all behaviors and outcome variables shown in Table 1, the range of differences between correlations was .00 to .23, with an average difference of r– = .09. The size of the difference between the average 1-min slices and the 5-min Total correlations varied among behaviors: gaze: r– = .04, gesture: r– = .06, nod: r– = .10, self-touch: r– = .05, smile: r– = .08, speaking time: r– = .18. Thus, gaze, gesture, nod, self-touch, and smile generally showed the least loss of predictive validity between the 1-min and 5-min Total correlations, whereas speaking time showed the greatest predictive validity loss. Such results suggest that 1-min slices of gaze, gesture, nod, self-touch, and smile may be better suited to be predictively valid of outcome variables compared with 1-min speaking time slices. Because there was such a wide variation in what the outcome variables were, we caution against drawing inferences about whether the 1-min slice predictive validity was better for some types of outcome variables than others.
Table 2 presents the same data in a different fashion. In Table 2, the correlations were averaged across studies and across outcome variables within each NVB while also showing the individual 1-min slices. This display enables us to investigate whether there are systematic differences between slices (first through fifth) as well as between NVBs. The table strongly suggests only a modest amount of variation among these slice and NVB averages, and no striking patterns emerge. However, as with Table 1, it is again clear that the predictive validity of any of the 1-min slices is weaker than the predictive validity of the 5-min Total, with the grand means being r = .26 for 1-min slices and r = .34 for 5-min Total (as shown in the bottom row of Table 2). The last two columns of Table 2 show that this gap existed for all of the NVBs, with the gap being smallest for gaze (–.04 difference) and largest for speaking time (–.13 difference), replicating the Table 1 predictive validity loss findings presented in the previous paragraph.
Predictive Validity of Individual 1-Min Slices of NVB, Average of 1-Min Slices of NVB, and 5-Min Total NVB, Across Outcome Variables.
Note. Correlations are absolute values (see text for explanation). NVB = nonverbal behavior.
Correlations were Fisher-z transformed before averaging, then returned to r metric.
The bottom row of Table 2 also illustrates the loss in predictive validity by slice location, regardless of behavior. Although the 5-min Total correlation showed the strongest predictive validity (r– = .34) overall, the individual slice results indicate that Slice 2 showed the strongest average predictive validity (r– = .31), and Slice 4 showed the weakest average predictive validity (r– = .23). This implies that Slice 2 is a better choice than Slice 4; however, there is only a difference of r– = .08, overall, between the weakest and the strongest average predictive validity correlations, suggesting that for the present cues and outcome variables, there is no strong reason for choosing one slice over another.
Cumulative Slice Validity
Now we turn to the question of the predictive validity of different-length, or cumulative, slices. We have already shown that individual 1-min slices often perform reasonably well, although not as well as the 5-min Totals in predicting the various outcome variables. Now we seek to determine the relative predictive validity of 2-min, 3-min, and 4-min cumulative slices, compared with the 5-min Total.
For these analyses, we simply added slices, cumulatively from the beginning, to form increasingly longer combined slices. Thus, we added together the first two slices, the first three slices, and the first four slices and calculated their correlations with the same outcome variables analyzed in the previous section, relative to the first slice alone and the 5-min Total (both of which were already reported in the previous section).
Table 3 displays the results of these analyses. In the table, the first and last columns of correlations are the same as in Table 2 (first slice and NVB Total, respectively), whereas the middle columns show the predictive validity of the cumulative slices. As shown above, for all six NVBs, the first slice by itself predicts the outcome variable more poorly than a combination of slices does. However, for five of the six NVBs—gaze, gesture, self-touch, smile, and speaking time—the first 2 min (i.e., Slices 1 + 2) afford predictive validity that is on par with, or close to, that of the 5-min Total.
Predictive Validity of Cumulative Slices of NVB, Across Outcome Variables.
Note. Correlations are absolute values (see text for explanation). NVB = nonverbal behavior.
Correlations were Fisher-z transformed before combining, then returned to r metric.
For nod, in contrast, comparable predictive validity was only evident when more slices were combined, as shown in Table 3. Slices 1 + 2 were substantially different from Total. But when Slices and 1 + 2 + 3 and Slices 1 + 2 + 3 + 4 are used predictive validity was close to that found for Total.
Discussion
The goal of this research was to ask whether a researcher wanting to use NVB coding to predict another variable is justified in coding less than the full amount of video available. This is a methodological issue, but one that intersects with theoretical considerations. Some behaviors, or some behaviors in certain contexts, may be relatively unsuited to the thin slicing approach, and researchers need to explore the reasons for this because it is this kind of knowledge that will guide best practice. We addressed the methodological question in the following operational terms. Five studies each provided 5-min videos of participants in different kinds of dyadic interactions. Across the five studies, six NVBs were coded: gaze, gesture, nod, self-touch, smile, and speaking time. Each NVB was coded in 1-min excerpts (slices), as well as combined into slices of varying length. In addition, the “Total” NVB was calculated by combining all five individual slices of NVB into a 5-min Total. These NVB measurements were then correlated with other (not NVB) variables to understand how much predictive validity was lost when the NVB slices were shorter.
On average, there was predictive validity loss when individual 1-min slices were used. The overall mean correlation for individual slices was .26, whereas the overall mean correlation for the 5-min Total was .34. There was little evidence that any particular slice had a predictive advantage over any other (Table 2), and there was little evidence of consistent trends. In terms of specific behaviors, speaking time showed the greatest loss in predictive validity, suggesting that speaking time may not be ideal for 1-min slices. It is up to individual researchers to decide the threshold beyond which they would consider a 1-min slice to entail an unacceptable amount of predictive validity loss. Even so, the results present promising evidence that individual slices have good predictive validity under some circumstances (e.g., slices of gaze, gesture, self-touch, and smile), although we acknowledge that the data may provide little guidance in terms of which slices work the best for NVBs and outcome variables other than those we measured.
The cumulative slice analyses produced a clear and optimistic picture. We investigated how many slices would have to be combined, starting at the beginning, to get the closest value to the total behavior with outcome variable correlation. For five of the six NVBs, the predictive validity of Slices 1 and 2 combined produced a degree of predictive validity that was very close to that found for the 5-min Totals. This means it was not necessary to code all 5 min of the videotaped interactions in these studies. These findings are consistent with earlier work demonstrating that while the earliest slice may be the weakest at representing particular behaviors, coding 1.5 to 2 min from the start of a videotaped interaction reliably represent some behaviors, gaze above all (Murphy et al., 2015). The present results also are consistent with other thin-slice research, demonstrating that predictive validity between personality and specific behavioral outcomes increases with more exposure of targets’ thin slices across task contexts (e.g., Borkenau et al., 2004). We believe these results should give researchers considerable confidence in the thin-slice approach.
At a practical level, these results also provide guidance on power analyses and the selection of slice length as power analyses rely on predicted effect sizes (Abraham & Russell, 2008). Irrespective of behavior, the results imply that a researcher can practically double statistical power by coding 2-min slices than 1-min slices and that coding longer than 2-min slices does not substantially increase power (for all of our NVBs except nodding). In terms of specific behaviors and outcomes, a researcher studying Behavior A and Outcome B could determine from these results that at 80% power they would need an N of 114 with 1-min slices or code 5-mins with an N of 66. 1 Given the number of behaviors and outcomes in the present analysis, the potential applications for guidance with power analyses and methodological decisions by behavioral researchers could be quite broad for researchers interested in the reported constructs or at least provide some foundation on which to base such decisions. (But to repeat, whether the presented effect size estimates apply to other NVBs or different outcome variables not measured here is unknown.)
We acknowledge a number of limitations in this research. First, only six NVBs were assessed. Whether such results apply to other discrete behaviors or to “fuzzier” constructs measured in thin slices such as “serious face” or “appears anxious” remains unknown (Epstein, 1983). Second, 5 min of video was operationally defined as the behavior “Total.” Many video corpora would have more than 5 min of codable data. Five min was used in the present research because this was the common denominator in our five studies. Coding of (for example) 15 min might produce larger predictive validity coefficients than coding of 5 min, thus potentially making the gap between 1- or 2-min slices and Total more striking. Yet, it is worth noting that in previous research 1-min slices of five different NVBs correlated very well with their respective 15-min Total coding (Murphy, 2005). Although the Murphy (2005) study did not assess predictive validity, it is strong indirect evidence that the 1-min slices would have the capacity to predict other variables. Furthermore, the fact that in the present study, we found that 2 min of coding was virtually equivalent to 5 min makes it less likely that adding additional minutes beyond five would produce a sharp increase in predictive validity.
Another consideration is that the 33 outcome variables analyzed from the five studies were a convenience sample of variables rather than systematic a priori selection. Furthermore, the outcome variables were not equally distributed across studies and across NVBs. There is no reason to suspect bias, however, because they were measured in the five studies for purposes unrelated to the present analyses. Furthermore, we used an unbiased criterion for selecting from the five studies the outcome variables to be used in the present analyses—those with correlations with the 5-min Total NVBs that achieved p ⩽ .05, two tail. We used p value as the selection criterion because it made sense to examine the predictive validity of shorter slices only if the Total had a credible relation with the outcome variable. In spite of this restrictive basis for selection, a wide assortment of different types of outcome variables was available for analysis.
A question for future research is what kinds of behaviors, interactions, and circumstances are more and less suitable for thin-slice analysis. The present studies involved relatively unstructured dyadic interactions in which participants are likely to display their NVBs rather predictably across the interaction, at whatever rate is their personal habit. This predictability is demonstrated by the finding that thin slices of NVB from comparable circumstances did correlate well with each other in Murphy et al. (2015). It is possible however, and indeed likely, that there are interaction settings in which such predictability does not occur, due to the structuring of the interaction and/or task constraints. Also worth considering is that the interactions in the present studies were between unacquainted individuals. Whether the thin-slice predictive validity findings presented here apply to acquainted individuals (e.g., between spouses, teacher–student, parent–child, etc.) is an open question.
The above limitations may introduce the question of the generalizability of the findings. It is worth noting that the analyses involved data collected from five previously conducted studies on unrelated topics (Frauendorfer et al., 2014; Johnson, 2015; Murphy et al., 2016; Ruben et al., 2015; Schmid Mast & Hall, 2006). These disparate studies were conducted across more than 10 years, involved various laboratories (across continents), and utilized different sample populations; we believe these are strengths for our analyses and increase the generalizability of results. In addition, more than 50 comparative analyses were conducted encompassing six NVBs and more than 30 different outcome variables, in more than 400 participants total. Although not the final word on thin-slice predictive validity, the present studies set the stage for important further research.
All told, both the potential implications and limitations of these findings should be considered by thin-slice researchers as to whether the findings are applicable to their own study designs and behavior measurement. As emphasized in earlier papers, researchers must consider the context and how well the measured behavior matches the construct of interest when measuring the behavior (Blackman & Funder, 1998; Epstein, 1983; Moskowitz, 1986; Murphy et al., 2015). Clearly, more research is needed to determine the generality of the present findings to other research methodologies and settings.
In sum, previous thin-slice studies presumed that the thin-slice method yielded valid results because thin slices of behavior predicted an outcome variable at a statistically significant level. But, in reality, there was little systematic investigation into the predictive validity of the thin-slice methodology as a whole. We believe the present work, in addition to our earlier paper regarding the reliability of thin slices (Murphy et al., 2015), provides strong support that thin slices may be an efficient method to measure expressive behavior as related to other study variables. Despite some limited generalizability, the present analyses were designed to systematically assess the relative predictive validity of thin slices of NVB, and as such, they begin to build an empirical foundation on which NVB researchers can base decisions about behavioral coding.
Supplemental Material
Murphy_OnlineAppendix – Supplemental material for Predictive Validity of Thin-Slice Nonverbal Behavior from Social Interactions
Supplemental material, Murphy_OnlineAppendix for Predictive Validity of Thin-Slice Nonverbal Behavior from Social Interactions by Nora A. Murphy, Judith A. Hall, Mollie A. Ruben, Denise Frauendorfer, Marianne Schmid Mast, Kirsten E. Johnson and Laurent Nguyen in Personality and Social Psychology Bulletin
Supplemental Material
Supplementary_Table_1 – Supplemental material for Predictive Validity of Thin-Slice Nonverbal Behavior from Social Interactions
Supplemental material, Supplementary_Table_1 for Predictive Validity of Thin-Slice Nonverbal Behavior from Social Interactions by Nora A. Murphy, Judith A. Hall, Mollie A. Ruben, Denise Frauendorfer, Marianne Schmid Mast, Kirsten E. Johnson and Laurent Nguyen in Personality and Social Psychology Bulletin
Footnotes
Acknowledgements
The authors appreciate the many research assistants who contributed to behavioral coding in each of the presented studies.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Notes
Supplemental Material
Supplemental material is available online with this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
