Abstract
Facial emotion recognition (FER) tasks are often digitally altered to vary expression intensity; however, such tasks have unknown psychometric properties. In these studies, an FER task was developed and validated—the Graded Emotional Face Task (GEFT)—which provided an opportunity to examine the psychometric properties of such tasks. Facial expressions were altered to produce five intensity levels for six emotions (e.g., 40% anger). In Study 1, 224 undergraduates viewed subsets of these faces and labeled the expressions. An item selection algorithm was used to maximize internal consistency and balance gender and ethnicity. In Study 2, 219 undergraduates completed the final GEFT and a multimethod battery of validity measures. Finally, in Study 3, 407 undergraduates oversampled for borderline personality disorder (BPD) completed the GEFT and a self-report BPD measure. Broad FER scales (e.g., overall anger) demonstrated evidence of reliability and validity; however, more specific subscales (e.g., 40% anger) had more variable psychometric properties. Notably, ceiling/floor effects appeared to decrease both internal consistency and limit external validity correlations. The findings are discussed from the perspective of measurement issues in the social cognition literature.
Social cognition encompasses the perceptual, interpretive, and problem-solving processes that occur during social interactions (Green et al., 2008). Psychopathology researchers have become increasingly interested in social cognitive processes because they may explain links between psychopathology, poor social functioning, and negative outcomes (Dodge, 1993; Fonagy et al., 2015; Pinkham et al., 2014). In particular, facial emotion perception or recognition is one of the those most thoroughly investigated social cognitive processes linked to psychopathology (e.g., Gur & Gur, 2016). Indeed, impaired or biased facial emotion recognition (FER) has been linked to varied forms of psychopathology, such as psychosis, social anxiety, and autism spectrum disorders (Paiva-Silva et al., 2016; Pinkham et al., 2014). Thus, having reliable and valid measures of FER is critical for research to advance.
Despite this, oftentimes little is known about the psychometric properties of the performance-based FER tasks used in psychopathology research (Pinkham et al., 2014, 2016). Amplifying these concerns, researchers frequently modify existing FER measures, making it difficult to generalize psychometric properties and substantive findings across studies. Two common modifications in psychopathology research include (a) blending or “morphing” neutral and emotional faces to create expressions of varied intensity (e.g., subtly angry) and (b) making ad hoc scales to measure theoretically interesting processes (e.g., tendency to mislabel neutral faces as sad; Daros et al., 2014; Paiva-Silva et al., 2016; Staugaard, 2010). The borderline personality disorder (BPD) literature provides a particularly good example of these issues and is the focus of this study (Lazarus et al., 2014). Specifically, this study aims to (a) illustrate the roles of theory, task heterogeneity, and psychometric properties by reviewing the BPD-FER literature, (b) empirically examine common FER task modifications (i.e., morphing and ad hoc scoring), and (c) develop and validate an FER task useful for future research.
BPD and FER
Considerable theory and research indicate that FER errors or biases may be central to social cognition in BPD (Brück et al., 2017; Lazarus et al., 2014; Niedtfeld et al., 2017). Studies in the BPD-FER literature have typically involved participants completing computerized tasks, in which they view a face displaying an emotional expression or a neutral-unemotional face, then labeling the face, either while viewing the face or immediately afterward (see Figure 1 for example). Such studies have led to the consistent findings that BPD is related to (a) difficulty distinguishing neutral from emotional faces (e.g., mislabeling neutral faces as emotional; Daros et al., 2013) and (b) a bias toward perceiving anger (e.g., mislabeling other expressions; Domes et al., 2008; Fenske et al., 2015; Lazarus et al., 2014; Veague & Hooley, 2014).

GEFT Task Presentation Details.
The theoretical explanations for these findings remain contested. For instance, Meehan and colleagues (2017) proposed that BPD is related to generally enhanced emotion detection (e.g., whether any emotion is present) and specifically enhanced emotion labeling (e.g., which emotion is present) for threat-related facial cues (e.g., anger and fear), as well as related deficits (e.g., inaccuracy with neutral faces). Similarly, Daros and colleagues (2013) have suggested that BPD is related to increased arousal in response to social threat (e.g., angry faces), leading to enhanced recognition of low-intensity expressions and mislabeling neutral expressions; however, diverging from Meehan and colleagues (2017), they also predict that intense emotional expressions elevate arousal to an extent that depletes cognitive resources, leading to poor performance. Despite these promising findings and theories, an impediment to further progress is methodological variability in FER tasks and their unknown psychometric properties.
Specifically, FER tasks in the BPD-FER literature vary substantially in four ways: expression intensity, number/types of expressions, time constraints, and scoring. Researchers typically generate subtle facial emotion expressions via morphing a neutral face and a full-intensity emotion expression, varyingly weighting the contribution of the emotional expression (e.g., 40% neutral–60% anger; see Figure 2 for examples). This procedure is of particular importance, as it has generated many of the findings regarding BPD’s relation to subtle and intense emotions. Despite this, studies vary considerably in their coverage of expression intensity. Given their centrality to the theories above, the Daros and colleagues (2013) and Meehan and colleagues (2017) studies are particularly good examples of this variation, as the former only included tasks with full-intensity emotion and neutral expressions, whereas the latter included subtle emotion expressions (0%–75% intensity) and no full-intensity expressions (e.g., 100%). Given these differences in methodology, it is perhaps only natural that these studies would generate diverging theories.

Face Morphing Procedure.
Beyond expression intensity, the FER tasks used in BPD studies vary in their number of expressions, with some having as few as two emotional expressions (Veague & Hooley, 2014) and others having six (Daros et al., 2013). Aside from making study-to-study comparisons difficult, studies with a limited number of expressions may have lower ecological validity as previous research suggests at least six basic emotion expressions (i.e., anger, disgust, fear, happy, sad, and surprise; Ekman et al., 1987; Van Kleef, 2010). In addition, tasks in the BPD-FER literature vary in stimulus presentation time, with some using brief presentations (e.g., 500 ms; Veague & Hooley, 2014) and others extended presentations (e.g., 5,000 ms; Unoka et al., 2011), again complicating comparisons. Although it is difficult to gauge the precise impact of variation in presentation time, some work suggests that individuals with BPD are negatively affected by time constraints (Dyck et al., 2009) and basic research on social processes indicates that facial expressions are processed quickly (e.g., <1.0 seconds; Sadler et al., 2009; Streit et al., 2003), suggesting that lengthy presentations may have lower ecological validity. Finally, researchers often create ad hoc scales based on theory (e.g., bias for labeling neutral faces as angry; van Dijke et al., 2016), which involve rescoring or combining responses in unique ways. These ad hoc scales are often of critical importance to studies, providing the most direct measure of a hypothesized process.
Exacerbating the problem of between-study variability in tasks, the psychometric properties of the tasks used are often not reported and may be altogether unknown. Frequently, an existing FER task or facial stimuli are manipulated in terms of the above factors (i.e., intensity, breadth of expressions, time, and scoring), but the effects of such manipulations are rarely examined. The consequences are thus largely unknown, but these manipulations may impact reliability, correlations with other measures, and represent fundamental alterations to the processes being assessed. Understanding the psychometric properties of FER tasks and common manipulations to them may help explain inconsistencies between studies. Unfortunately, despite these methods being common to other psychopathology literatures (Mendlewicz et al., 2005; Staugaard, 2010), the authors could only locate one study examining the reliability of morphed facial emotions. In Cecilione and colleagues (2017), children completed an FER task that included emotion expressions morphed in 10% increments (i.e., 10%–100%). Cecilione and colleagues’ (2017) results showed that retest reliability was low for brief (i.e., 6 stimuli), subtle emotion scales (e.g., 10% sad); however, reliability was adequate for scales that aggregated across intensities. This study serves as a caution to investigators that alter FER tasks and do not examine the reliability of the resulting task. It is necessary to generalize these results (adult samples, tasks with neutral faces, etc.) and examine how variation in task reliability impacts validity. Furthermore, the present review of the BPD-FER literature suggests that existing FER tasks lack the features that researchers need, leading to the ad hoc manipulations discussed above. The field would benefit from a validated FER task that is consistent with theoretical models of BPD and contains scales useful for psychopathology research (e.g., measures of bias).
Present Study
FER bias or inaccuracy is an important process that may underlie interpersonal dysfunction in BPD and other forms of psychopathology. Nonetheless, standard FER tasks are often poorly suited to testing psychopathology theories, leading researchers to modify tasks and create scales ad hoc. Notably, these modified tasks and scales have unknown psychometric properties. In this study, we aimed to create and validate a task with morphed expressions and theory-appropriate scales, while following important principles of test construction (Clark & Watson, 2019). Due to the centrality of facial expressions with varying or “graded” intensity, we henceforth refer to this as the Graded Emotional Face Task (GEFT). In addition to developing a task for future studies, this study also aimed to explore its reliability and validity, as doing so may provide a better understanding of similar tasks used in psychopathology research.
Specifically, we approached the task development process in a manner similar to how questionnaires and interview instruments are developed (e.g., Simms & Watson, 2008): we chose a representative stimulus set, selected the stimuli with the best psychometric characteristics, and examined the internal properties and external correlates of the task. This was accomplished through conducting three studies in large college student samples. Study 1 focused on stimulus selection for the GEFT, Study 2 provided data on the GEFT’s external correlates in a separate sample, and Study 3 examined relations between the GEFT and a measure of BPD symptoms in college students oversampled for BPD features. Regarding external correlates, we hypothesized that other measures of FER or nonverbal theory of mind would demonstrate the strongest convergent correlations, as they assess similar processes through similar methods (e.g., Mitchell & Phillips, 2015). Relatedly, although there is limited research on this topic, we expected convergent and discriminant validity to emerge for tasks with scales assessing recognition of a specific facial expression (e.g., anger recognition ability across tasks should be related). After nonverbal measures are considered, we hypothesized that interpersonal functioning measures that use different assessment methods (i.e., situational judgment task and self-report questionnaire) would show positive, but somewhat smaller correlations with the GEFT (e.g., Moeller et al., 2012). Finally, we hypothesized that a task-based measure of a theoretically unrelated construct (i.e., risk-taking) would demonstrate a minimal or nonsignificant correlation with the GEFT. In the following sections, we report how we determined our sample size, all data exclusions, all manipulations, and all measures in the study.
Study 1 Method
Participants and Procedures
Undergraduate students (N = 247) fluent in English were recruited from a psychology department research participant pool. Participants attended 2-hour sessions and were compensated with course credit for participation. Sessions included up to seven participants and took place in a computer laboratory, which contained eight Mac computers (OS X 10.6.8) stationed in privacy carrels. Participants began the session by completing a demographic questionnaire, which indicated that participants were on average 19.4 years old (SD = 4.0), 52% were female, and that 56% of the sample identified as Caucasian, 29% Asian, 15% African American. In addition, 9% identified as Hispanic. Following completion of the demographic questionnaire, participants completed the GEFT using Inquisit 4.0. Given the number of FER stimuli, participants were randomized into two groups (N1 = 117, N2 = 130) who then completed separate versions of the task. These sample sizes were deemed adequate for the primary purpose of Study 1, which was to select stimuli for the final version of the GEFT through removing stimuli with weak item-total correlations (Cohen, 1992). Outliers were identified and removed on the basis of exceptionally poor performance (−3 SD) on stimuli for which most participants performed very well (i.e., 80% and 100% anger, disgust, sadness, surprise, and happiness), resulting in final samples of 100 and 124 participants.
GEFT Stimulus Pool
The GEFT was created using stimuli from the NimStim face database (Tottenham et al., 2009), which contains 672 color photographs of 43 racially diverse professional actors enacting 16 different facial expressions. The NimStim database of facial stimuli was chosen for a number of reasons: number of actors, variation in model ethnicity and gender, coverage of basic emotional expressions, previous use in morphing studies, prominence in neurobiological research, and frequent usage in the BPD literature (Matzke et al., 2014; Meehan et al., 2017; Smith et al., 2013). For the present task, only closed-mouth versions of anger, disgust, fear, happy, sad, and neutral faces were used, as open-mouth faces are more likely to produce morphing artifacts (Tottenham et al., 2009). Surprised faces with open mouths were included, as the NimStim database did not include closed-mouth surprise faces.
All relevant NimStim stimuli were mapped (i.e., delineated) for important facial landmarks by trained undergraduates in Psychomorph, a Javascript application developed by facial perception researchers at the University of York (Tiddeman et al., 2005). Separate undergraduates reviewed these mappings and rated their agreement with facial landmark placements; disagreements were discussed and resolved at team meetings.
After stimuli were mapped, they were morphed in Psychomorph. A five-stage morphing procedure was used to produce faces with 80% neutral to 20% emotion, 60% to 40%, 40% to 60%, 20% to 80%, and 0% neutral to 100% emotion blends. Examples of this morphing process are provided in Figure 2. This created a total of 31 stimulus categories (i.e., 6 emotions × 5 intensity levels + neutral faces). All morphed faces were reviewed by undergraduate assistants for picture quality (blurriness, naturalness, etc.). The variability in picture quality was relatively minor and, as such, no faces were excluded due to problems with mapping, morphing, or picture quality.
Previous research has examined the reliability of the individual NimStim faces (i.e., agreement between participants; Tottenham et al., 2009). Based on this, we decided to choose faces from each category with kappas >.80. Notably, for fear and surprise faces, this would have resulted in fewer than 10 stimuli; therefore, the kappa requirement was lowered to .60 for these stimuli. All morphed stimuli that were generated from original NimStim faces that met these criteria were included in the item selection sample, resulting in 784 faces.
GEFT Stimulus Presentation
As mentioned above, participants were split into two groups (N1 = 100 and N2 = 124), each of which was assigned a selection of expression-intensity categories that were designed to maximize the similarity to the final task. Thus, all participants were presented with (a) the neutral face stimulus category and (b) both high (100% or 80%) and low (20% or 40%) intensity stimuli for each emotion category. The 60% intensity emotional stimuli were split between the groups (e.g., Group 2 viewed 60% disgust, fear, and happy stimuli).
Each face was preceded by a 1.0-second fixation cross, then displayed for 1.0 seconds, after which the following categorization options were presented without time constraints: anger, disgust, fear, happy, sad, surprise, and neutral. Participants were asked to “describe how the person feels.” Figure 1 provides an illustration of this process. Despite the lack of time constraints on responses, undergraduate research assistants timed participants and provided corrective feedback to participants completing the task more slowly than most (e.g., “please try to complete the next block more quickly, do not over-think the decisions you make in this task”).
Stimulus responses were scored as 0 = “incorrect” or 1 = “correct,” based on whether the correct emotion was identified. For the final scales, scores are presented as proportions, reflecting the total number of correctly identified stimuli divided by the total number of stimuli.
Analyses
The internal consistency of each emotion category was examined, and 10 stimuli were chosen, based on item-total correlations and the gender-ethnicity distribution. For the neutral face scale, 20 stimuli were selected. For each category, gender representation was equal (e.g., five male and five female stimuli) and ethnic representation was held constant (i.e., 60% Caucasian, 20% African American, 10% Hispanic, and 10% Asian). The following algorithm was used to make item selection decisions: (a) choose the stimulus with the highest item-total correlation; (b) once a gender or ethnic group requirement is met, stop selecting stimuli from that group; and (c) if a gender or ethnicity requirement is not met due to the preceding rules, deselect the stimulus with weakest item-total correlation and add a stimulus that would fulfill the requirement.
Study 1 Results and Discussion
Following outlier removal, the item selection algorithm was applied. In general, the item selection algorithm was successful at maximizing scale homogeneity, as shown by increases in average interitem correlations (i.e., M increase of 60% in Average Interitem Correlation (AIC) see Table 1). The algorithm also maintained a balance of gender and ethnicity across scales, with the exception of Hispanic fear faces, which were excluded at an early stage because all potential stimuli were open-mouthed.
Study 1 Stimulus Pool and Scale Results.
Note. This table provides results from the initial stimulus pool and final scales developed in Study 1. All descriptive statistics are based on the proportion of stimuli correct. AIC = Average Interitem Correlation.
Overall, internal consistency varied considerably across emotion category and intensity level, though was generally below acceptable values (e.g., α = .70), with the average alpha across all scales being .64. The average internal consistency within emotion categories ranged from .53 (AIC = .10) for sadness to .72 (AIC = .21) for happiness; however, the effect of intensity was more complicated. The 20% intensity scales averaged an alpha of .56 (AIC = .13), but the other intensities differed from one another negligibly (i.e., α ranged .65–.67, AIC ranged .16–.18). An inspection of the scales suggests that the effect of intensity on internal consistency differed by emotion category, such that for some emotions, low-intensity stimuli (20%–40%) formed more homogeneous scales (e.g., happiness—40%, AIC = .29) and for other emotions, high-intensity stimuli performed better (e.g., fear—80%, AIC = .22). Examining means, standard deviations, minimum scores, and maximum scores suggested that ceiling and floor effects may exist in scales that performed particularly poorly in regard to internal consistency (e.g., fear–20%, M percent correctly identified = 5%).
The low internal consistency of 10-item scales was somewhat surprising, given that instruments with a similar number of stimuli (e.g., JACBART’s 8 per scale; Matsumoto et al., 2000) have performed better in past analyses, though is consistent with Cecilione and colleagues (2017). There are several possible explanations for this discrepancy. First, the present task’s use of intensity-specific scales, although in-line with approaches used in the psychopathology literature (e.g., Meehan et al., 2017), suppressed interitem correlations by grouping items with low variances together (e.g., in case of floor effects). More generally, scales with ceiling or floor effects (e.g., fear—20%, M = 5%) will have reduced variability and thus lower reliability (Cooper et al., 2017; Spearman, 1904). To partially illustrate this point in the present data, scale variability (i.e., SD) correlated .72 with scale internal consistency. Second, and related, it may be the case that the procedure of morphing changes scale characteristics relative to simpler tasks (e.g., Matsumoto et al., 2000), such that there is more error variance due to nonsystematic response processes (e.g., guessing). Third, many studies do not include a neutral category, which may also alter the response process of individuals taking the test. Finally, participants were not exposed to the full task, which may have different psychometric properties. In addition to replicating these effects in a sample that received the full GEFT, Study 2 aimed to build upon these findings through examining (a) the internal consistency of overall emotion scales (i.e., across intensities) and (b) the GEFT’s external correlates.
Study 2 Method
Participants and Procedures
Recruitment procedures were identical to Study 1; however, we determined to test hypotheses regarding differences between external validity correlations (e.g., r differences = .18) that a sample of at least 175 would be necessary (Cohen, 1992). Participants (N = 219) began the session by completing a demographic questionnaire, which indicated that participants were on average 19.1 years old (SD = 1.6), 49% were female, and that 48% of the sample identified as Caucasian, 38% Asian, and 14% African American. In addition, 10% identified as Hispanic. Following completion of the demographic questionnaire, participants completed in randomized order (a) the GEFT or (b) a battery of convergent and discriminant validity tasks. The GEFT was presented in Inquisit 4.0. The validity tasks began with the Situational Test of Emotion Understanding (STEU) and the Inventory of Interpersonal Problems-Short Circumplex (IIP-SC), in survey format in Qualtrics, and then switched to a battery of randomly ordered tasks presented in Inquisit 4.0 (i.e., Balloon Analogue Risk Task [BART], Emotion Recognition-40 [ER-40], and the Reading the Mind in the Eyes test [RMET]). Outliers were identified and removed on the basis of exceptionally poor performance (−3 SD) on GEFT stimuli that most participants performed very well on (i.e., 80% and 100% anger, disgust, sadness, surprise, and happiness), resulting in a final sample size of 200.
Measures
GEFT
The final task was used from Study 1; however, several stimuli were added. As noted in the Study 1 discussion, there were no Hispanic fear faces included in the first study because there were no closed-mouth Hispanic fear faces in the NimStim database. Thus, two open-mouth Hispanic fear faces (i.e., one male, one female) were morphed and added for this data collection. As in Study 1, each face was preceded by a fixation cross for 1.0 seconds, the face itself was displayed for 1.0 seconds, and then followed by seven response options. Again, participants were encouraged by research assistants to complete the task quickly. Stimulus responses were scored as 0 = “incorrect” or 1 = “correct,” based on whether the correct emotion was identified. For the final scales, scores are presented as proportions, reflecting the total number of correctly identified stimuli divided by the total number of stimuli; this was done at the level of emotion subscales (e.g., 20% anger) and for overall emotions (i.e., anger, totaled across 20%–100%). In addition, a number of alternative indices were computed, based on previous research on BPD (e.g., Meehan et al., 2017) and questions typically pursued by psychopathology researchers that pertain to social cognitive bias: (a) emotion detection sensitivity, a count of the number of times a participant correctly identifies a stimulus as displaying an emotion, with no penalty for ascribing emotions to neutral stimuli or for choosing the wrong emotion (e.g., labeling a sad face as angry would still be scored correct), (b) emotion mislabeling bias for each emotion category, a count of the number of times the category is incorrectly applied to another emotion (e.g., labeling a sad face as angry would contribute to the anger bias score, but labeling a neutral face as angry would not), and (c) neutral mislabeling bias for each emotion category, a count of the number of times a particular emotion category is applied to a face that is actually neutral (e.g., mislabeling a neutral face as angry would contribute to the anger neutral mislabeling bias score, but not mislabeling a sad face as angry). Descriptive statistics and information on reliability is presented in Table 2.
Study 2 GEFT Internal Consistency and Descriptive Statistics.
Note. Data are presented from Study 2. All descriptive statistics are based on the proportion correct (or incorrect, for bias scales). The “0%” and “100%” columns indicate the number of participants who scored at the floor and ceiling of the scale, respectively. AIC = Average Interitem Correlation. “Emot. detect.” = emotion detection sensitivity, or the tendency to accurately perceive the presence of an emotional expression. “Bias” scales reflect tendencies to incorrectly apply that emotion to stimuli representing other emotions. “N-. . .” scales reflect tendencies to incorrectly apply that emotion to neutral stimuli. GEFT = Graded Emotional Face Task.
ER-40 (Carter et al., 2009; Kohler et al., 2004). The ER-40 is a commonly used facial affect recognition task that consists of 40 colored photographs of neutral, fearful, angry, sad, and happy faces. There are eight photographs per category and emotion categories have four high- and low-intensity expression photographs. Stimuli are balanced across categories for gender, age, and ethnicity. Research indicates that the total score is internally consistent, has good retest reliability, and is relatively free from floor/ceiling effects (Pinkham et al., 2016). In this study, the ER-40 was included as a measure of convergent validity, as it is similar to the GEFT. The ER-40 items were scored dichotomously (incorrect vs. correct) and used to generate the following scales: total score, neutral, four (fearful, angry, sad, and happy) high-intensity scales, and four low-intensity scales. To allow clear comparisons between the measures, the ER-40 was presented using the same format as the GEFT (e.g., faces displayed for 1.0 seconds), with the exception that there were only five response options (neutral, fear, anger, sadness, and happiness). In this study, the internal consistency for the ER-40 total score was acceptable (e.g., α = .74), but was more variable for the emotion-specific subscales (α = .48–.76; see Table 3). The ER-40 total score had a mean of 31.68 correct responses (SD = 4.48; M % = 79%, SD % = 11%), similar to a sample of BPD participants completing the task without time constraints (M = 82.24%, SD = 6.61%).
Study 2 Correlations Between Emotion Scales and Validation Measures.
Note. Data are from Study 2. Correlations directly relevant to hypotheses are underlined and the highest correlation in each column is in bold. Correlations ≥.14 are significant at p < .05. AIC = Average Interitem Correlation. ER-40 = Emotion Recognition-40; RMET = Reading the Mind in the Eyes test; STEU = Situational Test of Emotion Understanding; IIP-SC = Inventory of Interpersonal Problems-Short Circumplex; BART = Balloon Analogue Risk Task.
RMET (Baron-Cohen et al., 2001). The RMET measures the capacity to identify others’ mental states (e.g., theory of mind), based on the eye region of the face. Stimuli are 36 black and white photographs of individuals’ eye region, presented with four response options describing mental states of varied complexity (upset, playful, convinced, etc.). These stimuli can be scored to create a total “mind reading” score. Previous research indicates that the RMET is sensitive to social cognitive dysfunction with BPD (Richman & Unoka, 2015). Thus, similar to the ER-40, the RMET was included in this study as a measure of convergent validity and was presented using a similar format (e.g., stimuli displayed for 1.0 seconds). In this study, the internal consistency for the RMET total score was somewhat low (e.g., α = .69).
STEU (MacCann & Roberts, 2008). The STEU is a 42-item situational judgment test of emotional intelligence, which measures a social cognitive ability that has a small positive relation with FER (MacCann et al., 2011). The task involves reading single-sentence descriptions of situations and selecting between four options to indicate how a person in the described situation might feel. Evidence for STEU’s reliability and validity has been provided previously (MacCann & Roberts, 2008). The STEU was included for both convergent and discriminant validity purposes, as it was predicted to relate positively to the GEFT, but less so than the ER-40 or RMET. In this study, the internal consistency for the STEU total score was acceptable (e.g., α = .73).
IIP-SC (Soldz et al., 1995). The IIP-SC is a 32-item self-report questionnaire assessment of interpersonal behaviors that individuals perform excessively (e.g., “I argue with other people too much”) or find difficult to do (e.g., “It is hard for me to feel close to other people”), which are responded to on a 0 (not at all) to 4 (extremely) scale. Items form eight four-item scales (domineering, vindictive, cold, socially avoidant, nonassertive, exploitable, overly nurturant, and intrusive) that can be further used to create scores for interpersonal distress, agency (e.g., dominance vs. submissiveness), and communion (warmth vs. coldness; Wilson et al., 2013). Research in previous samples indicates adequate internal consistency and adherence to circumplex structure (Alden et al., 1990; Monsen et al., 2006). Previous research has demonstrated that both agency and communion have negative relations to FER accuracy (Moeller et al., 2011, 2012). Similar to the STEU, the IIP-SC was included as a measure of both convergent and discriminant validity because it was expected to correlate significantly with the GEFT, though more weakly so than the RMET or ER-40. In this study, the internal consistency for the IIP-SC score was acceptable (e.g., M α = .76), but was more variable for specific subscales (α = .66–.87; see Table 3). The mean total score in this sample was 31.90 (SD = 18.01), somewhat lower than what is seen in general psychiatric outpatient samples (M = 40.30, SD = 21.24; Williams & Simms, 2016).
BART (Pleskac et al., 2008). The automatic BART is a brief version of the classic BART (Lejuez et al., 2002), a behavioral task measuring the propensity for risk-taking behavior, which demonstrates considerable validity evidence (Fein & Chang, 2008; Hunt et al., 2005; Pleskac & Wershbale, 2014). In this task, participants pump air into balloons for points (i.e., 1 cent per pump), but must take care to not over pump and make the balloon explode; if the balloon explodes, they lose all their points. The number of times participants are willing to pump the balloon serves as a measure of their risk-taking propensity. The BART was included as a measure of discriminant validity, as it was expected to be minimally related or unrelated to the GEFT.
Analyses
Internal consistency (Cronbach’s α, average interitem correlation, and omega hierarchical; Revelle & Zinbarg, 2009), descriptive statistics, and distributional properties (e.g., skew) were examined for FER task scales, including the alternative social cognitive bias indices. Following this, correlations were computed between the FER task scales and the validity measures. To more directly test hypotheses regarding convergent and discriminant validity, Fisher’s r-to-z transformation (Silver & Dunlap, 1987) was used to test differences in magnitude between correlations, where the RMET and ER-40 were expected to be significantly (p < .05) higher than the STEU and IIP-SC, which were in turn expected to be higher than the BART.
Study 2 Results and Discussion
Relative to Study 1, similar but slightly lower results were obtained for the internal consistency of subscales (e.g., M α decrease = .04, see Table 2). Notable exceptions to this similarity include fear—20% and surprise—60%, which showed decreases in alpha of at least .15; excluding these outliers, the average alpha decrease was .02, from Study 1 to Study 2. Regarding overall internal consistency, in-line with Study 1, the emotion subscales had generally inadequate internal consistency (i.e., M α = .60), whereas good internal consistency was observed for the neutral scale. Similar to Study 1, an examination of distributional statistics would suggest that floor and ceiling effects may be an important cause of low internal consistency in some scales (e.g., see “0%” and “100%” columns in Table 2). Indeed, the number of participants at the floor for an emotion subscale correlated −.53 with subscale variability (SD), which in turn was strongly correlated with internal consistency (r = −.84).
Aside from replicating emotion subscale and the neutral scale properties, this study examined additional scales. First, emotion-level scales that combined stimuli from intensity-based subscales were examined. These scales showed good internal consistency (α = .82–.89) and appeared to be free from ceiling or floor effects. Similar to the subscale results, however, the fear scale was an outlier; on average, participants correctly identified only 36% of all fearful faces. In addition, the FER total score demonstrated excellent internal consistency and did not show ceiling or floor effects. Considering the alternative indices, emotion detection sensitivity and emotion labeling bias scales all demonstrated adequate internal consistency; however, likely due to floor effects, no neutral labeling bias scales for specific emotions demonstrated adequate internal consistency. In looking at distributional statistics, it is clear that neutral labeling bias scales all had strong floor effects; on each scale, more than 100 participants were at 0% (i.e., did not incorrectly label any neutral faces with that specific emotion). A set of likely causes for this are (a) participants’ relative accuracy in identifying neutral faces and (b) the low number of 20 neutral stimuli (e.g., compared with 250 for emotion labeling bias scales). As a follow-up analysis, a more general negative emotion mislabeling scale was examined, as this may be of use to BPD and psychopathology researchers. This scale was based on mislabeling neutral faces with any negative emotion (i.e., anger, disgust, fear, or sad) and demonstrated adequate reliability, with a less pronounced floor effect (i.e., 59 participants at 0%).
Results for validity analyses are displayed in Table 3. A strong correlation was observed between the GEFT total and the ER-40 (r = .51) total score, moderate correlations were observed between the GEFT and the RMET (r = .39) and STEU (r = .43) total scores, a weak but significant correlation was observed for the IIP-SC elevation score (r = −.15), and the correlation with the BART risk-taking index was nonsignificant (i.e., r = −.01, p > .05). In line with predictions, the correlations between the GEFT and nonverbal social cognitive measures (i.e., RMET and ER-40 total scores) were stronger than those observed for the IIP-SC elevation score (ER-40, z = 4.55, p < .001; RMET, z = 2.79, p < .01); however, these tasks did not differ from the STEU, which was substantially correlated with the GEFT’s total score, despite using a situational judgment test format (ER-40, z = 1.16, ns; RMET, z = −0.74, ns). In addition, although it was confirmed that the GEFT’s total score correlated more strongly with the STEU than a behavior-based risk-taking task (STEU-BART, z = 4.70, p < .001), there was no difference between the BART and IIP-SC elevation (z = 1.41, ns). Thus, the hypotheses regarding convergent and discriminant validity were only partially confirmed.
Examining the validity correlations between the GEFT and ER-40 provided further surprising results at the level of specific emotions. For emotions with counterparts across measures (e.g., GEFT anger and ER-40 anger), only GEFT fear, sad, and neutral scales demonstrated their highest correlation with their respective ER-40 scale; GEFT anger and happy scales correlated most strongly with the ER-40 fear and total scores. The internal consistency column of Table 3 suggests a possible explanation for these findings. Although most validity measures had adequate internal consistency, the sad, happy, and anger scales of the ER-40 all had alphas below .60. In contrast, ER-40 fear had an alpha of .76, which was higher than even the ER-40 total score. In particular, considering the low convergent correlation between ER-40 and GEFT anger scales (r = .23), it is worth noting that correcting (e.g., disattenuating) the correlation for unreliability suggests that perfectly reliable measures would produce a correlation of .35. Thus, although the internal consistency of all GEFT emotion-level scales was high, their poor convergent and discriminant validity correlations may be explained by the low internal consistency of ER-40 scales. Despite this, other potential explanations for this finding exist, including differences in stimuli (e.g., ER-40’s three-dimensional [3D] modeling approach) and the number of emotion categories represented (i.e., 4 vs. 6).
Looking more broadly at the criterion validity of the GEFT, several other patterns emerged. For instance, the disgust scale demonstrated some of the strongest criterion validity correlations (e.g., r with STEU = .43), which is notable given that the inclusion of disgust faces is inconsistent across FER measures (e.g., ER-40). In contrast, the GEFT neutral scale showed a moderate-strong correlation with ER-40 neutral (i.e., r = .49) and no correlations above .15 with any other measure. With the exception of surprise, GEFT emotion-level scales demonstrated a moderate or greater correlation (i.e., >.30) with at least one relevant social functioning scale and a nonsignificant correlation with the BART risk-taking index. Another notable finding is the GEFT total score’s −.30 correlation with cold-dominant (e.g., vindictive) interpersonal problems and the consistency of this relation across emotion-level scales, replicating previous work (e.g., Moeller et al., 2011, 2012).
Criterion validity results also are provided for emotion subscales and alternative indices in Table 4; neutral mislabeling biases for specific emotions were omitted, given (a) their poor reliability and (b) limited evidence of criterion validity (i.e., M |r| = .08, Max |r| = .22). Thirteen (43%) emotion subscales had at least one moderate correlation (i.e., ≥.30), another 12 (40%) had small significant criterion validity correlations, and the remaining five scales had no significant criterion validity correlations. Given that emotion subscales varied considerably in terms of internal consistency and range restriction, these differences were examined more closely. Overall, there were moderate correlations between emotion subscale internal consistency (α) and both the size of its average external correlate (r = .48) and largest external correlate (r = .49). This trend can be further demonstrated by examining Figure 3, which averages these results for various ranges of alpha and shows an increase in external correlation magnitude for more internally consistent scales. These findings are an empirical demonstration of the fact that the reliabilities of two variables set an upper limit on the magnitude of correlation between them (Parsons et al., 2019; Spearman, 1904). Beyond emotion subscales, the alternative indices also yielded an interesting pattern: low emotion detection sensitivity was related to high agency and low communion interpersonal problems (e.g., cold-dominance) and only weakly related to task measures of social cognition (i.e., ER-40, RMET, and STEU), whereas emotion mislabeling was moderately related to task measures of social cognition and less consistently related to self-reported interpersonal problems. This finding is consistent with Meehan and colleagues’ (2017) study of BPD, in which they differentiate between error patterns based on these processes.
Study 2 Criterion Validity of Emotion Subscales and Alternative Indices.
Note. Data are from Study 2. Correlations ≥.14 are significant at p < .05. The largest correlation for each scale (row) is in bold. N-negative = mislabeling neutral faces with any negative emotion. ER-40 = Emotion Recognition-40; RMET = Reading the Mind in the Eyes test; STEU = Situational Test of Emotion Understanding; IIP-SC = Inventory of Interpersonal Problems-Short Circumplex; BART = Balloon Analogue Risk Task.

The Relation Between GEFT Internal Consistency and Validity.
As a whole, these results provided support for the reliability and validity of the GEFT, particularly for the total score and emotion-level scales, which showed generally good internal consistency and criterion validity correlations. Although the emotion mislabeling indices also demonstrated good psychometric properties, emotion subscales showed inconsistent psychometric properties, and neutral mislabeling scales performed very poorly. Moreover, there was evidence that scale properties such as internal consistency affected the external correlations of scales. These results raise concerns regarding the reliability and validity of tasks that use similar morphing paradigms to compute such scales and indices, as well as potential discrepancies between tasks that use differing numbers of emotion categories. The potential implications for psychopathology are further explored in Study 3, which examined correlations between the GEFT and a measure of BPD traits.
Study 3 Method
Participants and Procedures
Participants were recruited from an undergraduate psychology participant pool, in which students participate in studies for course credit. At the beginning of each semester of data collection, all students in the subject pool completed the McLean Screening Instrument for BPD (MSI-BPD). Previous research has identified MSI-BPD scores of 7 and above to be indicative of a likely BPD diagnosis (Melartin et al., 2009; Zanarini et al., 2003). Participants with moderate BPD features (MSI-BPD = 3–6) and high BPD features (MSI-BPD ≥7) were oversampled, through offering additional study slots to participants in these groups and not to those with low BPD features (MSI-BPD <3). This resulted in a sample (N = 407) that was 21% high BPD, 44% moderate BPD, and 35% low BPD features, thus reflecting a higher degree of BPD features than a standard undergraduate sample (i.e., high = 14%, moderate = 36%, and low = 50% of full subject pool [N = 1,146]). The sample size for this study is based on requirements for structural equation modeling (Wolf et al., 2013) and more than adequate for detecting small-to-moderate correlations. These participants were on average 19.16 years of age (SD = 1.82) and 63% identified as female. Participants described their race as follows: 51% White or Caucasian, 31% Asian, 18% Black or African American, and less than 1% described themselves as Native American, Hawaiian, or Pacific Islander. In addition, 11% of participants reported that they were Latinx.
Measures
Borderline Personality Questionnaire (BPQ)
The BPQ is a questionnaire-based assessment of BPD, consisting of 80 statements rated using a true/false format (Poreh et al., 2006), which form 7 to 10 item scales. The BPQ has nine scales, each one corresponding to one of the nine Diagnostic and Statistical Manual of Mental Disorders (DSM) BPD criteria: Abandonment (BPD1), Relationships (BPD2), Self-Image (BPD3), Impulsivity (BPD4), Suicide/Self-Mutilation (BPD5), Affective Instability (BPD6), Emptiness (BPD7), Intense Anger (BPD8), and Quasi-Psychotic States (BPD9). In addition, these scales can summed to create a total score, representing overall severity of BPD criteria. The BPQ has demonstrated convergence with other commonly used measures of BPD (Poreh et al., 2006), related constructs (e.g., depression; Fonseca-Pedrero et al., 2011), and relations to social cognitive tasks (Snowden et al., 2013). In this study, the internal consistency of BPQ scales was generally adequate (Mdn α = .78, range = .64–.86). The total score had a mean of 26.68 (SD = 11.78), which is well below the cut-off score of 56 recommended for indicating a BPD diagnosis (Chanen et al., 2008), though is slightly higher than typically found in community samples (M = 21.06, SD = 12.91; Poreh et al., 2006).
GEFT
The task was administered and scored identically to Study 2; however, given the focus on BPD, the main scales of interest were 100% intensity emotion accuracy, 20% and 40% intensity emotion accuracy, neutral expression accuracy, emotion mislabeling, and neutral mislabeling. These scales correspond to specific processes of interest in the BPD literature (Daros et al., 2013; Meehan et al., 2017). Descriptive statistics and internal consistency for the GEFT scales were generally similar to Study 2 (α range = .24–.92) and are presented in Table 5.
Study 3 Correlations Between GEFT and BPQ Scales.
Note. Data presented are from Study 3 (N = 407). |r| ≥ .10 are statistically significant (p < .05) and in bold. BPQ = Borderline Personality Questionnaire; GEFT = Graded Emotional Face Task; BPD1 = Abandonment; BPD2 = Relationships; BPD3 = Self-Image; BPD4 = Impulsivity; BPD5 = Suicide/Self-Mutilation; BPD6 = Affective Instability; BPD7 = Emptiness; BPD8 = Intense Anger; BPD9 = Quasi-Psychotic States.
Study 3 Results and Discussion
Correlations between all BPQ scales and relevant GEFT scales were calculated. Overall, 42 significant (p < .05) correlations emerged, though all were of small magnitude (Table 5). The overall BPD severity score had two significant, though seemingly contradictory correlations: increased accuracy for 20% happy faces (r = .11) and bias toward mislabeling neutral faces as sad (r = .11). In examining individual BPD criteria scales, there was considerable heterogeneity, with one scale having 0 significant correlations (BPD1-Abandonment) and BPD9 (Quasi-Psychotic States) having 19 significant correlations with the GEFT, with all others having 1 to 6 significant correlations. Considering BPD9, the criterion with the most robust correlations, the results indicated overall inaccuracy for 100% negative emotions (rs = −.10 to −.17), inaccuracy for neutral expressions (r = −.18), enhanced emotion detection sensitivity (r = .10), and an overall tendency to mislabel emotional faces (r = .26). BPD9 also showed some relations to subtle emotion accuracy/inaccuracy and general neutral mislabeling (i.e., happy, sad, surprise, and negative mislabeling tendencies). The robustness of this criterion’s relation to FER has not previously been documented; however, it is consistent with previous findings linking psychosis and psychosis-like experiences to impaired FER (Pinkham et al., 2016).
It is also worth considering BPD9’s correlations with the GEFT, as well as those from other criteria, in the context of theories put forth by Meehan and colleagues (2017) and Daros and colleagues (2013) that connect BPD and FER. Consistent with Daros and colleagues’, all significant correlations with 100% intensity emotion scales were negative, with BPD9 most strongly capturing this effect. Consistent with both theories, all significant correlations with the GEFT neutral expression category were negative (i.e., BPD4 [Impulsivity], BPD5 [Suicide/Self-Mutilation], and BPD9). In considering the possibility of BPD being related to a bias toward social threats (as both theories do), it is worth noting that correlations with subtle expressions (20%/40%), biases in mislabeling emotional expressions, and biases in mislabeling neutral expressions all may provide evidence for preferential processing of social threat information and such scales are used for this purpose throughout the BPD-FER literature (e.g., Meehan et al., 2017). All BPD scales, except BPD1, showed at least one significant small correlation with one of these scales, though the direction of these correlations was at times contradictory (e.g., BPD4-20% disgust, r = .11; BPD9-40% disgust, r = −.18). Summarizing the evidence for consistent biases across BPD scales by emotion, the clearest evidence emerged for biases toward labeling expressions as happy or sad, followed by some evidence for an anger bias. There were fewer significant correlations between BPD scales and relevant surprise, fear, and disgust scales, with several of these correlations often suggesting biases in opposite directions.
Overall, there was some evidence of consistency with the BPD literature and prominent existing theories of the BPD-FER relation, particular in findings related to deficits with neutral emotions and accuracy for 100% intensity emotion expressions. Interestingly, the finding that BPD criteria may be heterogeneously related to FER performance is also consistent with previous research (Meehan et al., 2017), though, in general, criterion-specific relations are underexplored and the relevance of psychosis-like experiences (BPD9) has previously not been shown. In contrast, findings regarding emotion-specific biases were less consistent with the previous literature, with the clearest effects being present for mislabeling faces as happy or sad and evidence for a “social threat bias” (anger, disgust, fear) being weaker. As observed in Study 2 and discussed further below, the internal consistency of subtle expression scales and neutral mislabeling scales was in general inadequate, suggesting that reliably measuring and replicating subtle FER biases may be difficult. Future work may need to consider alternative methods and novel analytic approaches for more reliably capturing such effects.
General Discussion
This study sought to develop and validate an FER task based on theory-driven modifications that psychopathology researchers often make, both to examine the impact of these modifications and to create a task for future use. Important features of this task included the use of all six “basic” emotions, gender and ethnic diversity in stimuli, neutral expressions, time constraints, morphed-graded intensity faces, scales for specific FER biases, and empirically driven stimulus selection. Beyond task features, this study used a multisample development approach and included a multimethod evaluation of the task’s validity. The resulting task—the GEFT—demonstrated evidence of reliability and validity; however, specific scales performed more poorly than expected and raised concerns about procedures used throughout the literature for developing tasks and comparing studies using differing FER paradigms.
Reliability and Range Restriction
Although the GEFT total score, emotion scales, emotion detection sensitivity index, and emotion mislabeling indices generally showed good internal consistency and limited evidence of floor or ceiling effects, GEFT emotion subscales and neutral mislabeling scales showed less favorable psychometric properties. Internal consistency varied from poor to adequate for emotion subscales, consistent with the results of Cecilione and colleagues (2017), and was almost universally poor for neutral mislabeling scales. In both cases, variability and ceiling/floor effects appear to have played a role. For the neutral mislabeling scales, the issue was clear: Participants generally were accurate with neutral expressions, and the base rate of mislabeling neutral faces with any single emotion was low, thus creating strong floor effects (i.e., >50% of participants scored “0%” for each scale). The exception to this finding would appear to also support this conclusion. The negative neutral mislabeling scale was based on the incorrect labeling of neutral faces with any negative emotion (i.e., anger, disgust, fear, and sad) and thus had a higher base rate of mislabeling, resulting in less extreme floor effects and higher internal consistency. Notably, this scale showed small negative correlations with overall performance on another FER task (ER-40 total) and small positive correlations with two BPD criteria, suggesting it may be of interest in future psychopathology research (e.g., Staugaard, 2010).
For emotion subscales, the results were more complicated, in that expression and intensity appear to interact, such that ceiling and floor effects differed by emotion. For instance, the 40% happy scale showed the highest internal consistency of happy subscales and ceiling effects were evident in higher intensity happy scales, whereas an almost opposite pattern emerged for fear subscales (i.e., floor effect with fear—20%). This is in contrast to Cecilione and colleagues (2017), who found that low retest reliability was most clearly associated with low-intensity stimuli. Plausible explanations for the variability observed in this study include (a) some expressions may be easier or more difficult to recognize (e.g., fear; Matsumoto et al., 2000), (b) there may be differences in the quality of actor expressions by category, and (c) there may be other important stimuli differences across categories (e.g., surprise faces being open-mouthed). More generally, the prevalence of floor and ceiling suggests that the practice of grouping extreme facial stimuli together may itself be problematic, as it results in complicated patterns of poor psychometric properties across emotion categories (i.e., reduced scale variability and reliability). Overall, these results suggest caution in using GEFT emotion subscales, as well as the general procedure of grouping FER stimuli into scales by intensity.
The poor internal consistency of intensity-based subscales and neutral mislabeling scales should be concerning to BPD and psychopathology researchers. The magnitude of a correlation between variables is directly constrained by each variable’s reliability, such that for two variables X and Y, robserved = rtrue*sqrt(Rx*Ry). Indeed, this is directly demonstrated by Figure 3. Consequently, it is likely that the poor reliability has limited the ability of researchers to discover and replicate important relations between psychopathology and social cognitive processes. Indeed, this was evident in Study 3, in which previous results for emotion-specific biases from the BPD literature were only partially replicated. Notably, many studies have used similar morphing procedures to alter the intensity of facial expressions (Lazarus et al., 2014; Paiva-Silva et al., 2016) and used the resulting stimuli to measure the effect of intensity (e.g., higher performance on low-intensity stimuli in BPD; Meehan et al., 2017), as well as created scales for biases in labeling neutral faces with particular emotions (e.g., count of number of times emotion incorrectly applied). It is difficult to estimate the impact of poor reliability on previous research because information on reliability is rarely reported for FER tasks and other social cognitive measures (e.g., Schlegel et al., 2017); however, the present findings nonetheless suggest that previous research using similar tasks should be interpreted cautiously.
To improve methods in the FER literature, we make several recommendations. First, studies should report the psychometric properties of FER and other social cognitive tasks. This would be helpful for evaluating and comparing studies, as well as improving methods in this area. Second, attempts should be made to study and improve the psychometric properties of tasks (e.g., Pinkham et al., 2018). As one solution, researchers may consider adding extra stimuli to increase reliability; however, in the case of measuring the effect of intensity upon accuracy, this may ignore the broader problem of range restriction mentioned above. Thus, a third recommendation is to consider novel analytic techniques, such as latent growth curve modeling (e.g., Cecilione et al., 2017) and signal detection theory analyses (Thomas et al., 2018) to more directly, and reliably, model how accuracy is related to the intensity (vs. subtlety) of facial expression and how psychopathology may relate to differing perceptual thresholds.
Convergent and Discriminant Validity
Although most GEFT scales demonstrated convergent validity, the evidence for discriminant validity was less definitive than hypothesized; the GEFT total score was most strongly correlated with the ER-40, RMET, and STEU, more weakly correlated with the IIP-SC, and not significantly correlated with the BART. That the STEU correlated so strongly with the GEFT is surprising, given that its situational judgment task format does not require the same perceptual process that the visual stimuli of the GEFT, ER-40, and RMET share in common (e.g., Mitchell & Phillips, 2015). Nonetheless, there are studies that suggest a link between the STEU and the RMET (e.g., Elfenbein & MacCann, 2017), as well as the STEU and FER (Austin, 2010; Vonk et al., 2015), and that these relations may be of similar magnitude to relations between FER tasks (e.g., Scherer & Scherer, 2011; Schlegel et al., 2013). Possible interpretations of this literature and the present findings include (a) poor task reliability limiting effect sizes, (b) FER tasks measuring less specific psychological processes than theorized, and (c) the need for neural data to disentangle these processes (e.g., Mitchell & Phillips, 2015). Further addressing such issues is beyond the scope of this article; however, we believe that psychopathology researchers should consider this possible lack of specificity when incorporating FER and other social cognitive tasks into their work.
Considering specific expression scales, the strongest evidence of convergent and discriminant validity emerged for GEFT fear, sad, and neutral scales, as these scales had specific relations to their respective ER-40 parallel scales. Somewhat weaker convergent and discriminant validity emerged for GEFT anger, happy, surprise, and disgust scales. One likely reason for this is that the ER-40 lacks disgust and surprise faces. In addition, it should be noted that the ER-40 scales had generally inadequate internal consistency at the emotion scale level, in contrast to evidence of better reliability for its total score (e.g., Pinkham et al., 2016). Thus, it is not surprising that GEFT happy and anger scales did not converge as strongly with their ER-40 counterparts, which both had poor internal consistency (i.e., α < .55). Although these explanations plausibly account for the mixed evidence of convergent and discriminant validity, it is also worth considering that the tests may have more fundamental differences. For instance, it may be that the two additional emotion categories of the GEFT (i.e., disgust and happy) alter task difficulty or even the cognitive processes being assessed, relative to the ER-40. In future work, it would be useful to more systematically manipulate such task features, so that their impact upon reliability and validity can be better understood.
Implications for BPD and Psychopathology Research
The findings from this study have direct implications for FER research in the BPD literature, as well as psychopathology research more generally. First, considering the findings for BPD in Study 3, it is worth noting that this study supported previous findings for deficits in recognizing neutral and 100% intensity emotional expression, particularly negative ones relevant to social situations (e.g., disgust); however, the evidence of specific FER biases was less consistent with previous research. Second, most effects were specific to individual BPD criteria and did not emerge at the level of the overall BPD severity scale, consistent with previous findings of moderators of the BPD-FER association (Meehan et al., 2017) and general findings that BPD may be a broad, nonspecific construct (Williams et al., 2018). Although these findings provide some support for the criterion validity of the GEFT and may advance our understanding of the BPD-FER literature, they also indicated concerns and challenges going forward for the BPD literature and other psychopathology literatures that use FER tasks.
One such challenge is that the processes of interest to psychopathology researchers may be difficult to reliably measure and the effects of interest may be relatively small in the general population. Significant correlations between BPD scales and the GEFT in Study 3 ranged from .10 to .26. Meehan and colleagues (2017) found relatively similar effect sizes using a task with some similarities to the GEFT (multiple emotional expressions, morphing, etc.). One possibility is that the effects of interest are attenuated due to the unreliability of the measures (Cooper et al., 2017). It is also conceptually possible that individual differences in social cognition in those BPD traits are relatively small, but nonetheless have an important impact on social functioning as they compound over a great many social interactions (Funder & Ozer, 2019). Regardless, at present, the combination of small effects sizes and measures with low reliability suggests that effects may be difficult to replicate across studies. Indeed, Meehan and colleagues (2017) found effects that this study did not (e.g., fear accuracy), whereas this study identified effects not present in that study (e.g., happy bias for mislabeling neutral faces). Although it is difficult to estimate the effect that low reliability and small effect sizes has on the literature, the present results suggest that between-study differences should be interpreted with caution.
A related issue is the difficulty of understanding this study’s results in the context of previous studies because of variation in FER tasks. For instance, although the task used by Meehan and colleagues (2017) is relatively similar to the GEFT, differences may be noted: (a) the absence of happy or surprised faces, (b) the use of an “emotion detection” question prior to emotion labeling (i.e., only “neutral” and “emotion” options), (c) different intensities (i.e., 0%, 25%, 50%, and 75%), and (d) the absence of any time limit for responses. In even greater contrast, one study finding a negative bias for mislabeling neutral faces in BPD used only one level of intensity (60%), only three expressions (happy, anger, and neutral), varied presentation time (i.e., 100 millisecond vs. 3 seconds), and preceded each face with an image that was manipulated to induce specific emotional states. Thus, these two studies differ not only in the number/type of emotions presented, time constraints, and emotion intensity, but use other experimental manipulations as well. Reconciling divergent findings across studies such as these requires considerable speculation, in part because the effect that such manipulations have on the reliability and validity of FER tasks is not well understood. Understanding the effect of such manipulations and better integrating this literature requires more common ground in methodology.
In response to this concern, this study provides a new FER task with known psychometric characteristics and relevant scales: the GEFT. The benefit of a measure with known baseline characteristics is that researchers can explore diverse task modifications and have a reference point for understanding what effects their modifications have, in terms of reliability and validity. The GEFT is designed to be simple and flexible to implement. The full set of morphed faces and Inquisit code can be obtained through requesting access to use the NimStim database, 1 though unfortunately the task cannot be implemented online. Researchers using other laboratory software need only to be able to create a program with a fixation cross, picture presentation element, response options, and time constraints to implement the GEFT.
Limitations and Future Directions
Although this study had many strengths, including large sample sizes and multimethod assessment, several limitations are worth noting. First, this study used student samples, which may be higher functioning than patient samples and thus increase the ceiling/floor effects and reduce variability. Nonetheless, a previous study examining a range of social cognitive tasks, including the ER-40, in both psychotic spectrum disorder patients and healthy controls did not find clear evidence of increased floor or ceiling effects (e.g., Pinkham et al., 2018). Furthermore, descriptive data reported in Study 2 suggested that participants in general had higher social functioning than patient samples, though the differences were not large. Regardless, it will be important for future studies to examine the GEFT in clinical samples. Second, any test can be potentially limited by its initial item/stimulus pool. In the case of the GEFT, one might note the inconsistent use of open-mouth faces, use of only static stimuli (e.g., vs. dynamic video-based stimuli; Preißler et al., 2010), and the reliance on Ekman’s six basic emotion paradigm, which has increasingly been called into question (e.g., Barrett et al., 2019). We chose to focus on static pictures of Ekman’s six basic emotional expressions because this has been the dominant paradigm used in psychopathology research and we desired to maximize our continuity with this literature, allowing the present findings to have more direct implications for this work. In addition, it is notable that despite the diversity of actors in the item pool, it was necessary to make adjustments between Studies 1 and 2 to ensure adequate representation of Hispanic actors. Future revisions to the GEFT and other measures should consider broadening the stimuli considered for inclusion, as well as exploring alternative underlying theoretical models (e.g., Said et al., 2009).
Although this study provided evidence that (a) FER task internal consistency may at times be poor and that (b) this poor internal consistency was related to attenuated validity correlations, it should be noted that this study did not systematically manipulate scale internal consistency. Thus, it is unclear if internal consistency’s relation to effect size is causal, as it might be confounded by emotion or intensity effects, though such a confound was not obvious in this study. Future research should further examine this through both simulated and real data, which is manipulated to have lower internal consistency and restricted range (e.g., systematically remove stimuli).
The GEFT is well positioned to be used for such future research, as it has a large stimulus pool (e.g., 50 stimuli of varied intensity per emotion) and this study provided baseline psychometric data on the measure, to which future alterations can be compared. Furthermore, these data suggest that many of the GEFT scales have sound psychometric properties (total score, overall emotion scales, emotion mislabeling, etc.), though there are limitations at the subscale level that may reflect wider problems in how social cognitive tasks are developed and used in psychopathology research (e.g., floor effects). Thus, this study offers the GEFT as a measure ready for use, but also challenges the field to rethink the tacitly assumed reliability and validity of social cognitive tasks.
Footnotes
Acknowledgements
A special thanks is due to Drs. Kristin Gainey, Kenneth DeMarree, and Mark Frank for their feedback on a draft of this article. In addition, we thank Dr. Nim Tottenham for agreeing to host files related to the task developed in this study.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
