Abstract
Temporal dynamics have been increasingly recognized as an important component of facial expressions. With the need for appropriate stimuli in research and application, a range of databases of dynamic facial stimuli has been developed. The present article reviews the existing corpora and describes the key dimensions and properties of the available sets. This includes a discussion of conceptual features in terms of thematic issues in dataset construction as well as practical features which are of applied interest to stimulus usage. To identify the most influential sets, we further examine their citation rates and usage frequencies in existing studies. General limitations and implications for emotion research are noted and future directions for stimulus generation are outlined.
Existing research points towards the benefits of facial motion in emotion perception and recognition. By providing unique information about the direction, quality, and speed of motion, dynamic stimuli enhance coherence in the identification of affect, lead to stronger emotion judgments, and facilitate the differentiation between posed and spontaneous expressions (for a review see Krumhuber, Kappas, & Manstead, 2013). In the last two decades, this advantage—paired with the stimuli’s greater realism and ecological validity—has led to increased questioning and criticism regarding the use of static images (e.g., Tcherkassof, Bollon, Dubois, Pansu, & Adam, 2007; Wehrle, Kaiser, Schmidt, & Scherer, 2000), with a gradual shift in interest towards dynamic expressions.
The trend is reflected in the literature with exponential increases of relevant entries over the past 35 years. For example, a Google Scholar search for the word “dynamic face” and related phrases 1 returned merely 13 articles during 1980–1989 and 87 articles in the period 1990–1999. This figure rose to 889 results during 2000–2009 and has more than doubled to 2,184 results in the past 5 years, from 2010 to 2015. In order to meet new demands in research on both human communication and progressively in machine recognition or human–computer interaction, several databases of dynamic facial stimuli have been developed.
This article aims to provide a systematic review of the existing corpora and draw out the key dimensions and properties of the available dynamic sets. It should be noted that this review is not exhaustive with respect to the stimuli developed within the field of computer science (for an extensive overview of such see Cowie, Douglas-Cowie, & Cox, 2005; Sandbach, Zafeiriou, Pantic, & Yin, 2012; Zeng, Pantic, Roisman, & Huang, 2009). In order to account for the diversity of dynamic facial expression databases, the following selection criteria were applied for inclusion in the present review: (a) public accessibility of the database, (b) database article accessible and published between 2000 and 2015, (c) a minimum of five emotions, (d) digital format of recordings, (e) visual or audiovisual modality of stimuli, (f) real human encoders, and (g) individual portrayals (as opposed to emotive interactions; note that some might contain both types).
In an attempt to provide useful guidance for the readers of this article, we classified databases in terms of three fundamental issues that are relevant to decisions about stimulus sets. These include (a) conceptual features, which reflect thematic approaches in database construction and validation (Table 1), (b) practical features, which concern applied aspects related to stimulus usage (Table 2), and (c) citation and usage frequencies of dynamic datasets in the literature (Table 3 2 ), thereby elucidating their respective impact in the field. This latter issue can be categorized according to whether a dataset was used as stimulus material in research with human participants (social sciences) or for the training and testing of machine learning algorithms (computer sciences). With the tables designed to give specific information about each dataset, the accompanying text will focus on a general discussion, which is structured in terms of the key points listed in Table 1 and is intended to address both theoretical and technical issues, as well as possible directions for future stimulus development.
Conceptual features of 22 dynamic facial expression datasets.
Practical features of 22 dynamic facial expression datasets.
Note. #For databases containing multiple subsets, analyses pertain only to face-focused subsets (excluding social or postural subsets). °This aspect was controlled only to a limited degree. §Face-box was measured by means of a custom software OpenCV (Version 2; Bradski, 2000) and a Haar classifier (Viola & Jones, 2001) which locate the human face in the image and return the width and height of the face bounding box. It can be described as either the absolute number of square pixels 2 or as the relative proportion of the visible facial area in comparison to the absolute image size (% area). *Box estimate based on a subset of (available/suitable) videos. †Box estimate based on one sample video only. ‡Box estimate based on image samples extracted from the article itself (PDF). K = thousands.
Key descriptions: Stimuli: V = video; SQ = sequences. Format: V = video; A = audio; S = still images; 3D = 3-dimensional object files. Visible elements: FA = face; HD = head, NE = neck, SH = shoulders; UT = upper torso; AM = arms. Controlled elements: BG = background; LI = lighting; HR = hair; CL = clothing; AS = accessories.
Citation and usage frequencies of dynamic facial expression datasets.
Note. Citing reference: Articles which referenced the dataset. Keyword or full title: Articles in which the abstract, full title of each dataset, or any known acronyms (if applicable) were mentioned in conjunction with combinations of the keywords “face,” “facial,” “expression,” and “database” using the Boolean “AND” and “OR” operators as appropriate. Dataset usage: Articles in which a dataset was used as stimulus material in research with human participants (social sciences) or for the training and testing of machine learning algorithms (computer sciences).
Emotion Content and Diversity
When choosing an appropriate database, selection criteria should be guided by the specific study question of the researcher (Wagner, 1997). Typically, these tap into two main areas: the expression of facial expressions (encoding) or their perception (decoding). While encoding studies target the expressive features associated with an underlying emotional state, decoding studies investigate how those features are perceived and interpreted by observers (Ekman, Friesen, & Ellsworth, 1982). Depending on the question pursued, available sets of dynamic facial expressions may differ in their suitability.
Table 1 lists key conceptual points which shed light on the scope and potential application of each dataset. A brief review of the number and types of emotion concepts demonstrates that many databases adopt a categorical approach. The categorical view suggests a division of emotions into basic, mutually exclusive categories, such that each belongs to one category, with more complex, compound emotions accounted for by a blending between basic ones (Ekman, 1994; Ekman & Cordaro, 2011). Mostly, these categories are the six basic emotions: anger, disgust, fear, happiness, sadness, and surprise (and occasionally, contempt; Ekman, 1992). Databases that are strictly categorically oriented, featuring between five and eight basic emotion concepts (i.e., BU-4DFE, DISFA, FG-NET, STOIC) are suitable for decoding studies. By allowing for the examination of the expressive cues used in perception, emotion attribution processes can be investigated using these sets. However, a greater variety of stimuli is needed for encoding studies to accurately represent the range of emotional states expressed in everyday life (Calvo & D’Mello, 2010; Zeng et al., 2009).
To account for this complexity, the hierarchical approach may be particularly valuable (Shaver, Schwartz, Kirson, & O’Connor, 1987). Whilst often retaining most (if not all) of the basic labels, databases arising from this framework differentiate to capture nonbasic emotions, with their numbers varying between 11 (UT Dallas) and 55 (MPI). Of note is the CAM Face-Voice Battery which contains a hierarchical organization of 412 emotion concepts in 24 overarching groups. Databases with subordinate differentiation within or in place of some of the basic emotion concepts (i.e., BNED, DynEmo, EU-Emotion, GEMEP) serve particularly well for representing different degrees of arousal, which would go unnoticed if generic labels alone were used (Russell, 1980). This approach increases the diversity within emotion types by offering subordinate exemplars of varying intensities (i.e., nervous, anxious and frightened under fear; or amusement, joy, and excitement under happiness).
Databases that span a large range of emotion categories are also well suited for human–computer interaction research (Pantic & Bartlett, 2007). Increasing efforts are targeted towards computer systems that are able to recognize and respond to emotional signals. Such systems have an enormous potential for affective computing in terms of automatic human affect analysis, which can be applied in fields as diverse as security, medicine, education, and telecommunication (Picard, 1997). This rising interest is also reflected in the citation and usage frequency of the available dynamic sets. As shown in Table 3, the most cited and frequently used databases are CK, CK+, FG-NET, and MMI, all of which were created by computer scientists. These databases typically tap into specific and applied research themes in the computer sciences (i.e., comparing and improving the recognition or detection accuracy of machine learning algorithms), whilst the datasets in the social sciences are generally used in a more diverse way. For affect-based recognition systems to process complex facial signals representative of numerous emotions, wide coverage of emotional phenomena including nonbasic affective states may therefore be fruitful (Sandbach et al., 2012).
Attempts to extend the range of emotions represented in the databases would likewise pave the way for larger stimulus numbers in a practical sense. Databases sometimes portray fairly comprehensive sets of different types of expressions (see Tables 1 and 2) but only for a relatively small number of trained encoders (e.g., D3D-FACS, GEMEP), whereas others provide fewer videos per encoder but use a larger subject pool (e.g., DIFSA, DynEmo). A few sets (i.e., UT Dallas, BINED) include many videos for a medium number of encoders, but these databases typically do not provide behavioral coding for all stimuli which is disadvantageous in terms of facial action classification (e.g., Facial Action Coding System [FACS]; Ekman, Friesen, & Hager, 2002). Given these trade-offs, the MMI database—with well over 800 FACS-coded stimuli and 78 encoders—is perhaps the closest to providing a large number of diverse and behaviorally coded stimuli.
Elicitation Type and Control
A major issue for the selection of a stimulus set concerns the type of expression it contains. As can be seen in Table 1, the available dynamic databases tend to widely use deliberate expressions, with a majority employing a variant of posing over spontaneous emotion elicitation. Posed expressions can emerge from instructions to perform an expression/facial actions (i.e., ADFES, CK, MMI, STOIC) or the enactment of emotional scenarios using the Stanislavski or other method acting techniques (i.e., DaFEx, EU-Emotion, GEMEP, MPI). Such datasets typically allow for good experimental control and yield standardized and prototypical displays that are similar across encoders (Scherer & Bänziger, 2010). In addition to eliminating confounds prevalent in everyday emotion communication (e.g., display rules or emotion regulation strategies), posed expressions are often the preferred method of choice in decoding studies. Facial behavior of this type is more intense and unambiguous due to the clear intention to convey the desired emotion (Cohn, Ambadar, & Ekman, 2007). This can enhance recognition accuracy (Hess, Blairy, & Kleck, 1997) in studies that aim to test observers’ judgments against a predefined label assigned to the expression (Sneddon, McRorie, McKeown, & Hanratty, 2007).
However, this advantage of comparability and reliability can be a disadvantage in terms of realism. Given that everyday emotional expressivity is relatively subtle and heterogeneous (Motley & Camden, 1988), posed expressions may have lower ecological validity, failing to occur in natural or pseudonatural (e.g., films) emotion episodes (Cowie, 2009; Cowie et al., 2005; Scherer & Ellgring, 2007a). Indeed, evidence suggests that spontaneous expressions differ in appearance and timing from posed ones (Ekman & Rosenberg, 2005). Such differences are also reflected in the stimulus durations of the reviewed dynamic databases (see Table 2). Whilst posed sets (i.e., CK, DaFEx, MMI, STOIC) feature expressions of short (500 ms) to medium length (180 s), stimuli composed of spontaneous expressions (i.e., DISFA, DynEmo, HUMAINE) can last up to several minutes. Approaches based on deliberate and often exaggerated portrayals may, therefore, potentially fail to generalize to real-world behavior (Zeng et al., 2009).
To study emotions that approximate more natural instances, spontaneous databases provide a valuable source of information, especially for encoding studies (Scherer & Bänziger, 2010). Respective expressions are captured inconspicuously in either the lab or field (i.e., BNED, HUMAINE) or via emotion-specific eliciting techniques such as the presentation of emotionally laden pictures/films (i.e., BINED, BP-4D, DISFA, DynEmo, UT Dallas; see Gross & Levenson, 1995). Besides allowing for more fine-grained and natural forms of expression, spontaneous displays can include context-specific information about the emotion-eliciting event. This makes them challenging to analyze as they are often blended rather than pure emotions, with significant variability in expression across encoders (Bänziger & Scherer, 2007). Also, video backgrounds may vary (see Table 2), some having wavy curtains (BNED, HUMAINE) or naturalistic office-type environments that show additional objects such as cables and microphone holders (BINED, FG-NET).
For authentic emotion induction to become the method of first choice, researchers will likely need to aim for a compromise between spontaneity and experimental control (Sneddon et al., 2007; Zhang et al., 2014). At the moment, recording conditions are often not well technically controlled which affects the quality of the stimuli (see also Bänziger, Mortillaro, & Scherer, 2012). As a result, naturally oriented databases lag behind in providing top-notch, technically sound, materials. From the available sets that include spontaneous expressions, best recommendations are probably BP-4D, DynEmo, and UT Dallas, all of which (partially) standardize background and lighting and are of acceptable nominal resolution.
In the future, more work could be done to capture facial expressions at higher frames rates (60 fps and higher) using specialized recording equipment. A distinction could also be made between what is visible to the encoder and to the camera/perceiver. For example, dataset authors might want to set up a comfortable and natural environment for the encoder (allowing for spontaneous behavior), while at the same time ensuring that what the camera captures is systematically controlled. In addition to existing and well-validated techniques for emotion induction (for an overview see Coan & Allen, 2007), novel social entities such as virtual agents, robots, and androids may constitute a viable option for eliciting spontaneous expressions. Since their appearance and behavior is fully controllable, human users’ response patterns can be evoked and recorded in a consistent manner (MacDorman & Ishiguro, 2006).
Measurement and Validation
To validate the emotional content of expressions, judgment tasks (also referred to as recognition tests) serve as the primary validation measure in the context of the reviewed posed datasets (Scherer & Bänziger, 2010). With the aim of assessing the accuracy of the conveyed relevant emotions, observers were asked to provide an emotion label that matched the viewed stimulus. Most often this occurred out of a closed set of categorical options (from seven to 24). In some databases (i.e., BNED, BP-4D, DynEmo, HUMAINE) interrater agreement on emotion categories or segments is used as an extra measure of reliability, thus assessing recognition from a second perspective (and accounting for chance agreement if measured by the kappa statistic; Sayette, Cohn, Wertz, Perrott, & Parrott, 2001). Although the forced-choice paradigm yields robust results, particularly in the case of basic emotions (Limbrecht-Ecklundt et al., 2013), it has been criticized for lacking ecological validity since it forces the use of labels that might not otherwise be selected (Russell, 1993; Wagner, 1997).
In order to allow for a more flexible selection of emotion terms, without restraining the observer to one response option, alternative methods include confidence and intensity judgments applied to all emotion labels (e.g., Hi4D-ADSIP, STOIC) or continuous emotion ratings as expressions progress over time (e.g., DynEmo). A few databases use additional supportive measures that tap into the dimensions of valence, arousal and/or intensity (i.e., ADFES, BINED, BNED, EU-Emotion, GEMEP). These provide added value as they offer a more comprehensive framework for emotion assessment than mere categories or hierarchies (Russell, 1980) and can also increase the informative value of a given emotional episode.
For spontaneous datasets, introspective measures constitute an essential validation approach (e.g., BINED, BP-4D, DynEmo). Encoder self-reports of the emotion felt during the elicitation procedure provide insight into the elicitation effectiveness and accuracy of the resulting expression (Gray & Watson, 2007). This enables an evaluation of whether the target emotion was elicited. Nevertheless, reliance on self-report alone remains problematic due to potential discrepancies between what is experienced and what is reported (Nielsen & Kaszniak, 2007). In this context, additional information in the form of audiovisual cues (i.e., gesture, posture, speech) could be particularly useful to yield a coherent representation of the emotion in question (Cowie et al., 2005; Scherer & Ellgring, 2007b). Multimodal stimuli have long been acknowledged to improve emotion classification (Russell, Bachorowski, & Fernández-Dols, 2003; van den Stock, Righart, & de Gelder, 2007). Encoding studies may therefore substantially benefit from the presence of multimodal affective features in databases that allow examination across modalities (i.e., BINED, BNED, DynEmo).
Component measures such as the Facial Action Coding System (FACS) can be of considerable value in this regard by providing an objective classification of the observed behavior (Ekman & Friesen, 1982). Such measures permit a comparison between expressive features and emotion-related variables (i.e., self-reports, physiological responses) in the encoder. In the reviewed datasets, FACS coding is available for both deliberate and spontaneous expressions for the dimensions of action unit (AU) occurrence, intensity, and/or timing. The BP-4D set examines its stimuli using multiple methods (i.e., emotion self-reports, observer judgments, and FACS), thereby providing the most stringent validation of its content.
Some databases also submit their stimuli to machine recognition (i.e., DISFA, BP-4D, CK, D3D-FACS). FACS has been frequently used in studies of automatic expression classification, making it a prominent tool in affective computing (Cohn, Zlochower, Lien, & Kanade, 1999; Lien, Kanade, Cohn, & Li, 2000). Automatic AU recognition has been shown to achieve recognition rates comparable in accuracy to manual coding, indicating its potential to significantly facilitate the labor-intensive process (Cohn et al., 1999). However, most systems employing FACS for facial behavior measurement still have been using posed expressions to train the classifiers in recognition, thereby restricting their applicability in natural settings (Zeng et al., 2009).
To develop automatic systems that are robust to natural variations in appearance, behavior, and context, future research should invest in more stimulus sets containing spontaneous expressions (see BP-4D, DISFA; Bartlett et al., 2006; Pantic, 2009). Such an endeavour would also be advantageous for the (automatic) analysis of the temporal dynamics of spontaneous expressions. Whilst there are a few such attempts (e.g., Cohn & Schmidt, 2004; Valstar, Pantic, Ambadar, & Cohn, 2006), the field is still in its infancy with respect to the extraction and modelling of the temporal structure of spontaneous facial actions, including their temporal relations. To fulfil this requirement, high frame rates and good resolution are necessary preconditions (see Sandbach et al., 2012). Whilst the nominal resolution has increased substantially for some of the most recent sets (i.e., BP-4D, BU-4DFE, D3D-FACS, Hi4D-ADSIP), the effectively available visible area of the face in the video (i.e., face-box; see Table 2) is still less than 300 square pixels for the majority of databases. Such a resolution could prove insufficient for exploring microexpressions or subtle temporal features that require small parts of the face to be clearly visible.
In the future, cooperative efforts between psychology and computer science to work on a common dataset are indispensable (for a positive example see the “Facial Expression Recognition and Analysis” [FERA] challenge; Valstar et al., 2015). At the moment, only a small number of dynamic stimulus sets tend to be commonly cited and employed (i.e., CK, CK+, FG-NET, MMI, GEMEP; see Table 3). When comparing dataset usage between disciplines over the past 15 years, the number of empirical articles in the computer sciences (n = 1,543) vastly outnumbers those in the social sciences (n = 124). It therefore appears as if dataset usage in the social sciences is more restricted, with an almost exclusive focus on posed expressions. For knowledge transfer and dialogue to increase, researchers from both sides will have to embrace the wide variety of available stimulus sets. We hope that the present review helps to enable more work on the dynamic nature of emotions.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
