Abstract
The level of realism that real-time virtual humans have reached in the last years enables their use as an alternative to pictures and videos in the remediation of social cognition deficits. This paper presents the engineering principles and tools used to design facial expressions on virtual humans to play basic emotions. The proposal is based on the Facial Action Coding System that makes it possible to easily represent facial expressions. Then, the paper describes how the designed virtual human facial emotions have been assessed by healthy people. For that purpose, 204 healthy participants have taken part in an experiment in which they had to recognize the six basic emotions (each of them with two levels of intensity) depicted by the virtual humans. The overall accuracy of the emotion identification task was 88.25%, which outperforms most results obtained by other authors using virtual humans and/or pictures. The best recognized emotions were neutral, happiness and anger. Remarkably striking was the high success rate gotten for disgust, far superior to previous studies based on virtual reality. Unlike other works, no significant differences were found between women and men in the recognition of emotions, probably due to an enhanced dynamism and realism of the designed human faces. However, age-related differences were found for some emotions in favor of the younger participants. In addition, higher emotion identification rates were detected for higher intensity representations of each emotion, for more dynamic avatars and for faces shown frontally compared to lateral ones. Therefore, the results of the evaluation experiment have demonstrated that virtual humans perfectly convey emotions using facial expressions.
Introduction
Modeling social interactions is becoming increasingly important in the creation of interactive systems that apply Biomedical Engineering (BME) principles and design concepts [1, 2, 3]. BME pursues to close the gap between engineering and medicine to advance health care, including diagnosis, monitoring, and therapy [4]. Non-verbal communication (gaze, facial movements, gesture, interpersonal distance, etc.) is an essential dimension in the interaction of humans with humans and machines [5]. In this context, facial emotion (affect) recognition is the capacity to distinguish and perceive essential types of affective expressions in faces [6]. This daily life ability is critical for social interaction [7].
Indeed, the way an individual perceives emotional states in others influences his/her social success, which is important for assimilation to the community [8]. There is steady proof that patients with various neuropsychiatric disorders have noteworthy trouble in perceiving emotions communicated by others in a precise way. This creates a distortion of social circumstances that supports the presence of psychotic symptoms and a decrease of social functioning [9]. This deficit in facial affect recognition has been widely observed in psychotic disorders, particularly in schizophrenia and related disorders [10]. The disability is by all accounts stable throughout the disorder, not identified with psychopathology or pharmacological treatment, and autonomous of general cognitive deficits [11, 12, 13].
This is the reason why different mental interventions have been designed to improve facial affect recognition in patients with schizophrenia. In fact, recent meta-analyses have demonstrated promising results of psychotherapeutic methodologies improving facial affect recognition and functionality [9]. For this purpose, computer-based treatments appear to be fitting to date. Indeed, access to digital technology and the Internet to individuals experiencing mental disorders [14] empowers new ways to deal with their illness. Precisely, frameworks for emotion detection [15, 16, 17] and facial emotion detection [18] have been at the focal point of our research. The generation of frameworks based on virtual humans (VHs) for facial affect recognition [19, 20] is one of our current primary interests. In the most recent years, a few works on building multi-modal virtual characters for social cognition remediation programs [21, 22] and proposing the utilization of VHs to describe auditory hallucinations in schizophrenia patients [23, 24] have been presented.
The primary effort to reproduce human faces in 3D using computer graphics imagery is probably over 45 years old [25]. From that point forward, 3D human faces have been widely used in computer games and films. Current advances in graphics technologies are moving out of the uncanny valley, as real-time rendered virtual characters turn out to be increasingly realistic. Hence, an absence of human-like facial expressions portrayed by virtual characters increases the dimension of strangeness felt despite physical authenticity [26]. Additionally, physical authenticity ought to be accompanied by social realism [27], as the individuals’ assumptions regarding realistic movements or behaviors are raised by the level of physical authenticity [28].
A few physical models depicting facial expressions have been proposed from the perspective of muscle activation and can be used to accomplish a higher level of authenticity. One of the most popular models in the literature is the one by Ekman and colleagues [29]. Notwithstanding, conveying emotions using virtual characters is not a simple task, and this might be identified with the absence of knowledge about the time course of facial movements and the affective content felt from those developments [30].
The present paper introduces the engineering principles and tools used to design facial expressions on virtual humans for showing basic emotions. At that point, an investigation is done with healthy individuals to assess that the designed facial emotions are accurately interpreted by people who have no social cognitive shortfalls. Thus, the objective of the present paper is the description of the design process of virtual human facial expressions and their validation by 204 healthy participants. The novelty of the article lies, first, in the description of the virtual human design process that combines several techniques and commercial software packages. Secondly, in the large sample size to validate the design, and finally, the designed facial expressions are shown to each participant with two levels of intensity (low and high), two levels of dynamism (low and high), and from different perspectives (frontal and lateral views). Should the outcomes be concluding, our virtual human emotional faces could be used as correlation with patients with schizophrenia. Conversely, if healthy people are not able to correctly detect facial expressions on virtual humans with a high accuracy, it would be nonsense to use the avatars in further studies with real patients. The current paper must be considered as an initial move towards planning a complete therapy to upgrade facial affect recognition in these patients.
Facial action coding system
People are social creatures who need to convey emotions in order to socialize. Individuals express emotions in various ways and the face assumes a significant role in how emotions are transmitted both in verbal and non-verbal communication. Affect has been contemplated in the field of Psychology and Psychiatry. For example, Ekman and Friesen [31] concluded that there exist six universal basic emotions (Anger, Disgust, Fear, Happiness, Sadness, and Surprise). Facial muscles are used to demonstrate these emotions, changing that way the appearance in the face. A few studies attempted to discover a framework to evaluate the various changes in the face providing a few appro-ximations [32].
To reproduce the adjustments in the face, our proposal is not situated in an exact representation and investigation of the muscles, but in the well-known Facial Action Coding System (FACS). We picked this framework since it is the most generally used and has demonstrated its adequacy through a psychometric evaluation for assessing spontaneous expressions [33]. Subsequently, it gives more information, such as intensity, than other systems about the adjustments in the face [32]. In fact, FACS, designed in 1978 and modified and improved in 2002 [29], is a notable system that categorizes facial movements based on various Action Units (AUs), instructing how to perceive and score them.
Every AU characterizes a group of muscles that work together to provoke a modification on the facial appearance. The various AUs are gathered according to the location of the facial muscles that are separated in upper and lower face muscles. Upper face muscles incorporate eyebrows, forehead, eye cover fold, and upper and lower lids. The lower face incorporates the muscles around mouth and lips, and it is divided in different classifications as indicated by the muscles’ movement directions. There are other AUs based on muscles that move the neck and the gaze direction.
Facial expressions in FACS
Only fifteen AUs from the twenty-eight fundamental ones are required to describe the six basic emotions (Anger, Disgust, Fear, Happiness, Sadness, and Surprise) and the Neutral expression used in this research. The AUs used and their connection with every emotion is shown in Table 1.
The table shows that:
Twelve animations were modelled. Two levels (1.2) of animations were designed for every emotion. No AU was needed to model the Neutral expression. Each level indicates the emotion intensity; a higher number shows greater emotional intensity, (2) New AUs were added for some emotions, indicating that more muscles are moved, and more facial changes appear.
The work flow for the generation of the affective VHs is one of the novelties of this proposal. The contribution of our approach lies in the combination of several well-known techniques in the game development and VR fields, using different software packages in order to obtain the desired result. This work flow process is depicted in Fig. 1 and described in the following paragraphs.
Work flow of the generation of affective virtual humans.
We began by choosing and tuning two predefined characters accessible in Adobe Fuse CC [35]. Adobe Fuse CC is a software tool principally aimed at video game developers to create 3D characters using an advanced character creation editor. The characters gotten by using this tool exploit numerous visual features that are typical in recent high-end video games, specifically high-resolution textures, normal maps, ambient occlusion maps, etc. Besides, they are completely configurable, from the face to the length of the arms, torso or the garments they wear. For this work, predefined VHs were chosen (using the option “Assemble”) and not changed apart from picking garments and choosing a hair style (option “Clothing”) that did not occlude significant parts of the face like forehead and eyes.
By clicking the “Send to Mixamo” button from inside Adobe Fuse CC, the characters were then exported to Mixamo [36], a web platform that provides auto-rigging of humanoid 3D models. Rigging is a technique used in skeletal animation for representing a 3D character model using a series of interconnected digital bones. It refers to the process of creating the bone structure of a 3D model, which enables manipulating the 3D model like a puppet for animation. At that point, an idle animation was chosen from the available ones. This platform likewise provides some basic form of facial animation that can be used in other programs. In any case, it did not depend on FACS so that there is no immediate correspondence between them.
From that point onward, the resulting 3D character is brought into the 3D Studio Max [37] authoring tool to generate the AUs, always starting from the Neutral facial expression. We used blend shapes, also known as morph animation targets, which consists in altering the mesh likewise and storing the vertex positions for every AU. At that point, these AUs are smoothly morphed and combined to shape the wanted facial animation by using a “Morph Modifier”. Alternatives to implement AUs are the utilization of a muscle-based animation system or a hierarchy of bones to modify the geometry. A blend shape animation system was chosen because of its straightforwardness and the likelihood to combine AUs to generate complex facial expressions.
Once all the AUs were incorporated into the virtual human models, they were exported into Unity 3D [38], the real-time engine used to play the animations. Time is an essential information for the optimal planning and scheduling of interaction systems [39]. This software tool enables to use the initial idle animation incor-porated into Mixamo in the same manner as the blend shapes included into 3D Studio Max. At that point, the facial expressions were upgraded by adding wrinkles to the models. For this, a custom surface shader using several “Normal map” textures was generated in Unity 3D and used as a material parameter. Normal maps are used to simulate subtleties in an object’s surface by changing the vertices’ normal and, therefore, affecting light calculation on the surface. Everyone incorporated an alternate wrinkle pattern related to each emotion. This shader described all the visual appearance of the virtual human’s face, using textures for the skin color, normal mapping, reflections (specularity) and ambient occlusion. The standard lighting model was used, and shadows were enabled on all light types (fullforwardshadows option).
The textures generated by Adobe Fuse CC were used in the surface shader. For each facial expression, the Nvidia Normal Map Filter for Adobe Photoshop [40] was used to generate the normal maps based on photos of individuals delineating the facial emotion and on the wrinkle descriptions provided by Ekman and Friesen [34]. We used 7 different normal maps per virtual human, one per each facial expression plus the Neutral one. At last, the custom surface shader smoothly interpolated between the Neutral normal map and the normal map of every facial expression, simulating a dynamic generation of wrinkles. This interpolation was synchronized with the blend shape animation of the facial expression in a way that the wrinkles appeared gradually as the animation progressed. A value of 0 for a blend shape related to a given facial expression means that this facial expression is not displayed by the virtual human, while 100 means that the facial expression is completely displayed. This denotes that the vertices of the face depicting the expression have moved from the initial position (Neutral) to the final position for the given expression. Similarly, this numeric value ranging from 0 to 100 was used to blend the normal maps from the Neutral normal map to the corresponding normal map for the target facial expression. These two actions happening together gave the impression that the wrinkles are generated by the motion of the face when depicting each expression. For instance, Fig. 2 shows how the Surprise and Disgust facial expressions are enhanced by using wrinkles.
Male virtual human demonstrating the surprise emotion (top images) and female virtual human showing disgust (bottom). Both pairs of pictures show how they look without (left) and with (right) normal maps. Please pay attention on the wrinkles on the forehead and on both sides of the mouth in the male VH, and the wrinkles around the nose and the mouth in the female.
As the fine-tuning of most parameters related to the AUs has been done by hand, the way decisions have been taken should be highlighted. Two engineers designed the avatars from scratch to a first version. Then, the other engineer and the psychiatrists discussed about the likeness of the virtual human’s emotions to the human’s ones. The final version of the avatars was obtained in an iterative manner.
Different variations of the virtual humans used in the emotion recognition evaluation.
In a previous experiment [41], we detected troubles with some facial expressions. Originally, Happiness included a third variation of lower intensity, a very subtle smile which was mainly confused with the Neutral expression. It seems that the participants did not consider it to be a sign of happiness, even when they noticed the smile. Therefore, that variation was excluded from the present study. There were also some identification problems for Disgust and Sadness, which lead to their redesign.
Validation of facial expressions by healthy people
A validation of the system was performed to assess its suitability to propose a future clinical therapy for enhancing facial affect recognition in people with schizophrenia. Therefore, in first term it was determined as mandatory to evaluate if healthy people are capable of recognizing the virtual facial expressions generated. In addition, the acceptance and use of such avatar-based proposal has been positively assessed by 41 therapists in a recent previous work [42].
Several variations of the virtual humans used in the emotion recognition evaluation.
The sample size was 204 healthy volunteers. The single inclusion criterion was to be aged between 20 and 79 years. The mean age of the participants was
Experimental procedure
As explained before, this experiment aims at exam-ining whether the emotions shown by the designed virtual humans convey the same emotional meaning as real human expressions. Two variations of each of the original VHs previously used by our research team [41] were designed. This made a total amount of six VHs (see Fig. 3). Four were Caucasian (two males and two females) and two African (one male and one female). All six VHs were designed with two age representations, namely adult of about 30 years and old age. As said, seven emotions were portrayed for this experiment based on the FACS system: the six basic emotions and the Neutral expression. Two different intensities were implemented for each basic emotion, giving a total of thirteen intensity emotions, labeled as neutral, surprise1, surprise2, fear1, fear2, anger1, anger2, disgust1, disgust2, happiness1, happiness2, sadness1, and sadness2 (written in lower case).
Emotion recognition rates for each of the 13 animations. Average successful recognition rate: 88.25%
Emotion recognition rates for each of the 13 animations. Average successful recognition rate: 88.25%
Each facial expression was presented to the partici-pants four times (two frontal and two lateral views, one from each side). This made a total of fifty-two facial expression representations. Thirty-six of them were shown using the younger adult versions of the Caucasian VHs, eight using the aging adult version of the Caucasian VHs and eight using the African VHs. These fifty-two facial representations using different variations were randomly presented to the participants. Half of them were presented with less dynamism (only the most characteristic facial features presented movement, i.e. the AUs described in Table 1) and the other 50% with more dynamism, including movement of the head and the neck in order to bring more realism to the expressions.
The validation process is depicted in Fig. 4. It starts by requiring the participant to complete two questionnaires. The first one includes a collection of social, demographic and clinical data. The second is the Spanish version of the positive and nega-tive affect schedule (PANAS [43]), which is a 20-item self-report questionnaire that measures the individual’s positive and negative affect. This schedule is included in order to control the participant’s mood state or non-specific depression symptoms. After completing the questionnaires, the participant starts the evaluation test.
A short tutorial describing the task is presented to the participant. He/she must press a button to start the experiment. The validation process starts by displaying the first facial expression. Each time a new facial expression is shown, the character’s face is faded-in from a black background. A transition is made from the neutral expression to the new emotion (lasting 0.4 seconds), which is held for 1.5 seconds. Then, there is a new transition to the neutral expression (again, 0.4 seconds). This is in accordance with transition times studied in well-known works [44], as expression time lasts between 0.5 and 4 seconds.
Once this process has finished, a panel is presented to the participant asking him/her for the expression just offered by the VH. This panel also includes a button for each of the six basic emotions and the Neutral expression. Once the participant has selected an option, the character face is faded-out. This process is repeated for each of the fifty-two facial expressions. The experiment finishes once the system has presented all the facial expressions to the participant.
The same idle animation is used for all virtual humans. This animation is subtle enough not to distract the participant, while it adds a slight swing that provides more realism to the character. Similarly, a blinking animation is also added, but only during the time the system is waiting for a participant’s response, not during the actual visualization of the emotion.
Accuracy in facial expression recognition
The number of correct answers did not follow a normal distribution (Kolmogorov-Smirnov test:
As shown in Table 2, the percentage of successful recognition for each face expressing an emotion is high, all well above random chance (set at 14%), and all above 80% except fear1 and fear2. These results are consistent with previously published studies in which Fear and Disgust were the least identified emotions [45, 46]. In our study, although Fear is the least recognized emotion, it obtained a percentage of hits similar to previous studies [46, 20].
Emotion recognition rate with (left) and without (right) VH movement. Average successful recognition rates are 90.12% and 85.03%, respectively
Emotion recognition rate with (left) and without (right) VH movement. Average successful recognition rates are 90.12% and 85.03%, respectively
A closer look at the table shows that fear1 was confused with Surprise and Sadness, while fear2 was mainly confused with Surprise. This is astonishing as it differs from the results of our previous experiment using the same facial expressions [41], especially for Fear. With regards to disgust2 and sadness2, a great improvement is noticeable, as they improve from 68.5% to 89.6% for disgust2 and from 31.5% to 82.6% in the case of sadness2.
Regarding the rest of the emotions, the only one with a confusion percentage above 10% is disgust1, which was mainly confused with Anger. Remarkably, this is also consistent with previous literature [45, 47]. Nevertheless, the high percentage of successes around Disgust is striking in relation to previous outcomes [20, 45, 46, 47]. Previous studies have reported a limitation in the recognition of Disgust through virtual reality. This phenomenon may be due to the difficulty of authenticity recreating the nasolabial area [48]. For this reason, we paid special attention to this region during the redesign of emotions.
In general terms, the average percentage of successful recognition is high (88.25%), in this way improving in more than 5% our previously reported results (83.56%) [41]. This degree of accuracy is superior to that obtained in previous works with classical stimuli (natural faces) like the Ekman-60 Faces Test [49] or the Penn Emotion Recognition Test-96 Faces version [50]. With the first stimulus, the general percentage of success in two different populations was approximately 82% [49, 51]. With the second stimulus, a general percentage of success close to 70% was obtained [20]. In the latter study, the classic stimulus was compared with a set of virtual faces. With the virtual stimulus the overall accuracy was over 73%.
Similar studies based on virtual faces obtained an average percentage of success ranging from 62 to 78% [45, 47, 46, 19]. The last two papers included more emotions, namely contempt, embarrassment and pride. Thus, the results obtained in our study are in line with previous studies using virtual humans for the recognition of emotions. Only one previous study obtained a percentage of success superior to all other studies (91.7% [52]). Nonetheless, this study was only applied to 41 participants. Moreover, other findings of our experiment are consistent with previous studies using virtual humans to recognize emotions, where the Neutral expression and the Happiness emotion were the most easily recognized, followed by Anger and Surprise.
There was no significant difference in the overall number of correct answers for all emotions per gender (Mann-Whitney:
Emotion recognition rates for frontal (left) and lateral (right) views of the VHs. Average successful recognition rates are 9.06% and 87.45%, respectively
Emotion recognition rates for frontal (left) and lateral (right) views of the VHs. Average successful recognition rates are 9.06% and 87.45%, respectively
Emotion recognition rates with virtual human movement and using front cameras (left), and without virtual human movement and using lateral cameras (right). Average successful recognition rates are 91.18% and 84.42%, respectively
Regarding age, we created two groups (younger and older adults) using the median (47 years) to split the two classes. Similarly, no significant differences were found in the total number of correct answers. However, there were differences for some individual emotions. The results for anger1, joy2 and sadness1 were significantly better for the younger group (
As discussed earlier, each emotion was depicted by the VHs with two different intensities. In this case, the Wilcoxon Signed Rank test revealed a significant difference for the intensity parameter (
A deeper study focusing on individual emotions revealed that fear1, anger1 and disgust1 obtained a higher number of correct answers for higher intensity emotions (
A subtle movement of the VHs’ head and upper body was included in 50% of the emotions presented to the participants in order to study its effect in the recognition rate. Our hypothesis was that it would increase the rates of successful recognition because the number of hints were increased. The Wilcoxon Signed Rank test (
Table 3 shows the confusion matrices for both conditions. The average successful identification rate with movement was 90.12% while it was 85.03% for no movement. A general improvement is noticeable, for example, taking a look at fear1. Using the less dynamic virtual humans, this emotion was confused with Sadness almost one-fourth of the number of times it was presented to the participants. This was reduced to 3.7% when more dynamic VHs were used. It makes sense that a greater dynamism in the area of the neck and face is related to a better identification of emotions due to the increasing realism.
In addition, three different camera angles were randomly used to present the VHs to the participants (50% from the front and 50% from both sides). Our hypothesis was that front views would obtain a higher number of correct answers than lateral views. This was confirmed for the total number of correct answers by a Wilcoxon Signed Rank test (
Histogram showing number of participants in terms of ranges of bad responses.
Identification errors per each of the 52 faces presented to the participants. The trend line shows a negative slope of 18%.
This confirmed that the combination of virtual human movement and frontal cameras obtains the best possible results. Table 5 shows this enhanced combination side-by-side with the worst combination, which does not use virtual human movement and only lateral cameras. This comparison maximizes the differences. Apart from the ones mentioned before (fear1 and fear2), there are other noticeable differences between both tables. sadness2 is confused with Fear 10% of the times it is presented to the participants in the worst combination (see left side of Table 5). This percentage is reduced to 5% for the best combination (see right side of Table 5). Moreover, this issue is of great importance, as there are no studies published to date that include different camera angles and compare the identification of emotions in the avatars presented with frontal and lateral views.
The average number of errors per participant (measured as the mistakes made when identifying an emotion performed by a virtual human) was 6.11 (
Influence of the number of faces presented on the number of errors made
We wanted to study whether the number of emotion identification errors increased or decreased during the progression of the test. Figure 6 plots in the
Two situations would be possible: (1) more errors are made at the beginning, which would mean that the participants learn and improve along the test, or, (2) more errors are made at the end, which would mean that they become tired. The trend line (colored in red in Fig. 6) shows a reduction in the number of errors as the test progresses. The slope of the line is
The loss of progressive attention during the performance of a task is known, as well as the improvement in cognitive performance after the repetition of an instruction. However, no previous studies of facial recognition of emotions that evaluate the “fatigue effect” or “learning effect” have been published. Therefore, our results cannot be compared with any previous work.
Conclusions
The objective of this paper was twofold. On the one hand, it aimed at describing the complete engineering design process of virtual humans capable of expressing facial emotions. Following the Facial Action Coding System, six facial expressions with two levels of intensity were designed to convey the six basic emotions (Anger, Disgust, Fear, Happiness, Sadness, and Surprise), plus the Neutral one. On the other, this paper has described in detail the assessment by 204 healthy people of the avatar expressions designed.
The work described in this paper is a follow-up of a previous pilot study [41]. The current work has improved the design of the virtual humans, introducing variations for age and race, and increasing the number of participants in the evaluation from 23 to 204. This has enabled a further analysis of the results and allowed us to reach out wider and more precise conclusions.
The overall success recognition rate was 88.25%, which is consistent with (and mostly outperforms) the results obtained by other authors in previous works using virtual humans and/or pictures [19, 20, 46, 53].
Moreover, age and gender were found to have no statistically significant influence on the overall recognition rate. However, the difference of two age groups proved to be significant for some emotions, being better for the younger group. Another interesting result is that the intensity of each emotion was found statistically significant, being the more intense emotional expression easier to recognize by the participants in respect to the less intense one. Other technical aspects have also been evaluated, such as the camera angle in which the faces were presented to the participants, and the level of dynamism of the VH. The results showed that frontal cameras (versus lateral cameras) and more dynamic VHs (in relation to less dynamic VHs) provided the best results in terms of overall successful recognition rates.
In summary, the results show that the virtual faces designed in the experiment are valid for accurately recreating human facial affect expressions. Current findings show that virtual reality environments allow the design of virtual faces that can be controlled externally and in real time, overcoming some of the limitations associated with the use of static faces. This has clinical implications, since the advances provided by virtual reality for sure help to design therapies for patients with difficulties in identifying basic emotions.
The next objective of our research team is the design of a facial affect recognition therapy for patients with schizophrenia. The lessons learned through the current research can also be used to adapt and personalize the therapies to each patient. For example, diff-erent camera angles and even more dynamic VHs can be used to increase or decrease the learning curve of emotion recognition tasks. In future work, the idea and methodology proposed in this paper could be easily extended to various science and engineering applications [57, 58, 59].
Footnotes
Acknowledgments
This work was partially supported by Spanish Ministerio de Ciencia, Innovación y Universidades, Agencia Estatal de Investigación (AEI) / European Regional Development Fund (FEDER, UE) under PID2019-106084RB-I00, DPI2016-80894-R and TIN2015-72931-EXP grants, and by Biomedical Research Networking Centre in Mental Health (CIBERSAM) of the Instituto de Salud Carlos III.
