Abstract
Introduction
Long-standing research in the field of medical communication has shown that the way in which information is conveyed to patients has an impact on several important patient outcomes, 1 –5 including the quality of the diagnosis, patient satisfaction, reported enhanced physical and emotional health, understanding of medical information, recall and compliance/adherence to instructions, and better performance in daily activities, as well as improvements in markers of disease. In fact, next to expert knowledge, the communicative process is considered to be one of the two fundamental components of medical care. 3 It is alarming that patients tend to misunderstand or forget 20–75% of the information they receive, 6 which may result in additional hospital time and treatments for side effects, for instance, due to patients' errors in taking medication.
Misunderstanding medical information may be due to various factors. First of all, when faced with unexpected bad news or having to deal with a new diagnosis, patients may become overwhelmed and confused. 7 Second, the way information is framed (i.e., positively, rather than negatively) affects the decision-making process, particularly in older patients who tend to rely on heuristics rather than on analytic information processing. 8 Finally, in particular, older patients may misunderstand the information they receive because of poor health literacy. 9 Past research has indicated that checks of understanding are infrequent in medical interactions. 10,11 In addition, older patients might be reluctant to signal a communication problem because they value the competence of medical professionals and their authority and prefer not to challenge it in any way. 12 In this context, correct detection of nonverbal cues provides valuable information as to the patients' degree of comprehension. In addition, an automatic detection of confusion is of high relevance for Web-based information services for the elderly 13 (e.g., in the form of online instruction videos or electronic visits). 14
Several experimental studies have demonstrated that confusion and uncertainty can be detected with the help of facial cues. 15,16 To describe the relevant cues, research in nonverbal communication makes use of various annotation systems. The most common type of facial cue analysis is the Facial Action Coding System (FACS), originally developed by Ekman and Friesen. 17 The FACS offers a comprehensive anatomically based coding of facial expressions by decomposing them into 44 Action Units (AUs). According to Ekman and Friesen, 17 AUs (e.g., lip raise, jaw drop, and blink/eye closure) are the building blocks of facial expressions, including emotional expressions. For example, the facial expression of happiness is detected by the simultaneous presence of cheek raiser (AU 6) and lip corner puller (AU 12). For confusion, studies performed in learning environments involving an online tutor indicated that it is typically associated with a lowered brow (AU 4), tightening of the eye lids (AU 7), and a notable lack of a lip corner puller (AU 12). 18,19
In addition to manual annotations, the AUs can be detected with the Computer Expression Recognition Toolbox (CERT), a toolbox that offers a real-time recognition of the facial units described by the FACS and can perform with 80% accuracy on a dataset of videotaped spontaneous behavior. 20 Given that some of the cues may be barely detectable for humans, 21 the first objective of this study was to compare the results of the automatic detection with the performance of human raters. Because the issue of confusion detection in older age groups is particularly pertinent in the context of doctor–patient interactions, the raters in the perception study were all medical students. The second objective was to identify the relevant cues; we assumed that the same AUs that have been found to be indicative of confusion in learning environments (lowered brow, tightening of the eye lids, lack of a lip corner puller) might also play a role in the detection of confusion in medical interactions involving the elderly.
Materials and Methods
Recordings
In total, 24 participants (50% male) ranging in age from 70 to 90 years (mean=79.6; standard deviation [SD]=6.2) were recorded in the production study. They were recruited in a nursing home in Rotterdam (n=16) and an activity center for the elderly in Tilburg (n=8) in The Netherlands. Preliminary selection was conducted with the help of the organization staff in order to exclude participants suffering from deafness, dementia, and other cognitive impairments that could obstruct the effects of the experimental manipulation. Prior to each session, the participants were informed of the purpose of the study and gave a written consent for the recorded material to be used for research and educational purposes. Approval for the study was obtained from the Institutional Review Board, Faculty of Humanities, Tilburg University, Tilburg. The recruitment and the experimental procedure followed the guidelines of the Institutional Review Board.
All participants were presented with two 2-min medical instruction videos: a complex one and a control video. A pilot of the experimental procedure conducted prior to the study demonstrated that the presentation of the complex video instructions needed to take place after the control condition. The control video served as a baseline for the participants; moreover, presenting the complex condition first in some cases led to participants' unwillingness to continue because the task was perceived as too difficult. In the first video, a female medical professional conveyed information about the symptoms of a health problem (heart attack or stroke, divided equally across the participants) using everyday medical terminology; in the second video, the same speaker presented information about a different problem (stroke/heart attack, counterbalanced) using complex medical terminology. After each video, participants were asked to indicate the comprehensibility of the medical instructions they received on a scale from 0=“not comprehensible” to 10=“comprehensible” and to name at least one symptom of the health problems discussed in the instruction material. The pilot showed that some elderly participants were reluctant to evaluate the complex video critically. Therefore, next to the oral evaluation, an additional form of manipulation check was used in each experimental session, with participants asked to evaluate the instructions in a text form using pen and paper.
The information presented in the videos included the listing of symptoms associated with the health problem. The speaker in the video was looking straight into the camera while slowly reciting the instructions. Prior to the study, the content of the instruction videos and the experimental procedure were pretested with a male and a female participant from the same age group (not part of the experimental group). The recordings of the 24 participants were made with a Samsung (Seoul, Korea) optical image stabilizer (IOS) H300 HD camera located above and behind the computer screen on which the medical instruction videos were played. The distance between the screen and the participant was approximately 70 cm. The volume of the sound recording was adjusted for each participant individually to ensure optimal listening conditions.
Data Evaluation
We first compared the evaluations of the medical instruction videos obtained from the elderly participants. A two-way analysis of variance with Complexity (low, control) and Scenario (heart attack, stroke) as independent variables and the Comprehensibility scale as the dependent variable was conducted to test the effect of the experimental manipulation. With respect to the oral evaluation of the medical instruction videos, complex videos were evaluated as less comprehensible than the control videos: F 1,20=4.67, p=0.04, ηp 2 =0.19. The interaction effect of Scenario and Complexity showed a similar effect as the main effect of Complexity (F 1,20=4.67, p=0.04, ηp 2 =0.19), with no main effect of Scenario (F 1,20=0.01, p=0.93): the effect of the experimental manipulation was stronger for the scenarios involving stroke (meancomplex=8.22, SD=0.38; meancontrol=7.45, SD=0.35) than for the scenarios involving heart attack (meancomplex=7.92, SD=0.32; meancontrol=7.92, SD=0.35). The evaluation of the written scenarios using pen and paper showed that complex videos were evaluated as significantly less comprehensible by the elderly viewers than the control videos: F 1,20=57.94, p<0.001, ηp 2 =0.74; meancomplex=7.14, SD=0.26; meancontrol=8.10, SD=0.21.
Given the measured differences between the complex and control videos, the video recordings were judged to be valid for a perception study by third-party observers, as well as an automatic facial expression analysis. For these purposes, we made use of thin slices (5-s fragments 22 ), obtained in the complex and control conditions immediately after a complex medical term or its matched “easy” equivalent were used by the health professional in the video. The recordings of one female participant were excluded from the collection because of her positioning of the face outside of the camera view.
Perception Study
The goal of the perception study was to investigate if third-party human observers can correctly interpret elderly viewers' reactions to medical instructions involving complex terminology. The study was conducted with 40 participants (50% male) assigned the task of third-party observers, all advanced medical students at Erasmus University in Rotterdam between the ages of 18 to 26 years (mean=23.9, SD=1.8). They participated on a voluntary basis. Approval for the study was obtained from the Institutional Review Board, Faculty of Humanities, Tilburg University.
The study consisted of two within-participant conditions (Complexity: high and control) with stimulus order randomized per observer. After a brief introduction to the experiment, observers were instructed to evaluate short video fragments without auditory input on a 7-point scale, indicating the perceived level of confusion signaled by the facial cues of the person in the video. The scale was anchored at 1=absolutely [does] not [understand] and 7=[understands] extremely well. At the end of the questionnaire, observers were asked which facial features they considered to be indicative of confusion. To prevent public access to the experimental recordings, the study was conducted with the help of the LimeSurvey questionnaire software installed on a local lab server.
The stimulus set consisted of 92 short fragments collected from 23 elderly viewers with four fragments per viewer: two in the high-complexity condition and two in the control condition. In the high-complexity condition, the stimuli involved 5-s fragments of recordings of elderly viewers collected immediately after an infrequent medical term (e.g., “myocardium” or “diplopia”) was introduced in the medical instruction videos described in the production study. In the control condition, we used fragments of the same length collected after the presentation of matched terms in the control condition, as previously described. The video recordings were displayed in mp4 format fixed to 720×480 size, with a frame rate of 30 frames/s.
Results
To evaluate the ability of human observers to detect confusion, we first examined the effect of the within-participant factor Complexity (two levels: high and control) and observer's Gender as the between factor on the perceived level of confusion. A mixed design analysis of variance showed a significant main effect of Complexity: F 1,38=38.72, p<0.001, ηp 2 =0.51. The elderly viewers in fragments from the high-complexity condition were evaluated as showing a lower level of understanding of the received instructions (mean=2.83, SD=0.05) than viewers from the control condition (mean=3.01, SD=0.06). There was no main effect of observer's Gender (F 1,38<1, p=0.63) and no interaction effect of Complexity and Gender (F 1,38<1, p=0.50).
In order to calculate the human observer sensitivity to confusion cues, we created a new binomial variable based on the average value of all human judgments (mean=2.93, SD=0.07). A video stimulus was classified in the category “understanding” if the mean value was higher than the median=2.91 of the average value of judgments; it was classified as “not understanding” if the mean value was lower than or equal to the median. Calculated on the basis of the median values, the success of the human observer classification was 41%, with sensitivity to confusion (probability that a complex video was classified into the “not understanding” category) of 0.43.
In order to explore possible individual differences in the display of confusion, we made use of a linear mixed-model analysis with Complexity as the fixed factor and Viewer ID, the identity of the elderly viewer in the recording, as a random factor. The dependent variable was the mean perceived level of understanding, averaged over all observers in the perception study. The relationship between complexity of instructions and perceived level of understanding showed significant variance in intercepts across elderly viewers: var(u0j )=0.32, χ 2(1)=55.91, p<0.001. The results of the mixed-model analysis, summarized in Table 1, show that not all expressions of confusions were decoded as such, with some elderly viewers exhibiting facial behavior that was incorrectly interpreted by the third-party observers.
Mixed-Models Results of the Perception Study
SE, standard error.
We next performed computational analyses of the videos 23 to determine (1) the extent to which facial expressions provide cues to confusion regarding medical instructions, (2) the degree of agreement between the computer performance and human performance, and (3) which facial cues are diagnostic for confusion detection, using the CERT. 20
For each video frame, CERT detected the face region and analyzed the face in order to determine the presence of 20 AUs. When presented with a video containing a human face, CERT outputs (among other measures) 20 real-valued AU scores. To determine the extent to which the facial expressions of the elderly participants reveal their confusion, we trained a learning algorithm, the 1-nearest neighbor classifier (e.g., Hastie et al. 24 ), on the binary task of determining if a 5-s fragment corresponds to a participant that just heard a control (class 1) or a complex instruction (class 2). The classifier was trained using CERT AU scores averaged over the 5-s period (i.e., given a frame rate of 30, the averages were taken over 150 AU estimates). A so-called leaving-one-subject-out cross-validation procedure was used to assess the performance of the classifier. 24 This procedure provides a reliable estimate of the prediction accuracy, a measure of how well the trained classifier predicts unseen video fragments. A prediction accuracy of 50% would indicate that the trained classifier performs at chance level, whereas a prediction accuracy of 100% indicates perfect prediction.
Using the 20 AUs, the prediction performance obtained was higher than the performance in the perception study with human observers. The facial expressions provided cues to the difficulty of the instructions, in that the confusion of a previously unseen participant could be inferred with an estimated certainty of 64%. This compares favorably with the chance level of 50% and with the human observer performance of 41%. The sensitivity (true-positive rate) to confusion was 0.64, compared with the human observer sensitivity of 0.43. Human observers and the computer classifier were in agreement for 59% of the videos.
Subsequently, we determined the visual cues that were most informative in predicting confusion. To this end we performed feature selection by means of exhaustive grid search. Instead of taking (averages of) all 20 AUs as inputs to the classifier, we determined the prediction performances for individual AUs, all possible pairs, triples, and quartets of AU. (Evaluating the prediction performance obtained with combinations of 5 or more AUs would be too computationally demanding to be performed exhaustively.) The intention was to find those combinations of AUs that would give the best prediction performances because they would form the most informative visual cues for confusion. (It is important to note that the prediction performances obtained in feature selection should not be interpreted as indicating how well we are able to predict confusion. Instead, they indicate how well we would have predicted confusion if we had known the optimal AUs beforehand.) Table 2 lists the optimal combinations of AUs obtained in our feature selection experiment. For a single AU, chin-raising was the most informative visual cue, yielding a prediction performance of 63%. For 2, 3, and 4 AUs, the optimal combinations of AUs consist of lip-related visual cues (lip corner pull, lip pucker, and lip stretch), jaw drop (for 2 and 3 AUs), and cheek and eyebrow cues. Examination of the (nonaveraged) intensities for the optimal combination of 2 AUs (lip corner pull and jaw drop) revealed considerable individual differences in the separation between the scores in the two conditions.
Results of the Feature Selection Experiment
The left column list the number of action units (AUs) (out of the total of 20 AUs) for which the optimal combination was sought. The center column lists the corresponding number of AUs that yield the best prediction performance. The right column lists the prediction performances.
Discussion
Misunderstanding of medical information due to patients' low health literacy or cognitive limitations is a relatively common phenomenon in medical interactions. In our study, we investigated if confusion can be automatically detected from the facial cues and compared the performance of automatic classification with that of human observers. The analysis of facial movements was performed with the help of a computer toolbox that annotates facial movements within the FACS system. FACS has been widely used in research on spontaneous and simulated facial expressions of emotions and several mental and physical health variables, including expressions of pain. 25 Manual annotation of the AUs is laborious, but recent developments in the field of affective computing demonstrate that automatic annotation is possible.
In our study, the performance of the automatic detection analysis was slightly better than the human performance, with the combinations of 4 AUs (AU4, brow lower; AU6, cheek raise; AU12, lip corner pull; and AU20, lip stretch) achieving the highest score. These results are in agreement with cues to confusion identified in other settings. 19 They are also comparable to the findings reported by Durso et al., 16 who explored perceived expressions of confusion by means of surface facial electromyography (a technique that uses electrodes placed on the skin to detect electrical signals emitted by muscles when they contract). In their study, confusion was mainly associated with the movement of right and left corrugator supercilii (eyebrow movement) and the right depressor anguli oris (right lip corner pull), movements typically involved in negatively valenced expressions. 26
Statistical analysis of perceived levels of understanding indicated significant individual differences in the encoding of confusion, with some elderly participants being judged inversely to the original design. It is interesting that individual variety in nonverbal cues has been observed in the past for expressions of confusion, 26 with some participants making primarily use of the eyebrow regions, whereas others were especially expressive in the mouth area. The results of our study suggest, however, that facial cues associated with confusion may, at least in some cases, be misinterpreted for expressions of understanding by human observers.
Future research should focus on the origin and motivation for misperceptions due to the ambiguity of some facial expressions. For example, it could be the case that some of the recorded older individuals engaged in prosocial behavior to hide negative emotions associated with misunderstanding. Deciphering the mechanisms behind an emotion/stand display is likely to support the process of patient empowerment and improved decision making. Second, the use of Web-based health instruction videos is by no means limited to a particular age category. In the past, Web-based health programs have been tested not just with adults, but also with children 27 and adolescents. 28 Apart from dealing with different sources of comprehension difficulties, these age groups may, in fact, signal confusion differently because of different degrees of expressivity as a function of variance in emotional intensity 29 and emotion regulation 30 when contrasted with older adults. Therefore, it is important to compare their use of nonverbal cues in follow-up studies that would include age-suitable methods for collecting self-reported measures. Finally, the use of automatic confusion detection methods could potentially be beneficial during the development and testing phase of Web-based health applications, providing an alternative to performance-based measures and thinking out loud protocols currently used in user experience research.
Conclusions
Confusion caused by misunderstanding of medical terminology is signaled by facial cues that can be automatically detected with currently available facial expression detection technology. Automatic detection of facial movements provides a nonintrusive basis for building technology used in healthcare delivery applications on the Internet, including personalized health information and electronic visits.
Footnotes
Acknowledgments
The authors would like to thank Elise Peters and Tineke van den Hoek for their help with collecting the data for this study, as well as all the participants for their time.
Disclosure Statement
No competing financial interests exist.
