Abstract
Ensemble coding and attention are two mechanisms utilized by our visual system to overcome the limitation of visual processing when confronted with the overwhelming visual information. Recent evidence in ensemble coding of size suggests that the attended items contributed more to the averaging. On the other hand, some new evidence also indicates that reduced attention jeopardies the perceptual averaging of stimuli. What is the relationship between attention and ensemble coding? To answer this question, in the current study, we tested whether an exogenous attentional cue would influence the reported mean emotion of a crowd. We showed participants a group of four faces with different emotions. Participants’ attention was guided to the happiest or saddest face (attention conditions), or not to any specific face (baseline condition). The results supported the notion that the attention alters the ensemble perception of the facial expression by elevating the weight of that face in the ensemble representation. This opens the question for the neural mechanisms of ensemble coding and its connection to visual attention.
Introduction
Despite the capacity limitations of our visual system (Brady & Alvarez, 2015; Cohen et al., 2016; Haberman & Whitney, 2012; Whitney & Leib, 2018), we are able to perceive the surrounding world, which is always computationally overwhelming, seamlessly and effortlessly. One possible reason is that our visual system could compress the visual input by extracting the summary statistics (mean or distribution) of the heterogeneous information (Alvarez, 2011; Cohen et al., 2016; Haberman & Whitney, 2012; Whitney & Leib, 2018; Ying et al., 2019, 2020). Such ensemble perception has been observed in multiple levels of visual processing: from low-level features like size and orientation, to high-level features like facial expressions and identities (Alvarez, 2011; Alvarez & Oliva, 2009; Ariely, 2001; Corbett et al., 2012; Haberman et al., 2015; Haberman & Whitney, 2007, 2012; Leib et al., 2014; Sweeny & Whitney, 2014; Ying & Xu, 2017). Most of the studies in ensemble statistics concentrated on whether ensemble representation is extracting the mean and variance of the set. However, one question is still unclear: whether the ensemble statistics is a weighted averaging (each stimulus contributes differently to the ensemble representation and has a different “coefficient” in averaging) of the visual input or just an arithmetic mean (every stimulus contributes equally to the ensemble representation and has a same coefficient in averaging) of them. In other words, does each stimulus have the same weight in ensemble coding?
On the other hand, utilizing selective attention is another mechanism that allows our visual system to deal with the capacity limitation as mentioned above (Palermo & Rhodes, 2007; Vuilleumier et al., 2001; Vuilleumier & Pourtois, 2007). We are able to actively devote limited cognitive resources to a subset of sensory input, the processing of which will be facilitated by the attention (Posner, 1980; Rensink, 2000; Rhodes et al., 2011Treisman & Gelade, 1980). Impact of attention has been observed behaviorally and neurally. Behaviorally, for instance, attention could lower the contrast threshold (Bisley & Goldberg, 2003), decrease reaction time (Posner, 1980; Zhang et al., 2014), and increase the adaptation aftereffect (Rhodes et al., 2011). Neurally, for instance, attention could enhance the fire rating of cells (Wurtz et al., 1982), as well as alter the BOLD signal (Vuilleumier et al., 2001). One effective way to modulate attention is cueing technique (Bisley & Goldberg, 2003; Posner, 1980; Vuilleumier et al., 2001). It has been found that the response time is shorter when subjects were cued to the stimuli rather than cued away from the stimuli (Posner, 1980). Also, when guided by the attentional cues to the emotional faces than to houses, subjects exhibit significant emotion-related frontal positivity (Holmes et al., 2003). The effectiveness of cueing might (partially) stem from the high saliency of the abrupt onset (Yantis & Jonides, 1984, 1996).
Researchers have extensively studied ensemble representation as well as attention, we yet fully understand the relationship between them. Although several recent studies suggested that the ensemble coding of faces could be conducted implicitly and explicitly (Haberman & Whitney, 2009; Ying & Xu, 2017), and the ensemble coding could occur with reduced attention (Alvarez & Oliva, 2009), one could not assert the “automaticity” of ensemble coding: “automaticity” means requiring minimal attentional resources (Palermo & Rhodes, 2007). Some studies in ensemble representation of size exhibit the impact of attention modulation (Chong & Treisman, 2005; de Fockert & Marchant, 2008; Li & Yeh, 2017). For instance, de Fockert and Marchant (2008) found that attended items contributed more to the ensemble representation: the attended items have larger weights in ensemble representation. Therefore, the ensemble coding of size might occur in brain circuits which are subjective to attention, possibly through the early visual cortex (Fang et al., 2008; Moran & Desimone, 1985). Considering the similarities between the ensemble coding of lower- and higher-level objects (Haberman & Whitney, 2012; Whitney & Leib, 2018; see also Haberman et al., 2015), it is reasonable for one to hypothesize that the ensemble coding of faces might also be subject to attention modulation.
However, there are reasons to be skeptical that attention could interfere with the ensemble statistics of facial expressions. Facial expressions are processed through both cortical and subcortical pathways (Haxby & Gobbini, 2011; Haxby et al., 2000, 2002). A substantial body of evidence from brain imaging studies suggests that face processing in the amygdala is hardly affected by attention modulation, while cortical face processing (e.g., STS) is gated by spatial attention (Holmes et al., 2003; Vuilleumier et al., 2001). Consequently, attention might not necessarily impact the ensemble coding of facial expressions. Therefore, it is reasonable to further validate the relationship between attention and the ensemble statistics of facial expression.
Method
Subjects
Thirty subjects (17 females, mean age 20.1), with normal or corrected-to-normal vision, participated in this study. This number of subjects is based on the power analysis from a small sample pilot test (based on the pilot data of 11 subjects, sample size at 22 is needed to reach a power at 95%) together with previous experiments. Also, considering other experiments in ensemble coding, we finally chose 30 as the sample size. Written informed consent was provided by participants before the experiment. The study was proved by the Ethics Committee at Soochow University, China.
Apparatus
Face stimuli were presented on a 22-inch ASUS PG278Q LCD monitor (spatial resolution 2560 × 1440 pixels, refresh rate 120 Hz). The monitor was controlled by a host computer (Linux OS) running Matlab R2016a (MathWorks) via Psychtoolbox (Brainard, 1997; Pelli, 1997). Participants were asked to sit with their chins rested on a chin rest (53 cm away from the monitor). Each pixel subtended 0.025° on the screen. Each target face subtended 2.38° × 2.93° on screen.
Stimuli
In this experiment, we selected facial identity AM14 from Karolinska Directed Emotional Faces (Lundqvist et al., 1998) database as the testing stimuli. Pictures with Happy, Neutral, and Sad expressions (AM14HAS [Happy Expression], AM14NES [Neutral Expression], and AM14SAS [Sad Expression] from the KDEF database) from this chosen identity were selected. We are aware the fact the faces are Caucasian, but our participants are Singaporean. However, we do not believe this affects the interpretation of our data, based on our past experiments (Luo et al., 2015). In addition, we have used the KDEF database in a different experiment testing the ensemble coding of facial expressions (Ying & Xu, 2017) on Asian participants. Moreover, the emotional expressions from the KDEF database have been validated across different cultures (e.g., Goeleven et al., 2008; Yan et al., 2016). To create and manipulate the test stimuli, we used Webmorph software (Debruine, 2017), and Matlab (Mathworks, Natick, MA) to morph and further manipulate these faces.
We created one continuum of emotional faces with emotions from Sad to Neutral to Happy respectively to create the testing stimuli (Figure 1). The happiest face showed 100% of happiness, the neutral face showed 50% of happiness, and the saddest face showed 0% of happiness.

Illustration of stimuli. The images are AM14HAS, AM14SAS, and AM14NES from KDEF database. The continuum of emotional faces from sad (0% of happiness), to neutral (50% of happiness), to happy (100% of happiness).
All of the faces were grey scaled and cropped by an oval-shaped mask with only the central region of each face remaining visible. The luminance and contrast of the faces were matched by SHINE toolbox (Willenbockel et al., 2010). The target stimuli are four faces from the continuum. Their mean emotion was either 45%, 50%, or 55% of happiness (in each trial with randomized orders). We did so to minimize the potential learning effect. The four faces were + 45%, + 15%, −15%, and −45% of happiness compared to the mean. Therefore, there was one happy face, one sad face, and two neutral faces. The test faces are seven faces from the continuum, with 0%, 20%, 35%, 50%, 65%, 80%, and 100% of happiness. Note that, these test faces offered seven options for reporting the mean emotion. We presented all of the test faces on the screen simultaneously on a black background (adapted from Shah et al., 2015). Each face corresponds to one number.
Procedure
In this experiment, we examined whether attention to a single face affect the perceived mean of the four faces with different expressions. Subjects were exposed to the four faces, and cued either to the happiest face, the saddest face, or simply at the central fixation (i.e., not to any face, the baseline). If the perceived mean emotions are similar among different cueing conditions, then this would suggest that ensemble coding is merely an averaging of the crowd, regardless of attention. Alternatively, if the perceived mean emotions are significantly different among cueing conditions, then it would suggest that ensemble coding is subject to the loci of spatial attention.
Before the actual experiment, subjects went through a practice section to become familiar with the procedure of the experiment. Then they commenced the experiment. The trial sequence is illustrated in Figure 2. Each trial commenced with a fixation (1000 ms). Subjects were forced to concentrate on it until it disappeared. Then a visual cue appeared on the screen for 188 ms. It might appear randomly at one of these locations: the location of the happiest face (which will appear after the 94 ms interval); or at the location of the saddest face (which will appear after the 94 ms interval); or at the location of the fixation cross (without cueing attention to any specific face). After a 94 ms interval, four target faces appeared on the screen for 1000 ms. At last, subjects were asked to select one face from the list of faces that represented the mean emotion of the target faces, by pressing the correspondent button as soon as possible. During the response phase, participants were allowed to freely view the seven testing faces and make their decisions as accurate as possible with unlimited time.

The trial sequence of one example trial. In each trial, the experiment initiated with a 1000 ms fixation stage. Then a cue appeared on screen for 188 ms. In this example, the cue aims to guide subjects’ attention to the happiest face that appears after the 94 ms interval.
Analysis
We calculated the reported mean emotion of the face crowd and then corrected it against the actual mean emotion of crowd for each trial and for each condition separately. In short, if the value is zero, it means that the inner representation of the crowd is a perfect reproduction of the actual faces. If the value is positive, it means that perceived mean emotion of the crowd is happier than the actual mean. Then, we measured the cueing effect by calculating the difference between the perceived mean emotion of two cueing conditions against that of the baseline condition (the neutral cue condition).
Results
To clarify the impact of attention on the ensemble coding of the facial expressions, we compared the shift of perceived mean emotion of all subjects from baseline (cued to fixation condition; Figure 3). Results of paired sample t-tests illustrated that cueing to the happiest face (M = 4.40%, SEM = 1.50%; t(29) = 2.93, p = .007, Cohen's d = 0.53) significantly increased the weight of the attended face, and cueing to the saddest face (M = −2.73%, SEM = 1.27%; t(29) = −2.14, p = .041, Cohen's d = 0.39) also significantly increased the weight of the attended face in the reported mean emotion of that crowd, compared to the baseline condition (Figure 3). Also, we found a significant difference between these two conditions (t(29) = 2.70, p = .011, Cohen's d = 0.49). Moreover, we found there was no significant difference between the data from “Cueing to Fixational cross” (baseline) condition against zero (M = 1.85%, SEM = 1.20%; t(29) = 1.54, p = .14, Cohen's d = 0.28).

Summary of all subjects’ results. Cueing to one of the faces significantly biased the reported mean emotion of the face crowd towards the emotion of the cued face. Therefore, as hypothesized, the exogenous cues influenced the reported mean emotion of the crowd by elevating the weight of the cued face.
Discussion
In this study, evidence showed that the emotion of the face(s) within the locus of attention heavily influences the ensemble representation of the crowd. When the subjects were required to report the mean emotion of the crowd, the emotion of the attended face biased the perceived mean. As hypothesized, the exogenous cues influenced the reported mean emotion of the crowd by elevating the weight of the cued face. When subjects were cued to the happiest face, their perceived mean emotion of the crowd was happier; while they found the crowd sadder when cued to the saddest face. The findings here were consistent with previous research which showed that attention modulates the ensemble statistics of size (Chong & Treisman, 2003, 2005; de Fockert & Marchant, 2008; Li & Yeh, 2017), indicating a potential overlap between the mechanisms of the ensemble coding of low- and high-level objects.
In this study, the results suggested that the impact of the attention on the ensemble coding of facial expressions resembles that of the size (de Fockert & Marchant, 2008). Although it has been shown that the ensemble statistics abilities of high- and low-level features could not predict each other (Haberman et al., 2015), current findings together with a growing body of evidence suggests that the ensemble statistics of high- and low-level objects are both subject to attention modulation. Considering the fact that face perception is hierarchical (Xu et al., 2008), future research may further examine the possible neural mechanisms between high- and low-level ensemble statistics.
During the presentation of the face group, subjects were cued to one of the faces which has the most extreme emotion in the crowd. This method is similar to that of a previous report (de Fockert & Marchant, 2008) whereby the authors instructed subjects to attend to either the smallest or the largest circle in the set. Their study as well as the current study both found that attention modulates the mean representation by altering the weight of the attended item in averaging. However, researchers in that study modulate attention differently in their two experiments: they specify the size of the circle to attend to in their first experiment and highlight one circle by changing the luminance in the second experiment. This made certain that they explicitly measured the ensemble statistics, which is also what this study ensured. The similarities in the experimental design and the observed findings suggest that the modulation of attention is ubiquitous and fundamental among various levels of ensemble statistics.
During the response phase, participants were presented with seven test faces and were asked to select one of them as the best match of the mean emotion of the previously showed target faces. This paradigm has been widely used by many face studies (e.g., 3AFC in Burns & Bukach, 2021; 6AFC in Shah et al., 2015) and was believed to be more accurate (Mickes et al., 2012). On the other hand, it is noticeable that the seven test faces were presented at close spatial locations. However, we do not think this would evoke a significant visual crowding effect. In a classic face crowding study, the facial stimuli were presented at extra foveal locations (e.g., Fischer & Whitney, 2011; Whitney & Levi, 2011). While in this current study, the faces were presented horizontally along the central of the screen. Moreover, during the response phase, participants were allowed to freely view all seven faces, which will minimize the possible influence of crowding. Therefore, the response phase would be hardly jeopardized by visual crowding.
In this study, the emotion variances among the faces are large. Previous researchers studied the ensemble statistics with relatively small variances among objects. For instance, Haberman and Whitney (2009) tested the temporal ensemble statistics of facial expressions with four faces. The emotion difference between faces in their first experiment equals 6% of the emotional units in the current study. Unlike them, the emotion variance in here is much larger. However, it is reasonable to believe such kind of huge variance does not necessarily obscure the ensemble coding. Elias and colleagues (2017) found that the visual system is capable of averaging the emotion between fully happy (or angry) and neutral faces (50%). Besides, another study showed that a stream consisting of half happy and half sad faces could be averaged together (Ying & Xu, 2017). They also suggested that the variance of emotion does not alter the perceived mean emotion.
To summarize, evidence here showed that ensemble representation of facial expressions is sensitive to the emotion of the cued face(s). The modulation of attention occurs ensemble statistics of facial expressions, by an alteration of the weight of the cued faces in the perceptual averaging. The findings here indicate that weighted averaging is an important characteristic of ensemble statistics of faces.
Footnotes
Acknowledgements
H. Ying is supported by the Natural Science Foundation of Jiangsu Province (BK20200867), and the Entrepreneurship and Innovation Plan of Jiangsu Province. The design of this study has been presented as a poster at the Annual Meeting of Visual Science Society (VSS, May 2018, St. Pete Beach, Florida). The pilot of this study forms part of H. Ying's PhD thesis at Nanyang Technological University. The author thanks Dr. Hong Xu, Dr. Edwin Burns, Dr. Paul Boyce, and Nadine Garland for proofreading and helpful comments.
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Natural Science Foundation of Jiangsu Province (BK20200867), and the Entrepreneurship and Innovation Plan of Jiangsu Province.
