Abstract
When people learn perceptual categories, if one feature makes it easy to determine the category membership, learning about other features can be reduced. In three experiments, we asked whether this cue competition effect could be fully eradicated with simple instructions. For this purpose, in a pilot experiment, we adapted a classical overshadowing paradigm into a human category learning task. Unlike previous reports, we demonstrate a robust cue competition effect with human learners. In Experiments 1 and 2, we created a new warning condition that aimed at eradicating the cue competition effect through top-down instructions. With a medium-size overshadowing effect, Experiment 1 shows a weak mitigation of the overshadowing effect. We replaced the stimuli in Experiment 2 to obtain a larger overshadowing effect and showed a larger warning effect. Nevertheless, the overshadowing effect could not be fully eradicated. These experiments suggest that cue competition effects can be a stubborn roadblock in human category learning. Theoretical and practical implications are discussed.
With adequate training, people can often master fairly sophisticated perceptual classification tasks, such as telling the species of a bird or diagnosing cancer from x-ray images. With perceptual categorisation tasks arising in the real world, multiple features of an object are often each partly predictive of category membership. As an example, a grizzly bear in the wild can be distinguished from black bears based on its overall size, colour, size of the shoulders, profile of the face, and length of the claws. However, these attributes are not always available to the person who seeks to categorise objects in a naturalistic environment. For example, at dusk, colour of the animal may not be well perceived, and a perceiver would have to utilise other attributes she knows about the animal.
This raises a rather basic question: Does having a highly salient predictor of category membership present during training hamper a learner’s ability to learn other, less-predictive attributes? Putting this question into our grizzly bear example, if the colour of a bear is the most salient feature in distinguishing grizzly bears from black bears, do learners acquire less information about the less predictive features such as overall size, size of the shoulders, profile of the face, or length of the claws, if they learn them in a situation in which colour differences are available? This phenomenon is generally related to the cue competition effects that were first reported by Pavlov (1927). Since then, the cue competition effects have been extensively studied. This broad family of effects include the blocking effect (e.g., Kamin, 1967, 1969), and the overshadowing effects due to frequency or salience (Wagner et al., 1968).
There have been numerous studies reporting cue competition effects, such as blocking and overshadowing effects among human learners in causal judgement tasks (e.g., Mitchell & Lovibond, 2002). However, two human category learning experiments existing to-date (Bott et al., 2007; Murphy & Dunsmoor, 2017) would appear to suggest that human learners are largely immune to cue competition effects (but see Soto & Wasserman, 2010).
In the rest of the Introduction, we provide readers with a brief overview of cue competition effects. We then introduce our category learning task. Specifically, we take the opportunity to highlight the differences between our category learning task and the causal judgement paradigm, as the latter is traditionally thought to be the closest human learning task analogous to classical animal learning paradigms.
Having introduced our category learning task, we will describe our top-down intervention to see whether cue competition effects can be mitigated. Top-down instructions have been applied to mitigate cue competition effects with various degrees of success in previous studies. They have the benefits of being easy to apply. In contrast, these interventions are hard to apply on nonhuman animals without language capacity. The effectiveness of these interventions may shed light on potential differences between human and animal learning.
Cue competition effects
Within the framework of Pavlovian learning (Pavlov, 1927), multiple conditioned stimuli (CS) can be predictive of an unconditioned stimulus (US). As a concrete example, a whistle (CS1) and a tone from tuning fork (CS2) are presented preceding the presentation of some food (US) to induce a dog’s salivation (unconditioned response). The whistle and the tuning fork tone had similar psychological intensity. Upon pairing, the whistle and the tuning fork tone, presented in isolation, both lead to the same amount of salivation. In an Overshadowing condition, the whistle is made more salient than the tuning fork tone. After a considerable amount of training with the paired stimuli, when the dog is presented with the whistle alone, salivation is elicited. Presenting the weak tone alone, on the other hand, induces little salivation. This is taken as evidence that the CS, or cues, compete with each other to form associations with the US.
Other types of competition effects work similarly. In the well-known blocking paradigm (Kamin, 1967, 1969), instead of having stimuli with different levels of salience, all stimuli have similar psychological intensity. One stimulus may be paired with the US prior to the rest of the stimulus set. Once the association has been established, the stimulus prevents new ones from pairing with the US.
Decades of research has shown that cue competition effects are quite robust in Pavlovian conditioning, human causal learning paradigms (e.g., Gluck & Bower, 1988; Price & Yates, 1993; Shanks, 1991; Vogel et al., 2015), and human categorisation tasks (e.g., Soto & Wasserman, 2010, but see Bott et al., 2007; Murphy & Dunsmoor, 2017), for conclusion about weak cue competition effects). Attributes that are less salient, occur less frequently, or are paired later with the US are less likely to be learned. Learning only a subset of predictive attributes in a categorisation task is undesirable, since not all attributes helpful in distinguishing the categories would be available in a real-world scenario. On the other hand, the ability to select relevant features that are highly predictive of the outcome is crucial for survival in many naturalistic settings. It is perhaps an important skill for a learner to ignore the less predictive features for effective learning. The dynamics between the feature selection process and the top-down control over the process are the focus of the current study.
Category learning versus causal judgement
To study whether human learners are susceptible to cue competition effects, researchers have largely focused on causal judgement tasks designed to be analogous to animal learning tasks. One example of these paradigms is commonly known as the “food allergy task” (see Shanks, 2010, for a review). Human learners are shown one or two food items in a trial. They have to make a guess as to whether a fictional customer in the story will get sick after food consumption. In a blocking paradigm, Food item A is followed by an allergy, which can be denoted as an A+ trial. A combination food item, AX is then presented in a subsequent trial, probably with some delay. The food item AX is associated with a positive outcome (hence can be denoted as AX+). The human learner is later given test trials, in which only a single food item is presented in each trial. Her task is to indicate how likely the food item leads to an allergy on a rating scale. Human learners in this condition tend to judge X as less likely to be the cause of allergy, compared with another group where A+ trials are never shown. In this scenario, Food items A and X are known as cues, and they map onto the two CSs within a Pavlovian framework. The allergy outcome, on the other hand, maps onto the US closely.
We set out to study a far less well-explored type of human learning in the cue competition literature, within the domain of category learning. In the category learning literature, human learners have usually been shown one artificial stimulus at a time. The stimuli contain multiple attributes, or feature dimensions, that would allow learners to classify them into one of the (usually) two categories. To be successful at the task, learners have to associate individual features of the stimuli with the category membership.
The two paradigms mentioned above have a lot of parallels in the psychology literature. In both cases, learners learn about the features of objects and put them into distinct categories. Learners also usually learn about the category membership through feedback over trials: They would give a response for their decisions, and then receive a feedback as to whether they have made a correct or incorrect response. It is assumed that they will update their decision rules upon receiving the feedback.
Differences of the two paradigms are worth noting. In causal judgement task, a stimulus usually denotes a standalone object, such as a food item. Each stimulus is treated as an independent entity that causes the outcome. The attributes of a stimulus, such as colour, taste, or smell, are irrelevant to the outcome. Stimulus that leads to the same outcome does not necessarily possess semantic or visual similarities.
On the other hand, stimuli within the same category in a category learning task usually share a number of attributes. They usually possess a certain degree of resemblance, semantically, visually, or both. Stimuli tend to have multiple attributes, or feature dimensions, that are relevant to the category membership. These attributes can be discrete, taking a value of presence or absence, or along a continuum. To manage learners’ working memory load, researchers usually opt for having a large number of binary attributes (around 4–8), or having a small number of attributes (1–3) with continuous feature values.
We decide to study whether human learners show cue competition effects in category learning tasks. This was partly motivated by Bott et al. (2007) report of a weak to nonexistent cue competition effect and partly due to the lack of research in the area. During the preparation of the current study, Murphy and Dunsmoor (2017) reported a series of human category learning experiments. In their study, they utilised two aversive salient features (loud sound and electrical shock) of a category in an attempt to establish an overshadowing effect. Comparing two independent groups of participants, learning with or without the salient features, they show that people learn other defining features of the categories equally well. Agreeing with the previous report, these experiments suggest nonexistent cue competition effects for human category learning.
These reports suggest that the mechanisms of category learning and causal judgement task may be very different. With human participants, no reliable cue competition effects have been shown using the former procedure, whereas evidence for cue competition effects has been abundant for the latter. It is also possible that the category learning tasks used by the researchers were not sensitive enough to detect any cue competition effects. Our first step of the study was to establish a category learning paradigm that is sensitive to the effects, to show that a salient feature indeed can affect the learning of other features.
Top-down control over cue competition effects
Due to the important implications on cue competition effects on human learning, there have been efforts to eradicate the effects after their occurrence. A bottom-up, stimulus-driven approach usually involves dissociating one of the elements in a compound cue with the outcome. Thus, a participant would first encounters AX+, and then A−. Logically, she would then attribute the outcome to X, instead of A. This procedure is known as backward unovershadowing, which is also called “release from overshadowing” (e.g., Simms et al., 2012).
Other ways of mitigating these cue competition effects also exist. For example, researchers had found that when human learners were told that the effects of the cues are additive, cue competition effects tend to be smaller (Lovibond et al., 2003). There are initial evidence showing that even though bottom-up, stimulus-driven unshadowing procedure, and top-down instructions about stimulus additive both lead to a decrease in cue competition effects, their mechanisms are quite different. The bottom-up procedure requires cue competition effects to establish before lowering the salient stimulus’ effect, while the top-down instruction can mitigate the cue competition effect before it is established. There are suggestions that additivity pretraining, but not unovershadowing, requires higher order reasoning capacity. For example, Simms et al. (2012) reported that both younger (4- to 5-year-olds) and older children (6- to 7-year-olds) respond to unovershadowing procedures, but only the latter group are affected by top-down additivity pretraining.
Along the same line, it would be curious to know whether an overshadowing effect can be controlled through direct, top-down instructions to ignore the salient feature dimension.
If this is successful, the weight associated with the salient attribute is assumed to be lower, despite its high association with the outcome. These weights are often termed attention and memory strength in computational models of category learning (e.g., Nosofsky, 1984), even though they are actually decision weights in nature (see Hoffman & Murphy, 2006).
Modern-day computational models generally have no problems accounting for cue competition effects. In the animal learning literature, there has been substantial effort attempting to understand the mechanisms of cue competition effects through computational modelling since Rescorla and Wagner (1972, also see Mackintosh, 1976). In the human learning literature, various computational models have been successful in explaining changes in learning outcomes due to base-rates (Gluck & Bower, 1988), cue validity or salience (Kruschke, 1992; Nosofsky et al., 1994), or serial positions of stimulus presentation.
Despite the effort spent in developing computational models of cue competition effects, most of them cannot account for effects generated from top-down additivity pretraining or instructions. Take ALCOVE as an example (Kruschke, 1992): it has parameters to explain why some features are attended to, or memorised better from a bottom-up fashion. The attention parameter, in particular, explains how features with similar psychological strength can sometimes be attended to differently. For example, if the training is preceded by other forms of learning, attention to some new features may be blocked. In the current instantiation, however, the model does not implement any parameters for top-down control. This would suggest that in the case of top-down involvement, the effects would be incorporated into the weights for attention and memory strength. An alternative explanation is to assume that top-down control has no effects on the learning outcomes. Our experiments would directly test these propositions.
The current research
In all the experiments reported in the current paper, we asked human participants to learn to classify visually presented stimuli into two different categories. To maintain a high external validity of the study, we designed a category learning task with a relatively small number of feature dimensions defining the categories. Moreover, feature values varied along a continuum, and they are associated with the category membership probabilistically. This is different from the “one away” design popular in category learning studies, in which a larger number of feature dimensions is used, and the feature values are usually binary. The design allows us to detect a training effect with a high sensitivity and also permits some statistical analyses that would otherwise not be possible. Each experiment involved two phases: a learning phase and a transfer test phase.
Our pilot experiment aimed at establishing a clear cue competition effect. In the experiment, human learners went through a supervised learning phase to learn to distinguish two artificial categories of objects. Participants were assigned to one of the two between-subjects conditions. In the Control condition, multiple features (values on specific continua) were independently and probabilistically predictive of category membership. In the Overshadowing condition, variation on an extra feature was included, which predicted category membership with 100% accuracy. Thus, if a learner utilised the predictive significance of this feature, she could potentially achieve 100% performance in the learning phase of the study without giving any weight whatsoever to the other features. A final test was included in which the deterministic feature was no longer present. In this pilot experiment, classification accuracy was markedly reduced in the final test for the Overshadowing condition, compared with the Control condition. This shows that people’s learning of the less predictive features is not effective in the presence of a deterministic one, indicating an overshadowing effect.
Having established a paradigm that shows a strong overshadowing effect, Experiment 1 was performed with two goals in mind. The first was to replicate the findings in the pilot experiment. The second was to test whether the overshadowing effect could be mitigated by top-down control. To this end, an additional Warning condition was included. The Warning condition was identical to the Overshadowing condition in every aspect, except that participants were informed in advance of the training phase that the perfectly predictive feature, while present in training, would not be available during the transfer test.
In the transfer test phase, participants in all conditions were presented with the same types of test stimuli. Their task was to classify some new examples that had never been shown before in training, but they followed the same generative process used in the training phase of the Control condition. Only the probabilistic features were present in this phase, and the deterministic predictor feature was not present. If the overshadowing effect could be mitigated through top-down control, performance of the Warning group should resemble that of the Control group, and be better than the Overshadowing condition. If top-down control was not effective, performance of the Warning and Overshadowing groups should be similar, and both groups should perform worse than the Control group. As in the pilot experiment, participants in the Overshadowing condition showed a strong cue competition effect. The effect was clearly mitigated by top-down instructions, even though the effect obtained was not statistically significant due to a large variation among participants.
To obtain a clearer instruction effect, we attempted to amplify the overshadowing effect to provide more room for the instruction to exert its top-down effects. Experiment 2 had the same design as Experiment 1. We replaced the stimuli in Experiment 1 with a set of visually more naturalistic ones (Blair et al., 2009). At the same time, we reduced the number of category-defining feature dimensions so that the task was easier to master. We expected to see the same general pattern of results as in Experiment 1, with a greater statistical effect. This would be an evidence to indicate that the effect we found is not specific to a particular set of stimuli.
If the warning manipulations in Experiments 1 and 2 were effective, it would suggest that cue competition effects could be mitigated by top-down control. Such a procedure might have the potential to enhance people’s flexibility in learning. It would suggest that the process of category learning could be made less dependent on a small subset of attributes and idiosyncratic environmental factors. We delay the discussions on possible implications to the “General discussion” section.
Pilot experiment
The goal of the pilot experiment was to establish a paradigm that shows a robust overshadowing effect in perceptual category learning. We would also make use of the effect size obtained in the experiment to estimate the number of participants needed for the two experiments that followed. In the subsequent experiments, we would try to eradicate this overshadowing effect by imposing some top-down control.
Participants were given two categories of cartoon-like artificial stimuli (referred to as “demons”), and their task was to classify the stimuli into two categories. Participants were randomly assigned to the Control or Overshadowing condition in the training phase. They were given feedback upon responding during the training phase, so that they could adjust their classification rules accordingly. In the transfer test, participants classified a new set of stimuli without being given feedback.
Two categories of “demons,” labelled Old World and New World, were generated. Figure 1 shows examples of the demons. For the Control condition, the demons varied in three features: eye colour, eye width, and horn height. The distribution of each feature within a category was Gaussian and the values were probabilistically predictive of the category membership. The separation between the means for the two categories was 1 SD for each of the three features. These features varied independently. In general, Old World demons tended to have shorter horns, larger eyes, and their eyes tended to be blue or purple. New World demons tended to have taller horns and smaller eyes that tended to be green or blue.

Exemplars used in the pilot experiment. The demon-like stimuli differed in eye colour, eye size, and horn size in the Control condition. In the training phase of the Overshadowing condition, the horn colour was indicative of the category identities. The feature was not available during the transfer test.
For the Overshadowing condition, variations in the three features described above (for the Control condition) were present. In addition, there were variations on one extra feature. Specifically, the horns for Old World demons were always in brown colour, and the horns of the New World demons were always in pink colour.
The transfer test phase consisted of stimuli that were similar to those used in the Control condition, but the exact stimuli presented in the transfer test had not appeared in the training phase. Each demon had three probabilistic features that predicted its category membership. During transfer, the overshadowing feature had no variabilities: all demons had horns of the same colour. Category membership was defined only with the three probabilistic predictors. No feedback was given in the transfer test.
Methods
Design
The experiment contained two phases: a training phase and a transfer test. Two training conditions, Control and Overshadowing, were manipulated between subjects. All participants took a transfer test. The transfer test was identical to all participants regardless of training conditions.
The proportion of correct classifications in both training and transfer test was recorded to assess participants’ learning.
Participants
Seventy-six human participants recruited from our laboratory’s online research participant pool completed the experiment, which includes adults of various ages living in a variety of countries. The pool has been prescreened for excellent comprehension of English and careful attention to instructions. Participants were randomly assigned into two training conditions: Control (n = 34) and Overshadowing (n = 42).
Materials
The experimental environment was controlled by code written in Adobe Flash and delivered through participants’ web browser. Participants completed the experiment online using their own devices, so the absolute sizes of the stimuli on the screen could vary.
A total of 700 unique image files were generated, each depicting a 400 × 400-pixel demon. Three hundred images were used for training in the Control condition, another 300 images were used for training in the Overshadowing condition. The remaining 100 images were used in the transfer test for both conditions. Depending on whether the demon came from the Old World or the New World, the values of the demon’s eye colour, eye size, and horn height were randomly selected (with replacement) from their respective Gaussian distributions. Means of the two distributions for each feature were 1 SD apart. To determine how that would translate to the predictive power of a probabilistic feature, we ran a simple simulation. If an ideal observer relies on a single probabilistic feature for categorisation and she acquires the distributions perfectly, she would achieve an accuracy of around 69% at the transfer test.
For training images used in the Control condition and the transfer test images, the value of the horn colour was held constant. Training images used in the Overshadowing condition were identical to those in the Control condition, except that colour of the horn was determined based on the demon’s category membership.
A demon might or might not have a nose, but this feature did not predict whether it was from the Old World or New World. Due to an error in data storage, the parameters for generating the stimulus were lost. We replicated the results with a generative stimulus creating procedure in Experiment 1, and the results between two experiments are comparable. Interested readers can refer to the section below for details.
Procedure
The experiment started with a written introduction explaining the task to the participants. Participants were told to distinguish Old World from New World demons. They were told that Old World demons share some common features, as did the New World demons. They were encouraged to learn the category properties through feedback in the training phase and advised that there would be no feedback in the transfer test. Participants were also instructed to spend no more than 3 s with each demon. A short multiple-choice quiz then followed the introduction to ensure that participants understood the task.
The training phase was divided into six blocks, with 50 trials in each block. Depending on the participant’s condition, a training stimulus was drawn randomly from the respective training stimulus set without replacement. During each trial, a demon was shown at the centre of the screen on a black background. The participant pressed one of the two keys on the keyboard to indicate the demon’s category membership. The demon disappeared upon a response, or when 3 s had elapsed. A new trial began after a 1-s auditory feedback period and a 2-s intertrial interval (ITI). Participants were encouraged to take short breaks in between the blocks.
Immediately after the training phase, participants received instructions for the transfer test. They were told that feedback would not be given. They were also encouraged to use the same strategies they employed during training to classify the demons.
Participants in both conditions had the same set of test stimuli in the transfer test. The order of stimulus presentation in this phase was randomised for each participant. Each test trial began with the presentation of a test stimulus. Participants could take as long as they needed to decide which category the demon belonged to. When the participant responded, no feedback was given and a new trial began after a 2-s ITI. Each participant completed 100 transfer test trials, broken down into two blocks.
Results
Training phase
The average accuracy for both Control and Overshadowing conditions increased steadily from Block 1 to Block 3, reaching asymptotes for their respective conditions (Figure 2, left panel). Overall accuracy for the Overshadowing condition was higher than the Control condition throughout the training phase. This suggests that participants in the Overshadowing condition made use of the horn colour—the deterministic predictor, in the classification process. Comparing the last two blocks of trials between conditions, participants in the Overshadowing condition (M = 90%, SD = 19%) significantly outperformed those in the Control condition (M = 67%, SD = 14%), t(74) = 6.12, p < .001. As a measure of effect size for difference between the two groups, we computed Cohen’s d which proved to be very large, d = 1.41, 95% confidence interval (CI) = [0.90, 1.93].

Performance in the Control and Overshadowing conditions over blocks in the pilot experiment. Participants in the Overshadowing condition outperformed those in the Control condition during training, but the pattern reversed in the transfer test. Error bars denote between-subjects standard errors of the means.
Transfer test
The right panel of Figure 2 shows the overall performance for each condition during the transfer test blocks. Data from the last two training blocks and the transfer test were entered into a 2 (Training Condition) × 2 (Last Two Training Blocks, Test Phase) analysis of variance (ANOVA). A significant interaction between the two factors suggests that knowledge acquired in training with the two conditions was differentially transferred into the transfer test, F(1, 146) = 56.45, p < .001. Specifically, participants in the Overshadowing condition showed a substantial drop in performance from training (M = 90%, SD = 19%) to the transfer test (M = 57%, SD = 14%), t(41) = 9.14, p < .001. The drop, measured by effect size, was large, Cohen’s d = 1.76, 95% CI = [1.25, 2.26]. Their performance at test, with an average 57% accuracy, was significantly different from chance, t(41) = 26.4, p < .001. This indicates a small learning of the probabilistic features.
On the other hand, participants in the Control condition showed no reliable differences in performance between training (M = 67%, SD = 13%) and transfer test (M = 68%, SD = 10%), t(33) = 0.70, p = .49. The change was negligible as measured by effect size, Cohen’s d = .08, 95% CI = [−0.40, 0.57], indicating a successful transfer of knowledge learned in training to the transfer test. The overshadowing effect, indicated by the difference between the two conditions in the transfer test performance, was reliable, t(74) = 4.01, p < .001, Cohen’s d = .93, 95% CI = [0.44, 1.41].
Discussion
In the pilot experiment, participants in the two conditions received slightly different training. In the Control condition, participants learned to classify exemplars of the two categories could potentially rely on up to three different probabilistic predictor variables. In the Overshadowing condition, in addition to the probabilistic features that were available to the Control participants, participants also had a fourth feature available that had a completely deterministic association to category membership. Although this deterministic feature enhanced performance in training, it impeded the learning of the other, partially predictive, features. Hence, in the transfer test, when the deterministic feature was no longer available, participants in the Overshadowing condition performed so poorly to a point that their performance was only slightly above the 50% chance level. As we will discuss in the “General discussion” section, similarly large overshadowing effects have been observed in some animal studies of overshadowing (e.g., Pavlov, 1927), albeit using quite different task designs. Unlike other animals, however, human participants can be instructed to try to learn despite the presence of the overshadowing feature. Experiment 1 examines this kind of top-down instruction effect.
Experiment 1
In Experiment 1, we first attempted to replicate the overshadowing effect observed in the pilot experiment with stimuli that were generated at the runtime of the experiment. This procedure made sure the stimulus set adhered to the same generative process of the categories in both the training and transfer test. Any idiosyncratic stimulus for a particular trial was unlikely to repeat. With this procedure, every participant received a different set of stimuli, while all of them belong to the same categories. The results obtained are more generalisable to a related context. This procedure is different from our pilot experiment that a set of images was pregenerated. It is also different from the “one away” design which is popular in the field that a small set of stimuli (generally fewer than 10 exemplars in each category) was used.
Having established a robust overshadowing effect in the pilot experiment, we aimed to test whether the effect could be mitigated by top-down attention control. In addition to the Control and Overshadowing conditions in the pilot experiment, a Warning condition was utilised. The Warning condition was identical to the Overshadowing condition, except that participants were informed of the overshadowing feature before the training. They were instructed to ignore the overshadowing feature, the horn colours of the demons in this experiment. Participants were told to try to learn other features that would help them classify the demons into the two categories.
If top-down control could be deployed effectively, performance in the Warning condition could potentially resemble that of the Control condition in both training and the transfer test (or conceivably just in the transfer test). On the other hand, if top-down control was ineffective, performance of the condition should resemble that of the Overshadowing condition.
Methods
The design of Experiment 1 was very similar to that of the pilot experiment, except that there was an additional between-subjects Warning condition. In addition, the study was run in the laboratory rather than online. Other changes made in Experiment 1 are detailed below.
Design
As in the pilot experiment, the experiment had two phases, training and transfer test. Three between-subjects conditions were compared: Control, Overshadowing, and Warning. The Control and Overshadowing conditions were identical to those in the pilot experiment. In the Control condition, three probabilistic features were independently predictive of the stimuli’s category memberships. In the Overshadowing condition, an additional binary feature was indicative of the stimuli’s category memberships. The new Warning condition was identical to the Overshadowing condition, except that participants were informed about the deterministic predictor before the training phase. All participants took a transfer test in which new stimuli were shown.
Participants
One hundred twenty participants (83 females, mean age = 21) from the University of California, San Diego Psychology Participant Pool participated for course credits. The number of participants was determined by a power analysis using the data from the pilot experiment. With a Cohen’s d of around 1.0 between the Control and Overshadowing conditions obtained from Experiment 1, an estimate of 17 participants per condition was needed to achieve a power of 80%. To ensure the normality assumptions required for the statistics tests, we decided to run at least 30 participants per condition.
All participants took a computerised version of Ishihara’s Test for Colour Deficiency. Eight colour plates were used in the test, and participants had to report the numbers in the plates. One participant did not pass the test and was replaced. Participants were randomly assigned into one of the three conditions: Control (n = 41), Overshadowing (n = 36), and Warning (n = 43). Informed consent was obtained before the experiment began.
Materials
As in the pilot experiment, two categories of demons were created. In both the training phase of Control condition and the transfer test of all conditions, the two categories of demons varied in eye colour, eye size, and horn height. Demons in the training phase of the Overshadowing and Warning conditions had an additional binary feature—each of the two horn colours mapped perfectly onto one of the categories. Once we generated the values of these features, we created a brown rectangle (width: 500 pixels, height: 300 pixels, RGB values: [180, 150, 110]) on the screen denoting the face of the demon. We then map other features onto the brown square to create an image of a demon.
Specifically, the Old World demons had shorter horns (mean height: 70 pixels, SD: 20 pixels) and bigger round eyes (mean diameter: 70 pixels, SD: 20 pixels). New World demons had taller horns (mean height: 90 pixels, SD: 20 pixels) and smaller round eyes (mean diameter: 50 pixels, SD: 20 pixels). The colour distributions of the eyes of Old World and New World demons were independently generated for each participant. A random colour was generated within the HSB colour wheel, with saturation and brightness set to 100%. This colour determined the mean of the distribution of the Old World demons’ eyes. The distribution has a 30° SD. The mean of the New World demons’ eye colour distribution was 30° clockwise to that of the Old World demons on the HSB colour wheel, again with a 30° SD.
Another colour was randomly generated for each participant in the Overshadowing and Warning conditions. In the training phase, the colour served as the horn colour of the Old World demons. Another colour that was 30° counterclockwise on the HSB colour wheel served as the horn colour of the New World demons.
Instead of having a fixed set of images for all the participants as in Experiment 1, the stimuli were created with a generative process. The stimuli were created at the runtime of the experiment. Feature values were randomly generated with the predefined Gaussian distributions of the horn height, eye size, and eye colour. The values were then translated into the corresponding geometric shapes and mapped on a brown rectangle (the demon face) to form a coherent object. Hence, no two stimuli were identical except by coincidence. This process guaranteed the generation of two families of stimuli that were visually similar among each other within a category. The values were also parameterised in mathematical formulae to allow certain statistical analyses. The experimental environment was written in JAVA programming language.
Procedure
The Control and Overshadowing conditions were identical to the corresponding conditions in the pilot experiment, except that each block was 100 trials long. The total numbers of trials in the training and transfer test phases were again 300 and 100, respectively. Participants were seated in a normally-lit, sound attenuated room. In the training phase, they were told that the task was to classify the visual stimuli into two categories. They were encouraged to spend no more than 3 s on each trial, and the stimulus would be replaced by a white blank screen if no responses were given within 10 s. Audio feedback was given upon response.
For the Warning condition, the basic procedure was identical to the Overshadowing condition. Participants were trained with the presence of the deterministic feature (horn colour). The only difference between the two conditions was that participants in the Warning condition were informed about the deterministic feature. They were told that one of the horn colours signified one category, the other horn colour signified another category. They were instructed to ignore the horn colour and try to learn about the category properties through other probabilistically defined features. Specifically, they were told, Old- and New-world Demons you see in Training will differ in their horn colors: Old-world Demons have one horn color and New-world Demons have another horn color. However, you should know that this horn color difference will not be present in the test. To maximize your performance in the final Test phase, please try to learn about the intrinsic differences between the Old- and New-world Demons. Just ignore their horn colors and attend to their other properties.
The same message was repeated before each block began. They were not told which other features defined the categories.
All participants performed a transfer test which the demons’ horns were always grey in colour. Again, all stimuli were created according to the generative process. All participants encountered different sets of stimuli. Hence, participants had to classify the demons using the three probabilistic predictors.
Results
Training phase
Performance of the Control and Overshadowing conditions during training phase was very similar to that in the pilot experiment (Figure 3, left panel). Accuracy increased steadily across blocks. Performance of the Warning condition was superior compared with the Control condition, but inferior compared with the Overshadowing condition.

Participants’ performance in Experiment 1. In the last training block, those in Overshadowing condition performed the best, followed by those in the Warning condition. Those in the Control condition performed the worst. In the transfer test when overshadowing feature was no longer available, the pattern reversed. Error bars denote between-subjects standard errors of the means.
A 3 (Block) × 3 (Training Condition) mixed-design ANOVA was employed to analyse the training performance. A significant main effect of Block, F(2, 234) = 46.63, p < .001, indicates improvements in performance across training blocks. A significant main effect of Condition, F(2, 117) = 23.27, p < .001, indicates differential performance across training conditions. A significant interaction between the two factors, F(4, 234) = 4.84, p < .001, provides evidence that the rates of improvement in training were different in the three conditions. Specifically, the Overshadowing condition reached a performance asymptote quickly, whereas the other two conditions improved at slower rates.
In the last training block, the Control condition had the lowest performance and the least variability (M = 63%, SD = 9.5%), the Overshadowing condition had the highest performance but also greatest variability within condition (M = 87%, SD = 19.7%). Performance of the Warning condition was in between the two (M = 81%, SD = 16.7%). A one-way ANOVA, F(2, 117) = 23.42, p < .001, indicates significant differences between these conditions. A Tukey’s honest significant difference (HSD) test indicates reliable differences between the Control and Overshadowing conditions (p < .001, Cohen’s d = 1.48, 95% CI for d = [0.97, 1.99]), and Control and Warning conditions (p < .001, Cohen’s d = 1.28, 95% CI for d = [0.80, 1.76]). However, there was no reliable difference between Overshadowing and Warning conditions (p = .20, Cohen’s d = .34, 95% CI for d = [−0.12, 0.79]).
Transfer test
Comparing performance of the final training block and the transfer test, a 2 (Final Training Block, Test Block) × 3 (Training Condition) mixed-design ANOVA was conducted. It suggests that the changes in performance from the final training block to the transfer test block in the three conditions were different, F(2, 117) = 6.36, p = .002. Although performance of the Control condition improved slightly in the transfer test, Mdifference = 3.0%, t(40) = 2.77, p < .01, a drop in performance was seen for both Overshadowing, Mdifference = −29.8%, t(35) = 7.78, p < .001, and Warning conditions, Mdifference = −18.9%, t(42) = 6.08, p < .001.
Participants’ performance of the Control and Over-shadowing conditions during transfer test (Figure 3, right panel) was very similar to that in the pilot experiment. Contrary to the high level of performance in the training phase, participants in the Overshadowing condition (M = 57%, SD = 11.0%) did the worst, whereas those in the Control condition did the best (M = 66%, SD = 7.8%). Participants in the Warning condition scored in between the two conditions (M = 62%, SD = 11.1%). Transfer test performance for all three conditions were significantly different from chance (ps < .001).
A one-way ANOVA, F(2, 117) = 8.43, p < .001, indicates significant differences between these conditions. A Tukey’s HSD test indicates a reliable pairwise comparison between Control and Overshadowing conditions (p < .001, Cohen’s d = .98, 95% CI for d = [0.49, 1.46]). The effect size obtained is almost identical to that in the Pilot Experiment, indicating a successful replication.
Importantly, even though participants in the Warning condition appeared to outperform the Overshadowing group, the pairwise difference between Overshadowing and Warning conditions (p = .098, Cohen’s d = .43, 95% CI for d = [−0.03, 0.88]) is not significant. The difference between Control and Warning conditions does not reach statistical significance either (p = .087, Cohen’s d = .49, 95% CI for d = [0.05, 0.93]). These led us to designing a experiment with a more salient overshadowing effect in Experiment 2.
Receiver operating characteristic analysis
To better quantify how effective the three training conditions, we applied a receiver operating characteristic (ROC) analysis to the transfer test data (Green & Swets, 1966; Macmillan & Creelman, 2004). The ROC analysis has an additional advantage over accuracy analysis in that it estimates participants’ ability to differentiate the categories regardless of where the participants place their decision criterions (Wixted et al., 2017). Each point on Figure 4 denotes the proportion of correct classification of Old World demons and incorrect classification of New World demons for a learner. Correctly classifying an Old World demon as Old World was considered a hit, whereas incorrectly classifying an Old World demon as New World was considered a miss, and so forth. Discriminability (d′) was calculated for each participant assuming an equal variance signal detection model. The value of d′ measures how well a participant was able to tell apart the two categories.

Receiver operating characteristic (ROC) analysis for test blocks in Experiment 1. Each point on the graph represents a participant’s performance in the transfer test blocks. We denote hit rate as the proportion of correct classifications to the Old World demons. The false alarm rate represents the proportion of incorrect classifications to New World demons. The diagonal line represents chance level in discriminating Old World demons from New World ones.
In line with our conclusions from the pilot experiment and from the accuracy analysis, participants in the Control condition achieved the highest mean discriminability (d′ = 0.90, SD = 0.45), followed by the Warning condition (d′ = 0.65, SD = 0.61) and the Overshadowing condition (d′ = 0.38, SD = 0.63). A one-way ANOVA confirmed that discriminability differed across conditions, F(2, 117) = 8.24, p < .001. Tukey’s HSD test suggested that participants in the Control condition could better discriminate the two categories compared with those in the Overshadowing condition (p < .001, Cohen’s d = .97, 95% CI for d = [0.48, 1.46]). However, discriminability of the Warning condition was not reliably different from either the Control condition (p = .11, Cohen’s d = .47, 95% CI for d = [0.02, 0.92]) or the Overshadowing condition (p = .09, Cohen’s d = .44, 95% CI for d = [−0.02, 0.90]).
Classification strategies in transfer test
To look into more detail at which features were relied upon by individual participants in transfer test, we performed a logistic regression for each participant using his or her transfer test data. Features of the stimuli, namely eye colour, eye width, horn height, and nose were used to predict a participant’s classification decisions in the transfer test. Weights of features were standardised. The relative strength of a participant’s standardised weights gives an indication of which features were utilised. Each row in Figure 5 denotes a participant’s data. The colour of each of the four tiles indicates the level of reliance on each of the features in classifying a demon. A deep green tile means that a feature had a heavy weight in the appropriate direction, a white tile suggests no reliance, and a deep red tile indicates a weight in the wrong direction.

Classification strategies adopted by individual participants in the transfer test. Each row represents the data of a participant. The colour of a tile represents the weight a participant put on a feature (deep green indicates correct utilisation, white indicates no utilisation, and red indicates incorrect utilisation). A nonwhite colour for nose column indicates an incorrect weight of that feature in the decision rule. Participants were grouped by conditions: (a) Control condition, (b) Overshadowing condition, and (c) Warning condition.
For the nose feature, any nonzero weight, regardless of colour, would be inappropriate since it was not predictive of category membership. Therefore, all weights reaching statistical significance are red. For a clearer visualisation, only weights that reach statistical significance at .05 level were shown in colour (Wills et al., 2015).
Participants were sorted in a descending order according to their classification accuracy in the transfer test, so the higher rows refer to the better-performing participants.
Regardless of conditions, it can be seen that participants with higher classification accuracy tended to use a mix of multiple defining features in their classifications while correctly ignoring the nose as a predictive feature. More participants in the Control condition (Figure 5a) and Warning condition (Figure 5c) appeared to utilise multiple defining features compared with those in the Over-shadowing condition (Figure 5b). More than half of the participants in the Overshadowing condition did not show any obvious dominant classification strategies.
Discussion
Experiment 1 successfully replicated the overshadowing effect found in the pilot experiment. When a highly predictive feature was available in the training phase, learning of other probabilistic predictors of category membership was markedly reduced. Importantly, these results also show that the overshadowing effect seems to be quite stubborn. Warning against the overshadowing feature appears to shrink the effect, but it does not completely eradicate the overshadowing feature’s influence on learning. The ROC analysis suggests that participants’ ability to discriminate the two demon categories was reduced by the overshadowing feature. Warning against the feature did not reliably improve discriminability.
In Experiment 1, the difference in transfer test performance between the Control (66%) and Overshowding (56%) groups was relatively small. The difference between Warning and Overshadowing conditions is clearly trending towards statistical significance but the results are inconclusive. It motivated us to design a similar task with a higher sensitive to show the effects of top-down instructions. In the following experiment, we attempted to replicate the main findings of Experiment 1 using stimuli that are simpler and more naturalistic in appearance. We specifically tried to exaggerate the overshadowing effects, such that there would be more room to assess the effects of top-down instructions. As seen in Figure 6, the stimuli appeared a little bit like living cells. The category membership was predicted by a single probabilistic feature in the Control condition. We tested whether the overshadowing effect demonstrated in the preceding experiments would still stand in such a case, and if so, whether warning would help mitigate the effect.

Sample stimuli used in Experiment 2. In general, (a) Goopies had fewer dots and (b) Plunketts had more dots. In the training phase of the Control condition, all stimuli were presented horizontally. In those of Overshadowing and Warning conditions, Goopies were tilted to the left and Plunketts to the right. All stimuli were presented horizontally in the transfer test trials.
Experiment 2
Methods
Design
The basic design of Experiment 2 was identical to that of Experiment 1. The experiment consisted of two phases, training and transfer test. Three training conditions were compared between subjects: Control, Overshadowing, and Warning. All participants took the same transfer test which new stimuli were shown. As in the previous experiments, proportion of correct classifications in both training and transfer test was recorded to assess participants’ learning.
Participants
Ninety participants (61 females, mean age = 20.2) from the University of California, San Diego, were recruited for the experiment, all of them participated in exchange for course credits. Participants were randomly assigned into the three conditions: Control (n = 31), Overshadowing (n = 31), and Warning (n = 28). Informed consent was obtained before the experiment began. All participants passed a computerised version of Ishihara’s Test for Colour Deficiency as described above.
Materials
Cell-like stimuli were used in the experiment (see Figure 6). Two categories of cell-like stimuli were created, which we named Goopies and Plunketts. Stimuli in the Control condition and the transfer test of all conditions varied in colour, height, and number of black dots, but only the number of dots was predictive of category membership. In general, Goopies had fewer dots, and Plunketts had more dots. The number of dots was drawn at the runtime of the experiment, from a discretised Gaussian distribution with a mean that differed for the two categories. The means of the two distributions were 12 and 22, with an SD of 5. Therefore, the means of defining two categories were separated by 2 SDs. We ran a simulation and determined that the probabilistic feature would have a predictive power of 84% to its corresponding category.
All the stimulus has a fixed width of 450 pixels, whereas two other nondefining features, cell height and cell colour, varied across trials. The cell height distribution has a mean of 220 pixels, with an SD of 30 pixels. Therefore, the aspect ratio of the cell changed on each trial. The two categories followed the same colour distribution, which mean was randomly determined between 0° and 359° on an HSB colour wheel for each participant. The distribution had a standard deviation of 40°.
In the training phase of the Overshadowing and Warning conditions, the stimuli were defined the same way as in the Control condition. The only difference was that these stimuli were tilted to one orientation. Goopies were tilted to the left, whereas Plunketts were tilted to the right. The mean orientation of the two distributions were −45° and +45°, with an SD of 10°. We jittered the orientation such that they did not stay at the same orientation across trials. This also made orientation a slightly less salient feature. The stimuli were presented on a white background.
Procedure
The procedure of Experiment 2 was identical to that of Experiment 1, except that each experimental session consisted of 600 trials. Participants went through 400 training trials, and 200 transfer test trials. After completing every 100 trials, participants were encouraged to take a 30-s break.
In each of the training trials, participants in the Control condition saw a cell-like stimulus in the middle of the computer screen. The stimulus was always in a horizontal position. Participants were encouraged to make a keyboard response within 3 s. The stimulus disappeared in 10 s if no responses were made. A feedback tone indicated whether the response was correct.
The training phase of the Overshadowing and Warning conditions was identical to the Control condition except that the stimuli were tilted differently in the two categories. Goopies were always tilted to the left, and Plunketts to the right. In the Warning condition, participants were told that the stimuli would always be lying horizontally in the transfer test, and they were encouraged to learn other orientation-invariant properties of the categories that would help them do well in the transfer test.
After 400 training trials had elapsed, all participants took the same transfer test. In the transfer test, one stimulus was presented at a time. The stimulus was positioned horizontally. Participants had no time constraints for their responses. The experiment ended with a debrief after 200 transfer test trials were completed.
Results
Training phase
Performance of the all three conditions improved from Block 1 to Block 2 and stayed at a high level until the end of the training phase (Figure 7, left panel). Accuracies for the three conditions were quite different even in the first block, suggesting utilisation of the overshadowing feature commenced early in both the Overshadowing and the Warning conditions. Performance was different across conditions in the last two training blocks, as indicated by a one-way ANOVA, F(2, 87) = 30.3, p < .001. Tukey’s HSD test indicated that participants in the Control condition performed significantly worse than those in both the Overshadowing (p < .001, Cohen’s d = 2.03, 95% CI for d = [1.40, 2.65]) and the Warning conditions (p < .001, Cohen’s d = 1.45, 95% CI for d = [0.86, 2.03]). However, performance in the Overshadowing and Warning conditions was not reliably different from each other (p = .49, Cohen’s d = .29, 95% CI for d = [−0.24, 0.81]).

Training and test accuracies in Experiment 2. Participants in the Overshadowing and Warning conditions reached an asymptote within the first training block, but their performance was worse than those in the Control condition in the transfer test. Error bars denote within-participants standard errors of the means.
Transfer test
A 2 (Last Two Training Blocks vs. Test Blocks) × 3 (Training Condition) mixed-design ANOVA was conducted to examine the changes from training to transfer test phases in the three conditions. Accuracy in the transfer test was generally lower than that in the last training blocks, F(1, 87) = 147.4, p < .001. Importantly, a significant interaction between the two factors suggests that the changes are different in the three conditions, F(2, 87) = 54.4, p < .001. Specifically, participants in the Control condition showed a slight improvement in the transfer test, Mdifference = 2.9%, t(30) = 3.84, p < .001. A huge drop in performance was seen in both the Overshadowing, Mdifference = −40.7%, t(30) = 11.87, p < .001, and the Warning conditions, Mdifference = −27.6%, t(27) = 6.58, p < .001, when the deterministic binary feature was no longer available. The main effect of Condition was not significant when accuracy was averaged across the final two training blocks and the transfer test blocks, F(2, 87) = 0.90, p = .41.
As in the previous experiments, participants in the Control condition performed the best (M = 77%, SD = 11.5%) in the transfer test. Performance of the group was close to the theoretical ceiling of the task (84%). Participants in the Overshadowing condition performed the worst (M = 54%, SD = 15.1%). Participants in the Warning condition performed at an intermediate level (M = 64%, SD = 15.1%). Performance of all three groups was above chance (ps < .001).
A one-way ANOVA suggested that participants in the three conditions performed differently in the transfer test (Figure 7, right panel), F(2, 87) = 19.68, p < .001. Importantly, Tukey’s HSD test indicated that all pairwise comparisons were statistically significant (ps < .05). Specifically, there was a 22.2% difference between the Control and Overshadowing conditions (p < .001, Cohen’s d = 1.65, 95% CI for d = [1.06, 2.24]), indicating a cue competition effect. The instruction effects were clear, as indicated by a 9.7% difference between the Overshadowing and Warning conditions (p = .02, Cohen’s d = .64, 95% CI for d = [0.11, 1.18]). The top-down instructions did not completely eradicate the cue competition effect, as indicated by a 12.5% difference between the Control and Warning conditions (p = .002, Cohen’s d = .93, 95% CI for d = [0.38, 1.48]). The latter two comparisons provide a more conclusive evidence of the top-down instruction effects, compared with Experiment 1.
ROC analysis
Figure 8 shows participants’ performance in terms of hit and false alarm rates. It can be seen that most of the points of the Control condition clustered in the upper-left-hand corner, indicating an overall high discriminability (d′ = 1.56, SD = 0.70). A sizable portion of the Warning condition also clustered in the same region, indicating high discriminability for those participants. However, the remaining participants clustered around the grey line, which denotes chance-level performance. This brought the mean discriminability down (d′ = 0.87, SD = 0.88), compared with the Control condition. Discriminability of the Overshadowing condition was the lowest (d′ = 0.25, SD = 0.87), which was in fact not reliably different from chance level, t(30) = 1.63, p = .11. A one-way ANOVA indicates significant differences between the conditions, F(2, 87) = 19.72, p < .001. Mirroring the accuracy data, all pairwise comparisons are significant (ps < .05). Specifically, there was a 1.30 difference in discriminability between the Control and Overshadowing conditions (p < .001, Cohen’s d = 1.65, 95% CI for d = [1.06, 2.24]), a 0.69 difference between the Control and Warning conditions (p = .005, Cohen’s d = .86, 95% CI for d = [0.31, 1.41]), and a 0.62 difference between the Overshadowing and Warning conditions (p = .01, Cohen’s d = .71, 95% CI for d = [0.17, 1.24]).

ROC analysis for Experiment 2. The ability to discriminate the two categories was high for the Control condition. Performance of participants in the Warning condition was more diverse, whereas those in the Overshadowing condition did not differ from chance level.
Classification strategies in transfer test
As in Experiment 1, we performed logistic regressions to determine which features were utilised by individual participants in the transfer test. Number of dots, and height and colour of the cell were entered into the logistic regression to predict individual participant’s responses.
In Experiment 2, only the number of dots predicted category membership in the transfer test. We predicted that best-performing participants would utilise this feature strongly in their decisions. As shown in Figure 9a, most of the participants in the Control condition made use of the number of dots in classification, as shown in deep green in the first column. Best-performing participants did not incorrectly rely much upon other nondefining features in their decision rules. More than half of the participants in the Warning condition also utilised the number of dots (Figure 9c), whereas fewer than one third of all the participants in the Overshadowing condition reliably did so (Figure 9b).

Classification strategies adopted by individual participants in the transfer test. Participants were grouped by condition: (a) Control condition, (b) Overshadowing condition, and (c) Warning condition. Participants who performed well were more likely to make use of the number of dots (shown in deep green in the first columns in each figure) while not utilising other nondefining features.
Discussion
Experiment 2 suggests that the overshadowing effect reported in the previous experiments may be quite robust. Here, we utilised a simpler mapping of features to categories and more naturalistic-appearing stimuli. As in Experiment 1, we saw that when a deterministic binary predictor feature was made available in training, performance on a transfer test was impaired. From the logistic regression analysis, it appears that a large portion of participants in the Warning condition were able to effectively ignore the overshadowing feature, or use it to their advantage during training. This ability allowed the Warning group to outperform the Overshadowing group in terms of overall accuracy. However, it should be noted that performance of participants in the Warning condition was worse than those in the Control condition, indicating suboptimal use of the instructions.
General discussion
In three experiments, we showed a powerful overshadowing effect in human perceptual category learning, a point that had been in doubt given the small preexisting literature on the question (Bott et al., 2007; Murphy & Dunsmoor, 2017). Specifically, when a stimulus feature highly predictive of its category membership was made available during training (but not in the transfer test), participants acquired much less ability to perform the classification based on one or multiple probabilistic feature dimensions. In the three experiments, we reported, this overshadowing effect showed up as a robust performance difference between Control and Overshadowing conditions in the transfer test. The effect is quantitatively impressive, both in terms of proportional change of performance and in terms of Cohen’s d (effect size relative to interindividual variability in performance).
At the group level, the overshadowing effect was so powerful that performance of the Overshadowing groups in Experiments 1 and 2 was close to chance level in the transfer test. This is in sharp contrast to the Control groups, which clearly retained most of the category information. At an individual level, our analyses through ROC curves and logistic regression show that a considerable amount of participant variability. Some participants in the Over-shadowing conditions were able to pick up the probabilistic features, and to utilise them when the overshadowing feature was no longer available. Our logistic regression analysis shows how individual participants utilised each feature in the transfer test. The use of logistic regression to understanding individual participant has been under-utilised to understand strategies of individual learners (but see Wills et al., 2015). Our observation mirrors that of Kemler Nelson (1984, Experiments 1 and 2), who showed that some participants may take a more analytic approach in category learning, whereas others may take a more holistic approach.
Contrast with previous human category learning studies
The current study was partly motivated by Bott et al. (2007). In their Experiments 1 and 2, the researchers had participants learn to associate features of a car with one of the category labels. Half of the participants were in the blocking group. The group went through a pretraining phase where they were told that a particular feature always predicted the group membership. In an actual training trial, the participants were shown a list of car features and had to classify the car into one of the two categories. In the test phase, the features were shown one at a time, and the participants had to decide which category the feature belonged to. The authors predicted that participants in the blocking group would classify the features at chance in the test. This was not the case. The blocking group learned nearly as much about the features as the control group. This was taken as an evidence that blocking effect was weak in human learning in their study. Only in their third experiment, when participants were told to predict the outcome of the features, instead of performing a classification task, a blocking effect was clearly shown. Their first two experiments utilised a “train to criterion” approach, such that participants in the control and the blocking groups received very different amount of feature training. The blocking groups received an average of one to two blocks of training, whereas the control group received four to five blocks. This might be the main reason of a failure to show a blocking effect.
In their Experiment 3, participants in the control condition actually learned fewer features than the blocking group in the categorisation task, contradicting the authors’ original hypothesis. A blocking effect was shown only when the blocking group was asked to predict the outcome of the attributes. There were a number of differences between our task and the categorisation task reported by Bott and colleagues. First, as mentioned in the Introduction, the authors opted for a “one away” design with a large number of dimensions and binary feature values. Eight features were utilised in their study, but only up to 3 in the current study. We also allowed our feature values to vary along a continuum. These allowed learners in our study to build a more detailed representation of individual features. Our approach has a clear advantage of being more sensitive to differences across conditions in the transfer test. Second, the features in Bott and colleagues’ study were verbally presented, whereas ours were visually presented. Clearly, the features were processed by different perceptual systems, while one may be more sensitive for learning compared with the other. Third, the two studies adopted a very different transfer test. In the Bott et al.’s study, features were learned holistically as parts of an object during training, but were shown one at a time during the test. Strictly speaking, generalisation of knowledge is not required in the test since all the features had been shown in training. However, there might be a switch cost as the tasks were different in the two phases. In contrast, the training and transfer test in the current study was almost identical. Learners saw features embedded in objects in both phases. They were also shown new objects defined by the same generative process in the transfer test; therefore, an ability to generalise over superficial surface features was required. The two methods may tap on different kinds of category learning (Kemler Nelson, 1984; Wills et al., 2015).
Instead of focusing on whether a cue competition effect is robust to procedural variations (Maes et al 2016; but see Soto, 2018), we are primarily interested in whether human learners have voluntary control over the cue competition effects when they do arise (Mitchell et al., 2009). The basic question is concrete: If a strongly predictive feature inhibits learning about weaker predictors, can that inhibition be voluntarily reduced if the learner is consciously motivated to learn as much as possible about all predictors that are present?
In the rest of the Discussion, we will first explore top-down voluntary control in the Rescorla–Wagner model and some of its variants, and then consider other computational models that may explain some of the findings. We will then end with some practical implications of our findings.
Cue competition effects in computational models
The results from our study seem at least qualitatively compatible with the hypothesis of error-driven learning. A majority of participants learned in these tasks when they made mistakes, presumably because they adjusted their classification rule in response to the feedback they received. When participants were explicitly instructed not to rely upon the overshadowing feature in training, this overshadowing effect appeared to be reduced. The group means may not tell the whole story in this regard, however. The ROC analysis in Experiment 2 showed that around half of the participants in the Warning condition seemed to show a dramatically shrunken overshadowing effect, whereas others seemed not to show any such reduction (solid dots in Figure 8).
According to the Rescorla–Wagner Model, associations between US and CS are enhanced because doing so would reduce the error in future categorisation trials. In general, such acquisition-focused models have little problems explaining the results of cue competition effects.
By taking a bottom-up approach in explaining how the stimulus structure affects learning, however, it is not entirely clear how these error-driven learning models can incorporate voluntary control during category learning. Showing top-down voluntary control over the stimuli, here we explore the implications on currently popular models of categorisation.
A number of category-learning models that rely on error-driven learning, such as ALCOVE or RULEX (Kruschke, 1992; Nosofsky et al., 1994), seem to be able to account for overshadowing and blocking of the kind reported here. Fundamentally, these networks learn at the feature level, not at the exemplar level, and they change their weights only when errors are made. In the present task, participants performing in the training phase of the Overshadowing condition did so with few errors, and thus the ALCOVE model would predict that little useful adjustment would take place in irrelevant or redundant feature attention weights, resulting in little or no learning for irrelevant or redundant features.
The same appears true for error-driven hypothesis testing models such as RULEX. These models assume that individuals are biased towards learning simple rules for exemplar categorisation. Importantly, like ALCOVE, these models adjust the weights on the given set of exemplar features to find the ones that are best predictive of category membership. Put in the context of the current task, RULEX would impose a heavy weight on the Overshadowing feature. Once an individual learned about the Overshadowing feature and stopped making errors, the model leaves little room for learning other less valid features.
Less clear is how the current set of results fit with clustering models of categorisation such as SUSTAIN (Love et al., 2004). Although SUSTAIN can selectively weigh certain exemplar features more than others like ALCOVE and RULEX, it is not necessarily the case that learning about irrelevant or redundant feature information would be profoundly dampened. According to the model, exemplars are grouped into various clusters based on feature similarity, which in turn drives decisions about category membership. Although a single feature like the horn colour may drive classification performance during training, it is possible that SUSTAIN would still learn to categorise test exemplars that lack information about horn colour. This is because cluster membership in SUSTAIN is based on similarity matching. In our Experiment 1, features varied independently. Test exemplars would continue to be associated with clusters of similar looking exemplars. This may explain why a minority of participants in the Overshadowing condition were able to perform well in the transfer test.
Other theoretical models of categorisation, such as exemplar models (e.g., Hintzman, 1986; Medin & Schaffer, 1978; Nosofsky, 1984) and prototype models (e.g., Homa et al., 1979; Posner & Keele, 1968; Rosch, 1973; Smith & Minda, 1998), do not appear to fit with the data we have. During learning, these models construct a boundary that would separate the categories. The boundary is determined by two main factors, the selective-attention weights and the memory strength of training items. For the selective-attention component, the more relevant features are likely to be stretched and given more weight. In our Overshadowing condition, it means that the completely valid feature is likely to be attended, and gain a heavy weight. Other features are far less valid, and hence the psychological space of these features would be shrunk. The resulting decision rule is likely to resemble a single-dimension one, with the overshadowing feature as the sole component. Obviously, this rule would fail miserable during the transfer test, when the overshadowing feature is no longer available for classification. These models may be able to capture the group-level at-chance accuracy patterns of the Overshadowing condition, but unlikely to explain why some participants were able to utilise the less valid features in the transfer test.
Finally, one multiple-systems account of category learning, COVIS (Ashby, Alfonso-Reese, Turken, & Waldron, 1998; Ashby, Paul, & Maddox, 2011) posits that category learning takes place by one of two types of learning systems: a rule-based hypothesis testing system and a procedural learning system that relies on the physical similarity of category exemplars. The rule-based system of COVIS involves testing rules that could determine category membership by learning about one or a conjunction of multiple, independent features. Once a rule is discovered, it is presumably used until an error is made. Learning via the procedural learning system involves integrating information across multiple exemplar features, so that categorisation judgements are made based on the overall physical similarity of test objects to the learned category. Thus, it might be the case that learning via the procedural learning system in COVIS would learn all predictive features of the presented exemplars. If there is a conflict between the outputs of the two systems, the final output is determined by two factors, confidence and trust. Confidence is determined by the distance between the predicted and actual values of the current stimulus. Trust is determined by the history of success of the systems. Hence, in the overshadowing condition when the rule always produce the correct answer, trust in the system is high. It would predict that learners learn very little from the procedural learning system in general. COVIS has the potential to account for learning in the Overshadowing condition though, if it allows learning from the procedural learning system even when the rule-based system has a clear advantage. Regardless, it requires some adjustments to the current model specifications.
All of these formal models mentioned above are bottom-up and stimulus-driven. They assume no top-down intervention in biasing the attention towards a particular feature. According to these models, if two features are independently and equally predictive of a stimulus’ category membership, they will end up having equal, or very similar weights. This is a logical assumption in early development of a formal model of learning using artificial stimuli. In naturalistic learning environment, however, learners surely often have biases towards certain particular feature dimensions, reflecting previous learning experience, instructions from educators, and so forth. The popular models mentioned above do not have any independent parameters to bias top-down attention. As we mention below, top-down attention modulation is likely to be prevalent in day-to-day life. Therefore, modifications of these computational models are needed so that they can be capable of explaining top-down effects such as the one reported here.
Alternative models explaining the cue competition effects
The computational models mentioned above fit reasonably well with the data in the Control and Overshadowing groups. They explain how much participants in a group, as a whole, learn about a particular feature. While looking at performance of individual participants within a condition, these models do not work as well. As we can see in the ROC analyses and logistic regression analyses in Experiments 1 and 2, behaviours of some of the participants in the Overshadowing condition resembled those in the Control condition, whereas those of others resembled those in the Warning condition. These individual differences may be captured by a set of models that is known as performance-focused models (Miller & Escobar, 2001).
Performance-focused models posit that learners encode the events in which stimuli are paired. They have a lot of parallels with a class of models supporting the Exemplar Theory in the category learning literature (e.g., Kruschke, 2001). This contrasts with the acquisition models discussed above. From the perspective of the acquisition models, the learners are assumed to maintain only some summary statistics of the categories that they are trying to learn, such as the mean, standard deviation of each category, or the category boundary. Information encountered during the learning process beyond summary statistics is not encoded. In our experiments, that means that nondeterministic features would never enter the memory.
Data from our Overshadowing conditions suggest otherwise. Even when the deterministic features were not helpful in the training phase, some of the learners clearly encoded the information. This observation appears potentially compatible with the view of performance-focused models. One formulation of the performance-focused models, the Comparator Hypothesis (Miller & Matzel, 1988), suggests that the association between the overshadowed CS and the US is acquired, but not expressed, in the presence of another stronger overshadowing CS (Denniston et al., 2003). When the overshadowing CS is removed, the overshadowed CS has a chance of being expressed. Our data, at least for a proportion of participants in the Overshadowing condition, fit well with this description. They showed clear learning of some of the probabilistic features in the transfer test, when the deterministic feature no longer appeared.
Top-down instructions affecting learning
As one of our reviewers pointed out, our results would also align with the propositional approach to learning (Mitchell et al., 2009). This approach argues that a participant generates explicit propositions about the relationships between different elements in the environment and creates associations between them through a controlled reasoning process. It contrasts with the dual-system approach to learning, which states that the association between the CS and US is formed automatically, often outside consciousness. A corollary of this argument is that top-down instructions may be able to affect cue competition effects in some cases. As mentioned in the Introduction, this seems to be true under some situations.
For example, Colgan (1970) trained participants to associate a light and electric shock. After pairing, participants skin conductance went up to the light. Half of the participants were told how to predict the shock. The group showed little increased skin conductance for the subsequent no-shock trials, indicating effects of the instructions. Given the evidence that associations between stimuli could be modified by instructions, the effect of warning on the overshadowing effect fits naturally with this propositional approach to learning.
A number of other studies have examined the effect of top-down instructions in the human learning literature, but almost all of them utilised a causal judgement paradigm. In general, these top-down instructions are found to be effective in reducing cue competition effects (e.g., Lovibond et al., 2003; Mitchell & Lovibond, 2002; Waldmann, 2001; Williams et al., 1994). Interestingly, Simms et al. (2012) also found that these instruction effects are effective only for children with the capacity of logical thinking.
Why was warning not fully effective?
Although our results showed an effect of warning, it did not improve performance up to the level of the control condition. Why is this the case?
Many researchers in category learning suggested that attention has to be deployed to a dimension before it is utilised in category learning tasks (see Goldstone, 1998, for a review). In fact, many early computational models of category learning included a parameter that regulates how attention is weighed across features (e.g., Nosofsky, 1991). A feature that is predictive of category identity would be assigned a higher attentional weight in the decision rule. This larger attentional weight also makes differentiation within the feature dimension more fine-grained.
If variations within a feature dimension are not predictive of category membership, attentional weight to that feature is tuned down, denying maximal influence to the feature in the decision rule. This modulation is believed to be not completely voluntary (Rehder & Hoffman, 2005). For example, Shiffrin and Schneider (1977) showed that when stimuli previously served as targets, they may capture attention automatically. Participants in our Warning conditions might have faced a similar dilemma. They were instructed that the overshadowing feature should be ignored, yet it was highly predictive of category identity. The instruction might be overwritten by the automatic attention deployment, preventing the overshadowing feature from being fully ignored.
Alternatively, the overshadowing effect may be affecting the calculation and storage of predictive relationships. In both the Overshadowing and Warning conditions, once participants figured out the mapping of the overshadowing feature to category identity, they did not need to make any estimations based on other features. On the other hand, to do well in training, participants in the Control condition had to make a decision taking into account all the features. Long-term memory of the to-be-remembered items, the defining features of the categories in the current experiments, is believed to improve with repeated retrieval (as in the retrieval practice effect; e.g., Carrier & Pashler, 1992; Gates, 1917; Roediger & Karpicke, 2006).
The retrieval practice effect has been shown to apply to human perceptual category learning. Jacoby et al. (2010) had participants classify two families of birds with study-only blocks or with repeated testing blocks. Those in the repeated testing with feedback condition outperformed the study-only condition in the transfer test. In the experimental design examined here, participants in the Overshadowing and Warning conditions may be deprived of the opportunity to test themselves, leading to suboptimal learning assessed by the transfer test.
Practical relevance of the current study
Perceptual category learning has always been a central part of cognitive psychology due to its practical implications in education. The practical implications of cue competition effects extend even beyond explicit perceptual tasks. As an example, it has been suggested that some variability in mathematical performance can be attributed to the differences in the ability to distinguish different problem types (Rohrer et al., 2014, 2015). In typical math textbooks and exercises, however, it is common to have similar problem types grouped together. As a consequence, the type of problem can be easily inferred from the chapter they are presented in, or from neighbouring questions. This format for presentation deprives students of the opportunity to sort different problems into their corresponding types. In this classification problem, the chapter number serves as an overshadowing feature with complete validity, and the actual cues in the problems are not utilised during practice. Showing a robust overshadowing effect in the current study, our results suggest that students do not learn much about the cues to differentiate various problem types, and hence they may perform poorly in their final exam.
A further question we examined here is whether people have voluntary control over how much they learn about weak predictors in the presence of a strong predictor. For example, medical students learn to distinguish melanoma from benign skin lesions by examining photographs of different lesions that are depicted in different chapters of a book. The figure captions that label the images, and the title of the chapter the images are presented in, are all strong predictors making it immediately evident to the learner which category each training stimulus belongs to. Obviously, however, those cues will not be available in a real clinical settings. If learners can completely suppress any interference from the labels, then there may be little cost to having them present in training. If they cannot be suppressed, this might imply that standard training regimens could be usefully modified. Our partially successful Warning conditions suggest the latter.
We showed that with top-down instructions counteracting the cue competition effects, many learners can effectively suppress their influences. However, it is unlikely that the cue competition effects can be fully eradicated. The effectiveness of top-down instructions may also vary from task to task, which is worth further studying in follow-up studies.
Implications for real-world category learning
We conclude with some very tentative suggestions on possible implications for real-world category learning and training. Our studies utilised stimuli with a relatively small number of feature dimensions with continuous feature values, which we believe have at least a moderate resemblance to a certain proportion of real-life human category learning tasks. The three experiments showed that when a deterministic predictor feature was present, learning of some simultaneously presented probabilistic predictors was dramatically impaired. Future research could test whether our findings apply to the real-world setting. Here, we provide some possible directions.
First, training conditions might be profitably structured in a way that obligate the learner to perform the actual categorisation task using the predictors that will be available in the field (but no additional “easy predictors”). If a powerful predictor will not be available at the transfer test, it should not be present in the training process either.
Second, individual differences in the Overshadowing conditions appear to be strikingly large and important as seen in Experiments 1 and 2. In addition, warning of overshadowing features in Experiments 1 and 2 appeared to have quite different effects on different learners. Akin to the instructions given to her participants in Kemler Nelson (1984), warning is a type of top-down modulations (see also Wills et al., 2015). Participants with better attentional control may be able to ignore the overshadowing feature. As a result, this makes their performance in the transfer test more similar to that in the Control condition. Selective attention capacity, short-term and long-term memories might all play a role in the differential performance. Identifying these factors using standardised tests may help educators devise useful auxiliary training interventions.
Finally, it might be worth exploring training regimes in which learners are trained with one feature at a time. In our classification task, features are independently manipulated. That is, the value of one feature does not predict a value of another feature. Some classification tasks in real-life may have similar properties. In those cases, independent features could be trained by withholding other features of the stimuli during the training phase. In the case of learning to distinguish two kinds of birds, for example, exemplars of beaks can first be trained, followed by the legs, and so on. Learners may then acquire an independent weight for each feature dimension. Omission of some features in the transfer test stimulus set may have a less detrimental effect on classification accuracy.
Footnotes
Acknowledgements
The authors are grateful to Michael Waldmann, Andy Wills, Tom Beckers, and three anonymous reviewers for useful comments and discussion on previous versions of the manuscript.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Office of Naval Research (Grant N00014-10-1-0072), by a collaborative activity grant from the James S. McDonnell Foundation, by the Institute of Education Sciences, US Department of Education (Grant R305B070537), and by the National Science Foundation (Grant SBE-582 0542013 to the UCSD Temporal Dynamics of Learning Center, and Grant SES 1461535 to Michael Mozer). The opinions expressed here are those of the authors and do not represent views of the agencies that have supported this work.
