Abstract
In sentence processing, semantic and syntactic violations elicit differential brain responses observable in event-related potentials: An N400 signals semantic violations, whereas a P600 marks inconsistent syntactic structure. Does the brain register similar distinctions in scene perception? To address this question, we presented participants with semantic inconsistencies, in which an object was incongruent with a scene’s meaning, and syntactic inconsistencies, in which an object violated structural rules. We found a clear dissociation between semantic and syntactic processing: Semantic inconsistencies produced negative deflections in the N300-N400 time window, whereas mild syntactic inconsistencies elicited a late positivity resembling the P600 found for syntactic inconsistencies in sentence processing. Extreme syntactic violations, such as a hovering beer bottle defying gravity, were associated with earlier perceptual processing difficulties reflected in the N300 response, but failed to produce a P600 effect. We therefore conclude that different neural populations are active during semantic and syntactic processing of scenes, and that syntactically impossible object placements are processed in a categorically different manner than are syntactically resolvable object misplacements.
Imagine a world consisting entirely of randomly arranged objects. This would be disconcerting because you have learned over a lifetime that the world is a highly structured, rule-governed place. Objects in scenes, like words in sentences, seem constrained by a “grammar” that you understand implicitly and that allows you to process scenes efficiently. Biederman, Mezzanotte, and Rabinowitz (1982) first applied the terms semantics and syntax to describe different object-scene relational violations. According to their taxonomy, semantic relations concern the probability, position, and size of objects in scenes, as these characteristics require access to object meaning. A green lawn, instead of a carpet, on your office floor would be a violation of scene semantics. In contrast, your laptop hovering above your desk would be a syntactic violation because it defies a physical constraint, the law of gravity. Biederman et al. found that both semantic and syntactic violations in scenes resulted in slower, less accurate object detection (see also Võ & Henderson, 2009).
Here, we use the semantics/syntax distinction slightly differently: Syntax refers to the local positioning of objects within a scene, and semantics refers to the more global relationship of objects to the scene’s meaning. Both our definitions and those of Biederman et al. (1982) are efforts to specify a grammar of scenes, but is this semantics/syntax distinction a merely metaphorical borrowing of terms from linguistics (Henderson & Ferreira, 2004), or does the brain actually distinguish different types of object-scene relationships? We present evidence that distinct event-related brain potentials (ERPs) are associated with semantic and syntactic processing of objects in scenes and that these potentials resemble the potentials associated with semantic and syntactic processing of language.
Ganis and Kutas (2003) used ERPs to investigate the nature and time course of semantic context effects on object identification. After a 300-ms preview of a scene (e.g., soccer players), either a semantically consistent object (soccer ball) or a semantically inconsistent object (toilet paper) appeared in the scene. Ganis and Kutas reported a “scene congruity effect” in the N400 time window for incongruous relative to congruous objects; this effect closely resembled the N400 effect found for violations of semantic expectations elicited by verbal information (e.g., Holcomb, 1993; Kutas & Hillyard, 1980) or pictorial information (e.g., Barrett & Rugg, 1990; McPherson & Holcomb, 1999). Whereas Ganis and Kutas set up expectations by presenting scene previews and had participants attend to the semantic congruity of objects, Mudrik, Lamy, and Deouell (2010) found that even without a preview and without instructions directing attention to semantic congruity, a pronounced, slightly more anterior N300-N400 effect emerged. Thus, task relevance and preactivation of expectations are not prerequisites for detecting semantic anomalies in scenes. The N400 effect has been observed across many types of stimuli (linguistic stimuli, pictures, objects, actions, and sounds; for a review, see Kutas & Federmeier, 2011), which implies that the semantic processing of very different types of input might be based on one common mechanism.
In the language domain, a different ERP component, the P600, has been identified as a marker for syntactic problems that prompt reanalysis of the sentence (e.g., Friederici, Pfeifer, & Hahne, 1993; Osterhout & Holcomb, 1992). If syntactic violations in scenes are processed similarly, then those violations should elicit similar, late positive brain responses. Previous attempts to find ERP components specific to structural processing of scenes have yielded mixed results. Cohn, Paczynski, Jackendoff, Holcomb, and Kuperberg (2012) compared responses to comic strips with violations in the meaning or sequencing of images. They replicated the N300-N400 effects for semantic violations. Although they did not observe P600 responses to sequencing violations, they reported a distinct left-lateralized anterior negativity (LAN), which in studies of language has also been associated with syntactic violations (e.g., Friederici, 2002). Recently, Demiral, Malcolm, and Henderson (2012) manipulated the spatial congruency of objects in scenes while keeping semantic congruency constant. Following a 300-ms scene preview, spatially incongruent objects elicited an early N300-N400 component. However, despite manipulations of structural congruency, no P600 or LAN response was observed.
In the study reported here, we directly compared brain responses to semantic and syntactic violations in images of real-world scenes to test whether such violations produce neural dissociations similar to those in the language domain. In addition to presenting semantically incongruent objects, we created two types of syntactic violations: mild violations, in which objects were merely misplaced within scenes, and extreme violations, in which objects were implausibly balanced or hovered in midair. As we show later, semantic violations were marked by N300-N400 responses, whereas mild syntactic inconsistencies elicited a late positivity resembling the P600 found for syntactic processing of language. Extreme syntactic violations failed to produce a P600 effect.
Method
Stimulus material
We created 608 colored images of real-world scenes by photographing each of 152 different scenes in four versions: (a) with a semantically consistent object in a consistent location (consistent control condition), (b) with a semantically consistent object in a syntactically inconsistent location (inconsistent-syntax condition), (c) with a semantically inconsistent object in a syntactically consistent location (inconsistent-semantics condition), and (d) with a semantically inconsistent object in a syntactically inconsistent location (double-inconsistency condition; see Fig. 1). The latter, double-inconsistency condition was included to control for possible position effects.

Exemplar images of the four versions of a scene. The versions were created by crossing a manipulation of semantic consistency with a manipulation of syntactic consistency. Thus, in the consistent control condition, a semantically consistent object was in a syntactically consistent location (a computer mouse next to a computer); in the inconsistent-syntax condition, a semantically consistent object was in a syntactically inconsistent location (the mouse on the computer screen); in the inconsistent-semantics condition, a semantically inconsistent object was in a syntactically consistent location (a bar of soap next to the computer); and in the double-inconsistency condition, a semantically inconsistent object was in a syntactically inconsistent location (the bar of soap on the computer screen).
Each observer saw each of the 152 scenes only once during the ERP experiment. The scenes were evenly divided among the four conditions just described. We further divided the inconsistent-syntax condition so that in one third of the scenes, the violation was mild (i.e., mislocated object), and in the other two thirds, the violation was more extreme (i.e., violation of physics: hovering or balancing object; Fig. 2). Fourteen extra scenes were used as fillers for a repetition detection task.

Example scenes showing the two types of syntax violations used in the inconsistent-syntax condition. Mild violations were created by mislocating otherwise semantically consistent objects (left panel), and extreme violations (violations of physics) were created by showing objects either hovering in midair or critically balancing (right panel). Note that in the extreme case of violating physics, semantic understanding of the object is not necessary for observers to detect the syntactic violation (a hovering beer bottle should be as unexpected as a hovering book).
Images were not created by post hoc insertion of objects into scenes. Rather, hovering objects, for instance, were actually photographed hovering in midair (attached to invisible strings) to ensure realistic lighting conditions and minimize Photoshop editing (see the Supplemental Material available online for more examples of the scenes). In addition, the bottom-up saliency of the critical objects was assessed using Itti and Koch’s (2000) MATLAB Saliency Toolbox. The rank order of saliency peaks assigned to the critical objects was used to ensure that consistent and inconsistent objects did not differ in mean low-level saliency, F < 1.
Participants
Twenty-eight subjects (16 female, 12 male) participated. Their ages ranged from 19 to 35 years (M = 25, SD = 5). All were paid volunteers who gave informed consent. Each had at least 20/25 visual acuity and normal color vision.
Procedure
Participants were seated in a sound-attenuated, dimly lit room, where scenes were presented on a 17-in. monitor with resolution of 1024 × 768 pixels and a refresh rate of 75 Hz. Stimulus presentation and recording of the subjects’ responses were controlled using MATLAB and Psychophysics Toolbox (Brainard, 1997; Pelli, 1997). The scenes were viewed at a distance of about 70 cm and subtended a visual angle of 26° (horizontal) by 20° (vertical). Participants were told that they would see a series of scenes, each containing one critical object marked by a cue. Each trial began with a blink phase, and then a preview of a scene without the critical object was presented for 500 ms. Next, a red dot appeared at a location in the scene, indicating where to move the eyes and where to expect the critical object to appear. To avoid eye movement artifacts, we instructed participants not to move their eyes away from the cued location and to confine blinking to the initial phase of the trial. Five hundred milliseconds after onset of the cue (plus a random jitter between 0 and 300 ms, to prevent anticipatory effects), the critical object appeared in the scene; it remained visible together with the scene for 2,000 ms (see Fig. 3).

Trial sequence. Each trial started with the presentation of a fixation cross that indicated blinking was encouraged. Once ready, subjects pressed a button, which triggered the presentation of a preview scene without the critical object (500 ms). Next, a cue appeared (500 ms plus jitter), and participants moved their eyes to the cued location. Finally, the object appeared at the cued location and remained visible on the screen together with the scene (2,000 ms).
To keep participants engaged in viewing the scenes without signaling the object-scene inconsistencies, we asked them to view each scene carefully and to press a button when they spotted an exact repetition (i.e., a repeated scene with the cued object in the same location as it had been previously). All repeated scenes were filler scenes and were excluded from subsequent analysis; they were taken equally often from all four conditions. Assignment of scenes to the four conditions was randomized across participants using a Latin square design. At the end of the experiment, participants viewed all the scenes again and rated the inconsistency of each object-scene pairing (from 1, very consistent, to 6, very inconsistent).
Electrophysiological recording and analysis
The electroencephalogram (EEG) was recorded from 64 scalp sites (positioned according to the 10-20 system) assigned to nine regions (see Mudrik et al., 2010), a vertical eye channel for detecting blinks, two horizontal eye channels for monitoring saccades, and two additional electrodes affixed the mastoid bone. The EEG was acquired with the Active Two Biosemi system (BioSemi B.V., Amsterdam, The Netherlands) using active Ag-AgCl electrodes. All channels were referenced off-line to the averaged signals from the mastoid electrodes. The EEG was recorded at a 512-Hz sampling rate and was high-pass filtered off-line at 0.1 Hz (24 dB/octave) to remove slow drifts. It was subsequently segmented into 1,000-ms epochs time-locked to the onset of the cued object, and waveforms were averaged separately for each of the trial types. Average waveforms were low-pass filtered with a cutoff of 30 Hz, and each epoch was baseline-adjusted by subtracting the mean amplitude in the prestimulus period (−100 ms to 0 ms) from all the data points in the epoch. Trials with blinks, eye movements, and muscle artifact were rejected prior to averaging (12% of all epochs).
Results
Behavioral results
Repetition detection task
The overall error rate on this task averaged 3%. Trials that were erroneously reported as repetitions were excluded from the ERP analyses.
Inconsistency ratings
Inconsistent objects were rated higher on the inconsistency scale than consistent objects were (consistent control: M = 1.26; inconsistent syntax/mild violation: M = 3.13; inconsistent syntax/extreme violation: M = 4.38; inconsistent semantics: M = 4.33), all ts(27) > 11.0. Rated inconsistency was lower for mild syntactic violations than for extreme syntactic violation and semantic violations, t(27) = 5.21, p < .01, and t(27) = 7.12, p < .01, respectively. Rated inconsistency did not differ between the latter two trial types, t < 1.
ERP results
We were interested in whether we could replicate N300-N400 incongruity effects previously found for semantically inconsistent objects in scenes. In addition, we wanted to test whether syntactically inconsistent objects would elicit a late positivity resembling the P600 response known from sentence processing. Figure 4 shows the scalp distributions of ERP difference waves for the inconsistent-semantics, inconsistent-syntax/extreme-violation, and inconsistent-syntax/mild-violation trials (each inconsistent trial type minus the consistent control condition) in three time windows: 250–350 ms (N300), 350–600 ms (N400), and 600–1,000 ms post object onset (P600; see Sitnikova, Holcomb, Kiyonaga, & Kuperberg, 2008).

Scalp distributions of event-related potential difference waves (inconsistent condition minus consistent control condition) in the N300, N400, and P600 latency windows. Results are shown separately for the inconsistent-semantics, inconsistent-syntax/extreme-violation, and inconsistent-syntax/mild-violation trial types.
These scalp plots clearly show a dissociation between semantic processing (semantic violations) and syntactic processing of mislocated objects (mild syntactic violations). Figure 5 presents grand-average waveforms from the midcentral region for the inconsistent-semantics, inconsistent-syntax/extreme-violation, inconsistent-syntax/mild-violation, and consistent control trial types. Semantic inconsistencies triggered a negative response in the N300 and N400 time ranges. In contrast, mild syntactic inconsistencies elicited a significant positive response in the P600 range and a trend for a positive response in the N400 time window. Extreme syntactic inconsistencies produced a significant negative response in the N300 range and no positive response in the P600 range (statistics are reported later in this section).

Grand-average event-related potential (ERP) waveforms measured at the midcentral region (electrodes FC1, FCz, FC2, C1, Cz, C2, CP1, CPz, and CP2; see the scalp diagram) for the consistent control, inconsistent-semantics, inconsistent-syntax/extreme-violation, and inconsistent-syntax/mild-violation trial types. The time windows for the N300, N400, and P600 ERP components are highlighted.
To test whether the observed ERPs for the different inconsistency types differed significantly from the ERPs in the consistent control condition, we measured the mean amplitude for each time window for each trial type and submitted these values to paired t tests (see Fig. 6). We confined our analyses to the midcentral region (averaged across electrodes FC1, FCz, FC2, C1, Cz, C2, CP1, CPz, and CP2), which has previously shown strong “N390 scene congruity effects” (Ganis & Kutas, 2003, p. 129), as well as P600 effects in sentence processing (e.g., Osterhout & Holcomb, 1992).

Effects of inconsistency on amplitude of the event-related potential in the midcentral region in the N300, N400, and P600 latency windows. For each time window, the graph shows the mean difference in amplitude between the consistent control condition and the inconsistent-semantics, inconsistent-syntax/extreme-violation, and inconsistent-syntax/mild-violation trial types, respectively (each inconsistent condition minus the consistent control condition). Error bars depict ±1 SE. Asterisks indicate statistically significant differences from the consistent control condition (*p ≤ .05, **p < .01).
N300 time window (250–350 ms)
In this time window, both semantic violations and extreme syntactic violations elicited significantly more negative responses than the consistent control condition—semantic violations: mean difference = −1.65 µV, t(27) = 3.67, p < .01; extreme syntactic violations: mean difference = −1.06 µV, t(27) = 3.08, p < .01. The mild syntactic violations did not elicit an effect in this time window, t < 1.
N400 time window (350–600 ms)
In this time window, we observed a pronounced negative response to semantic inconsistencies (−1.50 µV), t(27) = 2.70, p < .01. The extreme syntactic violations did not elicit an effect in this window, t < 1, whereas the mild syntactic violations elicited a marginally significant positive response (+1.33 µV), t(27) = 2.03, p = .05.
P600 time window (600–1,000 ms)
In this time window, we observed a strong positive response to mild syntactic inconsistencies (+2.33 µV), t(27) = 3.00, p < .01, whereas neither semantic inconsistencies nor extreme syntactic violations elicited responses significantly different from those in the consistent control condition, t < 1 and t(27) = 1.23, p = .22, respectively.
Double-inconsistency condition
Results for the double-inconsistency condition (semantic and syntactic violations combined) showed that when the syntactic violation was extreme (i.e., a violation of physics), responses were very similar to those observed for the inconsistent-syntax/extreme-violation trials (e.g., it did not matter whether a hovering object in a kitchen was a banana or a toothbrush). However, when the syntactic violation was mild, the response looked more like the response to semantic inconsistencies (e.g., the response to a football in a kitchen was pretty much the same regardless of where in the kitchen the football was located).
Discussion
The aim of our study was to investigate whether semantic and syntactic processing of objects in scenes draw on qualitatively different neural mechanisms. To do this, we used electrophysiological markers known to distinguish semantic and syntactic processing in the language domain. We replicated previous findings of an early N300 component suggesting initial difficulties in perceptual processing of inconsistent objects (e.g., Eddy, Schmid, & Holcomb, 2006; McPherson & Holcomb, 1999). We also found an N400 for semantic inconsistencies, which might signal increased postidentification processing based on semantic knowledge (e.g., Ganis & Kutas, 2003; Mudrik et al., 2010).
Most important, we demonstrated a clear dissociation between semantic and syntactic processing, as we found a late positive response to syntactic scene violations resembling the P600 response to syntactic inconsistencies in language. This late positivity was observed only for mild syntactic violations (i.e., mislocated objects). More extreme syntactic violations elicited no such effect. We speculate that although a pot in a kitchen (Fig. 2a, upper left illustration) is nothing unusual, an unexpected structural relationship between the pot and its usual location within the kitchen may trigger scene reanalysis marked by a P600 response.
Violations of physics, such as a hovering beer bottle (Fig. 2, upper right illustration), may be odd enough to impede initial perceptual processing, triggering an N300 effect, but such violations may be too odd to permit resolution by the later scene reanalysis that yields a P600. This interpretation is consistent with findings that extremely ungrammatical sentences do not elicit a P600 response (see Hopf, Bader, Meng, & Bayer, 2003). It might also explain why Demiral et al. (2012) did not observe P600 responses but instead observed early N300-N400 effects: In addition to repeating a limited set of scenes, they did not distinguish between mild (e.g., a painting on the floor) and extreme (e.g., a bus in the air) structural manipulations. Averaging across these different types of structural violations might have concealed the P600 response. On the basis of our data, we suggest that a hovering bus constitutes such an extreme syntactic violation that reanalysis is impeded.
Even more than in linguistics, the distinction between syntax and semantics in scenes is not always clear-cut. Biederman et al. (1982) offered a first, thought-provoking classification of object-scene inconsistencies according to which only inconsistencies with physical constraints, such as gravity, are considered syntactic. We have proposed that syntactic processing goes beyond physical constraints to include evaluation of relative object positions within a scene. A computer mouse, sitting on top of a computer screen, is physically legal, but unexpected with regard to the local, interobject arrangement of the scene, which makes it a syntactic violation by our usage. Semantic processing, in contrast, examines object meaning relative to the globally assessed semantic scene category. In the preceding example, the mouse is in the wrong location with regard to the computer, but in the right scene, which makes it semantically congruent. In our data, we found different neural signatures for “plausible object in the wrong arrangement” and “implausible object in this scene,” and these signatures were similar to those seen for “plausible word in the wrong arrangement” and “implausible word in this sentence.” This similarity suggests that there might be some commonality in the mechanisms for processing meaning and structure across a wide variety of cognitive tasks. As Chomsky (1965, 1987, 2006) might put it, the general principles of language are not entirely different from the general principles of thought, including thought about visual scenes.
Footnotes
Acknowledgements
We thank John Gabrieli and Marianna Eddy at MIT for their invaluable support of this project.
Declaration of Conflicting Interests
The authors declared that they had no conflicts of interest with respect to their authorship or the publication of this article.
Funding
This work was supported by Office of Naval Research (ONR) Grant N000141010278 to J. M. Wolfe and by German Research Foundation (DFG) Grant VO1683/1-1 and National Institute of Health (NIH) Grant F32EY022558 to M. L.-H. Võ.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
