Abstract
Pitch accent serves multiple duties (encoding lexical accent, syntactic structure, and focus) in spoken Japanese. This study investigates how listeners interpret a role-ambiguous pitch prominence surfacing as F0 rise, which could be a cue to the resolution of a syntactic ambiguity between two possible branching structures, or a signal of contrastive focus on the constituent accompanied by the rise. Two visual world paradigm experiments tested the same Japanese linguistic stimuli with and without pitch emphasis on the second word of structures of the following form: modifier + N1 + N2. In Experiment 1, the visual context suppressed the availability of the contrastive interpretation; in Experiment 2, the visual context made the contrastive interpretation available. We found that the same pitch event can be interpreted as both syntax-encoding and contrast-encoding information within the course of processing the same sentence, as long as contextual information is made visually available. When contrastive focus is pragmatically felicitous, it is computed immediately, as soon as the incoming input is accompanied by a notable pitch prominence (Experiment 2). The same prosodic cue can then be re-interpreted as a signal to syntax after the branching ambiguity is recognized due to subsequent input (Experiments 1 and 2). This is most consistent with the view that an initially assigned prosodic boundary is exploited for re-interpretation.
1 Introduction
Extensive research has examined how prosodic information influences or guides sentence comprehension (for reviews, see: Cutler, Dahan, & van Donselaar, 1997; Wagner & Watson, 2010). The type of prosodic information on which most studies focus is the role of prosodic boundaries (e.g., by controlling prosodic breaks). Prosodic boundaries can serve as a relatively direct cue to prosodic phrasing, which supposedly indicates how constituents are grouped together, which can in turn signal correspondence to syntactic phrasing in more or less straightforward ways.
In contrast, prosodic prominence phenomena, particularly pitch accent, are mostly discussed as useful information in computing discourse or information status. The immediate effects of prominence in online spoken sentence processing have been reported in a number of studies in English (e.g., Carlson, 2001; Dahan, Tanenhaus, & Chambers, 2002; Ito, Bibyk, Wagner, & Speer, 2012; Ito & Speer, 2008; Kurumada, Brown, Bibyk, Pontillo, & Tanenhaus, 2014; Pierrehumbert & Hirschberg, 1990) and cross-linguistically (Ito, Jincho, Minai, Yamane, & Mazuka, 2012; Li, Hagoort, & Yang, 2008; Weber, Braun, & Crocker, 2006). Evidence for an immediate effect of pitch accent in the processing of discourse or information structure in languages like Chinese and Japanese is especially worth attention; in these languages, pitch accent primarily conveys lexical information (by way of either lexical tone or lexical pitch accent) and therefore its extra-lexical function could be expected to be less informative.
Whether pitch prominence can have an effect on syntactic parsing is a matter of debate, but only a relatively small number of studies have focused on this issue. Some of these studies have demonstrated that, in attachment ambiguities where there are two possible syntactic heads for the incoming modifier in the structure built so far, the accented head tends to attract attachment (Carlson & Tyler, 2017; Lee & Watson, 2011; Schafer, Carter, Clifton, & Frazier, 1996) in English. These studies take different stands in accounting for the nature of this effect. One view is that the accented head is understood as focused, and thus more important to the main assertion of the sentence (Shafer et al., 1996), while the other maintains that the accented head is more salient in memory and thus draws attachment (Lee & Watson, 2011). Although the link between pitch accent and syntactic structure allows different theoretical stands, the empirical finding that an accented head attracts attachment is uncontroversial and consistent across different syntactic structures in English (Carlson & Tyler, 2017).
2 Pitch accent in processing spoken Japanese
The role that pitch prominence plays in syntactic processing can become more complicated in languages where pitch information is already serving multiple duties (including lexical-level distinctions) such as Chinese (lexical tone) and Japanese (lexical pitch accent). In this paper, we focus on Japanese, where pitch prominence plays a role in guiding spoken sentence comprehension at three levels.
2.1 Lexically conditioned pitch accent phenomena: lexical pitch accent and downstep in Japanese
Words in Japanese have lexically determined pitch accent. A word is either accented or unaccented and if it is accented, the position of the accent is lexically specified. The HL tonal fall occurs at the accented position. In example (1a) below, the accented word ne’ko (“cat”) carries the accent on the initial mora (an apostrophe indicates the position of the accented mora for the word). The accented H tone of the HL fall is linked to the accented mora, and the following L tone is assigned to the subsequent mora. The unaccented word tori (“bird”) in (1b) emerges with an LH sequence because in Tokyo Japanese, the first mora receives an L followed by H (initial rise). That H tone spreads rightwards if the word is more than two moras long (For detailed descriptions of the association between lexical accent and surface tone, see: Haraguchi, 1977; McCawley, 1968; Poser, 1984). The F0 of the H peak in accented words tends to be realized higher than that of unaccented words (accentual boost; Kubozono, 1988).
(1) a. ne’ ko “cat” H L b. to ri “bird” L H
A relatively rich literature demonstrates that lexical pitch accent information plays an immediate role in word recognition (Cutler & Otake, 1999; Otake & Cutler, 1997; Sato, Ito, & Mazuka, 2008).
Interestingly, the purely lexical property of pitch accent is further related to phrasal level prosody. Quite independently of accentual boost, introduced above, lexical accent has an effect on subsequent words, in that an accented word lowers the H peak of the following words. This is called downstep (Poser, 1984) (or catathesis; Pierrehumbert & Beckman, 1988). It results in the H peak of ne’ko in (2a) being realized lower than that in (2b), because the former is preceded by an accented word (ao’i ) and therefore is subject to downstep, while the latter is preceded by an unaccented word (akai), and therefore is not.
(2) a. ao’i ne’ko blue cat b. akai ne’ko red cat
The downstep can occur cumulatively over more than two words, each immediately preceded by an accented word, resulting in a staircase-like F0 contour (3), unless there is some factor enforcing the start of a new prosodic phrase, thereby blocking downstep from continuing, or interfering with the tonal contour resulting from downstep to hinder the surface staircase-like pattern from emerging.
(3) ao’i ne ’ko-no ka’sa blue cat-Gen umbrella
In a series of accented words undergoing downstep, the phrase-initial accented word tends to be realized higher in F0 as a function of the number of downstepping words to follow (especially when downstep continues over more than two words). This is called anticipatory rising (Rialland, 2001) and is evidence that the speaker looks ahead several words to decide on the pitch register of the utterance-initial item. There is another piece of evidence that the surface pitch of a word is planned by reference to the subsequent part of the sentence, that is, the number of accented words (or, more precisely, of minor phrases; Selkirk, 2000) to follow: Kubozono (1988) observed that the F0 peak of the third accented word is realized higher when the whole phrase involves four accented words instead of three. This is called rhythmic boost, and was reconfirmed in later studies (e.g., Shinya, Selkirk, & Kawahara, 2004), demonstrating that syntactic structure and the number of prosodic phrases being involved can influence the prosodic structure independently.
The domain of downstep defines a prosodic phrasing called Major Phrase (Kubozono, 1988; Poser, 1984; Selkirk & Tateishi, 1991) or intermediate phrase (Pierrehumbert & Beckman, 1988). Downstep is reset at the beginning of a new prosodic phrase (Major Phrase or intermediate phrase), which results in recovering the higher F0 peak level. What complicates the issue is that it is not always clear when it re-sets because the F0 peak height can be elevated for different reasons, as discussed above (e.g., rhythmic boost), not necessarily involving a prosodic phrase boundary. Thus far, we have seen that the same or very similar surface phenomena of pitch prominence (elevation of H of the pitch peak) can be accounted for as (part of) different phonological events, with or without involving phonological phrasing. At this point, it may not be clear how the theoretical arguments introduced thus far relate to the syntactic or discourse structure of sentences, but we revisit this issue in our general discussion (Section 5), as our results have a potential implication regarding such arguments. Next, we will discuss yet more factors, at higher levels of linguistic representation, that lead to pitch prominence phenomena.
2.2 Syntax-encoding pitch accent in Japanese
Studies have found that the realization of downstep is affected by syntactic structure in Japanese (Kubozono, 1988; Selkirk & Tateishi, 1991). Whether pitch prominence phenomena can serve as a direct signal to syntactic structure in English is controversial, but there is an established finding in Japanese that the surface tonal outcome of downstep depends on the syntactic branching associated with the phrase that undergoes downstep. Consider example (3) again, repeated here: (3) (repeated) ao’i ne’ko-no ka’sa blue cat-Gen umbrella
(3) can mean “an umbrella with a blue cat on it” by assuming that “blue” modifies the immediately following word, “cat,” as shown in (4a). It can also have an alternative interpretation by assuming that “cat” and “umbrella” first form a unit, which is then modified by “blue,” as in (4b).
(4) a. [[ao’i ne’ko]-no ka’sa] (Left Branching (LB)) blue cat-Gen umbrella “umbrella with blue cats on it”
b. [ao ’i [ne ’ ko-no ka ’sa]] (Right Branching (RB)) blue cat-Gen umbrella “blue umbrella with cats on it”
The syntactic structures for (4a) and (4b) will be referred to as left-branching (LB) and right-branching (RB), respectively, throughout this paper. The LB structure has been identified as a default preferred interpretation by native speakers; it is realized with a downstep continuing over all three prosodic words that are lexically accented, as we have already seen in (3) and (4a). In contrast, in the RB structure, which is considered somewhat marked, prosody marks this non-default structure by expanding the pitch of the H peak of the second word (in this case, ne’ko ‘cat’), counteracting the descending pitch pattern produced by downstep, as shown in (4b).
One interpretation of the prosodic marking associated with RB would be to take the rising of F0 as an indication of a new prosodic phrase (Selkirk & Tateishi, 1991), in which case the downstep is re-set once at the second prosodic word, inserting a prosodic boundary between the first (e.g., ao’i) and the second (e.g., ne’ko) prosodic word. In contrast, Kubozono (1988, 1993) called this phenomenon of elevated pitch peak metrical boost, whereby the F0 of the H peak is boosted on an element that lies at the left edge of a right-branching syntactic structure without starting a new prosodic phrase. 1
The extent to which native speakers actually distinguish the two branching structures with recourse to the difference between the default downstepping pattern and a pattern involving metrical boost is subject to debate. Inconsistency or variation among individual speakers in ways of encoding the distinct structures into different F0 patterns (Hirose, 2006) would make it difficult for listeners to estimate the reliability of surface pitch difference as a cue to resolve branching ambiguity. In this study’s experiments, we test whether metrical boost is effectively and promptly used by listeners during online processing to resolve the left-branching versus right-branching ambiguity.
So far, we have discussed that some information about upcoming input in Japanese can be provided by the F0 peak height of constituent words and downstep, including the occurrence of various types of pitch boosts (accentual boost, rhythmic boost, and metrical boost) that can modify or even counteract the surface output of downstep. This information may include the number of constituents to follow within the domain of the current downstep, and whether the incoming materials form a left-branching or a right-branching syntactic structure. We will next discuss yet another linguistic dimension that affects the realization of the prosodic pattern with respect to F0.
2.3 Pitch prominence and focus
In Japanese as in many other languages, pitch prominence, or an enlarged pitch range expansion, can express extra-syntactic information such as discourse and information status conveyed by focus. Focus refers to a part of an utterance that is particularly important or informative, including emphasis on new information (as opposed to given, shared information), wh-phrases as the target of interrogative clauses, and contrastive information, as in (5). (Capital letters indicate the word in focus.) (5) ao’i NE’KO blue CAT (e.g., as opposed to a blue squirrel, in response to a question such as “Is that a blue squirrel?”)
Whether or not the pitch prominence that encodes the focus status of a word involves the start of a new prosodic phrase is again an issue among researchers. Poser (1984) demonstrated that emphasized elements are pronounced with boosted pitch compared to the same word in the same syntactic structure without emphasis, and that this boosting phenomenon is independent of the occurrence of downstep (i.e., this boost does not block downstep, hence does not affect prosodic phrasing). Shinya (1999) presented evidence that the focused item can undergo downstep if preceded by an accented word; that is, downstep and focus are independent phenomena. What this means is that even if focus causes notable pitch expansion, it could still be subject to concurrent downstep. In contrast, Pierrehumbert and Beckman (1988) attributed the pitch rise on focus to the presence of a prosodic boundary.
In addition to a pitch boost on a focused item itself, pitch range is known to be notably compressed or reduced following the focused item (post-focal reduction; Ishihara, 2003). This is illustrated in Figure 1, where the pitch track on the left demonstrates the pitch expansion on the focused item ne’ko-no (“cat-Gen”) and a post-focal reduction on a’isu (“popsicle”), where the pitch range is notably compressed, as compared to the pitch track on the right-branching phrase without a focus (see Ishihara, 2011, for a more detailed illustration).

Example pitch tracks of phrases with a contrastive focus on W2 (left) (suppose ne’ko “cat” is assigned a contrastive interpretation to yield either “popsicle with blue CATS, as opposed to popsicles with something else that’s blue”/ “blue popsicle with CATS as opposed to other blue popsicles with something else”), and a metrical boost on W2 (right) without a focus.
Whether this post-focus phenomenon is independent of prosodic phrasing is yet another issue under debate (for a review, see Ishihara, 2011). While some researchers have claimed that post-focus materials are incorporated into the same prosodic phrase (post-focal dephrasing; Nagahara, 1994), others have observed that there can be a prosodic boundary in the post-focus domain (Sugahara, 2003). We shall come back to this point in the final discussion of the experiments.
If the focus is signaled by a sequence of pitch events (the pitch boost on the focused item followed by the post-focal reduction of the pitch range), the processing of the focused element itself would not be sufficient for recognizing its focus status, which would become apparent only when the post-focal part of the input becomes available. However, previous psycholinguistic evidence in Japanese suggests that in real-time incremental processing, listeners rely on the signal of an increased pitch peak of the focused element, rather than waiting to confirm its status by checking for post-focal reduction. Ito et al. (2012) reported results from a series of visual world paradigm studies with adult and child (six-year-old) Japanese speakers. Both groups fixated on the contrastive target in response to pitch expansion on the contrastive information (e.g., pink cat -> GREEN cat) when the contrastive status of the target had been established between the two successive trials. For example, “GREEN” facilitated looks to a picture of a green cat among other green-colored animals in the time interval before the head information (type of animal) became available, after participants responded to a trial when they heard “pink cat.” This is compelling evidence that listeners compute contrastive information as soon as the signal of focus-related pitch expansion becomes available, with a focus-eliciting context by the preceding trial. Further evidence comes from a study by Kitagawa and Hirose (2012). The study demonstrated that when Japanese listeners decide how to interpret ambiguous wh-scope (between a matrix and embedded clause), which will be disambiguated by the endpoint of the post-focal reduction domain (i.e., at the end of either the embedded or the matrix clause), they do not necessarily wait until the post-focus part of the sentence. Instead, they tend to make the decision by weighing the degree of F0 elevation of the sentence-initial wh-focused item, which tends to vary as a function of the length of the domain of post-focal pitch reduction (i.e., the longer the reduction will be, the higher the initial pitch register is), as an indirect cue to wh-scope. This also suggests that sentence comprehension in Japanese in general tends to rely heavily on pitch rise, which can be a theoretical or practical problem, as we address below.
2.4 Ambiguity in pitch prominence: what does it signal?
As we have discussed, pitch prominence in Japanese can influence comprehension of spoken sentences at multiple levels of linguistic representation. In Japanese, F0 information’s multiple duties can lead to ambiguities in how to interpret certain F0 phenomena. This paper focuses on the potential ambiguity between two distinct pitch events, metrical boost and focus-driven pitch prominence, which are qualitatively different but could be perceptually very similar. That is, when a phrase contains LB versus RB structural ambiguity (as seen in (4)), an expanded pitch on the phrase-second item can be interpreted as metrical boost, signaling the RB structure, as in (6b). If the left-branching structure were intended, the second word (N1) ne’ko would have to be pronounced with downstep, as in (6a).
(6) ao’i ne’ko-no …(ka’sa) blue cat-Gen …(umbrella) a. ao ’i ne ’ko-no …
b. ao ’i ne ’ko-no …
At the same time, the notable pitch elevation on ne’ko-no (as in (6b)) could also be part of a contrast-marking event, as discussed in 2.3, as long as the context in which it is produced is compatible with such an interpretation, for example, when the referent stands in a contrastive relationship with some other entity, as in ao’i NE’KO (as opposed to a blue squirrel), or a’oi [NE’KO-no ka’sa] (as opposed to a blue umbrella with a squirrel printed on it).
In order to correctly identify the pitch prominence on ne’ko as part of a focus-marking event, the listener would have to confirm that the subsequent item is subject to post-focal pitch range reduction. As we have already discussed, however, there is good reason to believe that listeners make processing decisions incrementally without waiting for more confirming information to appear later. If so, such a hasty decision would only be motivated when the context clearly supports the contrastive interpretation, and therefore encourages the listeners to anticipate focus marking. Ito, Arai, and Hirose (2015) tested the possibility of multiple interpretations of pitch expansion on the second item of NPs with potential LB versus RB ambiguity, using phrases comprising a color adjective + N1-Gen + N2, where all three relevant words were lexically unaccented. Because the occurrence of downstep was technically not an issue as there was no accented word to trigger it, the informativeness of the phrase-medial pitch expansion (on the second prosodic word, henceforth “W2”) in terms of branching structure would be less obvious. Nevertheless, because the W2 pitch rise should block the dephrasing of the three accentless items, which would have been observed as the default prosody, it could still function as a signal of RB syntax. Ito et al. manipulated contrastive status in their experiments’ prime-target design. For example, for a target trial asking for “pink + FROG + hat,” the preceding prime trial asked for either a hat with a pink pig on it (LB-prime) or a pink hat with a pig on it (RB-prime), which could be uniquely identified in the provided visual scene as there was only one visual object relevant to the linguistic description. That is, the prime trials did not allow branching ambiguity. While the results of the study’s eye-tracking experiments confirmed that W2 pitch rise increased bias toward the RB interpretation, the study did not find direct evidence that pitch prominence on W2 elicited a contrastive interpretation. It instead found that the listeners’ comprehension generally slowed down when the contrastive interpretation of the pitch expansion did not match the contrastive status provided by the visual context. Ito et al. thus provided support for the idea that one and the same prosodic cue can be considered a syntactic cue and a contrastive cue within the same trial to some extent, or at least that the processing of W2 pitch expansion is sensitive to contextual information. What has not been found so far is positive direct evidence that pitch prominence actually facilitates the contrast analysis independently of the effect of increasing the RB-interpretation preference.
The primary goal of this paper is to investigate how listeners interpret the (potentially) ambiguous F0 rise on W2, that is, the second prosodic element (in this case, N1-Gen) of an adj + N1-Gen + N2 sequence with LB and RB ambiguity. We are particularly interested in the timing at which the interpretation of the pitch rise as contrastive information or as the RB-signaling metrical boost takes place. Experiment 1 has two goals. First, it is intended to reconfirm the effectiveness of the F0 rise on W2 as a cue to RB syntax. Second, it aims to identify the timing of the effect, when the metrical boost is the only reasonable interpretation of the pitch event due to the absence of a context that would allow listeners to accommodate a contrastive entity to validate the contrastive interpretation of the pitch prominence. Experiment 2, in contrast, creates a situation where the pitch rise could present an ambiguity between a cue to syntax and a cue to contrastive focus. Will there be a tug-of-war between the two alternative interpretations of that cue, or will both of them affect interpretation at different times in the course of input processing? Which interpretation will eventually determine the final interpretation?
3 Experiment 1
Experiment 1 examines the effect of pitch prominence on the second word (W2) of structures taking the form: adj + N1-Gen + N2. The linguistic input is structurally ambiguous (LB or RB), and the listeners were instructed to indicate their interpretation by clicking on one of the eight objects in a visual scene presented on a computer monitor. The visual scene always had one object that corresponded to the LB interpretation (LB-target) and another object that corresponded to the RB interpretation (RB-target). The idea is that when the portion of the sentence in question in the audio material was subject to downstep, there would be an overall bias toward LB, while when the audio material had a pitch rise on W2 (N1), the choice of the RB interpretation would increase. We use online eye-movement patterns to identify the timing at which the prosodic manipulation takes effect. In this experiment, the interpretation of the pitch prominence as contrastive focus is not supported, as the visual scene never includes a color-contrasting competitor, whether the target interpretation is LB or RB, and there is no preceding context facilitating the contrastive interpretation. Therefore, we can expect that the pitch rise on W2 (N1) should predominantly be interpreted as a syntactic cue for RB syntax.
3.1 Method
3.1.1 Participants
Thirty-two undergraduate students (18–23 years old) from the University of Tokyo participated in the study for a small payment. They were all native Tokyo Japanese speakers with no hearing or vision impairment.
3.1.2 Materials
3.1.2.1 Audio stimuli
The critical materials consisted of 12 experimental audio items of the form: [color word] + N1-Gen + N2-Top, followed by wa dore (“which one”), as exemplified in (7).
(7) ao’i ne’ko-no ka’sa-wa do’re? blue cat-Gen umbrella-Top which “which one is the umbrella with blue cats/the blue umbrella with cats?”
(Note: Japanese does not have number distinction; ne’ko is compatible with both a single cat or multiple cats.)
All three constituents of the NP were lexically accented words (hence subject to downstep if they form an LB structure). There were also 12 filler audio items with no branching ambiguity or particular pitch emphasis. Some of the fillers mentioned the color/pattern of an entire object while some referred to the color/pattern of a part of the object, in both cases using various syntactic forms.
The audio materials were recorded by a female speaker of Tokyo Japanese who was familiar with Japanese phonology. Each item came in two audio versions. In one, the speaker had been asked to read the item to express the intended meaning of the LB interpretation. This version was the downstep (default prosody) condition. In the other version, the speaker had been asked to have the RB interpretation in mind. The second words (e.g., ne’ko) were pronounced with a raised pitch (compared to the first version) so as to counteract downstep. This version was the W2 pitch rise condition. A summary of the relevant measurements of the recorded utterances is presented in Table 1, and example pitch tracks contrasting the two conditions can be found in Figure 2.
Acoustic profiles of the relevant part (the syntactically ambiguous noun phrases) of the audio stimuli (mean duration and mean peak F0 values over twelve items followed by standard deviation in parentheses).

Example pitch tracks of an experimental item in the downstep condition (dotted line) and W2 rise condition (solid line).
It was confirmed that both the second constituent and its genitive-marking particle of the recorded utterances in the downstep condition had significantly lower F0 peaks on average compared to their counterparts in the W2 pitch rise condition (with a consistent difference of at least 50 Hz across all items). It was also found that the F0 peaks of the initial color words were significantly higher on average in the downstep condition than in the W2 rise condition, presumably due to anticipatory raising preceding downstepping elements (Rialland, 2001) and/or due to the lowering of the word in the W2 rise condition before a planned rise. We found no durational difference between the two conditions except that the genitive marker had a longer duration on average in the downstep condition than in the W2 rise condition.
Each participant listened to all 12 sentences, six in the downstep condition and six in the W2 pitch rise condition, arranged into two counterbalanced lists. The presentation order was varied in each list. For half of the participants, the first encountered experimental item was in the downstep condition; for the other half, it was in the W2 pitch rise condition.
3.1.2.2 Visual stimuli
Each visual display was divided into eight areas, each containing an object of one of the types listed in (8). The critical objects are the LB- and RB-targets, one of which should be selected in the final clicking task. LB- and RB-distractors mimicked the design of LB- and RB-targets but featured different animals and colors than the targets. The presence of such objects with the same colors would have established a contrastive relationship with each of the LB- and RB-targets, but the sentence-initial color information would eliminate these distractors from the possible candidates. Other objects were unrelated fillers, one of which had the same color mentioned by the audio sentence so as to serve as an additional color distractor. An example visual scene appears in Figure 3 (NB: Japanese does not have grammatical number distinction so multiple cats can be referred to as ne’ko. The apparent number mismatch between the linguistic stimuli and the visual object is not an issue for any item). The positions of the different object types were balanced across items so that any target type (listed in (8)) appears at least once but not more than twice in each of the eight positions.

Example visual scene for (7) in Experiment 1. Distractor objects are not in the same color as the target objects; therefore, they do not stand in a contrastive status.
(8) LB target RB target LB distractor (different color, different W2 object) RB distractor (different color, different W2 object) Unrelated object in the color mentioned in the audio stimulus Unrelated objects in unmentioned colors (x3)
In some of the filler trials, one or two pairs of visual objects that mimicked the LB/RB visual contrast appeared as unrelated objects so that the LB/RB target object pairs in the experimental materials would not draw excessive attention from the participants. Eight filler items included two pairs with LB/RB-related visual contrast, presenting in total four objects of the same color, in order to keep the target visual scenes from standing out from the rest of the trials. Other filler items included five (in two cases), three (in six cases), or two (in seven cases) objects of the same color, so that the distribution of the number of visual objects of the same color within a scene would not be expected to create biases in the participants’ judgment.
3.1.3 Procedure
Participants sat in front of a Tobii 1750 eye-tracker. Their eyes were calibrated using Tobii Clear View. The stimuli were presented and controlled using E-prime. After successful fixation of 1000 ms on the fixation cross, the visual display appeared on screen. There was no automatic drift-correction procedure but the presentation of a new item was contingent on successful fixation on the cross. If that was problematic for any trial, recalibration would be performed. The sentence was played through a pair of speakers 2500 ms later (the delay allowed participants to scan the eight visual objects). After hearing the sentence, participants clicked on the object that they thought was the target referent. Participants’ eye movements were recorded from the onset of the audio stimuli until the mouse click at a sampling rate of 50 Hz.
Participants were told to find and click on the visual object that matched the referent mentioned in the audio sentence. They were told to take their best guess if they were not sure which one to pick. The participants were also told that they would not have to compromise their comprehension for speed.
3.2 Data analysis and results
3.2.1 Final target selection responses
Participants’ final clicking responses were analyzed in a generalized linear mixed model (GLMM) (dependent variable (DV): binary target choice where LB choice and RB choice are coded as 0 and 1, respectively; fixed factor – prosodic manipulation where W2 pitch rise and downstep conditions are effect coded as 0.5 and -0.5, respectively, before being centered; random factors – participant and item). The final model was selected through a backward selection (the description of the model is reported in the endnote), 2 and revealed that the W2 pitch rise increased the sentence-final choice of the RB target (β = 0.642, z = 4.31, p < .001). The overall percentage of the default LB choice was 71% in the downstep condition and 56% in the W2 pitch rise condition.
3.2.2 Eye-tracking data
The goal of this study’s two experiments is to examine the effect of prosody (W2 pitch rise), which can potentially have multiple functions, which in turn can emerge at different points in time. The exact time course of these possible effects cannot be predicted in advance. We used generalized additive mixed-effects models (GAMM; Wood, 2015; Wood & Saefken, 2016), which are capable of handling time-series data that do not presuppose a linear relationship between the predictor and the outcome while also taking random variables (such as participants and items) into consideration. The analysis was performed in R (R Development Core Team, 2014) using the package mgcv (Wood, 2015).
We broke the gaze data down into 100 ms time windows, starting from the offset of the second word (e.g., ne’ko), where the prosodic manipulation between the conditions becomes distinct but before further information as to whether the pitch rise is followed by post-focal reduction is fully evident. This time window was later included in the model as a predictor variable. For each of the visual target types (LB-target and RB-target), we summed the number of gazes for each window and calculated the logged ratio between the sum of gazes on the left-branching target over the sum of gazes on the right-branching target (“log-ratio,” hereafter). The dependent variable was the difference in the log-ratio (the positive value indicating a bias toward the LB target) between the two prosodic conditions (log-ratio in the W2 rise condition minus that in the downstep condition) for all intervals for each participant. As our experiment adopted a design in which each participant saw only one occurrence of each item in only one of the conditions, the item information was collapsed in this analysis so as to preserve the participant random factor to be considered in the analysis.
In running the models, both intercept and slope for participant were initially considered in the random structure and were subject to backward selection in each analysis. Once the change in the dependent variable over the time windows with the calculated 95% CI estimated by the selected model is plotted, we can expect the time intervals where the CI range (as indicated by a gray band in the figures) does not overlap with the zero axis to indicate the time period (as marked by a blue area) in which the prosodic manipulation produced a significant effect (the bottom of the range being above zero indicates a significant bias toward the LB targets, whereas the top of the range under zero indicates a reliable bias toward the RB targets).
As a planned procedure, we set two time intervals in advance, one approximately up to the offset of the entire noun phrase in question (within-NP region, i.e., up to the offset of ka’sa “umbrella” in this example). This refers to time windows 1–7, corresponding to 1700 ms from the offset of the second word, as the mean offset was 785 ms), which should reflect the immediate response to the prosodic information on W2 (e.g., ne’ko “cat”). The other was the following time interval subsequent to the offset of that noun phrase (post-NP region, referring to time windows 8–40, corresponding to 701–4000 ms), which should reveal how the prosodic information in question was evaluated once the overall NP in question had been received. This division in time windows is a precaution against overapplying a single model to the entire 4000 milliseconds of the data, which could potentially contain a series of independent events. 3 The predictor variable of interest was prosody (downstep, or without W2 pitch rise), effect-coded, and centered (positive values indicate the W2 rise condition). This predictor variable prosody was fit to our dependent variable with by-participant and by-item random effects. Model selection among different random effect structures followed the backward selection procedure by using the compareML function in the itsadug package. The pattern of relative bias between the LB and RB targets as a function of time windows 1–7 is plotted in Figure 4, and that for time windows 8–40, in Figure 5. The corresponding statistical profiles are presented in Table 2 and Table 3, respectively. The selected models are reported in the endnotes.4,5

Plotted log-ratio in W2 rise condition minus log-ratio in downstep (DS) condition over time windows 1–7 in Experiment 1. The gray band shows the 95% CI range, where its overlap with the zero-axis indicates no reliable difference between the two prosodic conditions.

Plotted log-ratio in W2 rise condition minus log-ratio in downstep (DS) condition over time windows 8–40 in Experiment 1. The gray band shows the 95% CI range. The blue area indicates the time interval where the two prosodic conditions produced a significant effect.
Model details of the generalized additive mixed model (GAMM) reporting parametric coefficients and estimated degrees of freedom (edf), reference degrees of freedom (Ref. df), F- and p-values for smooth terms for the analysis of time windows 1–7 in Experiment 1.
Model details of the GAMM for the analysis of time windows 8–40 in Experiment 1.
As can be seen in Figure 4, no reliable difference between the two prosodic conditions was found during the within-NP time windows. The blue area in Figure 5 indicates that the W2 rise condition had a reliably larger RB bias compared to the downstep condition during time windows 22–32 (2101–3200 ms), which largely coincide with the time in which the participants made their clicking responses as they decided between the LB and the RB targets.
We took a further step to reconfirm the effect of the prosodic manipulation that emerged in this particular time interval, taking both participant and item random effects into consideration. An LME analysis (Jaeger, 2008) was conducted, using the lme4 package in R (R Development Core Team, 2014) on the same time intervals (2101–3200 ms). The log-ratio was considered the dependent variable and prosody was the fixed variable, with both item and participant as random variables. The model selected through the backward selection process (using anova function) is reported in the endnote. 6 The results confirmed what the GAMM analysis has demonstrated: Looks to the RB target (over the LB target) increased in the W2 pitch rise condition compared to the default downstep condition (β = -0.20331, SE = 0.07417, t = -2.741, p = 0.0091) during the processing of the NP in question.
3.3 Discussion (Experiment 1)
There was an overall preference for the LB interpretation, as indicated by the participants’ final clicking responses, but the eye-tracking data suggest that pitch rise on N1 counteracted the LB bias; it increased the RB interpretation and looks to the RB target around the time that the participant made the choice (by clicking on an object). Though the effect was not observed as an immediate reaction to the prosodic manipulation, the results are consistent with our prediction that the pitch rise on the phrase-second word would be recognized as metrical boost, indicating the RB interpretation, when there was no reason for any other interpretation (i.e., when neither target object appeared in a legitimate contrastive relation with the other visual objects). Our next question is what happens if the visual scene supports the contrastive interpretation of the W2.
4 Experiment 2
4.1 Materials and design
4.1.1 Participants
Twenty-six undergraduate students (18–24 years old), also from the University of Tokyo, participated in the study for a small payment. They were all native Tokyo Japanese speakers with no hearing or vision impairment.
4.1.2 Materials
In Experiment 2, the materials (audio and visual) were the same as in Experiment 1, except that there were color-sharing competitors for each LB- and RB-target object in the visual scene (Figure 6), replacing the LB- and RB-distractors that did not share color information with the respective targets in Experiment 1 (Figure 3). In other words, under either branching interpretation of the target, the visual context in Experiment 2 is supportive of the contrastive interpretation of the pitch prominence of W2 (N1). In addition, the unrelated distractors used in Experiment 1 were no longer in the critical color, as there were enough color-competitors in the scene in Experiment 2. The filler items were the same as in Experiment 1.

Example visual scene for (7) in Experiment 2. Competitor objects and the target objects are in a same color, establishing a contrastive status between branching counterparts.
It should be recalled that contrastive focus involves a sequence of pitch events: pitch prominence on the focused item, followed by the compression of the pitch range on the subsequent input. Therefore, ambiguity in W2’s prosodic status is not an issue if the listeners wait to hear the sentence beyond the second word. In this experiment, however, we are interested in the incremental/first-course interpretation of pitch prominence as soon as it happens (i.e., before the listeners hear the input following W2), even if that immediate (possibly erroneous) interpretation is later overridden by more decisive information provided by the subsequent input.
In fact, assigning a contrastive focus status to the W2 (e.g., ne’ko) alone should not have a direct influence on how the branching ambiguity is resolved. The contrastive interpretation could be based on the LB syntactic analysis (e.g., blue CAT(S), as opposed to blue squirrel(s), on an umbrella), or could as well be based on the RB analysis (a blue umbrella with CAT(S) on it, as opposed to a blue umbrella with squirrel(s) on it). Hence, the visual scene had two color competitors, corresponding to both the LB interpretation and the RB interpretation, so as to keep both branching interpretations under contrastive focus equally available in the scene.
Our main interest is the online reaction of the listeners to the pitch prominence on W2 at the moment they recognize it. At this point, only the adjective and the W2 have been received. If the pitch prominence on W2 is taken as part of the contrastive focus marking, the only available interpretation of it at this point is, for example, “blue CAT(S), as opposed to something else that’s blue,” because the subsequent noun is yet to be received. This is based on the local interpretation of the color adjective as modifying the following noun (ne’ko “cat”), which is the only available head. So, an increased bias toward the LB interpretation in response to the pitch prominence, compared to the default downstep prosody, would be evidence for the immediate contrastive interpretation of the prosodic manipulation.
In contrast, if the pitch prominence on W2 is taken as a cue to the RB syntax in the same way as in Experiment 1, the pitch prominence condition would lead to an increase in looks to the RB target as a sentence-final effect, and the pattern of the results would not differ from that of Experiment 1.
It is also a possibility that these opposing prosodic effects are independent and can both occur during processing. If so, they may cancel out or compete with each other if their processing occurs simultaneously, or if they emerge at different timings. In any case, the resulting patterns should differ from that of Experiment 1.
4.2 Data analysis and results
4.2.1 Final object selection responses
Participants’ final clicking responses were analyzed in GLMM, with the same coding procedure as in Experiment 1. The final model selected by the backward procedure 7 revealed that the W2 pitch rise increased the choice of the RB target (β = 0.883, z = 4.720, p < .001). The percentage of the LB choice was 57% in the downstep condition and 35% in the W2 pitch rise condition.
We ran cross-experiment statistical comparisons where prosody and experiment were considered as fixed variables. The selected model, again through the backward selection procedure, is reported in the endnote. 8 There was a robust main effect of prosody, meaning W2 rise increased the RB choice in both experiments (β = 0.731, z = 6.514, p < .001). In addition, the overall rate of the default LB choice was lower in Experiment 2 than in Experiment 1 (in other words, more overall RB bias in Experiment 2), irrespective of prosody, as revealed by a significant main effect of experiment (β = -0.856, z = -2.371, p = .018). However, no reliable interaction between prosody and experiment was found (β = -0.1040, z = -0.928, p = 0.35). The percentage of the RB target choice in the two conditions in each experiment is plotted in Figure 7.

Percentage of RB target choice in Experiment 1 and Experiment 2.
4.2.2 Eye-tracking data
The data analysis procedure using GAMM was the same as in Experiment 1. The selected model resulted in the same random structure as in Experiment 1 (participant intercept and slope). The pattern of the relative bias between the LB and RB targets as a function of time windows 1–7 is plotted in Figure 8, and that for time windows 8–40, in Figure 9. The statistical profiles for these models are presented in Table 4 and Table 5, respectively. The selected models are reported in the endnote.9,10

Plotted log-ratio in W2 rise condition minus log-ratio in downstep (DS) condition over time windows 1–7 in Experiment 2. The gray band shows the 95% CI range. The blue area indicates the time interval where the two prosodic conditions produced a significant effect.

Plotted log-ratio in W2 rise condition minus log-ratio in downstep (DS) condition over time windows 8–40 in Experiment 2. The gray band shows the 95% CI range. The blue area indicates the time interval where the two prosodic conditions produced a significant effect.
Model details of the GAMM for the analysis of time windows 1–7 in Experiment 2.
Model details of the GAMM for the analysis of time windows 8–40 in Experiment 2.
As can be seen in Figure 8, the W2 rise condition produced a larger bias toward the LB target compared to the downstep condition, over time windows 4 and 5 (301–500 ms). This was also checked with an LME analysis, where the log-ratio was taken as the dependent variable, prosody the fixed variable, and both participant and item as random variables. The selected model 11 supported that looks to the LB target (over the RB target) increased in the W2 pitch rise condition compared to the default downstep condition (β = 0.2543, SE = 0.077, t = 3.318, p = .001). Because the timing of this effect was well before the offset of the entire NP in question, the observed effect should reflect an ongoing process during which the NP is being heard.
In contrast, as seen in Figure 9, there was a temporary bias toward the RB target caused by the W2 rise in comparison to the downstep condition in time windows 17–18 (1601–1800 ms), at which point the NP in question has already been heard. A separate LME analysis (the selected model had random intercepts for participant and item, same as the model in endnote 11) revealed a significant effect of prosody in this time interval (β = -0.32114, SE = 0.07737, t = -4.1509, p < .001). Together with the fact that the W2 rise condition led to the RB bias in the final clicking choice, we can assume the observed initial LB bias was eventually overridden in Experiment 2.
4.2.3 Cross-experimental comparisons
To confirm the effects observed in the time window analyses using GAMM, and to check if the difference in the manipulations in the two experiments is responsible for those effects, LME analyses were conducted on the time windows identified by the GAMM analyses in Experiments 1 and 2, with the log-ratio as the dependent factor, and the prosodic manipulation (W2 pitch rise vs. downstep, effect coded as 0.5 and -0.5, respectively, and then centered) and experiments (Experiment 1 and 2, effect coded as 0.5 and -0.5, respectively, and then centered) as the fixed factors. Both participants and items were included as random factors. We simplified the random effects structure from the maximal model using a backward selection procedure.
The immediate LB bias found in Experiment 2
The analysis focusing on time windows 4–5 aimed at confirming that the temporary attention to the LB target was a consequence of the context manipulation (see the selected model in the endnote). 12 There was a significant interaction between prosody and experiment (β = 0.13749, SE = 0.05311, t = -2.589, p = .01), indicating that the transient bias toward the LB target was mainly due to the experimental manipulation in Experiment 2, where the contrastive interpretation of the W2 was felicitous in the visual context.
The RB bias emerging in the post-NP time regions
An LME analysis targeted time windows 23–33, where there was a sustainable bias toward the RB target in Experiment 1 (see the selected model in the endnote). 13 There was a reliable main effect of prosody (β = -0.16046, SE = 0.06131, t = -2.617, p = .05), reflecting an overall bias toward the RB target caused by the W2 rise prosody, regardless of the contextual manipulation. There was also a main effect of experiment (β = -0.25454, SE = 0.09631, t = -2.643, p = .05), indicating that the overall bias toward the LB interpretation was stronger in Experiment 1 than in Experiment 2. However, the interaction between prosody and experiment was not significant (β = 0.04492, SE = 0.04558, t = 0.985, p > .05).
Yet another LME analysis targeted time windows 17–18, where there was a bias toward the RB target in Experiment 2. The selected model again had an item random slope for prosody as well as intercepts for participant and item. Although the effect found in Figure 9 seems rather weak, there was a reliable main effect of prosody (β = -0.18773, SE = 0.090281, t = -2.079, p = .05), as expected, and a main effect of experiment (β = -0.24989, SE = 0.12412, t = -2.013, p = .05), indicating the lack of a comparable effect in Experiment 1 in this time interval. There was an interaction between prosody and experiment (β = -0.12017, SE = 0.05070, t = -2.370, p < .05), suggesting that the bias toward the RB target in the post-NP region emerges earlier in Experiment 2 compared to Experiment 1.
4.3 Discussion (Experiment 2)
In Experiment 2 as well as in Experiment 1, the pitch rise on N1 increased the final RB choice. Looks to the RB target also increased toward the end of the sentence (rather than immediately upon hearing W2). In addition, the important difference between the eye-movement pattern in the two experiments is that the pitch rise manipulation created an immediate and temporary bias toward the LB syntax in Experiment 2: Looks to the LB object relative to looks to the RB object increased as soon as the pitch prominence was recognized. We interpret this as evidence that pitch prominence alone quickly facilitated the contrastive interpretation that was available at that point in time (i.e., when the focused element was locally identified, which is part of the LB interpretation of the entire phrase completed by the subsequent input). That interpretation would eventually be overridden by an updated interpretation of the same prosodic cue as evidence for the RB syntax.
5 General Discussion
5.1 Timing
The results from Experiments 1 and 2 together reveal that one and the same prosodic event can be interpreted as different cues at different timings. Listeners recognize the pitch rise as evidence of an RB-marking information such as metrical boost later, that is, when the subsequent constituents have already been heard. This is in fact reasonable because metrical boost can only be relevant when a right-branching structure is a compatible analysis for the input. The appearance of at least the head noun (in the example, the third word ka’sa) must be processed to confirm it.
We only found an increase of the RB interpretation in response to pitch prominence on the second noun in Experiment 1. There was no sign of that prosodic cue being interpreted as contrastive focus. This was expected, because the experimental stimuli provided no information, linguistic or visual, supporting the contrastive interpretation of the target. What, then, could make the eventual RB bias in Experiment 2, which is the result of the re-interpretation of the signal, show up more quickly than that in Experiment 1?
In Experiment 2, the only difference from Experiment 1 was that the visual context did support a contrastive interpretation (in principle, with either the LB or the RB syntax); this had the effect of facilitating attention to the LB target, which was compatible with an immediate and local contrastive interpretation of the pitch prominence on the second prosodic word. It seems reasonable to assume that listeners would wait to establish a contrastive interpretation until they could process the entire two-step prosodic event, which would consist of pitch prominence on the focused item and pitch reduction on the subsequent item. However, our results suggest that listeners react to the initial pitch prominence alone to enable themselves to make an immediate calculation of the focus interpretation.
The immediate focus interpretation of the pitch rise is overridden by the structural interpretation by the time the listeners settle on a final interpretation. This allows two accounts as to how it happens. One possible interpretation is that when the elevated pitch is perceived, a competition between syntax and focus interpretations takes place. The input alone is not adequate for either analysis, as metrical boost would only be relevant if the input unfolding actually involved an RB structure, which is not confirmed at this point. The input signal is also only partially compatible with the contrastive focus interpretation, without providing evidence for post-focal reduction. So both competitors are still uncertain candidates, but the results suggest that the contextual interpretation rules at this point. Then, after the subsequent input confirms that the RB structure is available and that metrical boost is a legitimate interpretation of the prosody, the syntactic analysis becomes dominant and wins this tug-of-war at the cost of the prosodic reanalysis.
An alternative account is that the contrastive interpretation is initially hypothesized on the basis of the expanded pitch event but then quickly refuted by the fact that the subsequent F0 contour is not compatible with the expected contrastive focus prosody, due to the lack of post-focal pitch reduction. In either case, both the prosodic representation and the interpretation that derived from the prosodic representation would have to undergo a reanalysis to arrive at the results observed in Experiment 2.
5.2 Possible theoretical implications for the status of prosodic phrasing
A curious fact is that the ultimate RB decision was achieved earlier in Experiment 2 than in Experiment 1, despite the need for reanalysis (which would predict an additional delay, if anything). A possible account is motivated by the theoretical debates concerning whether metrical boost and focus are phenomena independent of prosodic phrasing. If contrastive focus is actually expected to be accompanied by a prosodic boundary preceding the focused item, the listeners would hypothesize such a boundary preceding W2, which has a notable pitch expansion facilitating the focus interpretation. Even after the contrastive interpretation becomes untenable due to the later input, the hypothesized prosodic boundary could be maintained in the listeners’ mental prosodic representation, which in turn would encourage the syntactic analysis in which the second and third items are grouped together. This would result in facilitating the assignment of the RB structure to the input because it is more compatible with the previously computed prosodic boundary than the LB structure. So our experimental outcome is compatible with a group of theories that argue that focus prosody and metrical boost actually create a prosodic boundary.
In fact, the latest evidence argues that these prosodic events do not necessarily involve changes in prosodic phrasing in spoken sentence production (e.g., Ishihara, 2011). Nevertheless, this line of theories does not totally contradict our account, which looks at the processing aspect of linguistic communication. The studies that have presented the most compelling evidence for the independence of metrical boost and focus from prosodic phrasing have employed a paradigmatic approach where they diagnosed the occurrence or non-occurrence of downstep on the basis of entire utterances by comparing controlled pairs with contrasting lexical accent, thereby controlling the condition for downstep to occur. This approach, in a controlled laboratory setting, may provide a more objective and reliable basis for theoretically motivated diagnoses. However, it should be remembered that strict prosody-syntax correspondence, as judged on the basis of an entire utterance, is not something listeners in the real world can use in their incremental sentence comprehension, not to mention that controlled counterparts (where downstep cannot occur) are not available to real-world listeners. To make their online judgments as promising as possible, given their limited access to information about how the prosody they are perceiving links to syntactic or higher levels of linguistic representation, listeners could hypothesize signals that are not confirmed by the bottom-up evidence provided by partial input or a controlled baseline. In this case, listeners can hypothesize a prosodic boundary when they hypothesize contrastive focus immediately upon recognition of W2 pitch rise, which would later be used as evidence for RB syntax (as in Experiment 2). Such an account is of course only speculative at the moment and should be further tested in future investigations.
5.3 Other issues
One part of our results that we have not explained so far is the fact that there were more RB clicking choices in Experiment 2 than in Experiment 1, irrespective of the prosodic manipulation, as revealed by a significant main effect in the cross-experimental analysis. This is slightly counterintuitive because at one point, in an early part of the sentence, the experimental manipulation attracts more attention to the LB target, presumably because the LB analysis is the only available resolution of the contrastive prosody that is accessible at that point in the input. A possible reason may be the fact that the contrastive interpretation with the RB syntax is also available in the visual contexts provided in Experiment 2, once the entire configuration of the NP is processed to ensure that the RB analysis, where the adjective (e.g., ao’i “blue”) modifies the non-local head (e.g., ka’sa “umbrella,” instead of the immediately following ne’ko “cat”), is a possible structural choice. If so, participants may have been “re-interpreting” the W2 rise information not only as a realization of metrical boost (as hypothesized in this study), but also as the contrastive focus assigned to an alternative syntactic structure, possibly being primed by the visual information that suggested that such an interpretation was also an option. It is probable that this situation would lead to a higher acceptance of the RB target irrespective of the prosodic condition. This possibility will be our focus in a future investigation.
6 Conclusion
Our study found the following: (a) listeners interpret one and the same pitch event (pitch rise on the second noun in [color word] + N1-Gen + N2) as both syntax-encoding and contrast-encoding information during the course of processing a sentence in relation to contextual information, and (b) listeners compute contrastive focus as soon as they note pitch prominence on incoming input, without waiting for subsequent input to confirm the status of that pitch prominence, while they analyze the same prosodic cue as a signal to structure (metrical boost) only after they encounter subsequent items (such as the head noun that guarantees branching is relevant). These findings support the suggestion that the same prosodic cue can be reanalyzed in listeners’ mental prosodic representations. They also provide motivation for the theoretical stand that prosodic boundaries are involved in F0 events associated with contrast marking and branching disambiguation. The fact that the eventual RB bias caused by the W2 rise emerged faster in Experiment 2 (when the rise was initially processed as a contrastive cue) than in Experiment 1 can be accounted for by hypothesizing an initially assigned prosodic boundary being exploited for re-interpretation. In online processing, once a prosodic boundary is identified by the listener, it allows for reinterpretation at different linguistic levels in response to the accumulating information.
