Abstract
This article presents a generalized model of reference relations in the discourse of television news based on Montgomery’s principles of intelligibility, Halliday and Hasan’s reference cohesion, Martin’s identification system, Tseng’s cohesive reference and some intersemiotic models. Reference in television news involves reference patterns from the verbal track (verbal reference), the visual track (visual reference) and across the two tracks (visual–verbal reference). Based on the categorization of verbal reference, this study classifies visual reference as personals (such as visual reappearance), demonstratives (such as proximity and direction) and comparatives (such as similarity and difference). Reference across visual and verbal tracks includes three general types of visual–verbal reference, i.e. complementary, visual-as-bridge and parallel, among others. Through these patterns of reference and their reference chains, participants in television news can be tracked and identified. This model is applied to a comparative analysis of two news items broadcast separately by BBC’s News at Ten and CCTV’s News Simulcast. Differences and their implications are presented and discussed.
Introduction
Television news involves a complex interaction of various modalities in which reference plays a primary role in achieving coherence, news-worthiness and, above all, intelligibility. Montgomery (2007: 97–98) suggests that the intelligibility of a television news report is governed by two principles:
For any referring expression in the verbal track, search for a relevant referent in the image track; and
Treat any element depicted in a shot in the visual track as a potential referent for a referring expression in the verbal track.
This article considers Montgomery’s principles of intelligibility as a point of departure and shows a comprehensive map of reference relations that the principles govern in television news. Specifically, it aims at building up a model for analyzing how visual and verbal references work in cueing viewers to identify the participants (i.e. persons, places, things, objects or settings) in television news in order for them to achieve an intelligible and coherent understanding of the news stories.
Reference, in Halliday and Hasan’s (1976: 31) view, means ‘the specific nature of the information that is signalled for retrieval’ within and outside a linguistic text (namely, endophora or exophora). Tseng and Bateman (2010, 2012) have recently introduced the endophoric (not exophoric) reference into film studies and considered audiovisual reference as ‘reappearance’ of filmic elements. Such reappearance is realized by presenting and presuming of those elements across shots or scenes (Tseng and Bateman, 2010: 222). I will take the Hallidayan notion of reference as the conceptual basis and draw on Tseng’s ‘reappearance’ in discussing the visual reference.
Earlier works in this field can be traced back to Halliday and Hasan’s (1976) textual cohesion that is realized by five types of cohesive ties: reference, substitution, ellipsis, conjunction and lexical cohesion. Later on, Hasan (1984) adds the notions of ‘cohesive chains’ and ‘cohesive harmony’, arguing that a text involves two types of cohesive chains: identity and similarity. The proportion of cohesive chains in a text determines the degree of its cohesive harmony. Martin (2004[1992]) has developed Halliday and Hasan’s cohesive ties/chains and proposes a system of identification in terms of reference. Later on, such Hallidayan cohesive ties/chains have been frequently introduced into the study of intersemiotic relations in different multimodal texts. This includes the mapping of image–text grammars (e.g. Kong, 2006; Marsh and White, 2003; Martinec, 1998, 2013; Martinec and Salway, 2005), intersemiotic interaction and conjunction (e.g. O’Halloran, 1999, 2005; Van Leeuwen, 1991, 2005, 2006) and image–word cohesion and complementarity (e.g. Liu and O’Halloran, 2009; Royce, 1998, 2007). These developments have largely enriched the study of intersemiotics, but most of them, except for Van Leeuwen’s (1991) conjunctive relations, still focus on a page-based analysis. Furthermore, most studies centre on logico-semantic relations, without paying adequate attention to the reference aspect, which is quite similar in film studies, though film analysts have long proposed Hallidayan cohesion for the study of film language (e.g. Palmer, 1989).
Recently, however, some exciting progress (e.g. Janney, 2010; Tseng, 2008, 2012, 2013; Tseng and Bateman, 2010, 2012) has been made on visual reference. Drawing on Martin’s (2004[1992]) identification system, Tseng builds up a model of ‘filmic cohesive reference’ for describing the filmic resources that realize the presentation and reappearance of the elements in films such as people, objects and settings (Tseng, 2012, 2013; Tseng and Bateman, 2012). Likewise, Janney (2010) has specified some realization devices for reference in filmic images based on Halliday and Hasan’s (1976) typology of personal, demonstrative and comparative reference. The present article will build upon these findings but differs in at least three ways. First, their studies are restricted primarily to cohesive reference within texts (i.e. endophora) while this study covers reference within and outside the multimodal texts (i.e. endophora and exophora). Second, their researches focus on how reference works in achieving textual cohesion while the present study concerns not only cohesion but also newsworthiness that is achieved through reference in television news. Third, and above all, the genres addressed are different; their studies concern primarily reference patterns in film while the present study focuses on those patterns in television news reports. Usually, a film is a fictional narrative whereby visual and verbal messages are designed to tell cohesive and coherent (fictional) stories, whereas a television news item is a combination of presenting and commenting on a factual and recently occurring event (Ekström, 2002; Montgomery, 2007). It concerns not only cohesion and coherence of the news text, but also newsworthiness of the news. Against this background, this study is predicated on the view that television news might involve quite different patterns of reference and identification from those addressed in film studies (e.g. Janney, 2010; Tseng, 2008, 2012, 2013; Tseng and Bateman, 2010, 2012). Specifically, it is designed to address the following questions: how are patterns of reference and identification constructed in television news? How do they relate to the newsworthiness of the news? And to what extent may cultural differences (e.g. between Chinese and Western news) lead to different reference/identification patterns in television news?
Reference in television news is complex since it is realized in a multimodal fashion and involves various broadcasting patterns such as direct visual address, voiceovers, sound bites, diagrams, news footage, diegetic sound, etc. Nonetheless, these patterns are all presented, at least up to now, within two tracks: verbal and visual and we can group all kinds of reference in television news into the two tracks, taking those in the verbal track as verbal reference, those in the visual track as visual reference and those across the two tracks as visual–verbal reference.
Verbal Reference
Reference is either situational (or exophoric, referring to ‘a thing as identified in the context of situation’) or textual (i.e. endophoric, referring to ‘a thing as identified in the surrounding text’) (Halliday and Hasan, 1976: 32). The latter may be anaphoric (referring to an entity in the previous text) or cataphoric (referring to an entity in the subsequent text).
Reference in linguistic text is achieved primarily in three ways: personal, demonstrative and comparative. Personal reference is achieved ‘through the category of person’ (Halliday and Hasan, 1976: 37), including first-, second- and third-person pronouns. They are either existential when functioning as a head (such as I, me, we, us, you, he, him, she, her, it, they and them) or possessive when functioning as a determiner (such as my, our, your, his, her, its, and their). Another type can be both adjective and noun such as mine, yours, etc., functioning as either a head or a determiner.
Demonstrative reference concerns proximity to the encoder in terms of location and time. Determiners such as this and these and adverbs such as here and now denote ‘near’ while determiners such as that and those and adverbs such as there and then denote ‘far’. The definite article the shows a sense of neutrality in the sense that it is treated by the encoder to denote not ‘far’ or ‘near’ but shared knowledge. In television news, demonstratives are often employed as exophoric items due to their deictic properties. For example: [1] (NDY, BBC, 2011): There aren’t many laughs
In [1],
Comparative reference is realized ‘by means of identity or similarity’ (Halliday and Hasan, 1976: 37), including adjectives such as same, identical, similar, different and else, comparatives such as better and more, and adverbs such as identically, likewise and differently. A comparative reference can be identical, similar or different in terms of extent or manner. For example: [2] (NDY, BBC, 2011): There were [3] (BBC’s News at Ten, 8 January 2013): Increasingly, it’s an uneasy relationship between the Afghans and their foreign partners.
Here in [2],
To sum up, we can outline verbal reference as shown in Figure 1.

Types of verbal reference. The curly brace means immediate categories within it can be chosen simultaneously while the square bracket means only one of the immediate categories within it can be chosen at a time. For example, a reference can be {[exophora] + [personal: existential]} or {[endophora: anaphora] + [demonstrative: near]}, but not {[exophora] + [endophora: anaphora]} or {[personal] + [desmontrative]}; similarly hereinafter).
Visual Reference
We can map visual reference according to the categorization of verbal reference. First, it includes exophora and endophora. Those whose referents are outside the visual–verbal text are exophoric, and those whose referents are within the text are endophoric. Second, visual reference has its own personals, demonstratives and comparatives. But unlike verbal reference whose personals, demonstratives and comparatives are mutually exclusive, the visual counterparts can be coexistent due to the ‘polyinterpretability’ of visual images (Van Leeuwen, 1991: 112). For instance, a close-up shot of President Obama’s reappearance in the news can both be personal, i.e. [REAPPEARANCE: repetition] and demonstrative, i.e. [PROXIMITY: detail], as we shall see later. The following sections examine how these types of visual reference happen and function in television news. Analytic units are primarily independent shots or key-frame images captured from the news footage.
Visual reappearance
Palmer (1989: 321–322) has made an initiative elaboration of the visual cohesion by the notion of ‘visual recurrence’. He summarizes five types of visual recurrence: recurrence (‘straightforward repetition’), parallelism (‘repetition of formal elements with change of content’), paraphrase (‘repetition of content with some change of formal elements’), ellipsis (‘partial repletion of the item’) and surface signals (‘markers of tense, aspect, and juncture’) (1989: 321–322). His notion of ‘visual recurrence’ maps well with the visual personal reference,
1
dubbed here as ‘reappearance’ by borrowing Tseng (2012, 2013). As Tseng claims, audiovisual reference is the ‘reappearance’ of filmic elements across shots or scenes. When a person, for instance, is interviewed in television news, his or her torso/profile may repetitively appear during the interview. In so doing, his or her identity can be traced and located. If an element reappears fully repetitive of content and form, or repetitive of content with some change of formal elements, just like Palmer’s (1989) ‘recurrence’ and ‘paraphrase’, I will call it [repetition]. Take [4] for an example: [4] (NDY, BBC, 2011) (LS = Long shot; MCU = Medium close-up shot): Shot 9: LS, from behind, the reporter and the man are talking and watching Obama’s TV talk. ↑ [REAPPEARANCE: repetition] Shot 10: MCU the interviewee. Caption: SCOTT TALBOTT: The Financial Services Roundtable.
The reappearance of the same person in shots 9 and 10 can help viewers locate and identify him, whom we can recognize from the caption in shot 10.
If an element reappears partially repetitive, just like Palmer’s ‘ellipsis’ and ‘partial signals’, 2 I will call it [meronymy]. For instance, the image of rolling numbers in the screen may serve as a sign for the US national debts. It shows a part-and-whole relationship (or meronymy) between the referred and referring images, explicitly or symbolically.
Visual demonstratives
Demonstrative reference in language is essentially ‘a form of verbal pointing’ (Halliday and Hasan, 1976: 57). By verbal pointing such as this and that, a referent becomes identifiable. Some visual images also possess the nature of ‘pointing’, which we call ‘visual demonstrative reference’. A visual demonstrative reference can be achieved by [proximity] or [direction]. [Proximity] denotes the spatiotemporal distal or proximal location of the referred image. Sometimes a temporal distance can be identified through colour. Black-and-white images, for instance, may indicate a past time while coloured images may show the present (Tseng, 2008: 96). Nevertheless, [proximity] is mostly realized by the size of images (Janney, 2010: 253). Van Leeuwen (1991: 98) regards an image elaborated with a distal, long shot as its ‘overview’ while an image depicted with a proximal, close shot as its ‘detail’. Tseng (2012) has a similar view and categorizes visual elements into ‘generic’ and ‘specific’, but her categorization seems to include also other modalities such as verbal text (cf. Tseng, 2013: 41–48). To distinguish the visual from the verbal and to foreground visual proximity, I borrow Van Leeuwen’s idea here. As a result, a [proximity] can be realized by either [colour] or [size] whereby [size] is further realized by [overview] or [detail]. For example, shot 7 in [5] establishes an overall setting of the event by zooming away from shot 6. In [6], the participant (a man at the counter) in shot 16 is part of the setting depicted in shot 15. As a result, the image of the man has been drawn near by the zooming technology from an ‘overview’ to a ‘detail’.
[5] (NDY, BBC, 2011) (CS = Close shot; LS = Long shot; R = Right): Shot 6: CS some SMURFS at the opening of the stock market. ↑ [PROXIMITY: overview] Shot 7: LS the opening of the stock market. Camera pans slightly R and zooms away. [6] (NDY, BBC, 2011) (PS = Panoramic shot; CU = Close-up shot): Shot 15: PS people are dining at a restaurant. ↑ [PROXIMITY: detail] Shot 16: CU a man is working at the counter.
[Direction] refers to the vector signalled by a visual image or an element in the image (cf. Kress and Van Leeuwen, 2006[1996]: 60–62, 70–72). It can be realized by both [camera position] and [participant position]. A [camera position] involves the camera movement when taking pictures such as zooming, high/low angle, tilting, panning, etc. whereby the camera acts as a referring object, cueing viewers to the referred item (Janney, 2010: 253). A [participant position] concerns the participants’ action direction or point of view such as gazes, gestures and movements. Gaze can be designed as a vector of actions (cf. Kress and Van Leeuwen, 2006[1996]: 116–124), pointing viewers’ attention from the present act to the next. For instance, we may follow the direction of a participant’s gaze to notice what he or she is looking at, as Figure 2 shows.

Gaze as [DIRECTION: participant/camera position] (CNN News, 2011).
Figure 2 shows two successive key-frame images extracted from a news footage shot. The camera established the setting with a long shot of the former Italian Prime Minister Silvio Berlusconi. Initially, he was speaking to one guard, and then turning right and looking out of the frame. At the same time, the camera followed his gaze to include another three guards as the right image of Figure 2 shows. As a result, a [direction] reference is realized by both participants’ and camera’s positions.
A [direction] reference can also be realized by participants’ pointing gestures. In [7], for instance, the camera first took a long shot of a person (shot 8), and then followed his pointing finger to take a big close-up shot of the robots’ eyes (shot 9). In this process, the pointing gesture acts like a demonstrative, cueing viewers to an intended item.
[7] (RSP, BBC, 2010) (LS = Long shot; CU = Close-up shot): Shot 8: LS a man introducing the robot seals. Camera zooms in on him. ↓ [DIRECTION: participant position: gesture] Shot 9: Big CU the robots; camera zooms in on their eyes with the man’s gestural direction.
Another way of realizing [direction] is participants’ movement, or match on actions. Sometimes moving images are designed to match actions so that a coherent visual text is likely to be achieved. For example: [8] (NDY, BBC, 2011) (MLS = Medium long shot; MCS = Medium close shot; R = right): Shot 4: MLS the door from which Obama walking out to R. ↓ [DIRECTION: participant position: movement] Shot 5: MCS President Obama giving a speech.
Shot 4 in [8] shows President Obama walking out of the door and turning right. In shot 5, we see that he is making a speech. Thus, his moving direction in shot 4 can be seen as a cataphoric act, cueing viewers to predict the next image.
Visual comparatives
Janney (2010: 254) states explicitly that filmic images have comparative reference. As he says, ‘although film lacks lexical means of expressing comparative concepts, it has means of envisualizing likeness and differences.’ Van Leeuwen (2005: 229) also points out that connection between images sometimes can be contrastive or similar. An image in a filmic-televisual text can be seen as signalling or referring to its previous or subsequent images if it is similar with or different from them, either on content or form. In other words, we may have a visual comparative reference in terms of difference or similarity. This idea can be further supported by Palmer’s (1989: 321–322) notion of ‘recurrence’, as I have hinted in note 1. His ‘parallelism’ is analogous to the comparative system of [difference], and we can group his ‘recurrence’, ‘paraphrase’, ‘ellipsis’ and ‘partial signals’ into the system of [similarity]. Let us see two examples: [9] (HRC, CCTV, 2011) (CU = Close-up shot): Shot 17: CU stacks of American dollars. ↑ [SIMILARITY] Shot 18: CU another stack of American dollars. [10] (PIPM, CNN, 2011) (LS = Long shot; R = Right; CU = Close-up shot): Shot 14: LS Berlusconi speaking to guards with ease, hands in pocket. He then turns R; camera follows him talking to three guards. ↑ [DIFFERENCE] Shot 15: CU some people waving white sheets outside windows. Some stand at the window, watching the demonstration.
Images in [9] reinforce the similarity between shots 17 and 18. This is achieved with a strategy of ‘paraphrase’ borrowing Palmer (1989: 322), that is, the same image content of ‘the US dollar banknotes’ is filmed from two different angles. By contrast, images in [10] stress the differences between shots 14 and 15. One is about Berlusconi’s privilege as the then Prime Minister of Italy while the other is about a mass demonstration against his sex scandal, misuse of his privileged power.
To summarize, we can outline visual reference as shown in Figure 3.

Types of visual reference.
Visual–Verbal Reference
Visual–verbal reference concerns reference relations between the verbal and visual tracks whereby the referring item and its referent are presented synchronously across the two tracks. Montgomery (2007: 97–98) treats it as a ‘co-reference’ between the visual and verbal tracks that is governed by two principles of intelligibility (see the Introduction to this article). Drawing on this notion, we can explicate the visual–verbal reference in two aspects. First, it is the reference crossing two tracks; i.e. if the referring item is in the visual track, its (potential) referent must be in the verbal track, and vice versa. Second, it presupposes a synchronous presentation of the referring item and its referent. The referring item and its referent must be presented at the same time or nearly at the same time. It therefore excludes non-synchronous visual–verbal reference where a referring item in the visual may appear much earlier or much later than its referred item in the verbal or the other way around. Three types will be discussed in the ensuing sections: complementary, visual-as-bridge and parallel.
Complementary co-reference
A well-formed multimodal text, in Royce’s (1998) view, realizes visual–verbal complementarity. As Royce states, ‘the visual and verbal modes occurring in page-based multimodal text complement each other semantically to produce a single textual phenomenon in a relationship’ (p. 103). Martinec and Salway (2005: 343) hold a similar view, ‘when an image and a text are joined equally and modify one another, their status is considered complementary’. This is also applicable to visual–verbal reference in television news reports. Messages from both visual and verbal tracks can refer to the same item simultaneously (see Figure 4). As Montgomery (2007: 105) points out, a television news report is governed by ‘the presumption of overlapping or complementary reference between the verbal and visual’.

Visual–verbal complementary co-reference.
Complementary co-reference is realized by either visual-to-verbal reappearance or verbal-to-visual reference. According to Barthes (1977: 25), an image can illustrate a text so that the text information can be confirmed and authenticated (also see Van Leeuwen, 1991: 89–90, 2005: 230). I take this as a visual-to-verbal reappearance in the sense that such image illustration is often the reappearance, partially or fully, of the element mentioned in the verbal track (cf. Tseng, 2012, 2013; Tseng and Bateman, 2012; see also the ‘visual reappearance’ section of this article). For example: [11] (NDY, BBC, 2011)(PTS = Panoramic telephoto shot):
Shot 3: PTS
[Diegetic sound:] (broadcasting) What would a default on the nation’s debt mean for you?
[Reporter’s voiceover:] The nation waits on tenterhooks for
Here, ‘the Congress’ in the verbal track can be seen as referring to the concept of US Congress that is also denoted by the image of ‘the Capitol’ in the visual track. Thus, the two tracks form a co-reference. However, the image of the Capitol only partially stands for the Congress, and so, a meronymic visual reappearance’. [12] shows a full-repetitive visual reappearance of ‘Smurfs’: [12] (NDY, BBC, 2011) (CS = Close shot; LS = Long shot; R = Right):
Shot 6: CS some Smurfs at the opening of the stock market.
[Diegetic sound:] Rhythmic tinkling.
[Reporter’s voiceover:] Sometimes it might seem
it’s in the hands of clumsy Smurfs
Shot 7: LS the opening of the stock market. Camera pans slightly R and zooms away.
who helped open the New York Stock Exchange this morning
The image of Smurfs (in shot 6) in the visual track and the ‘clumsy Smurfs’ in the verbal track can be seen as fully referring to each other; the former is a visual repetition of the latter while the latter acts as verbal (re)mentioning of the former.
Visual-as-bridge reference
Visual-as-bridge reference takes the visual track as a bridge across the referring item in the verbal text and the referent in the context (see Figure 5). This involves two aspects. First, the visual track acts as a bridge that connects the verbal text and the referent. Second, the visual track as the bridge is essential. The verbal track cannot form the reference on its own. It must first refer to an item in the visual track, which then denotes the referent in the context. It is different from visual–verbal complementary co-reference in that the verbal can identify its referent even without resorting to the visual track, while the visual cannot. In other words, a verbal item in television news can be used to refer to a visual image and then denote its meaning in the real world. During this process, the visual image acts to anchor the news as if it is happening ‘here’ and ‘now’ while the verbal message serves, through reference, to comment on the relevant images (Edginton and Montgomery, 1996: 97; Montgomery, 2007: 94–97). This is quite different from films. For a film, a participant (usually a character) is recognizable based on the structure of narrative. Its visual and verbal messages are designed to tell coherent stories. An image in film does not intend to anchor the story in the here and now in the real world, nor does a verbal message comment on the image shot itself. In other words, it is common for a reporter in television news who is outside the news story to make comments on the image depicted in the news footage by a deictic reference such as this or here (Montgomery, 2007: 94–97, 104), whereas it is fairly rare for a character in a fictional film to comment on its moving pictures. Extract [13] soundly illustrates this point: [13] (PIPM, CNN, 2011) (CU = Close-up shot):
Shot 19 CU a sign ‘Vergogna’ on a young woman’s face.
[Reporter’s voiceover:] Message

Visual-as-bridge reference.
In [13],
Besides deixis or demonstratives, personal pronouns can also form visual-as-bridge reference. In a sound bite in television news, the speaker often uses first-person pronouns to refer to his or her personal identity, experiences or opinions, with his or her torso or profile simultaneously depicted in the visual track. For example: [14] (NDY, BBC, 2011) (CU = Close-up shot):
Shot 18: CU
[Woman:] It’s er unfortunate that they’re putting the United States at risk in order to carry out their own personal agendas. So, yeah
The speaker in [14] addresses directly to the camera. It is obvious the pronoun
Parallel reference
Visual–verbal parallel 3 reference means that visual and verbal messages refer to the items independently but are relevant in an indirect or abstract way. In other words, the visual and verbal tracks, at a specific moment, refer to items in a parallel way; neither of them explicitly complements each other, as Figure 6 shows.

Parallel visual–verbal reference.
Referent1 and referent2 are different, but potentially related, items. Usually, when parallel reference occurs in television news, the verbal message refers to one thing while its synchronous visual element refers to another. For example: [15] (HRC, CCTV, 2011) (LS = Long shot; PS = Panoramic shot):
Shot 4: LS a corner of
得到足够支持º在目前(15)
(17)
has not got enough supports. Among the present
Shot 5: PS
根据(19)
According to (19)
Shot 4 in [15] refers to the scene of a Congress meeting while the verbal track tells us about the voting over Boehner’s scheme in the House. Shot 5 refers to the Capitol that stands for the US Congress while the verbal tells us about the US legislative process even though they are implicitly relevant (both are about the solution to American national debts).
Sometimes, visual messages are symbolically represented in a visual–verbal parallel reference. That is, the images are symbolically encoded so that it looks as if two tracks present quite separate elements. Only after visual messages are understood symbolically in accordance with verbal messages can visual–verbal cohesion be read out. Take [16] for an example.
[16] (HRC, CCTV, 2011) (CU = Close-up shot):
Shot 8: CU, blurred, the process of printing
民主党议员(23)
Democrat (23)
Shot 9: CU, blurred, the flow of
[Diegetic sound:] Noises of the flowing coins.
[播音员画外音:] 已经明确表示反对(24)
[Presenter’s voiceover:] objects explicitly (24)
The visual track in [16] depicts a scene of printing money while the verbal track tells us about the Democrats’ objection to Boehner’s scheme. Literally, there is no obvious link between the visual and verbal track unless we associate it with the news topic and the verbal information and interpret it symbolically.
To sum up, we can outline visual–verbal reference as shown in Figure 7.

Types of visual–verbal reference.
Reference Chain
Martin (2004[1992]: 140) demonstrates that participants in reference relations can be traced and identified through reference chains. As he puts it, ‘phoric items depend semantically on the items they presume … presuming items can themselves be presumed.’ This notion of reference chain has been introduced by Tseng (2012, 2013) into the analysis of multimodal filmic texts. According to her, filmic elements are chained together through the system of [presenting/presuming]. [Presenting] means ‘introducing an identity for the first time’ while [presuming] means ‘resources for tracking a previously presented identity’ (Tseng, 2012: 130). This idea is also applicable to reference chains in television news. All reference items (except exophora) can be chained together by the system of [presenting/presuming], whether they are within the visual or verbal track or across the two tracks, as Figure 8 shows.

System of reference chain in the text of television news.
Two points need be stressed. First, the reference chain concerns mainly endophoric reference rather than exophoric reference. Second, a reference chain occurs not only within verbal or visual tracks, but also across them. For instance, a referent may be presented in the verbal track and then presumed in the visual, or vice versa.
Sample Analysis
In this section, I will apply the above-sketched model to the analysis of two news items, which were broadcast separately by BBC’s News at Ten (or the BBC Ten O’clock News), in the UK, and CCTV’s News Simulcast (or ‘Xinwen Lianbo’ in Pinyin), in China. Both covered the following event. On 29 July 2011, the US House of Representatives cancelled a scheduled voting over a scheme proposed by the Republican Leader, John Boehner, about the borrowing limit of the US national debts. With the deadline of 2 August looming, solving the issue of the borrowing limit was becoming the highest focus of attention from within America and around the world.
The BBC case
BBC’s News at Ten covered this event in various ways, including news presentation, voice-overs, stand-uppers, sound bites and vox pop. Since it is too complex, and also unnecessary, to list all patterns of reference, I will concentrate on those that refer to the main participants in the news. These participants and their referring items are underlined in the following transcript.
[17] NDY (BBC’s News at Ten, 29 July 2011) (MCU = Medium close-up shot; PTS = Panoramic telephoto shot; MS = Medium shot; LS = Long shot; CU = Close-up shot; MLS = Medium long shot; MCS = Medium close shot; PS = Panoramic shot; R = Right; L = Left):
Shot 1: MCU Shot
[Presenter on camera:] In
1
raising
2
looming, leading
3
through
4
by another televised appeal by
5
6
7
From Washington,
8
Shot 2: MS
[Diegetic sound:] people humming in unison.
9
[Reporter’s voiceover:]
10
need of some divine inspiration.
11
Shot 3: PTS
[Diegetic sound:] (broadcasting) What would a default on
12
13
[Reporter’s voiceover:]
14
15
Shot 4: MLS the door from which
Slower than expected growth.
16
way out of
17
Shot 5: MCS
[Obama:] There’re
18
predict or avoid. Hurricanes, earthquakes, tornadoes, terrorist
19
attacks.
20
(to) solve
21
Shot 6: CS some
[Diegetic sound:] Rhythmic tinkling.
22
[Reporter’s voiceover:] Sometimes it might seem
23
24
Shot 7: LS the opening of the
who helped open
25
Shot 8: CU
[Diegetic sound:] Someone talking.
26
[Reporter’s voiceover:] There aren’t many laughs
27
Watching every twist and turn,
28
lobbyist represents
29
Shot 9: LS, from behind,
30
Shot 10: MCU
[Man1:] The possibility of a downgrade in
31
will send ripples across
32
standard for paying
33
a rating if
34
35
Shot 11: Flashing graph
[Reporter’s voiceover:] For decades
36
been raising
37
can
38
39
40
There were
41
42
43
44
Shot 12: Back to Shot 3.
[Diegetic sound:] (broadcasting) Working together is something
45
Democrats and Republicans aren’t doing yet.
46
[Reporter’s voiceover:]
47
arrived in Washington,
48
vote for any deal that allows
49
Shot 13: MS
50
Shot 14: CU
[Paul Broun:]
51
52
the things.
53
54
Shot 15: PS
[Reporter’s voiceover:] At Washington’s Eastern Market
55
56
Shot 16: CU
57
Shot 17: CU
[Man2:] totally disgusting
58
the fourteenth amendment and overrides
59
Shot 18: CU
[Woman:] It’s er unfortunate that
60
at risk in order to carry out
61
So, yeah
62
Shot 19: LS
[Reporter on camera:]
63
64
But
65
minute. The one trouble with
66
who actively think it would be a good thing to go over the brink.
67
Mark Mardell BBC news Washington.
68
The reference chains in [17] are outlined in Figure 9. Reference to minor participants has been omitted. Special cases such as text and extended reference 4 are not shown in the figure but explained in notes. Information indirectly presumed by time expressions is treated in the same way. Arrowed lines are used to indicate the reference direction whereby the presented items are pointed at by the presumed or re-mentioned items. Straight lines are applied to visual–verbal reference.

Reference chains of the news NDY in [17].
Figure 9 shows that the verbal track employs mostly personal pronouns as referring items. Nominal expressions are used to refer to participants when they are mentioned for the first time and re-used occasionally when they are re-mentioned. According to this figure, four main participants in the verbal can be identified: the US as a country, the borrowing limit (of the US national debts), Congressmen, and President Obama. The US is mentioned primarily by nominal expressions and personal pronouns. Nominal expressions such as ‘the United States/the US’ and ‘the world’s largest economy’ are used to accommodate its multiple identities while pronouns such as ‘its’ and ‘we’ are used to track its reappearance. For the US debt, it is first presented and occasionally re-presented by slightly different nominal expressions such as ‘the debt’, ‘the borrowing limit’ and ‘the debt ceiling’. But mostly it is presumed by the personal pronoun ‘it’. For once the demonstrative ‘this’ is used to indicate its immediate re-mentioning. The third participant, Congressmen, is presented and presumed primarily by nominal expressions such as ‘Republicans’ and ‘idiots’ and personal pronouns such as ‘they’, and occasionally by demonstratives such as ‘these’. The fourth participant, President Obama, is referred to mainly with proper nouns such as ‘the President’ and ‘Obama’.
Two comparative reference chains in the verbal are also interesting. The first is $1 trillion<<2.8 trillion<<almost 6 trillion<<more than 11 trillion<<$14 trillion. The second is parallel with the first: 18 increases<<4 increases<<7 more increases<<3 times. Both are illustrated in the visual track with comparative chains that accord with those in the verbal track (but with more accurate numbers). These patterns of comparative reference present viewers with a coherent and concise picture of the US national debt, vividly illustrating the increases of the borrowing limit of the debt from the Reagan administration to the Obama administration.
From the visual track, we can identify three main types of participants. The first is the setting: The Capitol as a general setting has been filmed from different angles and positions, running throughout the visual text (shots 3, 12 and 19). The second type is journalists: the reporter, for instance, appears several times (shots 8, 9, 13, 19), either showing his interviewing somebody or making direct visual addresses. His personal involvement has not only brought the reported event closer to viewers but also reinforced his journalistic authority over the authenticity of the news (Zelizer, 1990). The third major type is the reference to social actors in the news. Let us see two examples: President Obama must be one of the main figures depicted in the visual track. He is first presented in shot 4, showing that he is walking to the platform. Then his image is presumed in shots 5 and 8 by way of visual reappearance. Interestingly, the image of shot 4 can also be seen as a demonstrative reference to the image in shot 5, which is about his making a speech. This is achieved by [direction: participant position]. Another main figure in the news is an interviewee named Scott Talbott. Besides the verbal information and visual images that identify him, his ‘reappearance’ is also supported by a caption on the screen (i.e. ‘TALBOTT: The Financial Services Roundtable’). To conclude, the visual track, in [17], contains three kinds of reference chains that are formed mainly through visual reappearance of the participants. Through these visual reference chains, we can identify these participants and then understand relevant visual messages.
Extract [17] includes at least two major patterns of visual–verbal reference: complementary and visual-as-bridge. At least 12 cases of complementary reference are recognizable. Some are about background settings such as ‘the Capitol’ and ‘the stock exchange’ and some are about social actors such as ‘President Obama’ and ‘Smurfs’. Social actors are co-referred by the visual and verbal items primarily through [visual repetition/verbal reference]. To illustrate, shot 4 presents an image of ‘President Obama walking to the platform’ while its correspondent verbal information is also about him (‘the President’ at line 16). Thus, the visual image is a full repetition of the verbal referential information, co-referring to President Obama with the verbal information. By contrast, the settings are co-referred across the two tracks mainly through [visual meronymy/verbal reference]. For example, shot 7 presents the occasion of opening the Stock Exchange while the verbal refers to ‘New York Stock Exchange’ explicitly. Thus, the visual is a partial reappearance of the verbally referred entity, forming a [visual meronymy/verbal reference] co-reference.
Figure 9 identifies six cases of visual-as-bridge reference, including reference to journalists, interviewees and places. Most of them are formed through deictic first-person pronouns such as ‘I’, ‘we’ and demonstratives such as ‘here’ and ‘this’. To illustrate, shot 19 presents the image of ‘the reporter standing in front of the Capitol’ while in the verbal track the reporter is making a direct visual address and referring to the place depicted in the visual with a deictic ‘here’. Obviously the visual track has been taken as a bridge through which the referent in the real world is signalled.
The CCTV case
Unlike the previous case, the present one covers the event mainly through the presenter’s direct visual address and voiceovers, without any field reports such as stand uppers, sound bites or vox pop (see [19]).
[19] HRC (CCTV’s News Simulcast, 29 July 2011) (abbreviations as in [17]):
Visual track
Verbal track
Shot 1: MCS
[播音员:] (1)
1
取消原定于(3)
2
(6)
3
(8)
4
[Presenter:] (1)
5
Shot 2: PS
[播音员画外音:] 美国众议院共和党议员(10)
6
[Presenter’s voiceover:] The Republican Representative (10)
7
Shot 3: LS
[Diegetic sound:] voices and noises.
8
[播音员画外音]在宣布暂时取消(11)
9
在于议长(13)
10
[Presenter’s voiceover:] said, when announcing the cancelation of (11)
11
Shot 4: LS a corner of
得到足够支持º在目前(15)
12
(17)
13
has not got enough supports. Among the present
14
Shot 5: PS
根据(19)
15
According to (19)
16
Shot 6: CU
(21)
17
be passed in (21)
18
Shot 7: Super CU
给(22)
19
signed by (22)
20
Shot 8: CU, blurred, the process of printing
民主党议员(23)
21
Democrat (23)
22
Shot 9: CU, blurred, the flow of
[Diegetic sound:] Noises of the flowing coins.
23
[播音员画外音:] 已经明确表示反对(24)
24
[Presenter’s voiceover:] objects explicitly (24)
25
Shot 10: LS
[Diegetic sound:] Part of his speech.
26
[播音员画外音:] 按照(25)
27
提高(26)
28
[Presenter’s voiceover:] According to (25)
29
Shot 11: CS
并在
30
and to reduce nine hundred seventeen billion dollars of (29)
31
Shot 12: MLS part of
(30)
32
(30)
33
Shot 13: CU blurred, counting
美国白宫发言人(32)
34
(33)
35
Shot 15: PS the left front of
敦促(34)
36
urged (34)
37
Shot 16: MCS
[Diegetic sound:] Part of his speech.
38
[播音员画外音:] 能在(35)
39
[Presenter’s voiceover:] that could be passed in (35)
40
Shot 17: CU stacks of
(36)
41
(36)
42
Shot 18: CU another stack of
都把(38)
43
put (38)
44
Shot 19: CU
在提高(40)
45
it should be very easy to reach an agreement on (40)
46
The reference chains in [19] are outlined in Figure 10, which shows that its verbal track contains six main reference chains by which six participants are identifiable. They are 美国国会 (the US Congress), 债务上限 (the borrowing limit), 投票 (the voting), 赤字 (the deficit), 博纳 (John Boehner) and 博纳的方案 (Boehner’s scheme). Five of them, but 美国国会 (the US Congress), are presented and presumed with nearly the same nominal expression. For example, it mentions 博纳的方案 (Boehner’s scheme) six times, three with ‘博纳的方案’ (Boehner’s scheme), one with ‘方案’ (scheme), one with ‘这个方案’ (this scheme) and one with ‘博纳的最新方案’ (Boehner’s latest scheme). 美国国会 (the US Congress) is mentioned with at least eight different nominal expressions (with one pronoun ‘其’ (their) at line 43), such as ‘美国众议院’ (the US House of Representatives), ‘美国国会’ (the US Congress), ‘众议院’ (the House of Representatives), ‘参议院’ (the Senate), ‘些议员’ (some Congressmen), ‘两党’ (the two parties), ‘国会两院’ (both houses of the Congress) and ‘立法者’ (legislators). 5

Reference chains of the news HRC in [19].
Second, some deictics and demonstratives are also noticeable in the verbal track, such as ‘当地时间’ (local time), ‘当晚’ (that night) and ‘当天’ (that day). Such spatiotemporal reference must be helpful in promoting the factuality of the news. In the meantime, however, they may also undermine other news values ‘当地时间’ (local time) indicates an event happening ‘there’ not ‘here’. Similarly, ‘当晚’ (that night) and ‘当天’ (that day) indicate an event happening ‘in the past’ not ‘at present’. As a result, such uses in the news can implicitly situate the event in a past time and a remote place and thus undermine such news values as ‘recency’ and ‘proximity’ (Bell, 1991; Galtung and Ruge, 1965; Montgomery, 2007; van Van Dijk, 1988).
Four main participants, in the visual track of [19], are identifiable: the Capitol, US dollar banknotes, John Boehner, and Jay Carney. When depicting the Capitol, the camera first takes a panoramic shot of the Capitol (shot 2) and then zooms in on its interior parts (shots 3 and 4). The image of the Capitol is re-depicted in shots 12 and 19 with a medium long shot and a long shot. Each time when it is depicted, it serves as a general setting for the verbal messages. The images of people in the news are edited nearly for the same purpose. The report covers three main news figures with five different shots: one medium close shot for the presenter (shot 1), a long shot plus a medium close shot for John Boehner (shots 10–11) and a panoramic shot plus a medium close shot for Jay Carney (shots 15–16). Both Boehner and Carney’s images recur in the visual track from a background ‘overview’ to a foregrounded ‘detail’, but this recurrence looks still distant to viewers for the simple reason that the figures do not face the camera and their voices are lowered or erased. In other words, they are not portrayed as directly interacting with viewers, but as third-party characters or part of background settings. Background settings, of course, can serve as specifying circumstances or as ‘being there’ in the news field so that their (re)depiction might suggest the truth of the news accounts. However, too much scanning of background settings, rather than direct visual address, might reduce such effects and court a sense of distancing, instead (cf. Feng, 2013: 270; Zhou and Xu, 2002: 26).
The US dollar banknotes is the most mentioned element in the visual track in [19] (shots 8, 9, 13, 14, 17, 18). It is first presented with a close-up image in shot 8 and then presumed with a contrastive close-up image of ‘flowing coins’ in shot 9. Later, shots 13 and 14 pick it up again through a similar close-up image – ‘counting money’. Finally, shots 17 and 18 re-depict it with similar contents and forms – ‘stacks of US dollar banknotes’. These shots and images not only (re-)present the US dollar banknotes, but they also illustrate and reiterate, symbolically, a general theme of this report – the US national debts.
Figure 10 demonstrates two prominent types of visual–verbal reference: complementary and parallel. The first complementary reference shows a [visual meronymy/verbal reference] relation between shot 2 and line 3. The image of the Capitol in shot 2 maps partially onto ‘美国国会’ (the US Congress) at line 3 in the verbal track, and vice versa. The second is also a [visual meronymy/verbal reference] type. The meeting room in shot 4 illustrates part of the verbal message – ‘众议院’ (the House of Representatives) at line 12, which in turn signals a congress meeting depicted in the visual. The last one, in shots 15–16 and at line 34, is a visual–verbal repetition whereby the visual and the verbal co-refer to the US spokesman ‘卡尼’ (Jay Carney).
Visual–verbal parallel reference is the commonest in [19]. Apart from the above-mentioned three cases of complementary reference, almost no visual and verbal information explicitly refers to its synchronous counterpart. Among these parallel reference cases, two types are prominent. One is visual-as-being-there; the other is visual-as-symbolic. Let me first exemplify the former. Shot 3 presents an image of ‘a Congress meeting room’ while its correspondent verbal information involves ‘博纳’ (John Boehner), ‘方案’ (the scheme) and ‘投票’ (the voting). None of them matches well with the visual element, but the image seems to indicate that the said event at lines 6–11 takes place right there in ‘the Congress meeting room’. It thus echoes indirectly with the verbal information. The latter type can be illustrated from the following examples. Shot 19 presents a picture of a clock handle (or time), but its correspondent verbal information is about ‘债务上限’ (the borrowing limit). The visual and verbal messages are not compatible with each other, unless we interpret the image of ‘clock handle’ as a symbolic indicator for the looming deadline of the US national debts. Such a symbolic relationship occurs also between shot 8 and line 24. The visual in shot 8 is about ‘the US dollar banknotes’ while the verbal at line 24 talks about ‘Boehner’s scheme over the US borrowing limit’. They are not visually–verbally cohesive, but they can be potentially coherent if we see the visual information as symbolically indicating the US national debts mentioned in the verbal.
By comparison, we can summarize some major reference commonalities between the BBC and CCTV cases:
Common use of various nominal expressions to refer to the participants in the verbal track;
Preference of personal reference to demonstratives and comparatives in the verbal track;
Common use of visual reappearance to form visual reference chains;
Common use of ‘proximity’ demonstratives such as ‘overview’ or ‘detail’ in the visual track.
Their differences, however, appear much greater than commonalities. To start with, the BBC case seems to hold more deictic personal pronouns and demonstratives than the CCTV case. Since the former adopts various live broadcasting patterns such as direct visual address, voiceovers and sound bites, it is reasonable that deictic personal pronouns and demonstratives are commonly used. These deictics are interesting because not only can they easily identify the participants, but they can also achieve a sense of proximity and factuality. Moreover, the use of deictic personal pronouns and demonstratives indicates speakers’ personal experience and involvement in the event (cf. Tannen, 2007: ch. 2). Such personal experience and involvements might be highly evidential to the authenticity of news accounts (Montgomery, 2001a, 2001b; Scannell, 2001). By contrast, the CCTV case seldom uses deictic personal pronouns and demonstratives. Instead, it uses many nominal expressions, even though they are wordy and repeated. This is partially because this news item is presented solely by the presenter from the studio, without any other broadcasting patterns. Further, News Simulcast is a highly authoritative news bulletin programme (Ai, 2008: 124–125; Feng, 2013: 271). To avoid ambiguity, information needs to be clearly and accurately expressed, which often leads to the repetition of nominal expressions. Such repetition undoubtedly adds solemnity to the practice of news broadcasting, but at the same time it reduces its naturalness and authenticity and ultimately undermines its potential credibility.
As for the visual track, the BBC case tends to present participants and ideas in a live and direct manner. For example, within less than three minutes it uses one sound bite of President Obama’s televised speech, two fragments of journalists’ direct visual address and four sound bites of interviewees’ responses. By contrast, the CCTV case contains few direct audiovisual presentations save the presenter’s direct visual address at the beginning. It either includes a large proportion of visual images with presenter’s voiceovers or takes some of them as background settings for the verbally presented news event. The BBC case also takes some visual images as background settings, but much less than the CCTV case does.
Lastly, the BBC case uses more complementary and visual-as-bridge reference and less parallel reference than the CCTV case. Most participants in the former are easily identifiable from both the visual and verbal tracks. It uses quite a few cases of visual-as-bridge reference, with which the news appears all the more factual, immediate and proximate (Allan, 2010; Bell, 1991; Galtung and Ruge, 1965; Montgomery, 2007; Van Dijk, 1988). By contrast, the CCTV case uses more parallel reference patterns than other ones. As a result, participants mentioned in the visual track are often difficult to recognize, and thus likely to lead to audio–visual incoherence even though the verbal text is usually well formed and self-contained.
Concluding Remarks
To conclude, this study has sketched a generalized model of reference relations in television news. It argues that reference patterns in the visual track of television news can be analyzed as those in the verbal track which can be classified into personal reference, demonstrative reference and comparative reference drawing on Halliday and Hasan (1976). Visual images are personal referential in terms of reappearance of the participants depicted in the visual track. Visual images are demonstrative in terms of direction and proximity. A direction-oriented visual demonstrative can be realized through film technologies such as participants’ gazes, gestures and movements, and camera positions. A proximity-oriented visual demonstrative can be achieved through colours and sizes of the images filmed. A comparative visual reference is achieved mainly through the juxtaposition of similar or different shots and scenes. This article also discusses three major types of visual–verbal reference across the visual and verbal tracks including complementary co-reference, visual-as-bridge reference and visual-verbal parallel reference. All these patterns, mono-track or cross-track, are able to form reference chains by way of connecting the referring and referred items. Through these chains, participants referred or depicted in the visual and verbal tracks can be identifiable.
I have applied this model to the analysis of two news items chosen respectively from BBC’s News at Ten and CCTV’s News Simulcast. The analysis shows that reference between the visual and verbal tracks contributes to the overall coherence and intelligibility of a television news report (Montgomery, 2007: 97–98). We have already found that by locating the reference and reference chains, we can identify the participants encoded in television news, which might help us form a coherent reading of the news report. A television news report might not be verbally or visually self-cohesive, but it may become globally coherent as long as a visual–verbal co-reference is presumed. This view fits well for the BBC case, even though its verbal text is not as consistent as that of the CCTV case. The latter seems to be verbally self-cohesive, but this is achieved mainly at the cost of visual–verbal incoherence. It shows that its visual images are primarily designed as background settings or symbolic illustrations subject to the verbal information, which leads to a parallel, or unequal, and thus non-consistent presentation of messages between the visual and verbal tracks. It also shows that different reference patterns in television news might bring great impacts on the newsworthiness of the report. By analyzing the visual–verbal reference of the BBC case, we note that its report seems more factual, proximate and timely than the CCTV case. This is because, on the one hand, it uses more deixis and demonstratives to refer to the information in the visual track. On the other, it tends to present the news with the participants’ personal involvements, especially those of journalists and interviewees. Such a live and direct presentation not only guarantees a seamless correspondence between the visual and verbal messages but also accentuates the factuality of the news (Montgomery, 2006: 243). For the CCTV case, most participants are not presented as such.
Another finding is related to the multimodal genres. Television news, like film, is an audio–visually based multimodal genre whereby meanings are, for the most time, jointly accomplished through the visual and verbal interactions. However, television news is also a particularized genre whereby the meaning-making process might be achieved solely by the verbal or visual track. Usually, messages in different tracks of television news are presented by different participants. A voiceover, say, perhaps comes not from the person depicted in the visual track but from a news presenter or reporter. Hence, messages from the visual track might differ substantially from those in the verbal track. As mentioned above, an image in the visual track can serve as a bridge between the verbal text and the outside world in order to achieve a sense of factuality, proximity and recency. Furthermore, both the visual and verbal tracks might move separately and tell seemingly different stories at the same time, like those in the case of CCTV news. These features suggest that the visual and verbal messages can be significantly separated and distinguished when analyzing the meaning-making process of a television news report. This is quite different from some of other multimodal genres such as film. For those genres, a separation of visual and verbal strands of analysis might be senseless and unnecessary because the visual and verbal, among others, are mutually complementary, and the meanings are only arrived at by a dynamic interaction and exchange between them. For instance, a film might push forward a story with audio and visual actions that are performed by the same and sole actor. His or her audio actions must be seamlessly compatible with his or her visual actions and thus inseparable.
There remain some limitations, though. First, visual reference needs further developments. I have demonstrated that a single image can be identified as personal, demonstrative and comparative reference at the same time, but some implicit referential meanings still need to be specified due to the polyinterpretability (Van Leeuwen, 1991: 112). The second limitation is cross-reference between the visual and verbal tracks. As I have suggested at the beginning of the ‘visual–verbal reference’ section, this article deals only with the synchronous visual–verbal reference. Television news can avail the synchronous presentation of the information from both visual and verbal tracks to create a sense of immediacy, factuality or proximity of the news. This is important for the news to achieve its intelligibility and newsworthiness (Montgomery, 2007). Nevertheless, television news presents information not only synchronously between the visual and verbal tracks but also chronologically both within and across the two tracks. This is quite similar with the genre of film. As Tseng (2012, 2013) suggests when addressing cohesive reference in film, an element may first appear as a visual mode and then reappear as a verbal mode or vice versa. And third, other modalities except the visual and verbal modes remain untouched. It would be, for example, more significant to take into account the diegetic sound, colours, layout, etc. than just focus on the image and voiceovers when analyzing the reference in an excerpt of news footage.
Footnotes
Acknowledgements
This article is supported by the University of Macau (Project No. MYRG (Y1-L4)-FSH12-MM). Thanks go to Professor Martin Montgomery for his guidance during my writing of this article, Professor Theo Van Leeuwen and reviewers for their insightful comments, and Alison Cheetham for proofreading my transcripts.
Notes
Biographical Note
Address: 6-201 Teachers’ Building, Jiangxi University of Finance and Economics, 169 Shuanggangdong Road, Lushannan Avenue, Changbei District, Nanchang, China. [email:
