Abstract
This article introduces a new methodology for deriving the dynamics of visual segmentation in relation to the underlying cognitive processes involved. The method combines social semiotics approaches to visual segmentation with eye-tracking studies on authentic image viewing and simultaneous image description. The authors’ thesis is that visual segmentation suggested by the social semiotic approach is traceable in the behaviour of the viewers who perceive images while creating meaning. From this perspective, visual zooming is seen as both perceptually, cognitively, grammatically and analytically relevant. The interdisciplinary approach developed in the article presents new perspectives on the ways images are segmented and interpreted.
Keywords
1 Introduction
The notion of visual segmentation, image segmentation or scene segmentation has been discussed in various research areas. This has been done in visual perception (in terms of natural partitioning of a given image into meaningful regions and objects conducted by the viewers’ visual and cognitive system; see Henderson, 2007; Holsanova 2001, 2008), in computer vision and image processing (in terms of an automated partitioning of a given image into meaningful regions and objects conducted by feature extraction, edge detection, and region extraction; see DeLiang Wang, 2003) and in social semiotics (in terms of a contextually and functionally based description and partitioning of images as a communicative resource for meaning-making; see Boeriis, 2009, 2012; O’Toole, 2011[1994]).
The human visual system extracts and groups similar features and separates dissimilar ones, segments the scene into coherent patterns, identifies objects, distinguishes between figure and ground, perceives object properties and parts, and recognizes perceptual organization (Palmer, 1999). Our ability to perceive figures and meaningful wholes instead of collections of lines have been described in terms of gestalt principles of (a) proximity, (b) similarity, (c) closure, (d) continuation, and (e) symmetry (Köhler, 1947). For instance, the proximity principle implies that features and objects placed close to each other appear as groups rather than a random cluster. The similarity principle means that there is a tendency for elements of the same shape, brightness or colour to be seen as belonging together. The principle of closure implies that elements tend to be grouped together if they are parts of a closed figure. The continuation principle means that units that are aligned appear as integrated perceptual wholes. Finally, the symmetry principle implies that regions bounded by symmetrical borders tend to be perceived as coherent figures.
In the following, we will focus on visual segmentation in social semiotics and in visual perception and cognition. The aim of our case study is to compare the social semiotics approach to visual segmentation (Boeriis, 2009, 2012) to eye-tracking studies of authentic image viewing and simultaneous image description (Holsanova, 2001). In particular, we will first apply categories and rank mechanisms from the social semiotic framework to a complex image. We will then compare the results of the semiotic analysis with the results of the eye-tracking study of authentic viewing and description of the same image. Our thesis is that visual segmentation suggested by the social semiotic approach is traceable in the behaviour of the viewers who perceive visuals while creating meaning. It becomes visible in visual fixation patterns accompanied by a simultaneous image description. A comparison of semiotic analysis with eye-tracking measurements is very useful (Holsanova et al., 2006). By combining these two perspectives and methodologies, we hope to come to a better understanding of the dynamics of visual segmentation.
2 Background
2.1 Visual segmentation and the social semiotic approach
The social semiotic approach describes communicative resources as inter-subjective emergent rather than normative rule governed phenomena. The resource systems emerge from social use and, as such, are regulated by socially conventionalized communicative ‘aptness’ (Kress, 2010: 55). The individual visual text is the result of intentional strategic choices from the resource systems regulated by the given context. Consequently, an image is seen as a communicative product (multimodal text) constructed with an ideal model viewer/reader in mind. The situational and cultural context is seen as playing a crucial role in the meaning-making regulating the communicative consequences of the grammatical choices realized by certain structural elements (Baldry and Thibault, 2006; Halliday and Matthiessen, 2004; O’Toole, 2011[1994]; Van Leeuwen, 2005). Consequently, from a social semiotic perspective, any grouping in an image is ‘suggested’ by the image itself in context, even the more top-down conceptual groupings discussed below, because the way the individual image is arranged as a communicative product constrains what groups may be conceived of in that particular image. An image may lend itself to several readings but the overall (model reader) meaning is a complex intertwining of all of them.
Preliminary studies show that there seems to be a socially shared understanding of the segmentation of visual texts and of the general role played by segmentation in visual communication (Boeriis, 2012). We assume that segmentation plays an important role in visual communication as part of the socially shared communicative resources employed both by text producers and by text receivers (Kress, 2010: 44–5).
One of O’Toole’s major contributions to visual social semiotics is his segmentation of images into four general rank scale levels called ‘Member’, ‘Figure’, ‘Episode’ and ‘Work’ (O’Toole, 2011[1994]). This is inspired by Halliday’s (2004: 32) segmentation of verbal language into ‘morpheme’, ‘word’, ‘group’ and ‘clause’. O’Toole proceeds from visual art and defines Figure as what we recognize as one (whole) depicted entity in the image. Episodes are groupings of Figures that are involved in shared processes of different kinds, and Members are elements of the Figures that play important roles in the overall meaning. Work is the overall level of the entire piece. He argues that when viewing an image we tend to ‘home in’ on configurations of Members, Figures and Episodes and then ‘a kind of shuttling process’ takes place between the individual parts and the whole image (O’Toole, 2011[1994]: 12). According to Baldry and Thibault (2006: 26), the reading of the fixated structure in images (they term it ‘cluster hopping’) is based on a number of mechanisms such as periodicity and variation. Their cluster theory builds on a multi-variable rank scale with the spatial grouping of elements as the major factor. As we shall argue below, we find that there is more to the grouping of elements than the fact they merely occupy the same area of space in the pictorial frame.
In their ambitious description of visual semiotic resources, Kress and Van Leeuwen (2001, 2006[1996]) do little explicit sub-segmentation of the visual text. Kress (2010) operates with a general multimodal distinction between ‘text’, ‘module’ and ‘sign’, in what he calls a ‘top-down approach’, where the elements are ‘shaped by the contingent circumstances of those who make the text in its social setting’ (p. 148, original emphasis). Kress focuses on rank scale levels as situated semiotic functions within visual text rather than as a priori scalar levels. We will elaborate the idea of a more dynamically conceptualized rank scale in section 3.1.
2.2 Visual segmentation and the cognitive approach
Within the cognitive approach, researchers in the field of scene perception investigate where and when people look in a scene, and why they do so (Henderson, 2007). In order to assess these processes in detail, they use eye-tracking methodology and controlled studies with a careful experimental design. Eye fixations have been considered to constitute a boundary between perception and cognition since they overtly indicate that information was acquired. Therefore, eye movements ‘provide an unobtrusive, sensitive, real-time behavioural index of ongoing visual and cognitive processing’ (Henderson and Ferreira, 2004: 18).
Where we look in a scene is partly determined by the scene constraints and partly by the viewer’s goal, interest and expectation. Current theory suggests two different mechanisms for eye movement guidance: bottom-up (driven by low-level image features, such as luminance, contrast, edge density, colour and motion; see Itti and Koch, 2000) and top-down (driven by high-level cognitive factors, such as task, goals, prior knowledge and expectations of the viewer). This issue is problematic, however, since fixated regions (for example, faces) often contain both low-level factors of visual saliency and high-level factors of potential semantic importance. The influence and interaction between bottom-up and top-down factors is the subject of a current debate in the field (Foulsham and Underwood, 2007; Harding and Bloj, 2010; Nyström and Holmqvist, 2008).
What is the time-course in scene perception? Can we find recurring patterns and phases? First of all, there is agreement on our ability to get the ‘gist’ of the scene very early in the viewing process. Secondly, several researchers have found regularities concerning the time course of image viewing, characterized by specific patterns and phases of visual exploration. Back in the 1930s, George T. Buswell identified two general eye-movement patterns: an initial global and subsequent local phase of image exploration. The first consisted of a general survey in which the eyes moved with a series of relatively short pauses over the main portions of the image. The second consisted of a series of long fixations, concentrated over small areas of the image, evidencing detailed examination of those sections (Buswell, 1935: 142). Similar patterns have been identified by other researchers, where the phases have been given alternative names according to the inferred activity: evaluating – orienting phase, focal – ambient phase, exploration – inspection phase (Unema et al., 1968).
Recent studies show a revival of so-called scanpaths, spatial and chronological sequences of fixations, which were introduced as a concept by Noton and Stark (1971). Scanpaths offer more information than singe fixations and saccades since they ‘encompass a whole range of oculomotor data into one construct, which reveals visual processing over time and in space’ (Dewhurst et al., 2012).
Which units do viewers perceive? How do they segment the content of the scene? A visual fixation on an image element does not indicate what aspects and properties of the image element have been focused on, or at what level of abstraction. Scanpaths contain more information than single fixations but they do not reveal which concept was associated with it, or what the viewer had in mind. We still need some kind of referential framework in order to infer the ideas and thoughts to which these fixations and scanpaths correspond. In order to capture the dynamics of perception and meaning-making, we need two ‘windows to the mind’, in the form of eye movement protocols and (simultaneous) verbal protocols (Holsanova, 2001, 2008). This method is described and illustrated in section 3.2.
2.3 Differences and commonalties between the two approaches
One difference between the two approaches discussed above is that the semiotic approach considers visual segmentation to be a result of intentional, socially typical choices that are made to achieve the optimally desired communicative effect on a hypothetical model reader/viewer, whereas the cognitive approach focuses on the dynamic process of viewers’ actual interaction with the visuals from a reception perspective. While the semiotic approach emphasizes resources as socially shared knowledge based on conventions and consensus, the cognitive approach investigates general and individual patterns in the unfolding scene perception and segmentation. In particular, researchers study factors that may influence the perception and segmentation process, such as task or instruction, properties of various types of scenes and images, viewers’ expert knowledge, etc. The relation between expertise and the evolving socially shared semiotic system has been discussed extensively in recent social semiotics (e.g. Kress, 2010; Kress and Van Leeuwen, 2001) but the theoretical focus lies on ‘typical’ ways of doing things communicatively and therefore less on individual preferences and style. The social semiotics description of a fully evolved semiotic system will always be from an expert’s point of view.
Both approaches recognize principles of perceptual organization. In particular, the more structural gestalt principles based on proximity, similarity and other coherent patterns are crucial both from the social semiotic perspective and the cognitive perspective. As both approaches share these basic assumptions about structuring and perception of visual elements, we would at least expect to find evidence of the textual (structural) ranking mechanisms in the eye-tracking data.
3 Method
Firstly, we will describe visual segmentation based on a social semiotic approach (Baldry and Thibault, 2006; Boeriis, 2012; O’Toole, 1994). Secondly, we will present the cognitive approach and introduce the method of combining eye-movement analysis with simultaneous image descriptions (Holsanova, 2001).
3.1 Social semiotic method: the dynamic rank scale
The fact that an image may function as the overall work in one context and as part of a text in another poses a challenge for a static rank scale. As parts of overall wholes, images do not lose any unique features in the embedding and therefore they are at the same time a ‘whole’ and a ‘part’. We find that a good way to address this issue may be to adopt a dynamic approach to ‘wholeness’ in visual texts. In the analytical zoom approach (Boeriis, 2012), we zoom in or out between separate parts and more overall wholeness, and this dynamic approach renders it possible to analyse complex structures of parts and embedded wholes in an overall visual text. The analytical zoom is the analytical equivalent of the perceptual shuttling/cluster hopping discussed above. The hypothesis is that images are perceived as conceptual zooms back and forth from detail to wholes.
We operate with Boeriis’s (2012) four basic rank scale functions: ‘Whole’, ‘Group’, ‘Unit’ and ‘Component’ (see Figure 1). Although comparable to O’Toole’s rank levels, these functions are not distinct, static a priori levels, but rather dynamic text mechanisms in which certain parts or groups of elements may perform one of these functions.

The dynamic rank scale.
An image may function as a Whole framed on the wall, as a Unit in an article and as a webpage background. When analysing an article with an embedded image, it is possible and fruitful to ‘zoom in’ and analyse the image itself, and then relate to its function as Unit as we ‘zoom all the way out’. The analysis will be a process of motivated zooming in and out between different rank scale levels. The responsibility for the identifiable rank scale levels rests with the instantiated ranking mechanisms.
The segmentation of visual texts in social semiotics is typically based on gestalt theory laws of proximity and prägnanz (e.g. Baldry and Thibault, 2006; O’Toole, 2011[1994]), but other mechanisms have influence on the perceived grouping of elements. For instance, the gestalt laws of closure, symmetry and similarity are also relevant (Boeriis, 2009: 147). From a visual social semiotic viewpoint, we would expect to find a number of grouping mechanisms (rank-defining choices) that stem from different metafunctional realizations. To avoid resorting to ‘running commentaries’ (Bateman, 2008: 13), it is important to outline the important basic mechanisms defining the ranking in visual texts. Visual elements may play different united functional roles in the rank scale. Elements functioning as Components may join in certain ways and elements functioning as Units may group in certain other ways. These ‘ways of coming together’ we call ranking mechanisms.
Visual social semiotics inherits the concept of metafunctionality from Halliday (Kress and Van Leeuwen, 2006[1996]: 20), which is the idea that three types of meaning are always simultaneously conveyed in communication, namely ideational, interpersonal and textual meaning. Ideational meaning is the functional representation of the world around us, and it is often divided into two separate metafunctions, named the experiential and the logical metafunction. The experiential metafunction is visually evident through processes (e.g. action), participants (e.g. the actor or the goal) and circumstantials (e.g. the setting). The logical metafunction is about the complex relations between several processes and participants in the same image. This metafunction has not been discussed much in visual social semiotics, except in relation to moving images (Boeriis, 2009: 194–200; O’Halloran, 2004: 125). The interpersonal metafunction is made up of interpersonal relations between the communicative participants as they are realized in the text. In visual communication, this is realized through different viewpoint and modality systems (Boeriis, 2009). The textual metafunction is the structural meaning of an image in the composition where elements ‘are integrated into a meaningful whole’ (Kress and Van Leeuwen, 2006[1996]: 176). The meaning is realized through systems such as information value, salience, framing, frame dimension and frame shape (Kress and Van Leeuwen, 2006[1996]: 201, O’Halloran, 2004: 120).
The ranking mechanisms presented by Boeriis (2012) are based on the descriptions of grammatical realizations of metafunctional meaning in visual social semiotics. They describe the mechanisms by which elements are made to function at different rank scale levels by joining Components into combined Figures or grouping Figures into Groups. Figure 2 organizes the ranking mechanisms by their metafunctional origin.

Visual ranking mechanisms based on visual social semiotics.
We assume that it is possible to detect some of these rank-defining mechanisms in the eye-tracking and verbal protocols to be discussed below.
3.2 Cognitive method: image viewing and image description
How do viewers perceive and segment complex images? What units do they identify and at what level of abstraction? What does the temporal and semantic build-up of the visual examination look like? In order to answer these questions, we will use a dynamic, sequential method, combining the analysis of two sources, eye movement data and simultaneous verbal descriptions (Holsanova, 2001). This section presents the method and shows how these two kinds of data can give us distinct hints about the dynamics of the underlying cognitive processes. Our starting point is a complex picture (Figure 3).

Complex picture: the motif comes from a children’s book by Sven Nordqvist (1990). Reproduced by kind permission from Sven Nordqvist.
The visual fixation pattern in Figure 4 illustrates the path of image discovery. It shows: (a) which objects and areas of the visual scene have been fixated by the viewer; (b) in what order; and (c) for how long. The circles indicate the position and duration of the fixations, the diameter of each fixation being proportional to its duration. The lines connecting fixations represent saccades. The white circle in the lower right-hand corner is a reference point: it represents the size of a one-second fixation. This scanpath comes as a result of 7 seconds of image viewing.

Scanpath: image discovery of one participant during 7 seconds (Holsanova, 2008: 132).
However, the image content can be perceived and described on different levels and the fixation itself does not indicate what properties of an object in a scene have been acquired. Thus, in order to study the process of meaning-making and visual segmentation, we still need some kind of referential framework, to infer the ideas and thoughts to which these fixations and scanpaths correspond. Therefore, we need to combine visual data with the (simultaneous) verbal descriptions. The verbal description, uttered during the visual examination illustrated above, is shown below: … in the middle is a tree/ with one … with three birds doing different things// one is sitting on its eggs/ the other is singing/ and the third female bird is beating a rug or something//
Each line in the transcript represents a new verbal focus expressing the content of active consciousness. Verbal focus is usually a phrase or a short clause, delimited by prosodic and acoustic features: it has one primary accent, a coherent intonation contour, and is usually preceded by a pause or hesitation (Holsanova, 2001: 15ff.). It implies that one new idea is formulated at a time and active information is replaced by other, partially different information at approximately two-second intervals (Chafe, 1994). Several verbal foci are clustered into superfoci (for example, a summarizing superfocus or a list of items in the above example, delimited by lines). A verbal superfocus is a coherent chunk of speech, typically a longer sentence, consisting of several foci connected by the same thematic aspect and having a sentence-final prosodic pattern. Superfoci can be conceived of as thresholds into a new complex unit of thought (Holsanova, 2008: 8ff.).
Both the visual scanpath (Figure 4) and the verbal transcript illustrate the result of 7 seconds of viewing and description. They can be seen as a functionally delimited segmentation unit. If we want to look at the process of image viewing and image description in detail, however, we need to use a different visualization, known as multimodal score sheets (Holsanova, 2001). A multimodal score sheet (Figure 5) enables us to synchronize visual and verbal behaviour, follow and compare the content of the attentional spotlight and extract clusters in the visual and verbal flow. With its help, we can examine the relationship between what is looked at and what is said at a particular point in time.

Multimodal score sheet (Holsanova, 2008: 111).
The score sheet contains two different streams: it shows visual behaviour (objects fixated visually during description on line 1; thin box = short fixation; thick box = long fixation) and verbal behaviour (verbal idea units on line 2), synchronized over time. Simple bars mark the borders of verbal foci (expressing the conscious focus of attention) and double bars mark the borders of verbal superfoci (thematic clusters of foci that form more complex units of thought).
The scene is viewed, described and interpreted stepwise, in terms of sub-scenes. One portion of the image is in the focus of an attentional spotlight at a time. Thus, verbal data combined with visual data can be used as two windows to the mind. They contain the content of the attentional spotlight and reflect the functional visual segmentation of the image. It is not only detection and recognition of objects that matter but also how the perception process unfolds. When looking at an image and describing it verbally, viewers not only report what they see but also how the image appears to them. In other words, they are involved in perceptual, categorizing and interpreting activities.
To sum up, the combination of visual and verbal data showed that objects were conceptualized on different levels of specificity. We have witnessed a process of stepwise specification, evaluation, interpretation and even reconceptualization of image elements and the image as a whole. Informants started by looking at scene-inherent objects, units and gestalts. As their image viewing progressed, they tended to create mental units independently of the concrete image elements. They made large saccades, picking up information from different locations to support concepts that were distributed across the image. With the increasing cognitive involvement, observers and describers tended to return to certain areas, changed their perspective and reformulated or recategorized the scene. Their perception of the image changed over time. Active mental groupings were created on the basis of similar traits, symmetry and common activity. The process of mental zooming in and out could be documented, whereby concrete objects were refixated and viewed on another level of specificity or with another concept in mind (Holsanova, 2001, 2008, 2011; see examples in 4.2).
4 Empirical Examples
In this section, we will show examples of how the categories from the social semiotic framework can be applied to the complex image that served as a stimulus in the cognitive eye-tracking study (see Figure 3). We will also show examples of units created and delimited by informants during authentic image viewing and image description (Holsanova, 2001, 2008). This section will be concluded by a comparison of the social semiotic and cognitive approaches.
4.1 Application of the social semiotic analytical method to an image
The overall Pettson image (see Figure 3) is easily identified as the Whole since there are no marked variations in modality profile in the picture or any overall graphical framing devices. The image employs several ranking mechanisms and the ranking hierarchy they create is pivotal to the understanding of the image. Salience plays an important role in selecting the most prominent elements in a picture. The prevailing notion of salience in social semiotics is inherited from Kress and Van Leeuwen’s (2006[1996]) work and applied in the salience hierarchy of a (multimodal) text (e.g. Baldry and Thibault, 2006; Bateman, 2008; Boeriis, 2008, 2009). In this tradition, salience is based both on genre-predefined mechanisms and perception-based mechanisms. The four men (Pettsons) are the most salient elements in the hierarchy, closely followed by the cats. The birds and flying insects have a more circumstantial function.
The repetitive representation of a participant with the exact same intensive and possessive attributes identifies him as one and the same person in the same circumstantial setting. This short-circuits the spatial–temporal logic and makes the image a so-called simultanbild. Although complex and at times contradictory, the rank structure seems to quite clearly support an understanding of the image as a temporal progression from left to right: rather static on the left side, while on the right the narrative is more dynamic with the action processes and the compositional complexity.

Examples of ranking mechanisms in the Pettson image.
4.2 Results of the cognitive multimodal analysis
How did the viewers perceive and segment the image? What units did they identify and at what level of abstraction? This section presents results from a dynamic sequential analysis of eye-movement data and simultaneous verbal protocol data (Holsanova, 2001, 2008, 2011). Figure 7 illustrates examples of meaningful units created during the process of image viewing and image description by five viewers.

Results of the cognitive multimodal analysis.
4.3 Comparison between the cognitive and semiotic approach
As hypothesized, the data analysis confirms that semiotic and cognitive approaches give rise to common units based on perceptual organization, compositional characteristics and ideational aspects. We found parallels between compositional grouping and the mechanisms of segregation, separation, proximity and rhyme; between taxonomic grouping and the mechanisms of relational coincidence, classification and typification. Groupings based on similar traits and common activity were in line with the mechanism of relational (attributive) coincidence, process involvement and actional coincidence. Finally, mental zooming in and out documented in the eye-tracking study has been described as local ranking mechanisms at the various zoom steps of the dynamic rank scale but without attaching importance to the temporal unfolding in the perception.
The interpersonal ranking mechanisms of the semiotic model, however, were not found explicitly in the empirical study. This may be because the image has a homogeneous modality profile and only very subtle variations in viewpoint/perspective. The social semiotic analysis factors in the divergent perspectives as one mechanism among others in the division into the 1+3 structure. Even though a viewer does not explicitly detect the variations, these may have a subliminal impact which can be very difficult to verify with eye-tracking and verbal protocols.
5 Relevance
The results of this explorative interdisciplinary study of visual segmentation are relevant for further investigation of the rank scale in visual social semiotics in general and for multimodal rank scale theory in particular. They are of importance for further investigations of visual segmentation in scene perception and give new perspectives on the discussion of conceptual categories and on the role of bottom-up and top-down processes. The results can also contribute to future research on grammatical/structural functions that may not as yet be empirically verifiable, as well as to research on circumstantial meaning.
The ranking mechanisms and methods presented here can be applied, for instance, to figurative images, photographs, illustrations, graphics and layout on two-dimensional visual canvas. Even theories of moving images may benefit from the dynamic approach. Other modalities such as in three-dimensional or auditory communication, however, will probably employ other systems, rank scales and ranking mechanisms.
The combination of the social semiotic and cognitive approaches is beneficial in several ways. Empirical data can verify or refute theoretical assumptions and categories deduced from visual grammar and perhaps suggest new issues that have not yet been addressed. This may induce considerations about the aptness of certain grammatical categories from which they originally stem.
6 Limitations
This is merely an explorative case study and it is not based on hundreds of viewers or hundreds of images. The Pettson image cannot be considered representative of images in general, and therefore all tendencies revealed by this investigation have to be tested with a whole range of other images and with larger groups of viewers. Also, there is a need for carefully designed controlled experiments that would investigate in a systematic way the role of factors such as individual differences and expertise, the role of the task or instruction and the role of bottom-up and top-down processes for visual segmentation.
The fact that the social semiotic dynamic visual rank scale applied here is only a first tentative proposition for a rank scale is also a limitation. Moreover, the dynamic rank scale needs further discussion and development. Another potential limitation may be that the eye-tracking and verbal protocols are restricted by limitations in the respondents’ knowledge and awareness. The respondents can only express what they have concepts and words for, and implicit factors that may subliminally impact the meaning-making will not be mentioned. This may to a certain degree be provided by the combination with the social/grammatical approach. Also, the verbalization process itself has an impact on the perceptual process.
Due to limited space, a number of issues emerging from combining the two approaches are not pursued in this article. Also, a number of nuances which are significant in each approach could not be elaborated in detail because the focus is directed towards the integration of the two perspectives, rather than an individual examination of each approach. This is always a challenge in explorative interdisciplinary studies.
7 Discussion
This study compares the social semiotic model of visual segmentation with eye-tracking studies of image viewing and simultaneous image description. Our main thesis was verified as we found that visual segmentation was traceable in the behaviour of viewers who perceive visuals while creating meaning. We found the concept of the visual zoom applicable in both a social semiotic and a perceptual cognitive approach. Also, we found quite coincident segmentation categories (rank scale levels) and ranking mechanisms in the two approaches, as many of the semiotic categories were demonstrated by eye-tracking and verbal protocols. The inspiration of gestalt theory was one clear common denominator which facilitated the unison of the cognitive and social semiotic approach to visual segmentation.
The different approaches revealed different perspectives on the same phenomenon and there were of course discrepancies in what was emphasized by the two approaches. Certain distinct aspects that appear important in the empirical study of image perception were not as accentuated in the social semiotic approach (and vice versa). Among these discrepancies were: (1) the temporal aspects of image perception and the dynamics of visual segmentation; (2) the role of individual differences and expertise; (3) the role of the context, task, instruction or goal; (4) the role of an implicit model reader; (5) the role of interpersonal segmentation; (6) the role of paradigmatic relations within system resources; and (7) the understanding of salience. Space constraints prohibit us from further elaboration of these discrepancies, but they would all be very interesting areas for future investigation.
The semiotic versus the reception focus did yield interesting perspectives on each other. From both perspectives – and in unison with both – we found that rank scale segmentation plays a very important role in visual meaning-making. This indicates that the two approaches can indeed support each other, and in combining the two perspectives and methodologies, we came to a better understanding of the dynamics of visual segmentation and the underlying cognitive processes. Even though this is merely a first explorative study in an area that needs much more research, we find it plausible to suggest that taking similar interdisciplinary approaches to this as well as to other multimodal phenomena could be very fruitful.
Footnotes
Acknowledgements
We would like to thank Kay O’Halloran, Anders Björkvall, Roger Johansson and the Eye Tracking Group at Lund Humanities Lab for their comments on previous versions of the manuscript. The work has been supported by the Linnaeus Center for Thinking in Time: Cognition, Communication and Learning (CCL) at Lund University, funded by the Swedish Research Council (grant no. 349-2007-8695). The work has also been supported by the Faculty of Humanities and the Institute of Language and Communication at University of Southern Denmark.
Biographical Notes
MORTEN BOERIIS is an Assistant Professor at the Institute of Language and Communication at the University of Southern Denmark. He specializes in multimodality, visual communication, moving images and business communication, and teaches various courses at the Department of International Business Communication.
Address: Institute of Language and Communication, University of Southern Denmark, Campusvej 55, DK-5230 Odense M, Denmark. [email:
JANA HOLSANOVA is Associate Professor in Cognitive Science at Lund University, Senior Researcher in Linné Environment ‘Cognition, Communication, Learning’ and Project Leader at Humanities Laboratory in Lund. She has been elected the Vice Chair/Chair Elect of the Visual Communication Division, International Communication Division (2011–2014). She has been using eye-tracking methodology to study image perception, the interplay between language and images, the role of images for learning, visual thinking and interaction with various media. Her publications include Discourse, Vision and Cognition (Benjamins, 2008) and Myths and Facts about Reading: On the Interplay between Language and Pictures in Various Media (Norstedts, 2010).
Address: Cognitive Science Department, Lund University, Kungshuset, Lundagård S-222 22 Lund, Sweden. [email:
