Abstract
This study reconceptualizes redundancy, complexity, and emotion in terms of cognitive load (specifically as resources allocated and required), then measures the combined real-time impact of these variables on available resources and encoding over the course of an hour television news program. Operational definitions of redundancy in the literature were ordered by their theoretically predicted level of resources required, then coded overtime. Dynamic measures of audio and video complexity in terms of resources allocated and required were developed and tested. Over the course of the news program, all combinations of independent variables occurred and the theoretically derived combinations of cognitive load successfully predicted changes in resources available as measured by Secondary Task Reaction Times (STRTs) and encoding indexed by recognition. The results suggest that defining message variables in terms of dynamic changes in cognitive load can allow us to predict the simultaneous dynamic impact of multiple message variables which contribute to complexity on processing capacity and message processing.
This article attempts to predict the moment to moment variation in memory processing of television messages as a function of the moment to moment variation in the difficulty of the message processing task. To do so, this article will reconceptualize audio/video redundancy, message complexity, and emotion in terms of the human processing system rather than in terms of media or content variables. Specifically, each of these variables will be defined in terms of the over-time change in cognitive load they place on the human’s limited-capacity processing system. The ultimate goal is to develop a measure of complexity that can be used across media platforms, genres, and messages and thereby contribute to the development of a general theory of message processing. Such a measure, when completed, should allow any message to be coded based on how many resources it causes to be automatically allocated and required by the human processing system, which will then be able to predict the thoroughness of message processing.
The development of such a theory requires a rejection of many of the underlying and often unstated assumptions of mass communication research, an instantiation of a new set of assumptions, and a redefinition of basic terms like communication and mass communication. The underpinnings of this approach are explained at length elsewhere (see A. Lang, 2011; A. Lang & Ewoldsen, 2010; A. Lang, Potter, & Bolls, 2009, for discussion) and will be presented more briefly here. First, traditional effects research is media-centric and has an assumption of media determinism (Lowery & DeFleur, 1995). Following from this assumption are definitions of media that are related to their “form” (e.g., radio, TV, print, computer, cell phone, etc.), their “goals” (e.g., persuasion, entertainment, information, education, etc.), or their “content” (e.g., news, pornography, advertising, etc.). The approach used here is evolutionary and human-centric. It argues that humans evolved so as to survive in their environment. Everything in that environment was real. Humans have not evolved since media were invented. Therefore, to some extent, they process media as real environmental stimuli (Reeves & Nass, 1996). Thus, media need to be defined in terms of how they are processed by that evolved human processing system.
Second, traditional approaches tend to ignore time. Media use is lumped into large categories such as amount viewed on average, whole messages, whole media use sessions, before and after. Time and change are central to this approach, which defines communication as the over-time interaction between evolved human brains (Sherry, 2011) and mediated communication as the real-time interaction of at least one evolved-human-brain and a mediated stimulus. Note that communication is not defined in terms of senders and receivers, or the extent to which the receiver and the sender agree on message content, or the prior or later behaviors of the evolved human. Communication is the real-time interaction. It is argued that by understanding and explaining that interaction we will be able to predict what Weber and his colleagues call the “synchronicity” in human brains during media use—that is, the things that most human brains do as a result of automatic (evolved) processing (Weber, Tamborini, Westcott-Baker, & Kantor, 2009). Later research examining how individual differences and environment influence processing will be able to predict the “lack of synchronicity” in human brains during media use. An eventual complete theory would be able to predict a great deal of the variance in the over-time attention to and processing of messages. Success would mean that we would be able to predict which parts of media messages become part of a person’s store of knowledge and understand why, as well as understand what emotions users will experience and attach to that knowledge.
Of course, that goal is far in the future, and this study is only a small step—but long journeys begin with a single step. In this study, we will take what we have learned from more traditional media-centered static research and reconceptualize a set of important message variables as dynamic and human-centric. We will then predict the local, moment by moment, level of available processing resources in the evolved brain to complex over-time change in the presence and level of these three variables. This approach will allow us to investigate the real-time interactive effect of all combinations of these three message features on viewers’ cognitive load as a result of automatic processing at specific local moments. While all these features have been studied before as static global message variables, only some have been examined in terms of their impact at the moment when they are occurring, and for none of them has an attempt been made to predict and measure their variously combined impact as they naturally occur (singly, in pairs, or all together) in real time.
An examination of research on television news processing yields a plethora of message variables that have been theorized to play a major role in what is remembered from television messages. This study reconceptualizes, from a human-centric dynamic viewpoint, three of the most commonly studied—emotional content, message complexity, and audio/video redundancy. This study will ground those reconceptualizations in the theoretical framework of the limited-capacity model of motivated mediated message processing (a dynamic human centric theory), which has been developed to explain the dynamic interaction among message variables and message viewers in terms of the controlled and automatic allocation of and requirement for cognitive resources during message processing. This article will briefly lay out the relevant aspects of the model and then use them to reconceptualize each of the independent variables in terms of their automatic contribution to local cognitive load in order to predict how local cognitive load predicts memory processing of the simultaneously occurring information.
Limited Capacity Model of Motivated Mediated Message Processing
Within the framework of the limited capacity model of motivated mediated message processing (A. Lang, 2006a, 2006b), mediated message processing is an ongoing interaction between mediated messages and media users. Mediated messages are conceptualized not in terms of delivery systems (radio, TV, computer, print), but as some combination of dynamic perceptual information. For example, what is usually called television is defined as continuous variably redundant streams of audio and visual perceptual information which could be delivered over a cell phone, a TV, a computer, or some other box. Media users are conceptualized as active motivated cognitive processors with finite cognitive resources that are allocated through automatic and controlled mechanisms to the processing of the mediated messages. It has been shown that many structural features of media messages (e.g., camera changes, animations, voice changes, sound effects) and personally salient stimuli (e.g., an individual’s name or information relevant to an ongoing goal) elicit orienting responses (ORs) in media users, which in turn elicit an automatic allocation of resources to processing the OR eliciting stimulus (A. Lang, 1990; Ohman, 1979; Pavlov, 1927). Similarly, motivationally relevant stimuli contained in messages (e.g., food, mates, sex, and predators) also elicits automatic resource allocation (A. Lang, 2006a, 2006b) as a result of the automatic activation of the viewers’ appetitive and aversive motivational systems. The motivational systems consist of two independent motivational systems, which are thought to be an evolutionary mechanism which enable an organism to respond automatically and quickly to motivationally relevant stimuli in order to promote survival (Cacioppo, Gardner, & Berntson, 1999; P. J. Lang, Bradley, & Cuthbert, 1997). The motivational activation has two primary characteristics. In the absence of motivationally relevant stimuli, the organism is thought to be slightly appetitively activated (called the positivity offset) but in response to the appearance of negative stimuli, the aversive system activates more quickly and vigorously (called the negativity bias). Both emotional content (either positive or negative) and the intensity of emotional content (the level of arousing content) of a mediated message are theorized to affect motivational activation and thereby influence resource allocation. The more the motivational system (either appetitive or aversive) is activated by arousing content, the more resources are allocated to the message. Thus, arousing content in either a positive or negative message elicits motivational activation whose initial level and rate of increase, however, differs depending on whether the message is positive or negative due to positivity offset and negativity bias (A. Lang, 2006a, 2006b).
The level of resources actually allocated to processing a message, called resources allocated, is, as a result, automatically influenced by the number of OR eliciting structural and content features and by the presence of personally and motivationally relevant stimuli in the message. In contrast, the level of resources needed to process a message, called resources required, is influenced by the interaction of the human with message content features such as difficulty, familiarity, and arousing content, and so forth. The interaction is important since the same message feature may require various amounts of resources for different humans as a result of individual differences in knowledge, experience, temperament, and so forth. However, some message features have more individual variation (e.g., difficulty or familiarity) while others produce less (e.g., camera changes, violent or sexual images). The difference between resources allocated and resources required is called the level of available resources, which should predict the thoroughness of message processing (A. Lang, 2006b; A. Lang & Basil, 1998). Message processing is good when resources allocated is greater than or equal to the resources required, resulting in a sufficient number of available resources. In contrast, if resources required exceeds resources allocated, the resultant insufficient available resources predicts less thorough message processing (A. Lang, 2000, 2006a, 2006b; A. Lang & Basil, 1998).
This concept of available resources is central to the reconceptualization we are undertaking in this study because we are making the argument that eventually we will be able to conceptualize any message variable in terms of how many resources it causes to be automatically allocated to processing and how many resources it requires in order to be processed—or in terms of dynamic change in available resources. Thus, widely different types of message variables can all be placed onto the same conceptual scale and, as a result, their combined influence on over-time processing can be investigated and eventually understood. The approach demonstrated here argues that a variety of message variables continuously change overtime in any mediated message. As a result of this real-time combination and recombination available resources change continuously overtime. As we learn how various message features impact resource allocation and requirements we can develop resource based conceptual definitions of more and more message variables allowing us to compare multiple message variables on the same processing based ruler. If we know whether a specific message feature elicits the automatic allocation of resources and the extent to which it requires those resources to be processed—then we can, theoretically, calculate—at any given point in time of the message—the available resources—and, based on that number, predict how well the concurrently presented information will be processed and remembered. This study, in an attempt to test this prediction uses three message variables about which we have enough data to begin to form such resources based conceptual definitions. Those three variables are emotional content, structural complexity, and audio/video redundancy.
Reconceptualizing Emotional Content as Available Resources
Previous findings support the notion that arousing compared to calm content both in still images (Bradley, Greenwald, Perry, & Lang, 1992) and TV messages (A. Lang, Bolls, Potter, & Kawahara, 1999; A. Lang, Dhillon, & Dong, 1995; A. Lang, Newhagen, & Reeves, 1996) requires more resources to process. Recently A. Lang et al. (2007) used the limited capacity model described above to examine the effects of emotional content on available resources. Positive and negative emotional content were defined as motivationally relevant material that would automatically activate the appropriate motivational system, leading to automatic allocation of resources to encoding and storing the motivationally relevant content. It was predicted that as the level of arousing content increased more resources would be automatically allocated to processing the messages. At the same time, it was expected that much of the additional allocation of resources would be required to process the motivationally relevant material. However, because of the positivity offset it was predicted that calm positive messages would receive a greater allocation of resources compared to calm negative and therefore have relatively more available resources. As a result of the negativity bias moderately arousing negative compared to positive messages were predicted to receive more resources and therefore have more available resources. Results showed that as the level of arousing content increased, resources required increased for both positive and negative messages. In addition, resources allocated was greater for calm positive compared to calm negative messages and increased more sharply, as a function of arousing content, for negative compared to positive messages (A. Lang, Park, Sanders-Jackson, Wilson, & Wang, 2007). Therefore, it is proposed that emotional content can be reconceptualized as changes in available resources as follows. Increasing arousing content will lead to increases in both allocated and required resources. There will be relatively more available resources during calm positive compared to calm negative periods of time, and during moderately arousing negative compared to moderately arousing positive periods of time.
Reconceptualizing Message’s Structural Complexity as Available Resources
Message structural complexity in mass communication research has been operationalized in many ways—as structure, as content, and as structure and content combined. However, it has been commonly defined at the global message level by coding the relative level of various characteristics such as difficulty or pacing (Anand & Sternthal, 1990; A. Lang et al., 1999; Thorson, Reeves, & Schleuder, 1985; Watt & Welch, 1983). In this study, the presence of a set of message characteristics known to influence real-time resource allocation or requirements will be coded at the intramessage local level. This will allow an over-time summation of predicted resources allocated and required and provide a predicted level of available resources for each point in time. The fewer resources available at a given point in time the more complex the message. Thus, the predicted level of available resources (as a result of automatic resource allocation and automatic resource requirements) at a given point in time can be used as an indicator of message complexity, assuming controlled resource allocation is held steady. When more resources are required to do a task than are allocated to or are present in the cognitive system, that is when available resources are predicted to be negative, it is often called cognitive overload. However, this may be a somewhat misleading description. When insufficient resources are allocated to a task like watching television people do not stop watching or become incapable of understanding the message. Rather, one or more aspects of processing (encoding, storage, or retrieval) are performed more poorly. Indeed, the decrease in performance may not be completely transparent to the viewer but careful measurement can detect these decrements.
In this study, we will use a specific set of structural features which have been demonstrated to impact available resources. Recent work within the model used here (A. Lang et al., 2007; A. Lang, Bradley, Park, Shin, & Chung, 2006) provides us with a set of common television message features that we know elicit automatic resource allocation and a set of characteristics that automatically require resources. Measuring the two over a given period of time provides a relative level of estimated available resources. As discussed above, the elicitation of an orienting response (OR) leads to the automatic allocation of resources to encoding. Thus, the developed measure of automatic resources allocated is to count (over some period of time) the number of Orienting Eliciting Structural Features, called OESFs. The measure of resources required is derived by coding the amount of information introduced (ii) by each OESF on a number of defined dimensions (depending on the OESFs sensory characteristics such as audio, video, audio/video). Previous studies have examined the validity of these measures (A. Lang et al., 2006, 2007). In one study, resources allocated and required were manipulated at the message level. The number of a particular type of OESF, the camera change, was manipulated to produce two levels of resources allocated to a message (low and high). Similarly, the amount of information introduced at each level of resources allocated was manipulated to create three levels of resources required (low, medium, and high). Results showed that at a given level of resources allocated, viewers had fewer available resources as resources required increased (A. Lang et al., 2006). The result was also replicated across messages that varied in their emotional content (A. Lang et al., 2007).
Of particular interest here is that, while previous tests of this theory have looked at resource allocation and resource requirements either as a global message variable, that is, as average resources allocated and resources required per second for an entire message, or as static local variables, that is, resources allocated and resources required at a particular point in time, the actual level of available resources is thought to be continuously changing in response to the continuous change in a message’s structural and content features. The measure used here, however, will attempt to modify these measures so that they can be used to produce a dynamic measure of available resources. It will further modify the measures by including the additional resources allocated and required as a function of audio/video redundancy at a given moment and emotion at a given moment. Thus the new dynamic measure of available resources based on not only local audio and video OESFs and ii (as in previous work by A. Lang et al.) but also local audio/video redundancy and emotion.
Reconceptualizing Audio/Video Redundancy as Available Resources
Psychologists have a long history of investigating processing efficiency when congruent and incongruent information is conveyed through multiple modalities positing that information processing becomes more difficult when congruency declines (Alais, Morrone, & Burr, 2006; Bonnel & Hafter, 1998; Cocchini, Logie, Sala, & MacPherson, 2002; Jolicoeur, 1999; Penney, 1989). Similarly, there is a relatively long history of studying audio/video redundancy in communication research. In general, audio/video redundancy has been defined as a categorical message level variable. Most theoretical approaches have focused either on how redundancy (either through dual coding, increased salience, or synchronicity) increases memory for messages or on how lack of redundancy (either through conflict, distraction, or cognitive overload) decreases memory for messages (A. Lang, 1995; Basil, 1992; Drew & Grimes, 1987; Grimes, 1990, 1991). Definitions of audio/video redundancy have varied from the number of physical channels to the congruency of the information in the two channels. Those examining number of channels were often designed to compare different media (i.e., print vs. TV and radio vs. TV) where print and radio were considered less redundant and television more redundant (Drew & Grimes, 1987; Hsia & Jester, 1968; Rolandelli, Wright, Huston, & Eakins, 1991; Severin, 1967). Other studies examined meaning relatedness of the information in the two channels. While some defined audio/video redundancy as an exact match between audio and video channels (Hsia & Jester, 1968; Severin, 1967), complete redundancy was difficult to achieve because the audio and video use different symbol systems (Graber, 1990). Instead, audio/video redundancy was widely defined as the degree of semantic relatedness between audio and video, which ranged from redundant, where the audio and video share facilitative, not contradictory, information, to nonredundant, which specifies the presence of actually conflicting information in the audio and video channels.
In general, the results of this research were somewhat mixed, with some studies suggesting that redundant information was easier to process and others that it was more difficult. A. Lang (1995) undertook a review of the literature and suggested that many of the problems were due to a lack of consistency across studies in controlling message complexity, definitions of redundancy, and measurements of processing efficiency. She suggested that audio/video redundancy likely influenced the resources required to process messages with more congruent messages requiring fewer resources and conflicting messages requiring more. In a reanalysis of the literature, she provided support for this notion. This study goes beyond the message-level prediction and analysis in the A. Lang’s study (1995) to examine how over-time variation in audio/video redundancy should influence resources required by the message from moment to moment. It is predicted that moments of time with less redundancy will require more resources than moments of time with more redundancy. Further, the level of audio and video complexity (that is the over-time change in available resources as a function of OESFs and information introduced) will determine the level of available resources at the same point in time. Thus, the combination of the complexity and redundancy measures will predict the available resources for a period of time. In this study a new measure called the Dynamic Audio/Video Redundancy and Complexity (DAVRC) attempts to do just that.
Dynamic Audio/Video Redundancy and Complexity (DAVRC) Measure
As discussed above, DAVRC is a modification of the already developed video (A. Lang et al., 2006, 2007) and audio (Potter et al., 2006; Potter, Lang, & Bolls, 2008) measures designed to produce a predicted relative level of available resources. To make these measures dynamic, it needs to have codable periods of time longer than the moment following an OESF and shorter than whole message. Further, it must be manageable—to code every 4 or 5 seconds of a long message for all possible audio and video OESFs and types of information introduced would render the measure unusable. Therefore, DAVRC is calculated as a ratio of a subset of the audio and video features known to alter resources required and allocated along with emotion and audio/video redundancy. It is predicted that as available resources decrease, memory for information in the message will decrease.
In this study DAVRC was coded for each 4-second period. The 4-second window was chosen primarily on logical and empirical grounds. First, logically, even though available resources can change over milliseconds, the impact of an OESF has been shown to last about 2 seconds and other message variables change even less quickly, for example, verbal content presents only 1 to 2 words per second, and structural features, even in fast paced messages, rarely occur more frequently than one every 2 seconds or so. For this reason, time periods shorter than 2 seconds would be unlikely to ever have more than one structural or content feature—particularly for audio coding. This was tested empirically by coding 2-, 3-, 4-, and 5-second periods of the same news stories and comparing the results. As expected, the shorter time periods, 2 and 3 seconds produced very little variability across time segments. It was determined that 4-second periods provided ample variability in DAVRC while still producing a dynamic measure. Further, 4 seconds is long enough to have enough audio content to create segment specific targets for the audio recognition test. DAVRC codes separately for audio and video resource availability (i.e., OESFs and information introduced) and audio/video redundancy resource availability. Then available resources for each segment are estimated by the ratio of resources allocated (as a function of OESFs) to resources required (as a function of information introduced and audio/video redundancy) within the 4-second segment. In this way, it is possible to produce a time series of 4-second intervals of predicted available resources over the course of a television message (A. Lang et al., 2006, 2007).
Measuring Available Resources and Memory/Encoding
This study uses Secondary Task Reaction Times (STRTs) to index available resources and recognition as indicators of thoroughness of encoding/processing. STRT has been shown to be a reliable indicator of available resources (A. Lang et al., 2006, 2007; A. Lang & Basil, 1998). In an experiment using the STRT paradigm, participants are instructed to pay close attention to a message for a later memory test on the information contained in the message (Basil, 1994). This is the primary task given to the participants. At the same time, they are told to be alert for a signal and to press a button as fast as possible after perceiving the signal. This is the secondary task. STRTs, in the television environment, have been shown to vary as a function of both resources allocated and resources required (A. Lang et al., 2006). In general, as with simpler, more static psychological stimuli, the faster a viewer presses the button after the signal, the more resources are thought to be available in the viewer to perform the secondary task, either because more resources are allocated to or because fewer resources are required by the primary task (A. Lang & Basil, 1998) but in any case because there are sufficient available resources to perform both the primary and secondary tasks well. However, in the television environment, it has been shown that if resources allocated (i.e., number of OESFs) is held steady while resources required (i.e., information introduced) is increased, STRTs become slower up to a point after which they become very fast and stay fast. It has been argued that this is the point where resources required exceeds resources allocated (A. Lang et al., 2006).
To distinguish the two instances in which STRTs are fast—plenty of available resources and insufficient available resources—requires a measure of task performance (A. Lang et al., 2006; A. Lang & Basil, 1998). When there are sufficient available resources the task of encoding the news should be performed well, producing good recognition, but when there are insufficient available resources it should be performed less well, producing poor recognition (A. Lang, Geiger, Strickwerda, & Sumner, 1993; A. Lang, Potter, & Bolls, 1999). Thus, this study predicts that as available resources decreases, STRTs will become slower and recognition will initially increase, then remain relatively stable up to the point where resources required exceed resources allocated, at which point STRTs will become fast and recognition will decrease.
Processing Dynamic Audio/Video Redundancy, Complexity, and Emotion
In this section of the article, we will build our complex dynamic prediction of available resources and recognition one variable at a time. First let us consider the unique contribution of audio/video redundancy to the equation. As discussed above, most studies have defined audio/video redundancy as a continuum from congruent to conflicting meaning relatedness. In addition, a number of studies have included a “talking head” condition, often as a control. DAVRC’s coded categories of audio/video redundancy were developed to match this history. Three of the categories are based on the meaning relatedness between the audio and video. Called semantic, thematic, and conflicting, they range from the most to the least related. Three levels were chosen because the basic available resources/processing relationship is conceived of as curvilinear (that is, processing should improve until available resources are insufficient and then get worse). Two points cannot show a curvilinear relationship. In semantic redundancy, the video content must correspond closely to what is heard in the audio content. In thematic redundancy, the video and the audio are about the same topic but they do not actually correspond. The difference between semantic redundancy and thematic redundancy lies in how closely the audio and the video are coupled in the story, both in terms of identification of specific objects, as well as description of events and actions taking place in both audio and video. In conflicting redundancy, there is no correspondence between the audio and video. The fourth category coded, because it is a qualitatively different presentation format, is talking heads. Talking heads have both video (i.e., an image of a speaker in a studio) and audio (i.e., the words of the speaker) but cannot be called semantically or thematically redundant given the above definitions. As well, it is difficult to call them conflicting because seeing someone talking about something else is “a natural phenomenon” that people are used to (A. Lang, 1995). A great deal of research in psychology suggests that people are efficient processors of talking heads both because faces are biologically and socially significant visual stimuli that are processed automatically and because they provide additional information related to the presenter’s traits (Glenberg & Grimes, 1995; Lynn, Shavitt, & Ostrom, 1985; Palermo & Rhodes, 2007; Pryor & Ostrom, 1981). Further, given that we evolved to use language, which enables people to communicate about things not present in the immediate environment, this should be a well-developed skill. Thus, from fewest to most required resources, DAVRC codes four levels, talking heads, semantic, thematic, and conflicting. If available resources were held constant, we would expect STRTs to be fast for talking heads and then to get slower up to the point where available resources become insufficient (which may or may not be included in our range of stimuli), when they should become significantly faster. Insufficient available resources would be demonstrated by a significant decrease in task performance (i.e., recognition). Thus, we predict a main effect of audio/video redundancy (collapsed across complexity) such that:
Hypothesis 1a (H1a): STRTs will increase as audio/video redundancy becomes more resource intensive such that STRTs will be fastest for segments containing talking heads, followed in order by those containing semantic, thematic, and conflicting redundancy, unless resources required exceed resources allocated, in which case STRTs will again become fast.
Hypothesis 1b (H1b): Recognition will remain relatively stable or decrease slowly as audio/video redundancy becomes more resource intensive, unless resources required exceed resources allocated, in which case recognition will drop precipitously.
Of course, complexity does matter, so here we add in complexity’s contribution. Recall that complexity is also defined by the relative level of available resources, fewer available resources means increased complexity. Thus, because lower levels of redundancy reduce available resources for processing the information contained in a message segment, the impact of redundancy on available resources will be greater for high-complexity conditions compared to low-complexity conditions. Specifically:
Hypothesis 2 (H2): STRTs will get slower and perhaps reach the point of insufficient available resources and fast STRTs at a more redundant level of audio/video redundancy during high-complexity segments (low available resources) compared to low-complexity segments (high available resources).
Finally, we need to add to our complex dynamic determination of available resources the contribution of positive and negative emotional content. Recall that we predicted (based on the concepts of positivity offset and negativity bias) that there would be more available resources for calm positive and moderately arousing negative messages. Thus, decreasing redundancy (and the resultant increase in resources required) will reduce available resources faster during calm negative compared to calm positive messages and during arousing positive compared to arousing negative messages. Specifically:
Hypothesis 3 (H3): STRTs will get slower at a lower level of redundancy for negative messages compared to positive messages at the calm content level, and for positive messages compared to negative messages at the arousing content level.
The final step is to examine the combined contribution of all three variables. Here we would expect that there would be the most available resources during low-complexity calm positive and moderately arousing negative messages and the fewest during high-complexity calm negative and arousing positive messages. Decreasing redundancy should, therefore, reduce available resources faster for high-complexity segments during calm negative and arousing positive messages compared to low-complexity segments during calm positive and moderately arousing negative messages. Specifically:
Hypothesis 4 (H4): STRTs will get slower and recognition will decrease as a function of decreasing audio/video redundancy and the point of faster STRTs and poor recognition performance will occur soonest for high-complexity conditions during calm negative and arousing positive messages and occur the latest for low-complexity conditions during calm positive and arousing negative messages.
Method
Design and Stimuli
The study had a mixed and nested experimental design, in which within-segment factors were nested in message factors. The within-segment design is a completely within 4 (Audio/Video Redundancy) × 2 (Audio Complexity) × 2 (Video Complexity) × 4 (Segment Repetition) design. Redundancy had four levels: semantic, thematic, and conflicting redundancy and talking heads. Both audio complexity and video complexity had two levels: low and high. These three factors were fully crossed, yielding 16 segment types; each type had four segment repetitions, resulting in 64 segments. Message factors were mixed 2 (Valence) × 2 (Arousing Content) × 4 (Message Repetition) × 6 (Presentation Order) design. Valence had two levels: positive and negative. Arousing content also had two levels: calm and moderately arousing, called arousing in the remainder of the article. These two message factors were completely crossed, resulting in four types of emotional content: calm positive, calm negative, arousing positive, and arousing negative. There were four arousing negative messages and five of the other types of emotional content. The variation in message repetition occurred because the design of the study required each type of emotional content to contain the 64 segments satisfying within-segment factors. Presentation order was the only between-subject factor in the experiment. Six semirandom presentation orders were created to control possible order effects and to preserve the external validity of the news program. Messages were blocked based on their emotional content and then within the block the messages were randomized. In the completed stimulus, the first block was always the arousing negative content, followed by either the calm negative or calm positive content, and finishing with the arousing positive content. That is, if it bleeds it leads, followed by less and less exciting news, and finishing up with human interest (Kerbel, 2000).
The complete news program stimuli were created through a four-stage process. First, national and local television news programs were recorded between August 2008 and February 2009 and then edited into individual news messages. Next, the messages were coded based on the DVARC measures to capture the type of redundancy and the level of audio and video complexity of continuous 4-second segments of the messages. Then potential stimulus messages were selected based on the DVARC coding and messages’ emotional content assessed by the researcher. Third, a pretest was conducted to confirm the overall emotional content level of the selected messages. Finally, 19 messages were selected for the final stimuli based on DVARC coding and pretest emotional ratings. The messages were edited into six versions of a news program using Final Cut Pro editing software. The number of experimental segments in a message ranged from one to 19. The news program for the experiment was 1 hour 4 minutes long. The total length of the 256 four-second experimental segments of the news program was 17 minutes, 4 seconds.
Independent Variables
Audio/video redundancy, audio complexity, and video complexity
DAVRC was used to code the available resources for each 4-second period of the entire stimulus tape. Intercoder reliability for the DAVRC including ratio computed for each 4-second period was 0.90 (Krippendorff’s α). The available resources, measured based on the modified version of Lang and Potter’s audio and video orienting eliciting features and information introduced measures, and the resources required by variation of audio/video redundancy were coded separately as follows.
Audio/video redundancy was defined based on the semantic relationship between audio and video in a 4-second segment. Semantic redundancy was defined as present when some of the video displayed specific elements of what were talked about in the audio during the time period. Thematic redundancy was present when the audio and video were about the same topic, but what was shown in the video does not contain actual elements of what was talked about in the audio during the time period. Conflicting redundancy is present when the audio and video do not appear to be related. Finally, talking heads was defined as the appearance of a news reporter or anchor as the main object on screen, speaking or reading a news story with no elements of the story being shown on the screen. Thus, the video contains only news personnel in a studio or on location at a time when the event being reported is not shown. Intercoder reliability for dynamic audio/video redundancy only was high (Krippendorff’s α = 0.95).
Audio available resources/complexity was defined as cognitive load to process audio structural features in a segment, and computed as the ratio of resources required to resources allocated, where resources allocated quantifies the elicitation of orienting response and resources required quantifies information introduced by audio features. Audio structural features were coded on five dimensions: (1) the number of onsets of voices, music, sounds, and special effects; (2) the number of human voices; (3) background music, (4) natural sound effects, and (5) special sound effects. Dimension 1 indexes resources allocated because it counts the number of orienting eliciting audio structural features which result in automatic resource allocation. Dimensions 2 to 5 index resources required as these dimensions are audio information which requires cognitive resources to process. Audio complexity is indexed by the ratio of resources required, the summed value of dimensions 2 to 5, to resources allocated, dimension 1. The equation used is (∑ D2+ D3 + D4 + D5 +1) / (D1+1). To avoid division by 0, 1 was added to both the numerator and the denominator. Audio complexity ranged from 0.667 to 4. Low audio complexity was defined as 0.667 to 1.66 and high was defined as 2 to 4. Intercoder reliability for dynamic audio complexity was 0.86 (Krippendorff’s α).
Video available resources/complexity was also computed as the ratio of resources required to resources allocated, where resources allocated quantifies the elicitation of OR and resources required quantifies the amount of information introduced by visual features. The visual structural features of each segment were coded on each of six dimensions: (1) the number of camera changes; (2) the number of focal object changes; (3) the number of objects; (4) visual occurrences of color video footage, still pictures, text, and animation, computer graphics, black/white video footage, and so forth; (5) object movement, and; (6) the number of camera movements (i.e., pan, dolly, and zoom). Dimension 1 was used to index resources allocated as it counts the number of orienting eliciting features which elicit automatic resource allocation. Dimensions 2 to 6 were used to index resources required as these dimensions represent information introduced which then requires cognitive resources to process. Video complexity was then indexed by the ratio of dimensions 2 to 6 to dimension 1. The equation used was: (∑ D2 + D3 + D4 + D5 + D6 +1) / (D1+1). Again, 1 was added to both the numerator and the denominator to avoid division by 0. Video complexity ranged from 1 to 11. Low video complexity was defined as 1 to 3.75 and high as 4.4 to 11. Intercoder reliability for audio complexity was 0.90 (Krippendorff’s α).
Valence and Arousing Content
The valence and arousing content variables were message factors determined through a priori consideration of the emotional content of messages and then confirmed by pretest self-report ratings. First, the researcher selected messages based on a theoretical and practical understanding of emotion and motivation. Messages were considered positive if they were about individual or public success, or creating a public or individual good. Negative messages were about a failure, threat, or something bad for the public or an individual. Arousing content was present when messages contained a great deal of excitement, energy, or strong displays of emotion whereas calm content was present when messages lacked energy, excitement, and strong displays of emotion. Next, 63 students (Mage = 19.95, 37 male) participated in the pretest in exchange for course credit. The participants were asked to continuously rate how positive, negative, or aroused they felt while viewing each message using the continuous response measure technique (CRM; Biocca, David, & West, 1994). The CRM was implemented using an online-rating function of MediaLab (Jarvis, 2008b). Each message was viewed by a participant only once. The messages and ratings were assigned to participants in such a way that each message was rated on all the three CRM scales by an equal number of participants. After viewing each message, the participants were also asked to rate on a 9-point scale how positive, negative, and aroused the message made the participant feel overall.
In order to ensure the emotional content at the message level, a mixed 2 (valence) × 2 (arousing content) ANOVA was submitted separately on overall positivity, negativity, and arousal ratings. Results showed that positive messages were rated more positively than negative ones, F (1, 62) = 357.50, p < .001, η2 = 0.85, whereas negative messages were rated more negatively than positive ones, F (1, 62) = 474.83, p < .001, η2 = 0.89. Arousing messages were rated as more arousing than calm ones, F (1, 62) = 51.20, p < .001, η2 = 0.45. These results are shown in Table 1. Medians of the CRM data were also plotted against time to ensure that all the experimental 4-second segments occurred within parts of the message that met the requirements for valence and arousing content.
Mean Ratings of Positivity, Negativity, and Arousal.
Dependent Variables
STRTs
To index available resources, STRTs were measured by recording the time in milliseconds from the onset of a STRT probe to the moment the participant responded to the probe by pressing a key on the computer keyboard. STRT probes were administered as a 200 millisecond audio tone at a frequency of 1,000HZ on a laptop computer using MediaLab. The probes were inserted during the 3rd or 4th second of each of the experimental segments of the final news program in order to ensure that they were primarily being influenced by the contents of the current 4-second segment rather than the preceding segment. A 2-second window was used to allow the probes to be placed at least 1 second or more before or after any orienting eliciting structural feature that occurred in the segment.
To reduce possible fatigue effects, participants were assigned to one of two STRT orders created for the same news program—order 1 had probes occurring only during odd numbered segments and order 2 had them occurring only during even numbered segments. Each participant heard and responded to a total of 128 STRT probes. Order 1 and order 2 participants were paired to create one complete set of data (256 STRT probes). Data were collected from 144 participants in each order. Due to equipment failure complete data was collected from 138 participants in order 1 and 139 in order 2 resulting in 138 complete sets of data. Analysis was run on the STRT reciprocals to reduce the effect of possible outliers (Ratcliff, 1993). Then results were converted back to milliseconds for interpretability.
Recognition
Encoding was assessed using forced choice yes-no visual and audio recognition tests, which were then summed for an overall recognition measure. Each experimental segment had an audio and a video target and foil pair, creating 1,024 items (256 audio target and foils and 256 video target and foils). Visual items were still frames and were selected so that each target and corresponding foil had either the same news anchor or similar types of events, objects, or scenes and were similar in visual complexity. Audio items were 3.5 to 4 second audio clips (in movie clip format with an all-black video stream). Target foil pairs either had the same news anchors or were about similar topics (therefore they were the same either in terms of theme or voice and sound quality) and were matched on audio complexity. Both recognition tests were administered on laptop computers using DirectRT (Jarvis, 2008a). Visual items were randomly presented for 250 milliseconds. Participants were asked to quickly respond “yes” if they remembered the scene/audio from the experiment, or “no” if they thought the scene/audio was new. Again, a pair of data from two participants, one from each order, creates one complete set of recognition data. Visual recognition data for three participants (n = 1 for 1st order, n = 2 for 2nd order) and audio recognition data for two participants (n = 1 for each order) were lost due to equipment failure, resulting in N = 142 for the recognition data. For the analysis the proportion of correct target identification was calculated.
Participants
Two-hundred an eighty eight students (177 male) were recruited from communication courses at a Midwestern university and received a predetermined amount of course credit for their participations. Participants’ ages ranged from 18 to 27 (M = 20.66, SD = 1.98). Two-hundred and eighty six participants reported their race. Two-hundred and twelve identified themselves as White, not of Hispanic origin, 5 as Hispanic, 21 as Black, not of Hispanic origin, 35 as Asian or Pacific Islander, and 13 as other.
Procedure
Upon arrival at the laboratory, participants were informed of the purpose of the study and the experimental procedure. After informed consent was obtained, each participant was randomly assigned to one of six message presentation orders and seated in a separate cubicle to ensure privacy and used an individual laptop computer with headphones. Participants were instructed to pay close attention to the news program (to control controlled resource allocation) as their memory for the show would be tested. In addition, they were told that from time to time they would hear an audio tone, at which points they should press the spacebar as quickly as possible. Participants were run in groups of six. After the news program, participants were exposed to a 4-minute distracter video clip, followed by the audio and visual recognition tasks (the order of the tests was randomly assigned). Participants were thanked and dismissed.
Results
A Redundancy (4) × Audio Complexity (2) × Video Complexity (2) × Arousing Content (2) × Valence (2) model was run to test the hypotheses. The results are shown in Table 2. STRT and recognition results are considered jointly to indicate the level of available resources. Fast STRTs and good recognition indicate ample available resources. Slowing STRTs with stable recognition indicate fewer but sufficient available resources. Slowing STRTs with falling recognition indicate very low levels of available resources. Very fast STRTs and poor recognition indicate insufficient available resources, this decision is made if the STRTs are as fast or faster than they are when there are ample resources and the recognition performance is as poor or poorer as the lowest level of available resources.
Summary of F Test Results by Dependent Variable.
H1a and H1b
These hypotheses predicted that decreasing redundancy would lead to fewer available resources. The main effect of redundancy was significant on both the STRT and the recognition data (see Table 3 for results). As predicted, there appear to be ample available resources during talking heads—very fast STRTs with excellent recognition at semantic redundancy, STRTs slow dramatically and significantly and recognition drops significantly, thus, there are still sufficient available resources, but not many. At thematic, STRTs remain the same but recognition drops again, suggesting even fewer available resources, and at conflicting, STRTs are as fast as they were during talking heads, and recognition is as bad as during thematic redundancy, meeting the definition for insufficient available resources. Thus, as expected, each level of audio/video redundancy results in fewer available resources.
Main Effects of Redundancy on STRT and Recognition.
H2
This hypothesis predicted that audio/video redundancy would decrease available resources more quickly (meaning at a higher level of redundancy) during high compared to low audio complex and high compared to low video complex message segments. The redundancy × audio complexity and redundancy × video complexity interactions were significant on both the STRT and recognition data. These results are shown in Figure 1. In both graphs, light bars and dotted lines are low and dark bars and solid lines are high complexity. In both graphs, you can see that STRTs start fast, get slower, and then get fast again. In both graphs the STRTs get significantly faster more quickly for high compared to low complex messages as expected.

Redundancy × complexity interaction on STRT and recognition.
Slight differences in the pattern of recognition exist for audio and video complexity. For low audio complexity recognition is excellent for talking heads and then declines steadily across redundancy conditions. A state of insufficient available resources exists at the conflicting level of redundancy. For audio complex message segments, recognition is never good, available resources are already low during talking heads and a state of insufficient available resources exists at thematic redundancy. The pattern is similar for video complexity except that during talking heads both low and high video complex segments have ample resources. Low complex message segments reach a state of insufficient available resources at conflicting redundancy and high complex message segments reach insufficient available resources at thematic redundancy.
H3
This hypothesis predicted that there would be fewer available resources during calm negative compared to calm positive messages and during arousing negative compared to arousing positive and that available resources would decrease across the levels of redundancy more quickly during calm negative messages and arousing positive compared to calm positive and arousing negative messages respectively. The results are shown in Table 4.
Redundancy × Valence × Arousing Content on STRT and Recognition.
The first prediction is that STRTs should be faster (more available resources) for calm positive compared to calm negative at each level of redundancy and they are. Similarly, STRTs should be faster for arousing negative compared to arousing positive at each level of redundancy and they are. The second prediction is that available resources should decrease across levels of redundancy for all types of emotional content, and they do. For the calm messages, the prediction is that available resources will decrease faster for calm negative compared to calm positive messages. Both calm content messages reach insufficient available resources at the conflicting level. However, task performance (recognition) is significantly worse and declines faster for calm negative compared to arousing negative messages. During the arousing content messages, the same pattern holds with both positive and negative messages reaching insufficient available resources at the conflicting level of redundancy. Counter to our prediction, task performance is somewhat better during arousing positive at three of the four levels compared to arousing negative messages.
H4
Finally, this hypothesis predicted the two four-way interactions (one for audio complexity and one for video complexity) with insufficient available resources, indexed by fast STRT and poor recognition, occurring soonest for high-complexity calm negative and arousing positive messages and latest during low-complexity calm positive and arousing negative messages. Both the redundancy × audio complexity × valence × arousing content and the redundancy × video complexity × valence × arousing content interactions were significant. They are shown in Figures 2 (audio complexity) and 3 (video complexity).

Redundancy × audio complexity × valence × arousing content on STRT and recognition.

Redundancy × video complexity × valence × arousing content on STRT and recognition.
Our expectations for both figures were that high audio complex segments would have fewer available resources compared to low audio complex segments. That less redundant segments would require more resource than more, and that for calm content positive would be have more available resources than negative while, for arousing content, negative would have more than positive. In Figure 2 you see four graphs. The first column compares the valence × audio/video redundancy interaction for high and low audio complexity for calm content only. The second column compares the same interaction for arousing content. For calm content we see that generally, STRTs are faster and recognition is better for positive compared to negative calm messages—especially when they are complex. For arousing content, however, we see a different pattern with STRTs being slower for positive compared to negative (indicative of fewer available resources—as predicted) but combined with better recognition (contrary to our predictions) for positive compared to negative. The differences are magnified during complex content. Of particular interest here is the improvement of recognition at the conflicting level for arousing positive messages.
The expectations for the video complexity four-way interactions were the same and the form of Figure 3 follows the same pattern and the results are quite similar. Again, STRTs are generally faster for positive compared to negative during calm content and slower during arousing content. During calm content, positive is better generally—though only slightly during low-complexity messages. During arousing messages valence related differences in encoding are not consistent.
Discussion
This study attempted to study news processing using a human-centric dynamic approach, which conceptualized a set of message variables in terms of level of automatically imposed human cognitive load and predicted their real-time combined impact on available resources and encoding thoroughness. Unlike previous research conceptualizing complexity as available resources which either collapsed the coding over an entire message or examined the impact at a single point, this study coded available resources for a set of variables (audio/video redundancy, audio complexity, and video complexity) continuously for each 4 seconds over the course of an hour news program. Multiple exemplars of all combinations of the variables were selected to measure the dependent variables of STRT and audio and visual recognition. For this study, the two recognition measures were combined into an overall measure of recognition. The results generally showed that this approach may provide a powerful way to predict moment to moment variation in the complexity of messages, cognitive load of users, and memory for specific moments in messages. Further, the results also provide additional support for refining conceptualizations of the specific variables used in this study (audio and video structural complexity, emotional content, and audio/video redundancy) as available resources.
First, the study made predictions about how four levels of audio/video redundancy should vary in terms of cognitive load that were supported. First, results suggest that conceptualizing audio/video redundancy in terms of four levels of decreasing available resources in the order talking heads, semantic, thematic, and conflicting as suggested by the literature was warranted. Second, results suggested that conceptualizing emotional content as having fewer available resources as arousing content increases and with calm positive and arousing negative messages having more available resources than calm negative and arousing positive as a function of the built-in positivity offset and negativity bias of the human motivational system was also warranted. Third, conceptualizing message complexity as available resources resulting from the ratio of orienting eliciting audio and video structural features (resources allocated) and the information they introduced (resources required) was also warranted. Finally, the interactions were built to show that one could combine the influence of all of these variables simultaneously in multiply changing combinations to predict available resources at given intramessage time periods.
Of course, there is much more work to be done. In particular the four-way interactions—though supporting some of the predictions—clearly suggest that more is going on as does the model on which these measures and predictions are based. First off, all participants are averaged together even though we know that differences in news viewing and public affairs knowledge strongly impact cognitive load. Given the within subjects nature of the design, these differences are controlled, but future analyses may be done to split the participants into groups and see if some of the unexpected findings are then explained.
Similarly, for this article audio and video recognition are combined though we know from previous research that audio and video encoding have different resource requirements. Indeed, it has been suggested that visual encoding is relatively automatic while audio encoding is relatively controlled and requires more resources to process (A. Lang, 1995, 1999; Shiffrin & Schneider, 1977). Similarly, it has been suggested that viewers use different strategies to maximize understanding and comprehension of difficult messages, such as shifts of attention to either the audio or video portion of the messages (Bergen, Grimes, & Potter, 2005). Given the hypothesis that audio and video require different level of resources for processing, they will be differentially influenced by the level of available resources. Therefore, the combined recognition measure may be hiding channel switching behaviors or very different available resource states between audio and video encoding. Future analyses should examine the two recognition measures separately to see whether such resource allocation shifts are taking place as available resources declines.
Another possibility would be to perform a signal detection analysis of the recognition data to examine the influence of real-time available resources on the yes/no decision-making process that yields the recognition data. In the communication field, signal detection theory has been used to explore whether improvements in recognition performance are due to more complete message processing or simply more liberal decision criterion (Fox, 2004). In fact, previous research by Fox et al. (2007) has shown that liberal shifts in criterion bias may herald the onset of insufficient available resources. Thus, it would be interesting to examine whether such shifts occur in our data and if they precede decreases in sensitivity.
Finally, this article introduces DAVRC measure extending previous work measuring these variables at a message level. The findings suggest that this measure can be coded reliably and may be a valid indicator of available resources for any given 4-second period. At a minimum, the measure provides a model for creating new dynamic message coding tools that can combine multiple message variables on the single metric of cognitive load in order to begin to understand real-time message processing. In addition, it produces continuous real-time theoretical ratio-level values that would make appropriate inputs to future time series based cognitive models (see Wang, Lang, & Busemeyer, 2011 for an example of such a model). Such future models could, by doing lagged analyses, also examine the assumption made here that the primary influence of coding messages in terms of predicted available resources occurs very quickly, within 2 seconds of relevant OESFs. Additional analyses can be done with this data to sort message segments by DAVRC score from low to high and examine the extent to which that variable predicts variance in STRT and audio and video recognition.
Finally, this article provides a model of how any sets of message variables of interest can be reconceptualized in terms of the resources they cause to be automatically allocated and required. This would allow us to compare studies across message types, media, and laboratories to truly extend our understanding of dynamic message processing. It is hoped that this article provides an exemplar of a new methodological paradigm that will allow us to begin studying the situated and evolved real-time dynamic interaction between complex media messages and complex human motivated cognitive processors.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
