Abstract
Video visual analytics is the research field that addresses scalable and reliable analysis of video data. The vast amount of video data in typical analysis tasks renders manual analysis by watching the video data impractical. However, automatic evaluation of video material is not reliable enough, especially when it comes to semantic abstraction from the video signal. In this article, we describe the video visual analytics method that combines the complementary strengths of human recognition and machine processing. After inspecting the challenges of scalable video analysis, we derive the main components of visual analytics for video data. Based on these components, we present our video visual analytics system that has its origins in our IEEE VAST Challenge 2009 participation.
Introduction
A large portion of the rapidly increasing global data volume is video. Actually, video is the dominant data type in many different data domains, for example, video accounts for half of the consumer internet traffic in 2011, 1 and more than 95% of clinical data is video. 2 Furthermore, the amount of video data organized in large collections grows at fast pace, for example, YouTube reports 3 of 72 h of video footage uploaded to their network every minute. The vast amount of data renders the traditionally human-centered analysis of video footage impractical. This affects different application domains with different video characteristics, such as closed-circuit television (CCTV), surgery surveillance, sports analysis, scientific studies, or the analysis of large video or movie collections.
Considering the complexity and data scale of the video domain, the new field of video visual analytics was established in recent years. Although video visual analytics has several older roots, the cornerstone to starting this relatively new area of its own can be traced back to IEEE VisWeek 2009. There, a workshop on Video Analytics was held to connect multimedia analysis with visual analytics, 4 and the IEEE VAST Challenge 2009 5 included a video mini-challenge for the first time. The main differences between video visual analytics and the already established application areas of visual analytics are characterized by its huge amount of complex and dynamic data as well as by the questions and tasks related to video analysis.
In this article, we review the typical questions asked in the context of video analysis as well as the particular challenges to visual analytics that arise from the complex type of video data. Aware of these challenges, we formulate a description of the video visual analytics process based on the research and development agenda of visual analytics, 6 and on the work of Pirolli and Card, 7 and Keim et al. 8 In section “Architecture of video visual analytics,” we then introduce a visual analytics system for video data that are based on this process model. For the prototypical implementation, however, we move away from the general notion of video analysis toward the more specific problem of object-based analysis of nonnarrative and unedited video data. (Nonnarrative video data are not intended to convey any narrative, nor has it been edited by a human composer. Others 9 use the terms scripted and nonscripted to distinguish between edited and unedited video footage. Such video data are therefore considered unstructured.) Since this system has its origins in the IEEE VAST Challenge 2009, we use the data and the scenario of the VAST Challenge to demonstrate the capabilities and workflow of our video visual analytics system.
Challenges in video analysis
In general, we can identify three main goals targeted by semantic video analysis: status determination, event detection, and model generation (cf. Figure 1). (These goals are derived from a literature survey of applications in the video domain, such as video retrieval and video surveillance. Considering the scope of this article, only a small selection of relevant publications10–15 can be mentioned.) In the case of status determination, the amount of data considered in video analysis remains quite manageable since the (overt) status of a scene or a recorded entity at a previously specified point in time can be determined using a small fraction of a video sequence (e.g. the current weather or the number of persons in a queue at a particular point in time). However, the detection of events in, or the generation of models from, video data typically involves analysis of a large portion of video footage. While the interval to be processed to generate a model of the video data coincides with the target interval of which the model has to be built, processing interval and target interval in event detection typically differ. Often, a large portion of video footage has to be analyzed to detect a typically small interval of the video sequence that covers the requested event. The main objective in event detection is therefore to determine the point in time at which a distinct event occurs. In contrast, models are built to describe the common patterns of a single sequence or a set of multiple video sequences (e.g. common movement patterns of a soccer player in sports analysis). Based on such description, predictions on future data can be made or the model can be used to define an event detection task: for example, the search for abnormal behavior of subjects in video surveillance as an outlier according to a common pattern (model).

Types of questions asked in video analysis with their corresponding processing and target intervals. There is a diffuse transition between online and offline processing for the particular tasks. Time is represented on the right side by the x-axis, and known points in time are marked by “a” and “b.” For status determination, only a small portion of data has to be taken into account. When searching for an event, the point in time is usually unknown (denoted by “?”), and hence, the processing interval often ranges over a large time-span. For model generation, the time-span of analysis typically matches the processed interval.
Furthermore, the problem of video analysis can be categorized into off-line analysis that considers only historic video data and online analysis of real-time video streams. In practical applications, both types of analyses exhibit different foci on the types of questions they ask. The question of status arises more often in real-time applications, whereas model generation is usually performed on historic video data. However, a sharp separation does not exist (e.g. there are also online model training techniques), as depicted in Figure 1. We will use these two dimensions (temporal aspect of the video data and type of target goal) to categorize the different tasks in video analysis throughout this article.
Two further challenges in video analysis, besides the large amount of video data to be analyzed, are the quality of search target definition and the complexity of the video domain. Well-defined search targets (e.g. the definition of a critical event in monitoring situations) require a proper model and information about the acceptable or necessary deviation from this model. The model can either be built from data or be compiled from previous knowledge of the analyst. Although precise search target definition is important for all three video analysis tasks, the impact of vaguely defined search targets on event detection is most severe. In event detection, vaguely defined search targets (e.g. a target defined as “all suspicious events”, cf. Höferlin et al. 16 ) hinder the application of known-item search by automatic event detection, which would generate huge reduction of analysis costs due to the large ratio between target and processing interval. However, vaguely defined search targets demand for more exploratory data analysis (EDA) that involves the human analysts to benefit from their background knowledge. In the same way, the complexity of the video domain prevents successful application of automatic video analytics approaches too. The complexity of the video domain and its data is characterized by the mainly unstructured and dynamic type of data, the numerous degrees of freedom (e.g. uncontrolled environment with regard to imaging parameters or illumination conditions 17 ), many ambiguities, and often low signal-to-noise ratio. Much of this complexity is directly introduced by the imaging process itself, such as ambiguities that stem from mapping the three spatial real-world dimensions unto two image dimensions with subsequent sampling and discretization of the captured image.
Unreliability of automatic video analytics approaches especially becomes severe in the context of security applications (e.g. video surveillance). In such context, high recall is obviously mandatory. However, high precision is similarly important because high false alarm rates annoy and desensitize the security personnel. Unfortunately, the low precision achieved by recent video analytics approaches is the reason that this area is questioned in general. 18 Hence, human analysts are required to analyze the video data manually or in a semiautomatic fashion by utilizing low-level (close to the signal) computer vision approaches that are rather reliable (e.g. the calculation of optical flow can be considered to be reliable in realistic scenarios (as the Middlebury benchmark indicates that the average absolute end-point error is typically below 1 pixel, cf. http://vision.middlebury.edu/flow/), whereas the higher semantic task of human activity recognition is rather unreliable in more realistic scenarios (e.g. Wang et al. 19 report of recognition accuracies on the Hollywood2 20 dataset of about 50%)). As consequence to the main challenges of video analysis—vast amount of data, complex data, and vaguely defined search targets—purely automatic or purely manual analysis is not applicable. Visual analytics, however, provides a way to combine the strengths of both worlds (cf. Keim et al. 21 ) to form a scalable way of both targeted and exploratory video analysis.
Video visual analytics
The general goal of visual analytics as “the science of analytical reasoning facilitated by interactive visual interfaces” 6 is to generate insight from data. The process of making sense of data with respect to investigational tasks was investigated by Pirolli and Card 7 and inspired the research agenda of visual analytics. 6 According to Pirolli and Card, the sense-making process can be split into two conceptual loops: a foraging loop and a sense-making loop. The foraging loop includes processes aimed at gathering, searching, and filtering data and extracting relevant information as foundation for further reasoning. These elementary reasoning products are then developed into a mental theory or hypothesis that is best supported by the evidence extracted from the data or inferred from an argumentative basis.
A simplified model of the sense-making process integrated into the visual analytics process is depicted in Figure 2. This visual analytics process model slightly differs from that of Keim et al., 8 by putting stronger emphasis on the sense-making process. The core of this model represents the sense-making loop in which hypotheses are developed from elementary reasoning products such as evidence. After hypothesis generation, each mental theory has to be checked against the data by prediction and hypothesis testing and may finally lead to some outcome in the form of task-depending insights or knowledge from data. In contrast to the sense-making loop that mainly involves information products of higher levels of abstraction, the foraging loop essentially consists of extracting relevant information from, and discovering knowledge in, raw data. The tasks appearing within the sense-making loop can therefore be summarized with reasoning and deduction, whereas the foraging loop involves separating signal from noise, relevant from irrelevant information, as well as extracting patterns and building models.

Integrative view on visual analytics and sense-making process. The tight integration of visualization, data mining, and user feedback (marked by the blue dashed rectangle) leverages different data analysis methodologies for pattern and structure discovery in the foraging loop (components are depicted in blue). Higher level reasoning products (yellow boxes) are involved in the sense-making loop. Based on elementary reasoning artifacts, more complex scenarios are generated during the analytical discourse of the analysts. After verification or rejection of these hypotheses against the databases, users may go back to the foraging loop several times in this iterative process, until the final outcome or insight is produced.
Tight coupling of the human recognition capabilities with the processing power of computers typically characterizes the sense-making process (foraging loop and sense-making loop) in visual analytics. Bertini and Lalanne 22 term this as “integration of automatic and interactive data analysis” and the “fingerprint of Visual Analytics”. Considering the visual analytics’ foraging loop, the combination of automatic and interactive data analysis is reflected in the close integration of three major knowledge extraction methodologies: EDA, knowledge discovery in databases (KDD), and information retrieval (IR). EDA, a term coined by Tukey, 23 is the human-centered data-driven process (bottom-up) of generating models of phenomena of the data (patterns and structure). In contrast to classical mathematical data analysis, EDA is facilitated by visual representations of the data and, thus, allows deriving statistical models by data exploration instead of relying on models pre-imposed by the analysts. However, after data exploration, any discovered model has to be evaluated within a confirmatory step.
As counterpart to the visual representation in EDA, the data-driven KDD process (bottom-up) utilizes automatic data mining by “applying data analysis and discovery algorithms that produce a particular enumeration of patterns (or models) over the data.” 24 These automatically mined patterns are subsequently confirmed or rejected by the analysts. Data mining algorithms allow evaluating a large amount of different models, while the confirmatory step of interpretation by the analyst within the KDD process assures that statistical significance is not undermined.
The models in EDA and KDD are largely data driven. In contrast to this bottom-up data analysis by exploration, IR enables the analysts to pose queries to the database to satisfy their information need. These queries stem from previous knowledge of the analysts or previous iterations of the foraging process and make up their contextual model. The top-down IR process can be used either in exploratory or in confirmatory ways and help the analysts develop and check higher knowledge constructs (as defined in the research agenda for visual analytics 6 ) for subsequent sense-making.
Although Figure 2 suggests a clear separation and order of processes, in practical realizations of the visual analytics process, transitions are diluted. All the processes are iterative and recursive, which means that besides repeating particular tasks, subtasks may also be represented by a visual analytics process or parts of it. This especially applies to the support of (sub)tasks by visualization and automatic methods, contributing to the high integration of visualization and automatic approaches in visual analytics.
In the context of visual analytics of video data, “new techniques must be developed to integrate the [these] capabilities for analyzing streaming video data into the analyst’s toolkit.” 6 We first addressed this integration with our participation in the VAST Challenge 2009. Since then, our toolkit to analyze video data matured into a complete visual analytics system that tackles video analysis with all its challenges and task described earlier, with focus on scalability. In section “Architecture of video visual analytics,” we describe this system and point out its close connection to the presented visual analytics model.
Scalability
As data scales, the information gap (i.e. the gap between the available data and the required information (here, we follow the recommendation of Wurman 25 that information that does not inform anymore should be just considered as data)) increases. What is required are techniques to facilitate knowledge extraction scalable to data increase. Therefore, the research agenda of visual analytics identifies scalability as a major challenge. 6 The authors distinguish five types of scalability that in turn all arise from the problem of exploding data volume. In general, we find three common methods applied to achieve scalability in the presence of huge amount of data: serialization, parallelization, and data reduction. In the context of visual analytics, these three methods have to be observed from the perspective of both human and machine.
Serialization is often used to process datasets that are too large to be kept in working memory of either human or machine. Hence, data can be processed in so-called data streams. Since video data are time-oriented, a natural streaming dimension already exists, and is usually used to process the data sequences.
To increase the processing capacity compared to a single working instance (e.g. a computer or a human analyst), work distribution is often the method of choice. This leads to parallelization of work on the one hand by collaboration of analysts or by crowdsourcing anonymous labor, and on the other hand, by distributed computing, such as cluster computing or cloud computing.
To enable data scalability of a single working instance, the problem size (most commonly the amount of data) has to be reduced. Data reduction is a natural approach often applied in the form of task-dependent filters (cf. filtering of the sensory input to the consciousness in the human brain) or aggregation methods, such as hierarchical problem solving or divide and conquer approaches that always start with a coarsely aggregated view on the data and end up with the details of the data if required. Note that aggregating data reduction methods—in contrast to filters or queries—can easily conflict with sequential processing of the data, if aggregation and serialization dimensions overlap.
These three methods and their application to visual analytics finally lead to the different notions of scalability enumerated in the research agenda of visual analytics.
Involving the human in the sense-making process introduces an additional jeopardy that has to be considered in the context of data scalability. Due the human’s perceptual limits, the concept of situational awareness becomes important. According to Endsley, 26 situational awareness “can be thought of as an internalized mental model of the current state of the operator’s environment” that covers three levels of understanding (i.e. perception, comprehension, and prediction of entities and their status 27 ) in the context of dynamic systems, such as video analysis. An erroneous or incomplete model, which is a deficit in situational awareness, can cause severe consequences. This is especially the case if decisions are made based on analysis that lacks situational awareness. Furthermore, the pressure of time that is involved in many real-time analysis tasks, such as online, proactive video surveillance, aggravates the situation even more. Major common perceptual deficits relevant in video analysis are as follows:
Change blindness reflected by difficulties in identifying unexpected changes during blinks, flickers, or disruptions; 28
Inattentional blindness reflected by poor recognition of changes that are outside the focus of attention; 29
The short period of attention when monitoring video screens of about 20 min. 30
Additionally, the analysts face distractions and additional responsibilities within their work environment that have to be reflected in a holistic view on their situational awareness. Besides taking into account perceptual limits of the human analysts, missing data and uncertainty of the data have to be conveyed to the users in order to maintain proper situational awareness for decision making. Hence, a scalable visual analytics system has to support the users in keeping track of the sequentially processed data and its uncertainties originating from measurement process and data transformations. In collaboration environments, situational awareness further involves an overview of the analysis state of the team members including their data interpretations, derived reasoning artifacts, and the implied confidences. In situations where data reduction techniques have been applied to the data, the confidence of applied filters has to be communicated to the human analysts, as well as the patterns or structure lost by aggregation. The three methods to enable scalable analysis (serialization, parallelization, and data reduction) and the notion of situational awareness inspired the visual analytics system for video data introduced in section “Architecture of video visual analytics.”
Related work
Video visual analytics is a quite new research area that is part of the field of multimedia analytics. 4 However, due to its integrative character, it has connecting points with many research areas, among them are automatic video analysis and computer vision, data mining and IR, data stream management systems (DSMSs) and moving object databases, and visualization and human–computer interaction. We focus the following discussion on a few fields in which we see the main contributions of this article.
There have been several projects on retrieval, exploration, and analysis of large archives of video data. Due to their integration of human and machine analysis, some of these approaches can be considered as video visual analytics. Among them are MediaMill,31,32 the Informedia project,33,34 and the works of Luo et al.35,36 In contrast to these approaches, we put more emphasis on the analytical discourse than on the pure retrieval of particular parts of the video stream.
Besides these approaches on video visual analytics, there are two areas of visual analytics that cope with data domains related to video data: geospatial visual analytics 37 and visual analytics of dynamic data. Although both video visual analytics and geospatial visual analytics are concerned with spatiotemporal data dimensions, the questions and challenges in these areas differ in many respects. Therefore, a few approaches from geospatial visual analytics may carry over to video analysis (e.g. the VideoPerpetuoGram (VPG) 38 described in section “Visualization and data mining” can be seen as an adoption of the space–time cube, the most prominent element in Hägerstrand’s 39 time geography that is applied in geospatial visual analytics 37 ), but others have to be adapted or newly designed. The questions in video visual analytics are even more focused on time than on space, although the general components of the questions are the same: “what”, “where”, and “when.” 40 Often, the “where” question plays a secondary role in video analysis since location is largely predetermined by the recorded field of view. Also, the data characteristics differ. Video data offer typically much higher resolutions than georeferenced data, both temporally and spatially. (Videos are typically captured at frame rates between 24 and 30 frames per second (fps) to allow for continuous motion experience by the human visual system. The perception of such apparent motion in film is based on two phenomena: 41 stroboscopic motion and visual persistence. Stroboscopic motion appears at frame rates between 5 and 10 fps. To reduce the flicker effect that is to increase visual persistence, higher frame rates are required. Hence, “at a rate of at least 16 frames per second, the ‘motion’ in a film seems smooth and natural.” 42 Exceptions to these frame rates can be found in time-lapse recording of video footage (frame rates for recording the video sequence is smaller than typical video frame rates) and slow motion captures (frame rate for recording the video is much higher than typical video frame rates). In the geographic information system (GIS) domain, various primary data sources are used. Remote sensing imagery, for example, typically offers frame rates between one image a day and one image a year, 43 and popular global positioning system (GPS) datasets, such as the Microsoft’s GeoLife GPS Trajectories dataset, provide no sampling intervals faster than a second. More information on temporal and spatial resolution of data used in the GIS domain can be found in appropriate textbooks. 44 For video analysis, the resolution largely depends on the field of application. For video surveillance, for example, typical and useful resolutions are discussed also in handbooks 45 and studies46,47). Additionally, raw video data provide much redundant information by dense optical sampling. Hence, it requires more storage and processing capacities and additional feature extraction calculations. Geospatial visual analytics, in contrast, has to typically cope with more objects and longer observation periods, what entails their own open challenges. 48 Generally, video analysis adopts the Eulerian view on the data: data are sampled on a static grid—the pixels. However, the Lagrangian particle-based view dominates geospatial visual analytics. The Lagrangian view reflects the flow transport of a single particle in a vector field. In video analysis, an additional tracking step is required to move from the Eulerian to a Lagrangian perspective (cf. Kuhn et al. 49 ). However, missing information in video data poses challenges in this feature extraction step. Due to the projection from a three-dimensional (3D) scene to two-dimensional (2D) image and the discretization and quantization of the continuous signal, much information is lost. This problem has a large impact on tracking due to the presence of occlusion and a limited field of view. As a result, many dedicated research areas have grown that try to solve the correspondence problem for tracking under these conditions (e.g. multitarget multicamera tracking, person re-identification). In the context of geospatial visual analytics, these issues usually do not exist since the data already contain correspondences and metadata necessary for analysis (e.g. GPS data). Nevertheless, both fields have the analysis of spatiotemporal trajectories of moving objects in common. Therefore, some of the applied methods overlap.
Because of its volume, video data are generally processed as a stream and can therefore be considered dynamic data, regardless of whether it is recorded historic data (offline) or real-time data from a live capture device (online). Hence, many aspects of dynamic visual analytics 50 also apply to video visual analytics. Despite this overlap, there are fundamental differences between video visual analytics and dynamic visual analytics as it is defined by Mansmann et al. 50 They especially consider dynamic visual analytics in the context of real-time tasks, such as monitoring of network traffic. However, video visual analytics inherently applies to dynamic data, regardless of the origin of the data. In the same way, their notion of situational awareness falls too short for video visual analytics. In contrast to their definition of situational awareness as the target of dynamic visual analysis, situational awareness in the context of video visual analytics is a mandatory element for a successful analysis in each stage of the process. We therefore consider situational awareness from a broader perspective.
A large body of literature addresses the issue of visualization of video data and other time-dependent data. For background information, we refer to two comprehensive and recent surveys, dedicated to time-dependent data 51 and video data in visualization and graphics. 52
To summarize, the presented video visual analytics approach differs from previously published work by its tight integration of the analytical reason process enabling the users to extract knowledge from video data by a unique combination of known-item retrieval and video browsing. Besides detailed analysis of the tasks and scalability challenges of the video domain, our contribution lies in the novel architecture of scalable video visual analytics and its prototypical implementation in the context of video surveillance that addresses the technical, perceptual, and cognitive challenges of visual analytics in the video domain.
Architecture of video visual analytics
As described earlier, visual analytics of video data has to cope with large amount of complex video data and often vaguely defined task descriptions. The system we present in this section integrates most aspects of the three methods to achieve data scalability for humans and computers, as introduced in section “Scalability.” Together with the issue of data scalability, we support maintenance of the analysts’ situational awareness, both in the foraging loop and during sense-making. Besides data scalability, we aim at a visual analytics system that can be flexibly used for any of the different tasks of video analysis introduced in section “Introduction.” This type of scalability, termed task scalability, likewise covers online and off-line tasks.
The video visual analytics pipeline in Figure 3 results from the mentioned requirements. A first prototype of this system was developed for our VAST Challenge 2009 participation. It received the award “Outstanding Video Analysis Tool.” 53 During the last years, it has been largely extended by several visualizations and interaction techniques as well as methods of automatic video processing. It now provides a rich visual analytics system with wide support of reasoning in video data. A revised usage example that shows the new video visual analytics system applied to the IEEE VAST Challenge 2009 scenario is provided in the supplementary material of this article. Before we consider each component of our video visual analytics pipeline in detail within the next sections, we will next provide a general overview. Please note that for the discussion of the prototypical implementation, we move away from the broad notion of video content analysis to a more specific problem type. With respect to the dataset and problem scenario defined by the IEEE VAST Challenge 2009, the presented system represents an instantiation of the general video visual analytics process with application to object-based video surveillance. (Objects and their interaction with the environment play a key role for the understanding of nonnarrative video data in object-based video surveillance. Since it is quite common to surveillance video data that only visuals are captured and audio is neglected, we will primarily focus on the visual component of surveillance video data in this article. However, the concepts presented here can in principal be extended to involve audio data as well.) In this domain, event detection (an event can also be the occurrence of a particular object) is the dominant task, as various studies13–15 reveal.

The video visual analytics pipeline. The video data streams are unidirectionally processed and finally presented to the human analysts (solid arrows). Feedback of the analysts (dashed arrows) closes the processing loop and allows them to adjust data transformation at any of the processing stages.
The unidirectional flow of video data reflects the fundamental idea of stream processing realized in our system (solid arrows). Video data are streamed through a couple of stages before it is presented to the human analyst. First, a video manipulator with the objective to enhance video quality (e.g. by color correction or noise reduction) is applied. Then, additional features, such as foreground masks, trajectories, and diverse properties of these features, are extracted (feature extraction). Here, subsequent stages determine the kind of features calculated (e.g. trajectory-based filters, particular properties that have to be visualized and are used for data mining). In the filtering stage, which is important for the scalability of the system, filters reject the data that have been defined to be irrelevant to the analyst. The importance of the filtered data is then judged by relevance measures. Subsequent methods may access the relevance rating to transform the representation of the data. Finally, the data are presented to the user by the highly integrated visualization and data mining stages. The human analysts can interact and feedback to each of the particular stages; this is outlined by the dashed arrows in Figure 3. In the early stages, analysts select the video data to be analyzed and apply video manipulators. They decide which parts of the data are relevant for their analysis, by designing filters, relevance measures, and larger filter pipelines. Combining various coordinated views allows them to observe the aspects of the data important to them. Moreover, they interact with the visualizations and apply different kinds of aggregation and mining methods that allow them to discover interesting patterns.
These stages of the video visual analytics pipeline mainly correspond with the foraging loop in Figure 2. The IR process is represented by the subloop between filtering and the human analyst. The KDD process can be found within the subloop data mining/visualization–human analyst with focus on data mining, where the EDA process is integrated into the same subloop with focus on visualization.
The support of the sense-making loop in Figure 2 can be found in the video visual analytics pipeline of Figure 3 in the form of the connection between the reasoning sandbox and the human analyst. In the reasoning sandbox, elementary reasoning artifacts, such as relevant information, assumptions, patterns, higher order knowledge constructs, and evidence, can be organized to support the analysts in formulating sound hypotheses. After hypotheses generation, the human analyst can check them against the data utilizing again the foraging pipeline in Figure 3.
Data streams
In the first stage of the video visual analytics pipeline (see Figure 3), the analysts can select one or multiple data streams to be considered in the analysis. This step corresponds to first step of search and filter of the foraging loop of Pirolli and Card. 7 The data sources are not restricted to video data alone. In particular, additional data sources that are not video but can be any time-dependent data stream may provide useful enrichment in the analysis. Examples of such data streams are automatic teller machine (ATM) transactions or badge data (i.e. entrance and exit times of persons for particular buildings as provided by the IEEE VAST Challenge 2009). Please note that data streams, although dynamic, can be prerecorded off-line data or real-time online data streams, such as from live cameras. Furthermore, the sampling rates of the added data streams can largely vary (e.g. VAST Challenge 2009 video data: ~15 fps; CCTV time-lapse video ~0.25–8 fps 46 ) and can be either regularly (e.g. video streams) or irregularly sampled (e.g. badge data).
To achieve data scalability, we generally stream all data sources within our visual analytics system. However, as soon as temporal aggregation techniques or aggregation of multiple data streams come into play, online algorithms or time windows are required. These windows can be either sliding windows or user-defined static time windows. Furthermore, parts of the streamed data may be sometimes cached by the particular stages (e.g. the feature extraction or the visualization stage) to resolve temporal dependencies in processing.
Manipulation
The algorithms in this stage are characterized by not changing the data type between input and output data. Therefore, this stage can be deemed to be a processing stage rather than a stage of extraction or understanding. The objective of the manipulation stage is to enhance the raw data signals in order to improve the quality and effectiveness of successive stages (especially feature extraction, relevance measure, and visualization). With respect to video, this stage may contain approaches to improve contrast (e.g. in foggy situations), to correct colors (e.g. to counterbalance illumination changes during the day), or to deshake (e.g. important for pillar mounted cameras) or deblur video sequences. The choice of the manipulators applied to the data streams is left to the human analysts.
Feature extraction
In the feature extraction stage, a variety of features are calculated that are utilized in later stages. In general, the feature extraction stage consists of many specific feature extractors that depend either on the original data stream (i.e. the manipulated video stream) or on a composition of previously calculated features. Some feature extractors further build and adapt their own (online) models or cache the data stream and required features using the sliding window technique. For example, calculated features, such as optical flow, or foreground segmentation only require the video stream and their internal models. The internal model for foreground segmentation is represented by a background model trained on the video data seen so far. Blob extraction, in contrast, depends on the optical flow (motion blobs) and the foreground segmentation (foreground blobs) as well as on the video stream itself. Finally, trajectories of moving objects are extracted by applying a tracking algorithm to the previously extracted blobs. This also involves an internal model of the movement characteristics as well as the sliding window technique. A more detailed description of the feature extraction algorithm used can be found elsewhere. 16 The structure of the feature extraction stage is mostly created automatically. A graph of feature extractors is constructed that is based on the features required for filters, relevance measures, and visualizations selected by the users as well as on these features’ own dependencies.
The feature extraction stage can be seen from different perspectives. The important role of trajectories in our video visual analytics system originates the assumption that changing parts in a video sequence are more relevant than static parts. Hence, feature extraction represents, on the one hand, a first data reduction step triggered by a model of background knowledge about the relevant parts in video data. To this end, the raw video stream is aggregated into a more abstract type of information. In fact, some visualizations only present features without the original video data, such as the Interactive Schematic Summaries (see section “Visualization and data mining”), which uses only trajectory and background image features. On the other hand, feature extraction can serve as an enrichment of the raw data by additionally created features. This is the case for the extraction of trajectories of moving objects from the data. Inspired by the flow visualization community, the transition from video data to trajectories also marks the transition between the Eulerian perspective on the data and the corresponding Lagrangian view. However, both types of descriptions have their raison d’être and complement each other.
Filtering
Data reduction by filtering is one of our main concepts for data scalability, besides data aggregation. Moreover, the filtering stage plays a versatile role in our visual analytics system. Users can apply filters to define continuous queries to retrieve particular data instances of video data and its calculated features. Continuous queries are a concept from DSMSs. In contrast to static queries, continuous queries “are evaluated continuously as data streams continue to arrive.” 54 By using filters as tools for IR, analysts can check their reasoning hypotheses against the data. Due to these reasons, the filters are processed prior to the relevance measure and visualization stages.
Our visual analytics system provides various ways of filter definition, according to the different requirements of the users. However, all definition methods provide visual support and context information, such as charts of data distributions and video context information. The filters we provide mostly apply to features extracted from video, such as blobs and trajectories, but can provide relevance feedback (see the subsequent section) to the raw video stream as well. We will therefore use the term data instance interchangeably for video frames and their extracted features. However, our main focus is the query formulation for trajectory-based video retrieval. 55
A rather exploratory approach to filter creation is their definition by examples (Figure 4(a)). This method allows users to directly transfer selected data instances into white or black list filters. Scatter/gather browsing on Interactive Schematic Summaries, as explained in section “Visualization and data mining,” further allows users to define such a list filter by one click. However, black and white lists of data instances do not generalize to new and unseen data that were not added to the list. To enable generalization, users can decide to convert a list filter into a decision tree filter, which is trained on the examples of the black list and white list. The approach of filter definition by ad hoc training of a classifier model can be extended to an interactive learning scheme 56 that requires fewer examples and, thus, less definition effort.

Filter definition in our video visual analytics system. Images (a)–(c) show different types of filter formulation: (a) by examples, (b) by property, and (c) by sketch. The selected property in image (b) is the movement azimuth of trajectories. The histogram (blue bins) depicts the distribution of the trajectories’ mean direction. An interval of the azimuth is chosen by fuzzy selection. The fuzzification function is a Gaussian in this example. More details including also parameter restrictions of other dimensions (e.g. velocity, lifetime, location) can be found elsewhere. 16 Image (d) shows a combination of different filters in a filter graph. Connected nodes are conjunctions of filters, and parallel routes denote disjunctions. Filters can be organized in containers (here, two containers) and activated or deactivated (see the right container).
In cases, where the information need of the analyst can be specified more precisely (e.g. known-item search), another group of filters allows us to constrain the continuously retrieved data instances by properties, such as the position, direction, velocity, and time of appearance of trajectories (Figure 4(b)). Each of these filters allows to us constrain the range of accepted values of one particular property. To help the analysts determine proper ranges for the filter definition, the property distribution and video context are additionally displayed. Furthermore, we allow fuzzy filter definitions to account for potential vagueness and uncertainty of the analysts’ query. 16
An alternative approach of specifying a filter model is its definition by sketch (Figure 4(c)). This definition interface allows the users to freely sketch a trajectory on top of a context video frame. After sketching the trajectory, the users may select the type of properties of the trajectory that will be included in the signature for retrieval.
The previously mentioned filters only consider one data instance at a time. Their decision on pass or reject of the trajectory is also only based on the trajectory and the filter model. Another type of filters that consider the relationship between two trajectories is required, to enable aggregation queries. However, continuous aggregation queries come at the cost of approximate answers calculated by sliding windows or any form of model prediction approach. 54 Hence, filtering for trajectory relationships requires the definition of a time window besides other interaction characteristics.
The single filter definitions can further be arranged in a filter graph for each data type and data source in the analysis project (Figure 4(d)). At the time of evaluation, the stream of data traverses this graph from the defined input to the node in the graph marked as output node. This way, analysts can construct complex queries that consist of Boolean combination of filter expressions. Switching the output node and rearranging the connectivity of the nodes enable the analysts to use the filter graph both as query history and as toolbox of modules and alternative filter branches. To support the more complex usage of the filter graphs, branches can be encapsulated in containers and used as a single filter unit. Containers for fuzzy filter definitions further feature defuzzification functionality and can be applied as relevance measure, which is discussed in the next section.
Relevance measure
The idea behind the relevance measure stage is to automatically evaluate the importance of each data element (e.g. video frames, trajectories) according to a user-defined model. In contrast to filters, a relevance measure does not exclude data from further analysis. However, the assigned relevances can be utilized in the subsequent visual representation of the data elements to guide the analysts’ attention to important areas in the data. This addresses perceptual scalability and facilitates situational awareness during analysis. Detailed description of the different mappings of calculated relevance and visual representation is provided in the next section.
To define a proper relevance model, users can select different relevance measures and connect them as a relevance graph. Hence, defining the relevance models is quite similar to the definition of filters. We provide two types of elementary relevance measures. The first type contains relevance ratings that are directly derived from the data elements’ degree of membership to a fuzzy filter. The others include their own relevance model parameterized by the users. For example, the importance of video frames can be assessed by motion activity level, 57 information theory, 58 or a visual attention model trained on eye-tracking data. 59
Visualization and data mining
In our system, the visualization stage is responsible for conditioning and communicating information to the human analyst. Multiple coordinated views provide complementary perspectives to particular facets of the data and enable the analysts to explore the dataset. 60 The analysts can select and combine these views according to their specific task requirements. Data streams of all views are temporally synchronized and can be controlled by the users via the timeline (see Figure 5(a)). Selected data instances are highlighted in all views (brushing and linking). Further details about the selected data instances are available by the selection manager, which also allows transferring these instances as relevant information artifacts into the reasoning sandbox. Moreover, each visualization allows exporting specific views on the data as pattern artifacts into the reasoning sandbox for further sense-making.

Screenshot of our video visual analytics system that shows several views of the IEEE VAST Challenge dataset. The timeline (a) depicts the video stream split into four different parts: one for each position the camera pans to. The first scene is selected (green bar) and the static time window is set to the end of the video (saturated bars). The conventional video player is shown in (b) and highlights the selection of a woman (blue bounding box). (c) The VideoPerpetuoGram shows a dynamic summarization of the video within a sliding time window as space–time volume. The volume is augmented by two trajectories of present moving objects; three key frames depict context information. In the VideoPerpetuoGram, the selection is also highlighted (blue colored trajectory) via brush and link. The configuration of the auditory display is illustrated in (d). By the auditory display, we intend to support the situational awareness of the analyst. Users can place and orient a virtual listener (blue arrow inside yellow circle) in the video scene. Moving objects are mapped to particular auditory icons (e.g. steps for persons and engine sounds for cars) that are adapted to the movement velocity (pitch and speed of the auditory icons) and distance to the listener (volume). Head-related transfer function (HRTF) is applied to enable 3D localization of the objects relative to the virtual listener. This requires the user to wear headphones. (e) The chart view depicts the time series of the velocities of two trajectories. (g) The interactive schematic summaries show a schematic summarization of the static time window that is selected in the timeline: the trajectories are clustered according to their position into three sets. The spatial distribution of the trajectories is depicted on the top, and the temporal coverage is visualized in the temporal context view on the bottom. (f) The reasoning sandbox shows the links between a couple of elementary reasoning artifacts, such as relevant information (I), assumptions with their support level mapped to color (A), higher order knowledge constructs (K), and hypotheses (H). (h) The filter and relevance graphs.
Data dimensions
Video data are naturally represented in spatiotemporal dimensions. However, it depends on the task for which data dimensions are important for analysis. When searching for instances of cause and effect, the temporal dimension becomes more relevant than spatial dimensions. Other tasks, however, consider the spatial dimensions (e.g. detection of access to a forbidden area), additional properties (such as movement velocity or assignment to an object class), or a combination of them (e.g. spatiotemporal dimensions to detect encounter of multiple entities). To achieve task scalability and to support the users in exploratory pattern discovery, it is therefore important to provide different views that show various perspectives of the data. Our visual analytics system provides complementary visualizations of different data dimensions.
The timeline view (cf. see Figure 5(a)), for example, illustrates the temporal context of different data streams by depicting their sampling intervals as bars. Additionally, the intervals of selected data instances, such as trajectories, are highlighted. Such visualization is useful to show both the temporal “location and duration of intervals. One can also see how intervals are related to each other.” 51
Object trajectories extracted from georeferenced video sources or directly captured by GPS devices can be displayed by the map view. 61 The representation in geographical context helps to identify patterns that span over a large area and possibly over multiple cameras. This view represents a connection between geospatial and video visual analytics.
Data mining and data aggregation
Depending on the analysis task and the availability of exact search target definitions, the analysis process features more or less exploratory characteristics. If the task is, for example, to search for a vaguely specified search target in off-line analysis, a first goal may be to gain an overview of the data. This may be achieved by building a mental model of typical pattern and structure in the data using data exploration techniques. Our visual analytics system therefore features tight integration of visual data exploration and automatic data mining. The different methods in data mining can be distinguished by classification, regression, clustering, summarization, dependency modeling, and change/deviation detection. 24 Within the visualization and data mining stage, methods of clustering, summarization, change detection, and regression are applied (however, regression only plays a minor role). Methods of classification are mainly used in the filtering stage, and dependency modeling is utilized in the reasoning sandbox.
An example is the Interactive Schematic Summaries 62 view (see Figure 5(g)) that combines automatic clustering of trajectories with a schematic visualization of the generated model. The view applies trajectory bundling to summarize calculated trajectory clusters as visual feedback of the data mining results. Furthermore, scalable video data exploration is achieved by scatter/gather browsing of the trajectory clusters. Cluster selection automatically adds a trajectory filter to the dataset and can be used to explore the dataset either by common structure or by comparison to the trained model.
Besides filtering, aggregation is the method of choice to enable scalable analysis of large data as well as facilitating visual pattern discovery by providing views of different scales on the data. Visualizations that apply data aggregation, such as Interactive Schematic Summaries, enable hierarchical exploration of the data from coarse to fine. Visual exploration and data aggregation further support the analysts’ abilities of pattern discovery and thus the formulation of new ideas and mental models of the data.
Another example of the usage of aggregation techniques to leverage the analysts’ pattern mining abilities is the chart view 61 visualization (see Figure 5(e)). This view depicts time series (e.g. of properties of trajectories as well as of additional one-dimensional time-dependent data streams) by different types of standard charts supporting several granularities of temporal aggregation.
Another type of aggregation that is typically applied to enable data scalability and visual pattern mining is the reduction of the resolution of the presented data. Resolution reduction allows inspecting a larger area of the data space and puts emphasis on coarse structures in the data. Most views of our system support the adaption of the spatial and temporal resolution. Despite the general adaption of spatiotemporal resolution of different visualizations, the representation interval of streamed and dynamically displayed data elements (i.e. video frames) can be adjusted. Besides conventional fast-forward video playback at constant pace, our system also provides relevance-based subsampling of the video stream, called adaptive fast-forward. Hence, adaptive fast-forward involves adaption of each video frame’s playback duration that is proportional to its importance assigned within the relevance measure stage. This in turn can be regarded as relevance-based adaption of the temporal data stream resolution, which helps maintaining data scalability even with respect to the low-level data signal. Thus, uninteresting parts (according to the relevance measure) are shown a shorter period of time to avoid boredom, while relevant periods are presented for a longer time to allow the analyst to better keep track of all activities in the video data.58,59 The display duration, either at constant sampling interval or adapted to relevances, is controlled by a central heartbeat mechanism to ensure temporally synchronized views on the data.
Situational awareness
Situational awareness is an important aspect in all parts of the analysis process. In the context of video fast-forward, two aspects become important to maintain the situational awareness of the users: the support of object identification and of motion perception. 63 Furthermore, the current playback speed has to be communicated to the users if the playback speed is adapted to the frames’ relevances. 58 Since all fast-forward visualizations have strengths and drawbacks in specific scenarios, we provide a variety of different fast-forward visualizations, 63 from which the analyst can choose the most suitable one according to task and data.
Our visual analytics system further addresses the issues of situational awareness arising when analyzing a large amount of dynamic video data. To cope with the perceptual deficits of humans, such as change blindness and inattentional blindness, we provide a couple of data representations that support situational awareness. Inspired by the multiple-resource theory, 64 we provide both visual data displays and displays using another communication modality: auditory displays for video data (see Figure 5(d)). Dependent on the level of data abstraction, either sonification of low-level video data 65 or sonification of higher level features, such as trajectories, 66 can be applied.
Another view on the video data that facilitates situation assessment is the VPG 16 (see Figure 5(c)). The VPG displays a particular period of a continuous video stream in its spatiotemporal dimensions, using a sliding window. Hence, a 3D video volume is rendered, where time is extruded in the third dimension. Sparse sampling of video frames, additional illustration of extracted trajectory features, and viewpoint navigation increase visibility and allow inspection of interaction and activity patterns in the data. This combined visualization of a short period of the video sequence combined with its extracted features alleviates change and inattentional blindness.
Furthermore, several views, such as the VPG and the conventional video player view, allow mapping the measured relevance of particular data elements on several attributes of their visual representation. For example, the VPG allows us to map the relevance value of a trajectory to its display color, and object segments in the video player are surrounded by a color-coded bounding box. In the same way, the degree of membership of a particular data element to a defined fuzzy filter can be mapped to the color of its visualization. This allows displaying the filter confidence of a data element. Moreover, other uncertainties, such as the positional uncertainty of trajectories (calculated by the feature extraction) can be either modeled by relevance measures or directly visualized by the VPG as a semi-transparent blur surrounding the visual trajectory representation. 16 This helps the analysts to be aware of their data quality.
Reasoning sandbox
The reasoning sandbox supports the human analysts in their analytic discourse and sense-making. Primarily, it manages the reasoning artifacts generated during the foraging and sense-making processes by the human analysts. By providing an overview of the different hypotheses, their support by evidence, alternative reasoning paths, and conflicting ideas, the sandbox helps maintain and assess situation awareness.
According to the research agenda of visual analytics, the “analytic discourse is the technology-mediated dialogue between an analyst and his or her information to produce a judgment about an issue.” 6 They further describe the discourse to be an iterative and evolutionary process and the three classes of involved information of the analysts: (1) the issue, (2) corresponding information that the analysts gathered, and (3) the analysts’ evolving knowledge.
The reasoning sandbox supports this iterative discourse and is therefore tightly coupled with the foraging loop. It visually organizes reasoning artifacts in all three levels of abstraction (closely following the reasoning artifacts defined in the research agenda of visual analytics 6 ). The first level (see Figure 5(f) top) includes relevant information, pattern artifacts, and higher order knowledge constructs. The second level consists of assumptions, which can be either supported or refuted by the top-level reasoning artifacts and, thus, may provide evidence or contradiction to the hypotheses in the third level. The degree of confidence of an assumption is modeled by its level of support (LoS) that is judged by the analysts with respect to the related top-level artifacts. Highly supported and, thus, proved assumptions are called evidence following the reasoning terminology of the agenda. 6 Assumptions that are not supported or refuted by any reasoning artifacts of the first level are considered as knowledge gaps. It is especially important for the analysts to keep these gaps in mind during hypothesis generation. That is why the research agenda of visual analytics emphasizes the separate handling of assumptions and evidence.
Within the sandbox, the reasoning process is represented as a graph of artifacts (nodes) connected with each other by supporting or refuting arguments, provided by the human analysts. To account for the separation of assumption not proved and evidence, the visual appearance of an assumption changes its color with its level of support: from yellow (LoS: not proved) to blue (LoS: proved (evidence)) or to red (LoS: disproved). Assumptions (as well as hypotheses) can further be interconnected to highlight conflicting or competing assumptions (or hypotheses).
Within the foraging loop, data elements that seem to be of relevance for the current case are selected by the analysts and directly imported into the reasoning sandbox as relevant information artifacts. Insight of patterns or structures in the data that is retrieved by visualization and data mining can be added to the reasoning sandbox as pattern artifacts. Together with the higher order knowledge constructs that can be directly formulated by the analysts, these three types of reasoning artifacts form the corresponding information gathered by the analysts (2).
The analysts’ evolving knowledge (3) is represented by the elements of the second and third levels as well as by the connections between them. These elements of the second level are arguments of different levels of support. The third level consists of hypotheses, which are more complex reasoning constructs, such as complete scenarios.
After a sound and well-supported hypothesis was found that properly answers the issue (1), whereas competing hypotheses could be rejected, the final outcome of the analysis (cf. Figure 3) is produced. This outcome is typically presented in the form of a dissemination product (i.e. report of the analysis result and its reasoning path).
Evaluation
An open question is how complex visual analytics systems can be evaluated. This emerging question is also subject of the biannual BELIV workshop (http://www.beliv.org), which has been established since 2006 and indicates its importance. Basically, there are two problems when evaluating complex system.
First, complex systems need expert users who are familiar with the system. Training of nonexperts to achieve comparable results requires a substantial amount of time. Moreover, the participants of the user studies should not only be experts with the system but also be professionals in their application domain, which further reduces their availability. This leads to studies with only few participants and thus insufficient statistical significance.
Second, objective measurements of complex systems are impractical. The performance of a complex system depends on a combination of a vast amount of properties that cannot be assessed by a practical amount of participants. For example, an objective performance measurement conducted to evaluate the fast-forward visualizations included 24 participants. However, even for these simple visualizations, which are far less complex than highly interactive visual analytics systems with multiple coordinated views, significant differences between two pairs of visualizations could not be shown.
For this reason, the proposed video visual analytics system was validated following Ellis and Dix. 67 They claim that validation should consider two parts: justification and evaluation. While the justification of the approach can be found in the particular sections and the corresponding articles, the strategy pursued for the evaluation of the video visual analytics system can be summarized as follows:
Separation of smaller parts and evaluation by quantitative (objective and subjective) user studies (e.g. fast-forward visualizations, 63 sonification approaches65,66);
Evaluation of approaches with medium complexity by qualitative (subjective) user studies such as expert interviews or think aloud (e.g. Interactive Schematic Summaries, 62 information-based relevance measure 58 );
Evaluation of complex systems by participation in challenges (e.g. VAST Challenge 2009).
Nevertheless, the overall performance of the system as a whole is not thoroughly evaluated, and further evaluation should be considered in future work.
Conclusion
In this article, we introduced the tasks and challenges of video data and pointed out a possible solution for its scalable and reliable analysis: video visual analytics. On the one hand, video analysis benefits from visual analytics, which is a powerful methodology that helps tackling the challenges existing in the context of complex data and vaguely defined search targets. On the other hand, we believe that visual analytics can benefit from video visual analytics too. Video visual analytics can provide a new perspective on the definition of visual analytics, due to its strong focus on the temporal aspect of data. This focus is reflected in our notion of scalability, situational awareness, and in the extension of the visual analytics process (cf. Figure 2). By presenting our implementation of the video visual analytics process, we hope to encourage further research in the field of video visual analytics in particular as well as multimedia analytics in general.
Footnotes
Funding
This work was funded by German Research Foundation (DFG) as part of the Priority Program “Scalable Visual Analytics” (SPP 1335).
