Abstract
Data visualizations typically show a representation of a data set with little to no focus on the repeatability or generalizability of the displayed trends and patterns. However, insights gleaned from these visualizations are often used as the basis for decisions about future events. Visualizations of retrospective data therefore often serve as “visual predictive models.” However, this visual predictive model approach can lead to invalid inferences. In this article, we describe an approach to visual model validation called Inline Replication. Inline Replication is closely related to the statistical techniques of bootstrap sampling and cross-validation and, like those methods, provides a non-parametric and broadly applicable technique for assessing the variance of findings from visualizations. This article describes the overall Inline Replication process and outlines how it can be integrated into both traditional and emerging “big data” visualization pipelines. It also provides examples of how Inline Replication can be integrated into common visualization techniques such as bar charts and linear regression lines. Results from an empirical evaluation of the technique and two prototype Inline Replication–based visual analysis systems are also described. The empirical evaluation demonstrates the impact of Inline Replication under different conditions, showing that both (1) the level of partitioning and (2) the approach to aggregation have a major influence over its behavior. The results highlight the trade-offs in choosing Inline Replication parameters but suggest that using
Introduction
Visualization tools are most often designed to depict the entirety of a data set—subject to a set of filters applied to focus the analysis—as accurately as possible. In this typical pattern, the goal is to provide the user with an accurate understanding of all of the data in the underlying data set that matches the active set of filters. This ethos was captured, perhaps most famously, in Shneiderman’s Visual Information Seeking Mantra: overview first, zoom and filter, then details-on-demand. 1 Variations of this basic approach have since been adopted in most modern visualization systems.
The foundations for these systems are visual mappings that specify a graphical representation for the underlying data. For small and low-dimensional data sets, these mappings can be direct (e.g. a scatterplot for a small two-diemsional (2D) data set). As problems grow in data size or dimensionality, algorithmic data transformation methods can be used to filter, manipulate, and summarize raw data into a more easily visualized form.
On top of these mappings, interactive controls are often provided to give users even more flexibility to filter or zoom to specific subsets of data. These interactions can be linked to more detailed information about data objects, such as via levels-of-detail or multiple coordinated views. The result, when well designed, is an effective visual interface for data exploration and insight discovery.
For this reason, these steps form the core stages of the canonical visualization pipeline.2,3 This approach can be enormously informative and has led to advancements in how people seek to understand information across a wide range of domains, such as helping computer users navigate through large file systems, visualizing medical records to help doctors understand patient histories, and visualizing weather data to identify regions most impacted by a recent storm.
Critically, these visualization use cases are all retrospective in nature. More specifically, they employ visualizations that attempt to faithfully report data exactly as it was observed. Users aim to see an overview of the entirety of a given data set. If a user applies constraints to focus the visual investigation (e.g. via zoom and filter), the visualization is expected to show the full set of data that satisfies the applied constraints.
In many visualization scenarios, however, users are in fact more interested in conducting prospective analysis: using visualizations of these historical data to reason about future or not-yet-observed data. For example, medical experts examining data for a cohort of patients might be most interested in what treatments would work best for a future patient with similar characteristics. Visualizations of historical sports statistics are often used to inform strategic decisions that are used in upcoming competitions. Financial visualization tools are often used to inform future investment decisions. In each of these use cases, visualizations of historical data are used to inform future decisions. For such prospective analysis tasks, these retrospective visual depictions of the data are often used, in essence, as naïve visual predictive models, with the assumption that visualizations of historical data can be used to infer or predict future observations. In many ways, these visual models are analogous to statistical models developed during statistical analyses, from which users often also attempt to predict future observations.
In many cases, retrospective representations are indeed very informative. However, just like the underlying descriptive statistics that such visualizations often depict, 4 traditional retrospective visualizations often provide insufficient evidence for making predictive inferences, even as the visual depiction itself might be especially suggestive for making such inferences. In many cases, the trends and patterns that a visualization of retrospective data present to a user may be artifacts of noise and expected randomness within the underlying data. For users to make valid inferences or predictions based on historical data, a more nuanced understanding of the data being visualized is required.
This critical gap between (1) retrospective visualization designs and (2) the predictive requirements of many users has been recognized within the visualization community. 5 Some have attempted to bridge this gap by adding support for inferential statistics within the visualization. Typically, this approach combines carefully designed statistical models with visualizations of the model’s results. For example, visualizations can be instrumented to estimate and display uncertainty, confidence intervals, or statistical significance. Alternatively, predictive modeling methods can be used to generate additional data, with the predictions themselves being incorporated into the visualization. These systems go beyond traditional descriptive reporting, but typically require a careful and sometimes onerous focus on modeling, including estimating underlying statistical distributions, which is incompatible with many applications for which a more accurate assessment of the repeatability of a given visualization would be useful.
This article presents Inline Replication (IR), an alternative approach to enabling inferential interpretation that is designed to overcome the above challenges. We introduce a partition function within the visualization pipeline to produce multiple folds for each visualized data subset. Metric functions are applied to each fold, and an aggregation function combines the individual measures prior to visualization. Interaction techniques enable users to examine both aggregate and individual fold metric values.
Our method is motivated by the bootstrap sampling and cross-validation techniques used widely in the statistics and machine learning communities. The IR approach is non-parametric, making it easy to apply and use generically within a visualization system without arduous modeling or assumptions about distribution parameters. IR integrates easily with the standard visualization pipeline and is also ideal for use in large-scale visualization systems where progressive or sample-based approaches are required. Finally, our method provides users with validation information that is both intuitive and easy to interpret. (This article includes material described in a pre-print that has been released by the authors via arXiv. 6 This version of the article is heavily revised and contains an entirely new evaluation section describing and analyzing results from a set of empirical experiments to test the behavior of IR under various conditions.)
The remainder of this article is organized as follows. It begins with a review of related work, then describes the details of the IR methodology. After that we present an experiment that aims to test the accuracy of such method against traditional statistical analysis. We then share example results from a variety of proof-of-concept systems that include the IR technique. These examples range from simple bar charts to more sophisticated interactive visualizations of large-scale event data collections. 7 The article concludes with a discussion of limitations and outlines key areas for future work.
Related work
The IR approach to visual model validation is informed by advances in several different areas of research. These include the topics of uncertainty, predictive visualization, and progressive or incremental visualization. Also relevant are visualization systems that utilize inferential statistics methods and conceptual models of the visualization pipeline.
Visualization of uncertainty
The visualization of uncertainty has been an active research area within the visualization community for many years. Studies have explored the problem from many perspectives, including developing taxonomies that have examined types of uncertainty 8 as well as visualization methods for conveying uncertainty. 9 In addition, there have been many efforts to formally study alternative methods for depicting uncertainty measures10–12 through user studies that explore the perceptual understanding of various uncertainty representations. However, these studies focus on the visual representation rather than methods for determining the degree of uncertainty.
Perhaps most relevant to the IR approach proposed in this article is work that has focused on estimating uncertainty via measures of entropy within a data set rather than using carefully constructed statistical models. 13 Compatible with IR, this work proposes using entropy as a non-parametric measure of uncertainty for categorical data which do not require formal modeling nor make assumptions about specific distributions within the data. IR provides a broader and more general framework for this approach, within which the entropy metric could be easily adopted.
In other work, the distinction between the “visualization of uncertainty” and “the uncertainty of visualization” has been highlighted. 14 The latter is a related but separate concept from traditional uncertainty visualization. Such work highlights that the rendered graphics of a visualization can convey a sense of authority which may not be warranted, even when the underlying data itself are considered to be beyond reproach. This challenge is a key motivation for IR, especially with respect to the confidence of the user in making predictions based on this unwarranted authority, as outlined in the discussion presented in section “Visualization as a predictive model.”
Finally, we distinguish IR’s focus on variation of data over partitions to challenges related to non-representative samples. Non-representiveness and selection bias are important threats to visualization validity, 15 and recent research has proposed methods to address these challenges within the context of interactive exploratory visualization.16,17 IR can complement these methods but focuses on issues of variation rather than representativeness.
Predictive visual analytics
Visualization has long been used to support predictive analysis tasks. However, most often, the “prediction” is performed by users reviewing historical data and making assumptions about what might happen in the future for similar situations. In fact, the relatively limited history of work on visualizations that incorporate more formal predictive modeling methods was the topic for a workshop at a recent IEEE VIS Conference. 5
The work that does exist in this area has often focused on model development and evaluation rather than supporting end users’ predictive analysis tasks. For example, BaobabView 18 supported interactive construction and evaluation of decision trees. More recent work has focused on building and evaluating regression models. 19 This method, like ours, adopts a partition-based approach to avoid making structural assumptions about the data. However, the focus on building regression models leads to an overall workflow that is very different from the proposed IR approach.
Others have focused on visualizing the output produced by predictive models. For example, Gosink et al. 20 have visualized prediction uncertainty based on formalized ensembles of multiple predictors. This approach, however, requires careful modeling to develop the predictors, including the specification of priors that enables the Bayesian method that they propose.
Outside the visualization literature, where novel visual or interaction methods are not a concern, predictive features are typically visualized using traditional statistical graphics, for example, systems that visually prioritize and threshold p values to rank features for prediction. 21 Such methods are fully compatible with the IR process proposed in this article.
Progressive/incremental visualization
Model overfitting and other sampling challenges are common to “Big Data” visualizations that rely on progressive or incremental techniques.22,23 Initial samples are small, grow over time, and can change in distribution as time proceeds. Some have addressed this challenge by including confidence intervals along with partial sets of query results. 24 However, relying on the query platform to assess confidence in data subsets does not easily support interactive zoom and filter operations after the query because these changes in visual focus do not necessarily result in new queries that generate new result sets. Moreover, these papers do not propose methods for computing confidence intervals, but rather assume that such data will be provided by the the database.
Inferential statistics
Statistical inference is a discipline with a very long and distinguished history. Most relevant to the IR method described in this article are challenges related to statistical significance and null hypotheses, and in particular, Type 1 and Type 2 errors. Type 1 errors refer to improper rejections of the null hypothesis which lead to conclusions that are not real effects, while Type 2 errors refer to falsely retaining the null hypothesis which can lead to assumptions that a true effect is false. 25
These types of errors are of critical concern in high-dimensional exploratory visualization where computational methods can quickly assess vast numbers of dimensions for statistical significance. Statistical correction methods have been proposed to reduce Type 1 errors, 26 but arguments have also been made against such approaches. Those arguments suggest that parameterized models or assumptions of “default” null hypotheses do not match real-world situations where distributions are rarely straightforward or independent. Suggesting that these correction methods are the wrong approach for exploratory work, Rothman 27 argues that “scientists should not be so reluctant to explore leads that may turn out to be wrong that they penalize themselves by missing possibly important findings.”
This tension is present in many interactive exploratory systems which make it easy to generate vast numbers of potential hypotheses. As a result, a wide range of methods have been proposed for modeling measures of confidence or significance.28–30 These efforts, however, typically rely on formal statistical methods that make assumptions about distributions and variable independence. For example, confidence intervals have some conceptual similarities to IR. However, calculating a confidence interval requires assumptions about underlying distributions of the data and knowledge of key distribution parameters such as mean and variance. Such approaches are problematic for exploratory visualizations which enable users to rapidly apply filters or constraints that can quickly change the underlying assumptions. The IR method we propose enables users to visually assess the reliability of hypotheses, providing a high degree of flexibility. Similar approaches that rely on user judgment have been shown to be quite effective.31,32
Models of the visualization pipeline
The traditional visualization pipeline model describes the process of transforming raw data first to an analytical abstraction, then to a visualization abstraction, and finally to a rendered graphic for interaction.2,3 We add partitioning and aggregation stages to support the IR approach. As we will describe, a special case of the IR model (with just one partition) is equivalent to the traditional model. By extending the canonical pipeline, our work has similarities with Correa et al.’s 33 paper describing pipeline extensions for an uncertainty framework focused on the data transformation process. However, unlike Correa at al., IR proposes a different set of extensions which adopt an empirical approach rather than assuming a Gaussian or any other formal error model.
Visualization as a predictive model
Visualization design is often conceptualized as a mechanism for reporting. This retrospective approach is so ubiquitous that terms such as prediction, forecast, and inference cannot be found within the indices of many leading visualization texts from the past 25 years.34–36
Many visualization consumers, however, use graphical representations of historical data as the basis for decisions about future performance. This is done even when the underlying data and transformations do not support such prospective conclusions. Despite potentially fatal flaws in terms of generalizability and repeatability, retrospective visualizations are in essence being used as predictive models.
The tendency to assume predictive power in visualization can be seen, for example, in modern casinos. Roulette wheels commonly include an electronic display 37 which shows the table’s recent history. Assuming a fair table, “red” and “black” numbers should be equally likely to appear. However, as illustrated in Figure 1, the history provided to gamblers is not sufficiently long to learn if the table has any systemic bias.

Roulette wheels allow users to bet on “black” or “red” squares. Casinos often display a simple visualization of “recent spins” to provide gamblers with a false sense of predictive knowledge. In this example, the display shows a recent preponderance of black numbers with the implication to gamblers that this may influence future spins of the wheel.
Why then is the gambler presented with a simple visualization of the history? The data are visualized to provide gamblers with a false sense of knowledge, to suggest to a hesitant gambler that a bet is an informed decision rather than a random choice. A gambler may infer that the recent streak of black suggests more black spins will soon appear. Alternatively, the gambler may infer an imminent return to red. To the casino, it does not matter what predictive inference is drawn as long as it provides a false confidence that leads to increased betting.
It is tempting to dismiss this scenario as one in which the gambler should be more informed about basic statistics. The small sample size and the independence of each roulette spin should make it clear that the display is not especially informative. However, relatively sophisticated users performing visual analysis of data from more complex systems can make similarly poor predictive assessments on the basis of visual representations that do not properly convey the underlying limits of their predictive power.
For example, consider a business analyst attempting to learn about why sales are declining or a physician using historical patient data to compare treatment efficacy. In these complex real-world cases, in which it is essentially impossible to fully understand the underlying statistical processes, it is natural for analysts to turn to visualization as a predictive model for their problem. Visualization enables these users to see what has happened and, based on trends or patterns in the visual representation, make assumptions about what will happen in the future.
However, just as the casino gambler draws inference from a not-so-meaningful visualization, these power users can be led to make poor predictions on the basis of visualizations that are essentially “overfit” models based on poor visual representations of the underlying process. This problem has even been documented in highly quantitative fields such as epidemiology, where public health analysts have had trouble discounting statistics from small sample sizes when visualized. 38
Issues of poor sampling and overfitting are especially problematic during exploratory visualization in which users can interactively apply arbitrary combinations of filters to produce new ad hoc subsets of data for visualization. Such systems are at greater risk of generating misleading visualizations that occur “by chance” rather than due to real properties of the underlying problem. 39 The same is true for visualization systems that use sampled or progressive queries to address issues of scale.
The potential for this sort of “visual model overfitting” is analogous to the overfitting problem in more traditional modeling tasks. In the machine learning community, this is addressed in part by cross-validation, a widely used technique for assessing the quality and generalizability of a model. 40 Rather than relying on a single model solution, cross-validation methods create and compare multiple solutions, one for each of several partitions of a data set (often called “folds”). This enables an assessment of model repeatability, with models that work consistently across partitions considered more trustworthy. Similarly, bootstrap sampling techniques in statistics 41 produce multiple samples from a single set of observations in order to derive better estimates of the original sample’s statistical properties.
If one considers—as we argue here—that a visualization is often used as a form of predictive model, then validation becomes a critical guard against problems associated with visual model overfitting. When a visualization is zoomed and filtered to focus on a specific subset, is the visual representation repeatable, or is it due to chance variation? Are the conclusions drawn from the visualization generalizable? The IR method outlined in the next section seeks to help answer such questions by embedding an approach similar to cross-validation and bootstrap sampling within the visualization pipeline so that each new view produced during user interaction can be evaluated for validity.
IR
IR is an approach to visualization in which the data set associated with each visualized measurement is partitioned into multiple subsets, or folds, processed independently to calculate derived statistics or metrics, then aggregated back together to be rendered in a visualization. This partitioned approach embeds an automated and non-parametric workflow for data replication within the visualization pipeline, as illustrated in Figure 2. The result is that visualizations based on IR can provide users with important information about the repeatability of observed visual trends, reducing the likelihood of certain types of erroneous conclusions.

The Inline Replication (IR) visualization pipeline sends each derived measure’s subset of data
The IR pipeline begins with the same initial step as a traditional visualization pipeline. A set of query or filter constraints is first applied to a primary data source D to produce a focused data set
Traditionally, the data for each
The IR pipeline, however, behaves differently. Each
This section provides an overview of the IR pipeline, focusing on the three functions at the core of the design: the partition function, the metric function, and the aggregation function. It then describes the IR approach to visual display and interaction and concludes with a discussion of useful variations to the core design.
Partition function
Conceptually, the partition function is designed to subdivide the data in a given measure-specific subset
Formally, we define the partition function as an operator that subdivides a measure-specific set of data
This function is applied to the raw data in
As discussed previously, multiple folds are created with the goal of supporting repeated calculations for each measure. Increasing the value of n to produce more folds increases the replication factor. However, higher n values also produce smaller
Partitioning with
Choosing an appropriate n is necessarily a compromise between increased replication and smaller sample size. We can look to the machine learning community for guidance, however, where empirical studies have shown that there is no meaningful benefit for values of n over 10. 40 Moreover, as data sets grow larger in many fields, smaller sample size becomes less of a concern.
Finally, there are certain conditions (e.g. very small data sets with little data to partition or very large data sets where sampled queries are required) where the basic formulation for the partition function can be problematic. Variations to the partitioning process, designed to help address these challenges of scale, are discussed in section “Variations.”
To illustrate the partitioning process, consider the roulette example from earlier in this article. The example bar chart showing the fraction of spins resulting in black or red is based on a single measure-specific subset of data
Metric function
The folds produced during partitioning are sent to a metric function which is applied independently to each fold as illustrated in Figure 2. The metric function computes derived statistics
The specific measures computed by the metric function are application specific but could range from simple descriptive statistics (e.g. sums, averages) to more complex analyses (e.g. classification, regression). Generally speaking, metric functions produce the same derived values that would normally be computed as part of a more traditional visualization process. The key difference in IR is that the metrics are computed n times for each
For example, consider the roulette use case described earlier. The metric function in this example would compute the fraction of spins resulting in black and red in each fold
An actual implementation of IR using a similar “fraction of the population” metric function is discussed in section “Use cases.” However, more sophisticated systems may adopt more advanced measures. For example, correlation statistics, p values, metrics of model “fit,” and regression lines are all compatible with the IR approach. Examples of IR using linear regression, correlation, and statistical significance testing are also described in section “Use cases.”
Aggregation function
The metric function produces a set of statistical measures
A variety of aggregation algorithms can be employed, with different approaches appropriate to different types of metrics. For example, for count-based metrics which capture the frequency of data items in each fold, a summation across all folds might be the most appropriate because a sum of counts for each fold provides an accurate total for the overall data subset
The aggregation function produces a single aggregate measure
As a concrete example, consider again the roulette scenario. The metric function described previously computed the fraction of spins resulting in black and red numbers for each of the fivefolds created by
Visual display and unfolding of partition data
Once aggregation has been performed, the merged data
First, an initial visualization is created using only the aggregate measures. The process for this stage is similar to a traditional visualization pipeline. The aggregate measures are mapped to visual properties of the corresponding graphical marks, which we call aggregate marks. These marks are then rendered to the screen for display and interaction. In the roulette example, for instance, the aggregate data for black and red spin rates (produced by the Aggregate function) can be used to generate a basic bar chart that is identical to what is shown in Figure 1.
Second, an IR visualization enables aggregate marks to be unfolded. An unfolding operation—typically triggered by a user interaction event such as selection or brushing—augments the aggregate marks with a visualization of the individual fold statistics that contribute to the aggregate measures. In the ongoing roulette example, the fold data would show the variation in proportion of spins that result in black and red numbers across each of the
Discussion
The ability to unfold aggregate measures into repeated measurements is a central contribution of the IR approach. By graphically depicting the repeatability of a particular measure across multiple folds, IR provides users with important and easy-to-interpret cues as to the variability of a given measure. Traditional visualization methods do not convey this information, meaning it is often not considered when predictive conclusions are made by users.
Another benefit of IR comes from the aggregation function. In particular, embedding within the visualization pipeline an ability to aggregate categorical values such as statistical significance classification can lead to more accurate results. Repeated measures combined with voting-based aggregation can, for instance, reduce the exposure to Type 1 errors when looking for statistically significant p values. For example, a statistically significant
Variations
Following the traditional approach to k-fold cross-validation, the baseline Partition function defined in section “Partition function” specifies that the constructed folds are disjoint, randomly partitioned, and exhaustive (equation (2)). However, relaxing these constraints leads to several valuable variations to the baseline IR procedure.
Partial partitioning
Relaxing the requirement of equation (2) enables the creation of partitions that do not contain all data points within
Partitioning with replacement
Relaxing the requirement that all folds are disjoint enables partitioning with replacement. Similar to bootstrap resampling,
41
this approach enables the same data point to be included in multiple folds (or even multiple times within the same fold). When using replacement, the data set in
Incremental partitioning
A number of progressive or sampled methods have been proposed in recent years to address the challenges of “Big Data” visualization.23,24 In these approaches, the full data set
Empirical experiments
A series of simulated data experiments were conducted to empirically measure the behavior of IR in conditions where—unlike many real-world situations—accurate ground truth information about data distributions was available. The experiments focused on evaluating the impact of IR with respect to Type 1 and Type 2 errors for a common type of analysis: correlations between two data sets.
Data generation
To simulate empirical experiments under varying conditions, two different types of base data distributions were created as illustrated in Figure 3. These base distributions include a unimodal test case with simple unimodal normal distributions—the typical assumption for traditional statistics—and a multi-modal test case designed to test how IR behaves when the unimodal assumption is broken—a common occurrence in real-world data sets.

Empirical experiments measured IR behavior under two different conditions: (a) data drawn from simple unimodal normal distributions and (b) data drawn from a more complex condition of two non-aligned multi-modal Gaussian distributions. This figure illustrates the distributions from which the eight data sets used in the empirical experiments were drawn.
A second factor in the empirical experiments was the amount of known correlation in a data set. The experiment tested four levels of correlation: strong correlation (Pearson’s r = 0.61), moderate correlation (r = 0.25), weak correlation (r = 0.1), and completely random data with no correlation (r = 0.0).
This two-by-four experimental design (two types of base distributions, four levels of correlation) resulted in the creation of eight artificial data sets—eight D in the IR nomenclature—as summarized in Table 1. Each D consists of 5000
The empirical evaluation utilized eight artificially generated data sets corresponding to the two-by-four experimental condition design of the experiments (two distribution types, four correlation levels).
Experiments
For each D, 10,000 randomized IR experimental trials were conducted for each of the
In each trial, 1000

For each of the eight artificially generated data sets D, we created 10,000 randomly sampled subsets
A first set of experiments was designed to explore three different parameters of the IR process. First, IR was applied to each
A second set of experiments was designed to characterize IR’s behavior in response to different numbers of data points in
Results and discussion
The results of the experimental trials outlined above are illustrated in Figures 5 and 6 for the unimodal and multi-modal distributions, respectively. The charts reflect the 4 × 5 × 3 experimental design: the number of folds

Results of significance tests on correlation for 10,000 trials of subsets of 1000

Results of significance tests on correlation for 10,000 trials of subsets of 1000
The charts in Figure 5 show the results for the unimodal experiments. As might be expected, when the correlation was high
There is a similar reduction as n increases for correlations of 0.1 and 0 for all conditions except under the the “any pass” aggregation method. This result may seem counter-intuitive at first. However, in this case, the only correlation present in a given trial is the result of random noise since the data set from which the sample is drawn has zero correlation by design. Increasing the number of folds results in an increase in the number of statistical tests. This in turn allows for a higher chance of noise producing a result that appears significant. Because the “any pass” aggregation method marks any result as significant as long as at least onefold was marked as significant, an increase in n results in an increase in false positives. This result is critical. Using an aggregation method such as “any pass” essentially eliminates the benefits of IR because it enables any single fold result to determine the aggregate measure result. This means that results need not replicate across folds, increasing the chance for false-positive results.
As expected, trials that used the stricter “majority-vote” and “all pass” aggregation methods had fewer samples exceed the significance threshold when compared to “any pass.” As the correlation level decreased, the “all pass” aggregation method returned the fewest significant results. This reflects the stricter replication requirements for that approach. This trend is evident across all permutations of the experiment, suggesting that IR behaves predictably, providing system designers with multiple controls (the number of folds n and the aggregation method) over the trade-off between false positives and false negatives.
We note that the results of these experiments confirm that the “any pass” aggregation method should not be used for IR as it results in an increase in false positives. The results are reported to contrast with the other aggregation options tested in the experiments (“all pass” and “majority vote”) and to highlight that some aggregation functions are counter-productive.
Comparing the data from the unimodal experiments to the multi-modal results in Figure 6, the results are nearly identical. All of the observations noted above with respect to the unimodal results also apply in the multi-modal case. This suggests that one of the primary design goals for IR has been met: that IR is robust to the underlying data distribution, without any assumption of normality.
The results of the second set of experiments, which focused on the impact of sample size on IR behavior, are shown in Figure 7. As expected, larger sample sizes produced more reliable detection of the correlations at all correlation levels. Using a stricter p value thresholds (e.g.

Results of significance tests on correlation for 10,000 trials of subsets of data points sampled from unimodal and multi-modal distributions with correlation of
Overall, the experimental results present a clear and familiar picture of the trade-offs that exist in the parameterization of IR (e.g. the number of folds
Use cases
The IR approach is compatible with a broad range of visual metaphors and interaction models, from basic charts to more sophisticated exploratory visual analysis systems. To demonstrate this flexibility and to explore the impact of adopting an IR pipeline, we developed two prototype IR systems: (1) a reference prototype to study IR in isolation and (2) a sophisticated visual analysis system to examine IR in the context of a more complex analysis environment.
Reference prototype
We developed a reference IR implementation as part of a simplified visual analysis prototype with the goal of exploring the IR parameter space in isolation, without concern for the more complex interactions that are part of a real-world application such as the one described in section “DecisionFlow2.” The prototype supports two basic visualization metaphors: (1) bar charts and (2) scatterplots with linear regression lines. The prototype was tested using a data set of electronic medical data containing over 40,000 intensive care unit (ICU) stays. 43
The prototype interface, shown in Figure 8, contains three panels. In the center is the visualization canvas itself. The left panel enables users to issue queries and control key parameters of the IR process. Options include the number of folds

The IR-based prototype shown here was developed to test the proposed pipeline and to explore the parameter space with two baseline visualization types: bar charts and linear regression lines. The left panel shows the query and IR controls, the middle panel shows the visualization space, and the right panel shows detailed descriptive statistics computed for both the aggregate representation and the individual folds.
Figure 9 shows a series of bar charts rendered using the IR prototype to visualize the gender distribution across three subpopulations from the ICU database. This example is directly analogous to the roulette wheel bar chart example introduced in section “Visualization as a predictive model,” as both summarize the distribution of a binary variable in a given population.

Six charts produced by the IR prototype system. (a–c) The top 3 charts show the gender distribution for three different sets of ICU patients. The relatively similar bar charts suggest that the underlying populations are comparable. However, when the same populations are visualized with (d–f) fivefolds, a different story appears. The charts now clearly demonstrate that we know less about the population visualized in the left column than we do about the population on the right. In this case, the difference is due largely to the size of the respective populations.
The top row of charts in Figure 9 shows the aggregate gender distribution for each of the three populations. The charts show a relatively similar distribution across all three populations, with a moderate increase in female representation moving from Figure 9(a) to (b) to (c). The bar chart shows the gender breakdown in each population quite clearly. However, there is no indication of the distribution’s stability across different groups of patients. Consumers of the visualization are left to assume that the bar charts provide an accurate depiction.
Figure 9(d–f) shows the exact same populations as Figure 9(a–c), respectively. However, these views incorporate measures computed for multiple folds
Figure 10, meanwhile, shows three screenshots of the IR prototype displaying data from the ICU data set using scatterplots with linear regression lines. In this case, the examples show data for populations of neonates, with weight mapped to x position and height mapped to y position. A linear regression model was calculated in all three cases using the IR pipeline with

Weight versus height distribution for patients admitted to a neonatal intensive care unit. Simulating the results from a progressive visualization system, this figure shows both the raw data and best fit regression line (shown in blue) for (a) 500 patients, (b) 1000 patients, and (c) 2500 patients. In all three cases, the IR pipeline has computed a regression across fivefolds, shown in red. The decreasing spread across the red regression lines conveys the expected—but often overlooked—change in variation between folds as the sample size increases. The gray band across all three charts has been added to this figure to emphasize these differences and reflects the variation across folds in (c) at the maximum observed weight.
These screenshots show how IR helps convey uncertainty during progressive analysis, using the incremental sampling feature of the prototype to vary the number of samples while keeping all IR parameters constant. In Figure 10(a), only 500 patients are included in the scatterplot. As captured by the varying slopes between the five red lines, there is relatively large disagreement across folds in the linear models they produce. This uncertainty would be invisible in a similar plot rendered without the folds.
As expected, the spread between the individual fold regression lines decreases as more patients are retrieved by the incremental query feature. For example, Figure 10(b) shows the same visualization with the same
As previously stated, the improvement in agreement as sample size increases is expected. However, as evidenced by the “recent history” charts at casino roulette tables and the other examples referenced throughout this article, visualizations are often assumed to be accurate, without taking into account issues of sample size or variation. This use case shows that IR can effectively convey this variation in the data without the need for careful modeling and in a non-parametric way that avoids assumptions about the underlying distributions.
DecisionFlow2
To test IR within a more fully featured exploratory visual analysis environment, we developed DecisionFlow2, a new IR-based version of our existing visual analysis system for high-dimensional temporal event sequence data. 7 A screen capture of the DecisionFlow2 interface is shown in Figure 11.

The DecisionFlow2 visual analytics system is shown here displaying medical event data using the Inline Replication (IR) process outlined in this article. The data in this example have been analyzed using fivefolds, without replacement. The inset subfigures show (a) an initial visualization of the aggregation function’s results for a particular medical event, and (b) a more detailed “unfolded” representation showing the variation in positive and negative support as observed across the fivefolds produced by the partition function. Figures 12 and 13 show how differences in the unfolded representation can help inform users during an analysis.
Original DecisionFlow design
The original version of DecisionFlow made heavy use of p values to help users identify event types that had a statistically significant correlation to a user-specified outcome measure. When visualizing medical data, for example, this approach enables users to find types of medical events (such as specific diagnoses, medications, and procedures) that—when appearing in a particular pattern in a patient’s history—are associated with better or worse medical outcomes.
An interactive timeline at the top of the screen enables users to segment a cohort of event sequences based on the presence of so-called “milestone” events. For a given subgroup, DecisionFlow visualizes statistics for the potentially thousands of different event types that occur between milestones, with the goal of helping users identify good candidates for new milestones. DecisionFlow conveys the event type statistics via an interactive bubble chart similar to the one seen in Figure 11.
In the bubble chart, each event type is represented by a circle whose x-axis position is determined by its positive support (the fraction of “good outcome” event sequences that contain the event type). Similarly, each circle’s y-axis position is determined by its negative support (the fraction of “bad outcome” sequences with the event type). Circle size and color encode correlation and odds ratio, respectively. Importantly, circles representing event types whose presence correlates significantly
Design adaptation for IR
In the IR-based DecisionFlow2 system developed for this article, a similar bubble chart design is used to visualize the event-type statistics. However, rather than showing data for measures computed for the overall population
The aggregate view (without showing the unfolded data) in Figure 11(a) looks essentially identical to the original DecisionFlow design. This is as intended, with the goal of making IR compatible with typical visualization designs. However, while the visual encoding is similar, the number of statistically significant correlations scores is reduced. In particular, a number of event types that were labeled as statistically significant in the original design were no longer found to be significant once majority-voting across the fivefolds was used to determine which event types were significant. This makes the visualization system more selective in rejecting the null hypothesis. The result is a reduction in the likelihood of Type 1 errors, which are a common problem in high-dimensional exploratory analysis. More detailed results and discussion are provided in section “Results and analysis.”
Another important part of the IR-based DecisionFlow2 is the ability to unfold the aggregate statistics for each event type. Users can unfold an event type by hovering the mouse pointer over the corresponding circle. For example, after hovering the mouse pointer for a few seconds over the circle shown in Figure 11(a), the unfolded representation shown in Figure 11(b) is added to the visualization.
As this example shows, the DecisionFlow2 displays the unfolded data as a convex region drawn around the original circle and outlined with a dashed border. This region corresponds to the convex hull determined by the
The typical behavior observed when utilizing the IR-based implementation of DecisionFlow2 is shown in Figure 12. Figure 12(a) shows an event type from a very large subset of data that shows very limited variation across folds. This is represented by the very small unfolded region located near the center of the red circle just above the mouse pointer. Figure 12(b) of the figure, meanwhile, shows an event type with much higher variation (the dashed unfolded region surrounding the mouse pointer). This figure, visualizing data from a smaller sample size, demonstrates what one might expect: findings based on smaller sample sizes tend to have more variability and therefore should typically be given less weight in a decision-making process.

In general, (a) smaller differences between folds are seen when sample sizes are larger, while (b) higher levels of variation are seen for smaller sample sizes.
However, this very critical difference is not observable via the original bubble chart. The size of the data corresponding to each bubble is made available elsewhere in the user interface for users who consciously seek it out, but the implications of the differences in data size are left to the user’s imagination. It is only through the unfolding process that the visualization itself conveys the difference in confidence that users should place in one view versus the other.
Moreover, it is critical to note that the size of the data set is not the sole determinant of repeatability for a given measure across folds. Major differences in measure values can be seen even for similarly sized data sets. For example, Figure 13 shows three different event types from the exact same subset of event sequences. While the number of event sequences was the same for each type, the association between ACE Inhibitors (center panel) and the user-defined outcome (eventual diagnosis with heart failure) was far more consistent across folds.

Even with the same sample size, different measures can have different levels of repeatability across folds. In this example, both (a) and (c) show relatively high levels of variability, while the small unfolded region in (b) suggests that the relationship between outcome and ACE Inhibitors was fairly consistent across all fivefolds. All three views were calculated using identical sample sizes.
Results and analysis
The IR-based DecisionFlow2 prototype provides visual feedback regarding the variation in positive and negative support. As previously described, the system also uses IR to assess the statistical significance of each event type’s correlation with patient outcome. For a given event type, correlation coefficients and p values are computed for each fold, then aggregated via majority vote. Event types with more than
To better understand the impact of IR and the choice of n on the visualized results, we conducted a quantitative experiment in which we compared performance for a sample user interaction sequence under various conditions. More specifically, we experimented repeatedly by performing the exact same exploratory analysis steps using DecisionFlow2, using the exact same input data, varying only the number of folds. The experiment was conducted at three partition settings:
In all three cases, the input data set consisted of event data from the medical records of 2899 patients containing 1,074,435 individual medical events. These timestamped events contained 3631 distinct medical event types: specific diagnoses, lab tests, or medication orders that were present in the patients’ records. Of the 3631 distinct event types, 381 were deemed prevalent enough by the DecisionFlow2 system to be the target of correlation analysis within the metric function. The same threshold was used across all three partition settings, enabling us to compare analysis results across the exact same control conditions.
The results of our analysis are shown in Table 2. With
A comparison of statistically significant findings in three different IR configurations with DecisionFlow2 applied to the same data.
The number of event types flagged as significantly associated with outcome was largest for
As expected—and as intended—the number of statistically significant findings is reduced as n grows from one to five. There are two primary reasons for this reduction. First, because each condition is applied to the same set of event sequences for the same patients, the partition size is smaller as n increases. The smaller number of patients reduces the statistical power for each partition. The expected impact of this is higher p values and fewer statistically significant findings. With the ever-growing size of data sets in many applications, however, the impact on statistical power due to partitioning should be minimal in many use cases. At the same time, the majority-vote aggregation function requires that a significant level be repeatedly observed across multiple partitions (
While statistical significance based on p value thresholds has known limitations to medical research and beyond, 44 it is a widely used metric in exploratory visualization because it enables a rough filtering of data to manage visual complexity and the user’s analytic attention. Follow-up analysis of any discovered insights is required. For this reason, reducing Type 1 errors becomes critical for modern visual analysis applications where vast numbers of data points can be tested and prioritized for user analysis. As the results presented here show, IR applies a higher bar for statistical significance, which has the potential to limit unsupported conclusions from the data in cases where users make quick predictive assessments directly from a visualization. It can also save significant effort in cases where follow-up analysis is performed by reducing the number of falsely generated hypotheses.
Discussion of limitations
The IR approach is designed to embed the process of replication directly within the visualization pipeline, providing a non-parametric approach to calculating and visualizing the repeatability of derived measures. As the examples in section “Use cases” demonstrate, the approach can be effective when applied to a variety of different measures and visual metaphors. However, there are limitations to IR that must be acknowledged.
First, the proposed approach does nothing to combat selection bias or other problems in the creation of the original data set. Any systemic sampling biases in the original data will be present across all folds created by the partitioning algorithm. Therefore, even measures that generalize well across multiple partitions are not necessarily generalizable to entirely new data sets.
Second, the IR approach is not truly predictive in nature. While information about the ability of various measures to replicate across multiple folds can be useful in vetting potential conclusions, findings uncovered via IR should be considered hypotheses that require testing using more rigorous methods when important decisions are to be made.
In particular, hypothesis testing often requires the collection and analysis of new data to fully understand the conditions under which a given insight holds true. Our method does not replace this step. Instead, IR helps reduce the number of Type 1 errors, which can lower the number of conclusions that need testing. However, IR does not eliminate the necessity of a post-hypothesis validation process.
Conclusion
Traditional data visualizations show retrospective views of existing data sets with little to no focus on prediction or generalizability. However, users often base decisions about future events on the findings made using these visualizations. In this way, visualization can be considered to be a visual predictive model that is subject to the same problems of overfitting as traditional modeling methods. As a result, visualization users can often make invalid inferences based on unreliable visual evidence.
This article described an approach to visual model validation called IR. Similar to cross-validation and bootstrap resampling techniques, IR provides a non-parametric and broadly applicable approach to visual model assessment and repeatability. The IR pipeline was defined, including three key functions: the partition function, the metric function, and the aggregation function. In addition, methods for visual display and interaction were discussed. The article reported results from empirical experiments that capture how IR performs under different conditions, providing insights into how the choice of IR parameters impacts performance. Finally, two uses cases were described, including a new IR-based implementation of a previously published exploratory visual analytics system. The use cases demonstrated the successful compatibility of IR with a variety of visual metaphors and derived measures.
While the results presented in this article are promising, they represent only one step in a growing effort to bring high repeatability and predictive power to visualization-based analysis systems. There are many areas for future work including improved techniques for detecting and conveying issues related to missing data, techniques for addressing and visually warning users regarding selection bias, and improved methods for conveying the degree of compatibility between a given statistical model’s assumptions and the actual underlying data.
Footnotes
Acknowledgements
The authors thank Brandon A Price for his contributions to the software development process for the reference prototype described in the “Use Cases” section of this article.
Funding
This article is based in part upon work supported by the National Science Foundation under grant no. 1704018.
