Prediction of data visibility in two-dimensional scatterplots

Abstract

The result of a visualization process depends on the user’s decisions along it. With the intention of accelerating this process and guaranteeing an appropriate visualization of the data, we are looking to semi-automatize the process to help the users with the decision-making along it. To contribute to this semi-automation, it is useful to have metrics that characterize different important aspects of the visualization techniques, such as data representation visibility. Besides, scatterplots are a widely used technique to visualize scalar datasets. In this context, this work presents a metric that evaluates data representation visibility considering glyph visibility in scatterplots. We defined a metric that estimates the proportion of glyphs that will be visible regardless of the drawing order, and it depends on the number of items in the dataset, the size of the window, and the size of the glyphs that will represent the data. To define and approximate the metric, we experimented with several random datasets for which both dimensions followed a normal distribution. This metric constitutes an alternative to characterize scatterplots and collaborates in the semi-automation of the user’s decisions along the visualization process.

Keywords

Information visualization visualization technique scatterplots visual scalability metric visibility metric

Introduction

The goal of visualization is to obtain a visual representation of a dataset; this representation should help the user to interpret the dataset correctly and achieve a proper and useful analysis.

Given the constant growing of the datasets in different application areas, the task of choosing the most suitable technique to visualize a dataset is not easy. Besides, the user’s decisions along the visualization process alter the final visualization: an unskilled user is prone to make wrong decisions that affect negatively the final visualization. Eventually, this may frustrate the user’s experience with the visualization.

In a drive to accelerate the process and guarantee a suitable visualization of the data, we are looking to semi-automatize the process by guiding the user in the selection of a visualization technique and the different parameters that configure the chosen one. This semi-automation partially depends on the existence of metrics that characterize different aspects of the visualization techniques and help to decide the most suitable one for a given dataset.

Both Tufte¹ in 1983 and Miller et al.² in 1997 set out the problem about measuring the quality of a visualization. In particular, Tufte¹ proposed measuring it based on the amount of consumed ink and the purpose of that ink; he stated that the majority of the used ink in a graph must represent information about the data. To reinforce the usefulness of metrics, Tatu et al.³ presented a study to validate the hypothesis that quality metrics are able to simulate the selection of the “best” view according to human perception.

This work focuses on two-dimensional (2D) scatterplots. Scatterplots are a very useful technique widely used for bi-dimensional data visualization. Moreover, this visualization technique is extensible to multidimensional data and very appropriate for large data visualization.

The result of a visualization with scatterplots depends not only on the dataset but also on the particular characteristics of the visualization: how large is the visualization? How large are the glyphs? What is the shape of the glyphs? Our purpose is to introduce a new decision element in order to prevent the user from using a trial-and-error approach to get a good and useful visualization. Therefore,

We propose a metric that estimates the amount of always-visible glyphs in the scatterplot regardless of the drawing order.

We present a mathematical approximation of the metric as a function of the amount of data to visualize, the window’s size, and the glyph’s size.

We analyze how, within a specific context (data, technique, and maximum available space), the defined measure assists the user in the selection of parameters that result in the best possible visualization regarding glyphs’ visibility.

The remainder of the article is organized as follows. The next section briefly presents the limitations of scatterplots and the previous work on the definition of reference frameworks and specific metrics for scatterplots. Section “Metrics and prediction” presents the context in which this metric was conceived. Then, in section “A metric to quantify data visibility,” a metric to measure data visibility in a scatterplot and its mathematical model are defined. In section “How this metric helps the users with scatterplots configuration,” we present two theoretical examples to show the possible usages of the metric and a real-data case to show how it can help users along the visualization process. Finally, in the last section, we draw some conclusions and outline future work.

Previous work

Given the wide variety of visualization techniques, it is advisable to have metrics that help to decide which technique is the most suitable to visualize a particular dataset. Besides, the result of the visualization process depends on the selection of a technique and the associated parameters. Since this work is focused on a metric for 2D scatterplots, this section presents their limitations, reference frameworks for the definition of metrics, and finally, metrics particularly defined for scatterplots.

Scatterplots’ limitations

Scatterplots have two major limitations: the number of representable dimensions and the amount of data that are possible to visualize, in spite of the existing superposition:

Dimensionality

Even though scatterplots are inherently bi-dimensional, the concept is extensible to three-dimensional (3D) visualizations⁴. In both cases, it is possible to represent multidimensional data with complex glyphs instead of points^5–7 or with multiple 2D scatterplot matrices,⁸ one for every different pair of dimensions. Both alternatives are limited in terms of the number of representable dimensions. Very complex glyphs are able to represent a great number of dimensions, although this implies bigger glyphs and then, a reduced number of representable ones. On the other hand, scatterplot matrices are suitable for approximately 10 dimensions at most to avoid individual scatterplots to become too small on a single display.

Overlapping

When visualizing big datasets, scatterplots tend to present high superposition among glyphs. Depending on the visualization application, two cases are possible: the user needs to distinguish glyphs among them or the user is satisfied just identifying different densities of glyphs along the scatterplot. In both cases, the overlapping makes the visualization analysis more difficult. In the first case, it can be dealt with different glyphs (shape or color), distortion,⁹ motion,¹⁰ or multiresolution. In the second case, transparency, histograms, or binning are suitable solutions.

Reference frameworks

Several authors were focused on confirming the necessity and utility of metrics to characterize the behavior of different visualization techniques. In particular, some authors worked on the definition of conceptual frameworks to define metrics for techniques; among them, there are criteria for evaluating visualization techniques,¹¹ a first systematization of quality metrics,¹² and an analysis of quality metrics for multidimensional data visualizations.¹³

Freitas et al.¹¹ defined four classes of criteria to evaluate the usability of a visual representation (completeness, spatial organization, codification of information, and state changes after user’s actions) and three classes of criteria to evaluate interactions (help and user orientation, navigation and browsing, and dataset reduction).

Bertini and Santucci¹² analyzed quality metrics and presented a first systematization. They proposed a classification based on three main classes of metrics: size metrics, visual effectiveness metrics, and feature preservation metrics. They also presented an outline of a methodology to define metrics for visual effectiveness and feature preservation.

To provide a common framework to analyze different quality metrics of multidimensional data visualizations, Bertini et al.¹³ presented a systematic analysis of published metrics. They characterized metrics based on common factors: technique, measured aspect, where the aspect is measured, purpose, and interactions.

Metrics for scatterplots

There are several metrics defined on scatterplots. Even though these metrics are not focused on visual scalability, some of them measure aspects as occlusion or visual degradation which impact directly on the visual scalability of the technique.

Brath¹⁴ proposed conceptual measures to help with the design and evaluation of static 3D visualizations, which are applicable to scatterplots:

Number of data points. Amount of discrete data values represented;

Data density. Ratio between the amount of data and the amount of pixels in the window (the window does not include toolbars, menu bars, borders, etc.);

Cognitive complexity. It includes the number of simultaneous dimensions, the maximum number of dimensions for each separable representation based on the task, and the effectiveness (represented by a point scheme that quantifies the effectiveness of a visual representation);

Occlusion percentage. Ratio (between 0 and 1) between the number of data points completely occluded and the total number of data points;

Percentage of identifiable points. Ratio between the number of visible and identifiable data points in relation with every other visible data point and the square of the number of data points.

Based on the measures of information content developed by Shannon,¹⁵ Yang-Pelaez and Flowers¹⁶ developed measures to evaluate the effectiveness of visualizations; these measures are related to information content covered by the data, information content of the data in the visualization, information capacity of a visualization, and topological information content. Each dimension contributes in $\underset{2}{\log} (range / precision)$ to the total information of both the data and the visualization.

Bertini and Santucci^17,18 focused on determining a correct data sampling to automatically guarantee quality parameters in a 2D scatterplot visualization with 1-pixel glyphs. They estimated the amount of active pixels (those that are distinguishable from background), the available free space, and the collisions because of the data sampling. The final visualization is divided into small areas of p pixels and several measures are calculated to evaluate different aspects of the quality of the image as degradation, density differences, and negative effects of the data sampling.

The previously introduced metrics on scatterplots are focused on measuring characteristics of already rendered visualizations. Besides, collision points ratio ^17,18 considers only 1-pixel glyphs and excludes the analysis of one of the most important aspects that affects the superposition in scatterplots.

In the visual analytics research area, scagnostics (scatterplot diagnostics) were developed to interpret the information through visual representation. By scagnostics, it is possible to detect anomalies in the dataset through indices calculated over the visualization of big scatterplot matrices. For example, Dang and Wilkinson¹⁹ worked with scagnostics to find anomalies and similar distribution among the scatterplots of a scatterplot matrix with more than 100 dimensions. However, scagnostics are focused on extraction and deduction of information about a dataset from a visualization, but not on the quantification of the characteristics of the visualization itself.

Metrics and prediction

The unified visualization model²⁰ (UVM) is a reference model that gives users and designers a unique mental model to express their needs. It defines a theoretical framework for describing the intermediate states and transformations of the data from its raw state in the application domain to the final view construction. From a dynamic point of view, the visualization can be perceived as a process that takes data from the user domain (i.e. the input data or raw data), processes them, and gives the view back. The UVM represents the different transformations that affect the dataset and the states that the data go through (see Figure 1).

Figure 1.

Pipeline of the unified visualization model (UVM). The different transformations that affect the dataset are colored in blue, the states that the data goes through are colored in red, and the representation of the interactions along the pipeline is colored in green.

The quality of a visualization could be measured along the different stages of the UVM. The view is the most straightforward stage to evaluate the result. However, an evaluation of the visualization in this last step implies the generation of the visualization, even if it is not going to be effective. Our goal is to predict the quality of a visualization before reaching the view, that is, before applying a particular visualization technique to the dataset. Previously, during the technique transformation, the dataset to visualize is already defined and the visualization technique to apply must be selected. In this last transformation, it is possible to evaluate measures that predict the result of the visualization of the dataset with a selected technique. In both cases, each technique must have its own set of measures to predict its performance with the given dataset.

Metrics can be used to guide the selection or the configuration of a visualization technique. It should be noted that unconstrained flexibility makes it difficult to choose appropriate or even optimal visualization techniques for a particular visualization goal. Given a dataset, we identified three different usages of metrics during the technique transformation:

Assisting the user in the selection of good parameters to visualize the dataset with a particular visualization technique;

Warning the user that a particular visualization technique is not advisable for the dataset despite the configuration;

Preventing the user from selecting a non-advisable visualization technique for the dataset. A semi-automatic system could expose only the potentially acceptable visualization techniques to the user for him or her to choose one.

The first usage of metrics should work in conjunction with and complement any of the other two. Moreover, this combined usage of metrics encourages the users to try out those visualization techniques that may result in potentially acceptable visualizations.

Metrics associated with each visualization technique are of great help when the user is selecting a technique to visualize a dataset. However, metrics are not enough since they may not consider the type or nature of the data to visualize. Therefore, when we propose the evaluation of a metric of a particular visualization technique with the given dataset, we are assuming that there exists a previous instance that has identified the technique as suitable for those data, for example, by using semantics.²⁰

A metric to quantify data visibility

An appropriate decision-making guidance for users along the visualization process depends partially on the existence of metrics. These metrics need to characterize distinct aspects of the techniques and help to configure them suitably for a given dataset. To accomplish this, it is necessary to define a metric that predicts how good the resultant visualization will be without rendering it. The ultimate goal is to reach a semi-automatic system that gives the user a visualization as a starting point for data exploration and analysis.

Given that superposition is an important limiting factor of scatterplots, it was defined a metric that expresses mathematically the concept of visibility, that is, the opposite concept of occlusion percentage,¹⁴ and takes into account parameters of the visualization

\begin{matrix} visibility = 1 - occlusion percentage \\ = f (dataset, visualization ’ s parameters) \end{matrix}

In this section, the concept of visibility index is defined and it is formalized by a method for its calculation. According to the guides presented by Tatu et al.,³ an algorithm to calculate the visibility index of scatterplot visualizations was developed. Then, a mathematical model that approximates this index was defined. This model allows the prediction of data visibility in the resultant visualization.

Visibility index

The visibility index is defined as a specific metric for scatterplots. Given a scalar dataset and the window’s and glyph’s dimensions, it estimates the expected percentage of glyphs that are not completely overlapped with other glyphs (there exists at least one pixel of the glyph which is not overlapped with another glyph), that is, the expected amount of glyphs that are always visible despite the rendering order.

Adopting Brath’s¹⁴ convention, window’s dimensions (height and width) include only the drawing area of the scatterplot, that is, it excludes menus, borders, buttons, supplementary visualizations, and so on. In the following analysis, only square windows are considered; then, only one value is enough to represent height and width (the size) of the window. As glyphs are also considered square, only one value is also enough to represent their size. In both cases, the side of the square is considered as the size of the window or the glyph.

Algorithm to calculate the visibility index

Algorithm 1 calculates the visibility index $τ$ of a dataset of size $n$ visualized with a 2D scatterplot. Figure 2 shows two examples of glyphs in a scatterplot and integer matrix that corresponds to each example.

Algorithm 1

Require: The dataset

The integer matrix M that represents the drawing area.

Ensure: The visibility index

1: Create a list G of glyphs. Each glyph must have information to obtain the exact positions (pixels) in the integer matrix M in order to represent it.

2: for all datum of the dataset do

3: Create a glyph g.

4: Add g to the list G.

5: Place g in the matrix and increase by one each element in M that represents g.

6: The amount of always-visible glyphs is the amount of glyphs g in G such that at least one of its pixels in the visualization is occupied only by itself, that is, the element of M that represents at least one pixel has value 1.

7: The visibility index is the ratio between the amount of always-visible glyphs and the total amount of glyphs.

Figure 2.

Two examples of glyphs placed in a 2D scatterplot and their respective integer matrix. (a) A case where the central glyph is visible depending on the drawing order: if it is drawn before all the other glyphs, it will be hidden behind them; but if the central glyph is drawn at last, it might be visible (depending, for instance, on the glyph or border color). (b) A case where, regardless of the drawing order, at least one pixel of each glyph is always visible.

Mathematical model for the visibility index

To analyze the behavior of the defined metric, the experiments with a total of 2760 datasets were conducted. The datasets were randomly generated following 16 different normal distributions for each one of the two dimensions and with 23 different dataset sizes ( $10, 10^{1.25}, 10^{1.5}, 10^{1.75}, 10^{2}, \dots, 10^{6}, 10^{6.25}$ , and $10^{6.5}$ ). Each dataset was visualized with 375 scatterplots with different configurations: 25 sizes for the windows ( $100, 300, 500, \dots, 4700$ , and 4900 pixels side) and 15 sizes for the glyphs ( $2, 4, 6, \dots, 28$ , and 30 pixels side). The 16 normal distributions were all generated with mean 1 but different standard deviations ( $0.05, 0.08, 0.1, 0.3, 0.5, 0.8, 1, 3, 5, 8, 10, 30, 50, 80, 100$ , and 300).

Normal distribution was used to perform the experiments. This distribution is the most common one; it fits the most natural phenomena when the sample is large enough and the random errors are sufficiently small.²¹

For each pair $〈 size of the window, size of the glyph 〉$ , the variation in the average visibility index based on the amount of data was analyzed (see Figure 3).

Figure 3.

Behavior of the visibility index depending on the logarithm of the amount of data. p and h represent the size of the glyphs and the window, respectively, and x is the amount of items in the dataset.

Limits of the function f

Given the size of the window h, the size of the glyph p, and the size of the dataset x, and taking into account that h, p, and x are independent variables, the function f which approximate the visibility index $τ$ should meet the following conditions:

$lim_{x \to 0} f (x, h, p) = 1$ . As the amount of data to visualize tends to 0, the possibility of every data being always visible independently of the window or glyph size tends to 1.

$lim_{x \to \infty} f (x, h, p) = 0$ . As the amount of data to visualize grows tending to infinity, the possibility of any datum being always visible becomes smaller and tends to 0.

$lim_{h \to 0} f (x, h, p) = 0$ . As the area destined to the visualization tends to 0, the proportion of always-visible data becomes smaller and tends to 0.

$lim_{h \to \infty} f (x, h, p) = 1$ . As the area destined to the visualization grows tending to infinity, the proportion of always-visible data tends to 1, independently of the amount of data or the size of the glyph.

$lim_{p \to 0} f (x, h, p) = 1$ . As the size of glyphs that represent the data reduces, the proportion of always-visible data tends to 1.

$lim_{p \to \infty} f (x, h, p) = 0$ . As the size of the glyphs grows, the proportion of always-visible data tends to 0, given that the possibility of any datum being always visible tends to 0.

Definition of the function f

For each triple $〈 x, h, p 〉$ , a function $f (x, h, p)$ approximates the visibility index $τ$

τ \approx f (x, h, p) = \frac{1}{1 + e^{γ (x, h, p)}}

Then, in order for the function f to satisfy the previously enumerated conditions, function $γ$ must satisfy:

If $lim_{x \to 0} f (x, h, p) = lim_{x \to 0} \frac{1}{1 + e^{γ (x, h, p)}} = 1$ , then $lim_{x \to 0} γ (x, h, p) = - \infty$ .

If $lim_{x \to \infty} f (x, h, p) = lim_{x \to \infty} \frac{1}{1 + e^{γ (x, h, p)}} = 0$ , then $lim_{x \to \infty} γ (x, h, p) = + \infty$ .

If $lim_{h \to 0} f (x, h, p) = lim_{h \to 0} \frac{1}{1 + e^{γ (x, h, p)}} = 0$ , then $lim_{h \to 0} γ (x, h, p) = + \infty$ .

If $lim_{h \to \infty} f (x, h, p) = lim_{h \to \infty} \frac{1}{1 + e^{γ (x, h, p)}} = 1$ , then $lim_{h \to \infty} γ (x, h, p) = - \infty$ .

If $lim_{p \to 0} f (x, h, p) = lim_{p \to 0} \frac{1}{1 + e^{γ (x, h, p)}} = 1$ , then $lim_{p \to 0} γ (x, h, p) = - \infty$ .

If $lim_{p \to \infty} f (x, h, p) = lim_{p \to \infty} \frac{1}{1 + e^{γ (x, h, p)}} = 0$ , then $lim_{p \to \infty} γ (x, h, p) = + \infty$ .

Then, considering $γ (x, h, p) = a \ln (x) + b \ln (h) + c \ln (p) + d$ , the following is obtained:

$lim_{x \to 0} γ (x, h, p) = lim_{x \to 0} [a \ln (x)] + b \ln (h) + c \ln (p) + d = - sign (a) \infty$

$lim_{x \to \infty} γ (x, h, p) = lim_{x \to \infty} [a \ln (x)] + b \ln (h) + c \ln (p) + d = sign (a) \infty$

$lim_{h \to 0} γ (x, h, p) = a \ln (x) + lim_{h \to 0} [b \ln (h)] + c \ln (p) + d = - sign (b) \infty$

$lim_{h \to \infty} γ (x, h, p) = a \ln (x) + lim_{h \to \infty} [b \ln (h)] + c \ln (p) + d = sign (b) \infty$

$lim_{p \to 0} γ (x, h, p) = a \ln (x) + b \ln (h) + lim_{h \to 0} [c \ln (p)] + d = - sign (c) \infty$

$lim_{p \to \infty} γ (x, h, p) = a \ln (x) + b \ln (h) + lim_{h \to 0} [c \ln (p)] + d = sign (c) \infty$

In consequence, coefficients a, b, and c must satisfy

a > 0, b < 0 and c > 0

Note on f and γ

Even though from $γ (x, h, p) = a \ln (x) + b \ln (h) + c \ln (p) + d$ and $f (x, h, p) = 1 / (1 + e^{γ (x, h, p)})$ , the equivalent function can be obtained

\begin{matrix} f (x, h, p) = \frac{1}{1 + e^{γ (x, h, p)}} \\ = \frac{1}{1 + e^{a \ln (x) + b \ln (h) + c \ln (p) + d}} \\ = \frac{1}{1 + x^{a} h^{b} p^{c} e^{d}} \end{matrix}

In practice, the first expression for f gives better results than the second equivalent expression.

Approximation of coefficients a, b, c, and d

From function $γ (x, h, p) = g_{1} (x) = a \ln (x) + b_{0}$ , the $τ$ index is approximated by parts. Coefficient a is the first one to be approximated. The new coefficient $b_{0}$ is approximated with a function $g_{2} (h) = b \ln (h) + c_{0}$ , and finally, coefficient $c_{0}$ is approximated with a function $g_{3} (p) = c \ln (p) + d$ .

Schematically, to perform the approximation, the function $g$ was divided into the following way

\begin{matrix} γ (x, h, p) = g_{1} (x) = a \ln (x) + \underset{g_{2} (h) = b \ln (h) + \underset{g_{3} (p) = c \ln (p) + d}{\underset{︸}{c_{0}}}}{\underset{︸}{b_{0}}} \end{matrix}

Figure 4 shows the steps followed to approximate the $τ$ index with the function $1 / (1 + e^{γ (x, h, p)})$ and to obtain coefficients a, b, c, and d; besides, it shows examples of the obtained values in each step. The approximation was performed using gnuplot (http://www.gnuplot.info/) with initial values of one. gnuplot uses an implementation of the nonlinear least squares (NLLS) Levenberg–Marquardt algorithm.

Figure 4.

The seven function-fitting ran to obtain the coefficients a, b, c, and d of the approximation of $τ$ .

Error analysis

To analyze the approximation error of f, the relative and absolute errors between f and $τ$ are compared.

Error	Minimum	Maximum	Average
Absolute	0.0	0.16581	0.01219
Relative	0.0	0.99997	0.28261

Mean squared error = 0.00364.

It should be pointed out that the high values of the relative error correspond to the values of $τ$ very close to 0, while the high values of absolute error correspond to the transition from 1 to 0 (see Figure 5).

Figure 5.

Plot of the relative and absolute errors between f and $τ$ .

How this metric helps the users with scatterplots’ configuration

Even though the formulae contemplate windows as big or glyphs as small as necessary, in practice, the size of the window should not be bigger than the size of the display and the glyph cannot be smaller than 1 pixel. In this context, for a given amount of data, it would not be possible to obtain a better result than the one with the larger possible h and the minimum possible p.

The goal of this metric is to guide the user while he or she chooses the parameters of the visualization, in particular the window’s and glyph’s sizes, by knowing the amount of data to visualize. Based on a display with a maximum resolution of $1920 \times 1080 pixels$ , we present two examples using normal-distributed synthetic datasets of how the visibility index may guide a user with their visualization.

Synthetic dataset 1: the user wants to visualize a dataset with 1058 data

In this example, the goal is to analyze the relationship between the glyph size and the visibility index while visualizing 1058 data in a $400 \times 400 pixels$ window. The user could decide to use glyphs of a given size or to set a restriction on the visibility index.

In the first option, if he or she chooses a glyph of $16 \times 16 pixels$ (see Figure 6(a)), then the user may notice that with those parameters the visibility index $τ$ is approximately 0.2976. That is, only 29% of the glyphs are expected to have at least one pixel not overlapped with another glyph.

Figure 6.

In case of (a), using the metric, the user may notice that only 29% of the glyphs are always visible regardless of the drawing order. In case of (b), the metric may help the user to notice that to get 90% of always-visible glyphs in a $400 \times 400 - pixel$ window, the glyphs should not be larger than $5 \times 5 pixels$ .

On the other hand, if he or she restricts the minimum acceptable value for the visibility index $τ$ to 0.9, then the user may notice that the size of the glyph should not be larger than $5 \times 5$ (see Figure 6(b))

\begin{matrix} f (1058, 400, p) = \frac{1}{1 + e^{a \ln (1058) + b \ln (400) + c \ln (p) + d}} \geq 0.9 \\ p \leq 5.60941 \end{matrix}

Synthetic dataset 2: the user wants to visualize a dataset with 300,000 data

In this example, the goal is to analyze the feasibility of a 300,000-data scatterplot visualization.

If the user restricts the visualization to a $400 \times 400 pixels$ window, he or she may notice that, even with a 1-pixel glyph, the visibility index of the visualization could not be greater than 0.036.

On the other hand, if the user restricts the visibility index $τ$ to a minimum value of 0.9, he or she may notice that even with a 1-pixel glyph, the size of the window needs to be greater than the available space in the display

\begin{matrix} f (300, 000, h, 1) = \frac{1}{1 + e^{a \ln (300, 000) + b \ln (h) + c \ln (1) + d}} \geq 0.9 \\ h \geq 2155.73 \end{matrix}

Furthermore, if the user decides to use the biggest possible window ( $1080 \times 1080 pixels$ ), he or she may notice that the visibility index $τ$ remains low, that is, less than 50%

\begin{matrix} f (300, 000, 1080, 1) = \frac{1}{1 + e^{a \ln (300, 000) + b \ln (1080) + c \ln (1) + d}} \\ \approx 0.48 \end{matrix}

In such cases where the visibility index is not promising, even with the smallest possible glyph and the biggest possible window, a system that supports 2D scatterplots should discourage this technique as a potentially acceptable one for this dataset.

Case study

The travel book Places Rated Almanac ²² rates numerically 329 communities according to nine criteria: Climate & Terrain, Housing, Health Care & Environment, Crime, Transportation, Education, The Arts, Recreation, and Economic. For all but two of these criteria, the higher the score the better. For Housing and Crime, the lower the score the better. The dataset has the population as additional information about each city; as suggested by OpenStreetmap, (https://www.openstreetmap.org) cities are classified into town or different ranks of cities based on their population (see Table 1).

Table 1.

Cities are classified by their population and each category is represented with a different color.

Place type	Population	Color
Town	10,000–100,000	Orange
Rank-30 City	100,000–500,000	Violet
Rank-20 City	500,000–1,000,000	Green
Rank-10 City	1,000,000–10,000,000	Blue
Rank-0 City	>10,000,000	Red

Let suppose that a user wants to generate a 2D scatterplot to compare Climate & Terrain, Transportation, and city classification; the first two attributes correspond to the axis of the scatterplot and the later, to the color of the glyph (see Table 1). Membership disambiguation tasks²³ include tasks where the user explores the data to find objects with specific characteristics, counts the number of objects in a selection, or identifies objects in an area. Overlapping obscures the structure and the information present in the data and make it difficult for the users to accomplish the previously described tasks. Moreover, if the visualization offers a semantic-zoom interaction (i.e. the values of other attributes are shown for a selected glyph) to help with these tasks, glyphs should be identifiable in order to be picked.

Considering that the distribution of the two variables Climate & Terrain and Transportation resembles a normal distribution (see Figure 7), the visibility index could be used to guide the selection of the glyph size.

Figure 7.

The density histogram, the cumulative distribution function of each individual variable against the cumulative distribution function of a normal distribution, and the histograms for each variable are graphical indicators that the distributions of the variables Climate & Terrain and Transportation resemble a normal distribution.

The user generates the 2D scatterplot in a $400 \times 400 - pixel$ window. The amount of always-visible glyphs depends on their size. However, if the user needs to use a glyph-selection facility, glyphs need to be big enough for picking. Then, there is a trade-off between the always-visible glyphs and their size.

For 20-pixel glyphs, the actual ratio of always-visible glyphs is about 0.6603, while the predicted visibility index is 0.5045. Even though the actual value is higher than the estimated one, it is still low: about 34% of the glyphs are not visible. However, for 10-pixel glyphs, the actual ratio is 0.9361 and the estimated one is 0.9452. Moreover, for 5-pixel glyphs, the real ratio rises up to 0.9910, while the estimated one is 0.9818. In this last scenario, the glyphs could turn out to be too small for picking. Figure 8 shows how identifiable is each category of cities for the different sizes of glyph. Note that in Figure 8(a) glyphs are big and suitable for picking but overcrowded and difficult to identify. On the other hand, in Figure 8(c) glyphs are identifiable, although they may be too small to be picked. A trade-off solution could be Figure 8(b), where there is a good visibility of glyph and they still have a good size for picking.

Figure 8.

Comparison of how distinguishable the glyphs are as follows: (a) scatterplot visualized in a 400 × 400-pixel window with 20-pixel, (b) scatterplot visualized in a 400 × 400-pixel window with 10-pixel, and (c) scatterplot visualized in a 400 × 400-pixel window with 5-pixel glyphs.

The visibility index is an appropriate tool for selecting the biggest glyph that still results in a high rate of always-visible glyphs. If the user restricts the minimum acceptable value for the visibility index $τ$ to 0.9, then the size of the glyph should not be larger than $11 \times 11$

\begin{matrix} f (329, 400, p) = \frac{1}{1 + e^{a \ln (329) + b \ln (400) + c \ln (p) + d}} \geq 0.9 \\ p \leq 11.82 \end{matrix}

Conclusion and future work

In the process of guiding the user in the selection of the different parameters of a visualization, the usefulness of metrics grows if every technique has attached at least one metric of visual scalability. A semi-automatic system could use these metrics to alert the user about decisions that degrade the visualization and guide him or her in the selection of the parameters that generate the best possible view. If the best possible view is not acceptable, then the semi-automatic system could use metrics to suggest alternative techniques.

This work presented a metric that, given the size of the scalar dataset to visualize, the window’s size, and the glyph’s size, estimates the proportion of always-visible glyphs in a scatterplot visualization with such characteristics. This metric could be useful to help the users to choose the most adequate parameters to get an acceptable visualization.

Given that the presented metric was derived from particularly distributed datasets, we plan to analyze and extend the mathematical model to other data distributions. On the other hand, in order to help the users along the visualization process, the ultimate goal is to have a variety of metrics that measure different characteristics of the potential resulting visualization and bring additional tools to choose the most appropriate parameters to visualize a dataset with scatterplots.

Footnotes

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was partially funded by PGI 24/N028 and PGI 24/N037, Secretaría General de Ciencia y Tecnología, Universidad Nacional del Sur, Bahía Blanca, Argentina.

References

Tufte

. The visual display of quantitative information. 2nd ed. Cheshire, CT: Graphics Press, 2001.

Miller

Hetzler

Nakamura

. The need for metrics in visual information analysis. In: Proceedings of the 1997 workshop on new paradigms in information visualization and manipulation (NPIV ’97), 1997, pp. 24–28. New York: ACM.

Tatu

Bak

Bertini

. Visual quality metrics and human perception: an initial study on 2D projections of large multidimensional data. In: Proceedings of the international conference on advanced visual interfaces (AVI ’10), 2010, pp. 49–56. New York: ACM.

Donoho

Gasko

. MacSpin: dynamic graphics on a desktop computer. IEEE Comput Graph 1988; 8(4): 51–58.

Chernoff

. The use of faces to represent points in k-dimensional space graphically. J Am Stat Assoc 1973; 68(342): 361–368.

Pickett

Grinstein

. Iconographic displays for visualizing multidimensional data. In: Proceedings of the 1988 IEEE international conference on systems, man, and cybernetics, Beijing, China, 8–12 August 1988, pp. 514–519. New York: IEEE.

Borgo

Kehrer

Chung

DHS

. Glyph-based visualization: foundations, design guidelines, techniques and applications. Eurographics State of the Art Reports (EG STARs). Eurographics Association, 2013, pp. 39–63, http://www.cg.tuwien.ac.at/research/publications/2013/borgo-2013-gly/

Becker

Cleveland

. Brushing scatterplots. Technometrics 1987; 29(2): 127–142, https://www-jstor-org.web.bisu.edu.cn/stable/1269768

Keim

Hao

Dayal

. Generalized scatter plots. Inform Visual 2010; 9(4): 301–311.

10.

Etemadpour

Forbes

. Density-based motion. Inform Visual. Epub ahead of print 7 October 2015, http://ivi.sagepub.com/content/early/2015/10/06/1473871615606187

11.

Freitas

CMDS

Luzzardi

PRG

Cava

. On evaluating information visualization techniques. In: Proceedings of the working conference on advanced visual interfaces (AVI ’02), 2002, pp. 373–374. New York: ACM.

12.

Bertini

Santucci

. Visual quality metrics. In: Proceedings of the 2006 AVI workshop on beyond time and errors: novel evaluation methods for information visualization (BELIV ’06), 2006, pp. 1–5. New York: ACM.

13.

Bertini

Tatu

Keim

. Quality metrics in high-dimensional data visualization: an overview and systematization. IEEE T Vis Comput Gr 2011; 17(12): 2203–2212.

14.

Brath

. Metrics for effective information visualization. In: Proceedings of the IEEE symposium on information visualization (INFOVIS), Phoenix, AZ, 20–21 October 1997, pp. 108–111. Washington, DC: IEEE Computer Society.

15.

Shannon

. A mathematical theory of communication. Bell Syst Tech J 1948; 27: 379–423, 623–656

16.

Yang-Peláez

Flowers

. Information content measures of visual displays. In: Proceedings of the IEEE symposium on information vizualization 2000 (INFOVIS ’00), 2000, pp. 99–103. Washington, DC: IEEE Computer Society.

17.

Bertini

Santucci

. Quality metrics for 2D scatterplot graphics: automatically reducing visual clutter. In: Butz

Krüger

Olivier

(eds) Proceedings of symposium on smart graphics, vol. 3031. Berlin, Heidelberg: Springer, 2004, pp. 77–89.

18.

Bertini

Santucci

. Give chance a chance: modeling density to enhance scatter plot quality through random data sampling. Inform Visual 2006; 5(2): 95–110.

19.

Dang

Wilkinson

. ScagExplorer: exploring scatterplots by their scagnostics. In: Proceedings of the 2014 IEEE pacific visualization symposium (PACIFICVIS ’14), 2014, pp. 73–80. Washington, DC: IEEE Computer Society, https://dx-doi-org.web.bisu.edu.cn/10.1109/PacificVis.2014.42

20.

Escarza

Larrea

Urribarri

. Integrating semantics into the visualization process. In: Hagen

(ed.) Scientific visualization: interactions, features, metaphors, vol. 2 of Dagstuhl Follow-Ups. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2011, pp. 92–102, http://drops.dagstuhl.de/opus/volltexte/2011/3304

21.

Alfassi

Boger

Ronen

. Statistical treatment of analytical data. Boca Raton, FL: CRC Press, LLC, 2005.

22.

Boyer

Savageau

. Places rated almanac: your guide to finding the best places to live in America. Chicago, IL: Rand McNally & Co., 1985.

23.

Etemadpour

Linsen

Crick

. A user-centric taxonomy for multidimensional data projection tasks. In: Proceedings of the 6th international conference on information visualization theory and applications, Berlin, 2015, pp. 51–62, https://dx-doi-org.web.bisu.edu.cn/10.5220/0005313400510062