Automating statistical diagrammatic representations with data characterization

Abstract

The search for an efficient method to enhance data cognition is especially important when managing data from multidimensional databases. Open data policies have dramatically increased not only the volume of data available to the public, but also the need to automate the translation of data into efficient graphical representations. Graphic automation involves producing an algorithm that necessarily contains inputs derived from the type of data. A set of rules are then applied to combine the input variables and produce a graphical representation. Automated systems, however, fail to provide an efficient graphical representation because they only consider either a one-dimensional characterization of variables, which leads to an overwhelmingly large number of available solutions, a compositional algebra that leads to a single solution, or requires the user to predetermine the graphical representation. Therefore, we propose a multidimensional characterization of statistical variables that when complemented with a catalog of graphical representations that match any single combination, presents the user with a more specific set of suitable graphical representations to choose from. Cognitive studies can then determine the most efficient perceptual procedures to further shorten the path to the most efficient graphical representations. The examples used herein are limited to graphical representations with three variables given that the number of combinations increases drastically as the number of selected variables increases.

Keywords

Automated design graphic design statistical graphics information visualization graphic user interface statistical graphics taxonomy

Introduction

Despite its short history, the field of statistical graphics has been very prolific. Friendly and Denis,¹ for example, identified more than 50 different types of statistical graphics. Obviously, there are a series of rules that exclude many of these graphics in specific situations. Factors such as the number and the type of variables, the number of unique values for each variable, or the order of the values can restrict the number of viable options. Consequently, all these considerations make the task of choosing an adequate graphical representation for each case a rather complex undertaking that requires prior experience and knowledge of the applicable rules and the available options.

One solution to this problem is to define automatic graphic selection algorithms that, for greater convenience, can be implemented on computers to determine what graphic or limited number of graphics is appropriate for a given situation. This solution is mentioned in the literature and included in certain computer systems, but a solution that enjoys sufficiently broad consensus has yet to be found. In response, our article aims to advance a new set of rules for automatic graphic selection based on the characterization of variables in a dataset.

The set of rules for the automatic selection of graphics refers to the number of variables to be represented graphically and, separately, the characterization of the variables. The latter are the characteristics that can be described for each of the variables (e.g. “numeric” or “alphanumeric”) independently of the characteristics of the relationships between the variables.

A review of strategies to automate statistical graphical representations

This section reviews the strategies that have been proposed in the literature to automate statistical graphical representations. In general, these strategies all include a more or less sophisticated characterization of the separate variables. Nonetheless, they are organized here according to the following factors: (1) the characteristics of the data to be represented; (2) the characteristics and needs of the user; (3) the representation models used, and (4) the limitations of the hardware used. Each of these factors is further discussed in the following.

The characteristics of the data to be represented

Kamps² uses the term “functional design” to refer to the methods that determine the aesthetic of the graphics based on characteristics of the data. More recently, Schulz et al.³ identify data descriptors that also consider the data acquisition, storage, and utility context. There are different aspects of the characteristics of the data that can help reduce the number of graphic possibilities for a given situation. Thus, considering the properties of each separate variable, automated graphical representations have been proposed based on, for example, the variable’s level of measurement (i.e. nominal, ordinal, interval, and ratio), its independent or dependent role, and the source type (empirical or theoretical) of the data. Implicit information such as the number of unique values of a variable and the presence of missing data are also factors. Examples of systems based on these properties include the CHART program,⁴ the BHARAT⁵ system, and ViSta.⁶ Another aspect is the relationship between pairs of variables, such as if each value of a variable corresponds to a value of another variable (functional dependency) or if a value corresponds to multiple values of another variable (as in multilevel and hierarchical data). Examples of systems that use these aspects include the APT⁷ tool, the SAGE⁸ system, and the EAVE² system. Finally, the total number of variables to be represented is a relevant criterion as considered, for example, by APT, that includes as a criterion the “expressivity” of a graphic and regards a language as expressive if it includes all of a dataset’s information and only its information.

The characteristics and needs of the user

The characteristics and needs of the user should be taken into account when selecting what graphic to construct. If, for example, the user wishes to know the precise value of an observation for a variable, then a table is adequate. But if the aim is to detect trends in the evolution of the variable over time, then it is more appropriate to depict the variable’s variation over time instead of comparing numbers in a table. BOZ⁹ is a system based on this type of information. It promotes what its author calls task-based graphic design. Another user characteristic is the ability of human perception. This led to what Kamps² defined as perceptual design, which has been implemented in, for example, APT.⁷ APT includes effectiveness in graphic selection as a criterion, for which Mackinlay used a ranking based on the precision in the decoding of the variables according to the variable’s measurement scale and the type of visual variable with which it is encoded. Perceptual design is also the basis for the development of the quality metrics encompassed in the methods of automated evaluation of visualizations.^10,11 The user’s preferences can also be used to select a graphic; these preferences can be gathered from the user’s graphic selection habits and history, as is done in VizRec.¹² Finally, users carrying out the statistical analysis of data require a type of graphic that is different from what is needed to present the information to a wider audience. ViSta⁶ is an example of a system oriented for statistical analysis, while the infogr.am web platform is oriented for presentation purposes.

Representation models

Representation models use predefined graphic types to interpret the data. These can be classified as one of the three basic forms—“point,”“line,” or “area”—that a mark can take on a plane, a reduced set of visual variables, or a specified taxonomy of the set of graphical representations compiled by Engelhardt.¹³ But representation models may also refer to multiple types of commonly accepted graphics, such as histograms, scatterplots, and bar graphs. One of the first systems to use this criterion was SageBrush,¹⁴ which was implemented in SAGE and made possible the construction of graphics based on prototypes or an MS Excel that prompts the user to choose among a gamut of graphic types and subtypes.

Hardware limitations

Various hardware limitations can also be considered. First, there may be computation or data transmission limitations, as implemented in systems like Polaris¹⁵ that can suggest, for example, a static instead of a dynamic graphic. Second, there are visualization limitations that adapt the graphic to be produced to the resolution and the size of the display screen, such as have been implemented in the BHARAT⁵ system. These considerations have come to be known as responsive design, which refers to the adaptability of the presentation type to the characteristics of the graphic display.

Table 1 concisely depicts the graphic selection strategies used by the different automated statistical graphic selection systems. We can see that the aforementioned functional design strategy is implemented in all the systems and complemented with other strategies to refine the final graphic to be presented. Thus, the SAGE, NSP, BOZ, and VizRec systems also use information about the task to be performed, and the APT, NSP, BOZ, Vista,¹⁶ and EAVE systems incorporate composition algebra to consider human perception abilities and create new graphic types, hence avoiding reliance on an ad hoc catalog of graphics. The CHART, BHARAT, BOZ, SAGE + SageBrush, Vista, Polaris, and Tableau systems allow the user to choose between different coordinate systems and visual variables, while the NSP, SAGE + SageBrush, and Tableau + Show ME¹⁷ systems allow the user to choose among a limited group of graphical representation types, such as tables, maps, scatterplots, and line and area graphs. Finally, systems like CHART, BHARAT, SageBook, Polaris, ViSta, and VizRec present the user with an ad hoc catalog of graphics to choose from.

Table 1.

Strategies used by different graphics automating systems.

System	Individual characteristics of variables	Between pairs of variables	Number of variables	Task to be performed	Human perception	User preferences	Coordinate system or visual variable	Graphical representation taxonomy	Ad hoc catalog of graphics	Processing limitations	Graphic display limitations
System	Data characteristics			Usercharacteristics			Representationmodels			Hardwarecharacteristics
CHART⁴			▪				▪		▪		▪
BHARAT⁵	▪	▪	▪				▪		▪
APT⁷	▪	▪	▪		▪
SAGE⁸	▪	▪	▪	▪					▪
NSP¹⁸	▪	▪	▪	▪	▪			▪
BOZ⁹	▪	▪	▪	▪	▪		▪
SAGE¹⁴	▪	▪	▪	▪			▪	▪	▪
Vista¹⁶	▪	▪	▪		▪		▪
EAVE²	▪	▪	▪		▪
Polaris¹⁵	▪	▪	▪				▪		▪	▪
Tableau¹⁷	▪	▪	▪				▪		▪	▪
ViSta⁶	▪	▪	▪						▪
VizRec¹²	▪	▪	▪	▪		▪			▪

Limitations of the various strategies

The following section discusses the limitations of the aforementioned data-automation strategies. It is presented here as a preliminary step to the presentation of our proposal. These limitations are summarized in Table 2.

Table 2.

Limitations of the various strategies.

Strategy	Limitations
Data characteristics	Characteristics that cannot be implicitly deduced have to be defined by an experienced user. Systems have mostly used a one-dimensional characterization of the variables
User characteristics	The task cannot be automated so it requires each individual user to define it. Systems using a compositional algebra based on human perception abilities fail to take advantage of new and creative graphical methods. Systems based on preferences of users do not allow to consider different groups of users for which the graphic is intended
Representation models	The system requires the user to predetermine the characteristics of the graphic which is undesirable for users without data visualization expertise
Hardware limitations	The system may simplify the graphics without the user being aware. Graphics generated by a system are usually stored and reproduced in other systems with different characteristics

Characteristics of the data to be presented

The first of the strategies, functional design, is implemented in all the systems. This strategy uses the characteristics of the variables taken separately, and the relationships between them that, ideally, are implicit in the datasets, and which the system uses to determine the graphic that is best suited to the data.

However, a limitation of this strategy is that these characteristics are not always implicit in the dataset; in these cases, it becomes necessary for the user to have a deep understanding of the nature of the data. For example, this limitation can be seen in the distinction between nominal and ordinal measurement scales, as well as between interval and ratio scales, in data displayed in tables. Usually tables contain text and numeric fields. The text fields may refer to unordered or ordered categories. If the categories are ordered, this should be reflected in the graphic. Let us suppose we have the categories “high,”“low,” and “medium”; in this case, the three categories reflect an ordered relationship and it would therefore seem natural for the graphic to reorder them as “high,”“medium,” and “low.” Yet it is difficult for an automated system to correctly deduce this relationship, making it necessary for the user to somehow specify this order. The same occurs with numeric fields; the systems are incapable of discerning if they are dealing with magnitudes with arbitrary units and origin (or zero) and consequently an interval measurement scale, or if, on the contrary, they are dealing with absolute zero and a ratio measurement scale. In short, it is often necessary for the user to specify the non-implicit characteristics of the data in order for the systems to automatically generate a graphic.

Another dataset limitation that impacts the characterization of variables is the number of dimensions with which the values of each variable tend to be characterized. Information is generally structured in databases that relate to tables, and these tables store information on the characteristics of the values of each variable, known as data types, such as Boolean, alphanumeric characters, and whole or real numbers. This characterization of the data is generally one dimensional, such that each category is exclusionary and does not allow for the combining of qualities from other dimensions, in order to thus restrict the gamut of graphic possibilities. An example of a multidimensional characterization is to consider, in addition to the aforementioned list of data types, whether the variable is a response or predictor variable, such that a variable can be characterized as, for example, Boolean and predictor or Boolean and response. Thus, the set of possible graphics to represent Boolean variables is further broken down into two smaller subsets.

The characteristics and needs of the user

The task to be performed is one way to complement functional design. For a given dataset, such as the number of traffic accidents per a country’s kilometers of highway, a graphic that makes it easy to know the ratio of accidents by each specific highway takes the form of a table with rows for each highway ordered by highway name, with the ratio of accidents in the adjoining column. But if we wish to group highways according to accident rate (“normal” versus “atypical”), a single-axis plot is preferable, since it depicts the distribution of the coefficients and makes it easy to identify atypical values.

However, complementing functional design with the task to be performed has its limitations. It is often the case that the user undertakes exploratory data analysis without a clear idea of the concrete tasks to be performed and may therefore be interested in visualizing representations for a dataset that serve different purposes. For example, to see whether the values of two variables are more or less correlated, or to identify concentrations in the distribution of the observations for a pair of variables, or to identify bimodal distributions in any of the two variables, or to compare the dispersion or the ranges of the values of both variables.

When the task is previously specified, the system can present graphics that are useful for a predetermined purpose. However, in addition to identifying the task to be performed, this requires identifying the tasks for which each graphic to be evaluated by the system can be useful. Let us assume that the task the user wants to undertake is to find the most economical flight between two cities in a certain period of time. The system can use a bar chart with the lowest daily prices when the period of time is short, or a line graph when the time period is longer. In general, though, the possibilities are more limited and it is easier to produce them automatically.

Quality metrics, based on perceptual design, can help in the automatic selection of graphics, but they are specially used to optimize some aspect of specified graphics. Another way to generate automated graphics is via a compositional algebra that can generate graphics based on rules derived from human perception abilities. But the drawback here is that this approach fails to take advantage of new and creative methods of graphical representation that are specifically tailored to a specific perceptual task. Systems based on compositional algebra tend to limit options to only one graphic instead of a gamut of graphics. Such is the case, for example, with the BOZ system,⁹ which provides two results: the graphic that theoretically is the most effective to execute a specific task; and a set of instructions about how to use the graphic to complete the task satisfactorily. In order to obtain another graphic, one must change the dataset or the task to be performed. Examples of tasks that BOZ handles include those of the type: “determine horizontal distance,” which generates a graphic that depicts the difference between two points on a horizontal axis; and “search for objects with shade” which displays only those objects with a specified shade level. Thus, the system may not permit new graphics created ad hoc for specific problems because they do not fit within the framework defined by these rules. For example, tasks that use complex symbols like box plots and Chernoff faces might be omitted, which paradoxically results in users being unable to evaluate them.

One final way to incorporate the characteristics and needs of users is through the incorporation of their preferences. This is one of the newest methods and it is sure to undergo significant development, similar to the development undergone by search engine recommender systems. ViZRec¹² is an example of such a system; it compiles recommendations based on the data type with recommendations based on the scores given to the graphics by users with respect to criteria such as “boring,”“useful,”“effective,” and “satisfactory.” There are two main problems with this approach. First, the user may have an insufficient registry of preferences for the system to suggest graphics based on it. Second, users may be interested not only in the suggestions based on their past history, but also in the preferences of the specific social group for which the graphic is intended and which may have its own particular communication register.

Representation models

Representation models is the third aforementioned strategy. It consists of the use of predefined graphics to interpret the data. In this case, the system requires information such as the coordinate system to be used, the visual variables into which it is to transform the variables of the dataset, and whether a point graph, line graph, or area graph should be used versus some other specific graphic, such as a scatter graph and histogram. This requirement could cause the non-expert user to easily produce incorrect graphics from a semantics point of view. For example, the use of a pie chart for values that it does not make sense to add up or that lack a maximum possible value. On the other hand, if an ad hoc catalog of solutions is used instead, the user would be limited and lack sufficient flexibility to include new graphical methods.

Hardware limitations

Finally, keeping hardware limitations in mind, especially in terms of memory, screen size, and resolution, allows for the transformation of large amounts of data into graphics that are comprehensible across any device, thus increasing usability. The drawback, however, is that these graphics can, without the user being aware of it, become simplifications that limit interpretation. Furthermore, graphics generated by a system are stored and reproduced in other graphic displays with different characteristics.

System limitations in characterizing the data to be represented

Characterizing variables implies assigning attributes to them. These attributes tend to be exclusionary qualities and they can condition the selection of the graphic to be used because some graphics are better suited to some qualities than others. Below we look at how different systems characterize the variables to be represented and discuss the limitations of each.

As previously stated, functional design strategy is based on limiting the range of possible graphics based on the characteristics of the data. It might be possible to deduce these characteristics implicitly; otherwise, the user will need to define them. There are three levels of characteristics: those that refer to each variable separately, those that define relationships between variables, and those that consider the overall number of variables to be represented.

If the number of variables to be represented cannot be deduced implicitly from the dataset, it is easy for the skilled user to indicate which variables to represent among those included in the dataset. It is also relatively easy to characterize each variable to be represented separately once these variables have been chosen and the rules of a specific characterization are known. Therefore, this is not overly problematic. With respect to the characterization of relationships between variables, however, it would be very costly for the user to characterize all the relationships between variable pairs because this requires a solid understanding of the data and, additionally, the number of relationships grows exponentially with the number of variables considered. Nonetheless, we should also keep in mind that statistical graphics are generally displayed as diagrams and not as networks with nodes and connections that basically depend on these relationships; therefore, if the purpose is to obtain a gamut of acceptable graphics, the effort on the part of the user to characterize the relationships between variables does not seem justified. For this reason, this section only considers the limitations of the systems in terms of characterizing each variable separately, in order that we may later propose a new approach.

Depending on the number of dimensions that the various systems use to characterize each variable separately, we may be looking at one-dimensional or multidimensional characteristics. While the former categorize the variables from a reduced number of exclusionary characteristics (e.g. nominal, ordinal, and quantitative variables), the multidimensional characterizations allow for successive subdivisions in each category based on each dimension being considered. A clear example of this is also considering the role of the variables as predictor or response variables. Thus, a quantitative variable, for instance, can be further characterized as a predictor or a response variable, and this distinction produces two possible gamuts of acceptable graphics that can display this quantitative variable as one or the other.

There are certain dimensions, such as the number of unique values, for which the automated graphic systems can utilize one of two strategies. One strategy, as proposed by Bertin,¹⁹ consists of categorizing this dimension according to pre-established limits, such as short variables when four or fewer unique values exist and long variables when more than 15 exist, and then evaluating the gamut of graphics according to these limits. Another strategy is to include ad hoc limits in the graphic selection algorithms based on the type of graphic being considered; for example, a maximum of 10 unique values per variable in a vertically oriented bar diagram that could, for example, increase to 30 if the bars are arranged horizontally. In this strategy, there are no pre-established exclusionary levels for the variable; in other words, the levels are diffuse.

The greater the number of identified dimensions and levels, the greater the number of resulting variable combinations and the greater the number of subsets of the sample space of available graphics, which restricts the search for an acceptable graphic. One-dimensional characterizations in two levels, for instance, allow for nine possible combinations to represent one, two, or three variables. In three levels, the number of combinations increases to 19, in four levels to 34, in five levels to 55, and in six levels to 83.

As can be observed in Table 3, the majority of the systems utilize a one-dimensional characterization of each variable taken separately that derives from Bertin’s¹⁹ proposal, which characterized the variables according to two dimensions: first, the level of organization of the input variables, which distinguished qualitative, ordered, and quantitative variables; and second, the length of the variables, which also distinguished three levels: short (between two and four values), medium, and long (more than 15 values). These two dimensions, each with three levels, make possible a total of 219 combinations for one, two, and three variables. The APT and BOZ systems, for example, utilize the classification of the domain of the variables in three levels: nominal, ordinal, and quantitative. SAGE subdivides ordinal and quantitative levels based on whether they refer to amounts or reference values and adds another dimension called the “domain of membership.” NSP subdivides the nominal domains into simple and multiple values and the ordinal domains into discrete and continuous values. Vista divides the quantitative level into three sublevels (scalars, vectors, and tensors). Polaris distinguishes between ordinal and qualitative, while Tableau distinguishes between categorical variables (with three sublevels) and quantitative variables (with two sublevels). VizRec distinguishes between categorical, temporary, and numerical variables.

Table 3.

Characterization of each variable taken separately in different automated graphics systems.

System	Dimensions	Levels	Details
CHART⁴	1	1	Quantitative
BHARAT⁵	5	2	Continuity (yes, no)
		2	Totality (yes, no)
		Diffuse	Cardinality
		Diffuse	Units
		Diffuse	Range
APT⁷	1	3	Nominal, ordinal, quantitative
SAGE⁸	2	5	Nominal, ordinal[coord. or amounts], quantitative[coord. or amounts]
SAGE⁸		4	Time, space, temperature, mass
NSP⁹	1	4	Nominal[simple o multiple], ordinal[discrete or continuous]
BOZ⁹	1	3	Nominal, ordinal, quantitative
Vista¹⁶	1	5	Nominal, ordinal, quantitative[scalars, vectors, or tensors]
EAVE²	1	6	Whole, real, boolean, character strings, OID, $Ø$
Polaris¹⁵	1	2	Ordinal, quantitative
Tableau¹⁷	1	5	Categorical [normal, dates, or geographical], quantitative [dep. or ind.]
ViSta⁶	1	7	Multivariate, categorical, classification, frequency table, frequency classifications, matrices and missing data
VizRec¹²	1	3	Categorical, temporary, numeric

Other systems utilize other models besides Bertin’s. The CHART system, for example, only considers the quantitative domain. The BHARAT system uses up to five dimensions, but only identifies levels for the dichotomous variables “totality” and “continuity,” which necessitates the establishment of customized rules in accordance with the value of the cardinality (number of unique values), the units, and the range of the variables. EAVE uses a very different classification, since it considers the domain of whole, real, and Boolean numbers, and character strings, unique object identifiers, and the empty set. Finally, ViSta considers up to seven levels; yet these do not refer solely to variable types, but also to different data structures like “frequency tables” and characteristics that can have all sorts of variable types, such as “missing data.”

A proposal for automated graphic selection

The proposed method of automated graphic selection is based on a multidimensional characterization of the variables that allows us to reduce the gamut of possible methods of graphical representation for a particular combination of variables. This requires a two-front approach to finding the appropriate graphic: first, a top-down approach that implies the characterization of the variables based on the qualities that impact the selection of one graphic over another; second, a bottom-up approach via the characterization of the graphics according to the number of variables and those characteristics of theirs that fit with the previous characterization of the variables. This results in the grouping and situating of the statistical graphics into sets the length of which is the number of variables and the elements of which are the characteristics of these variables. Each set of graphics corresponds to the user’s various information search tasks, making cognitive studies necessary in order to include the task to be performed among the variables to be incorporated for a more precise automated graphic selection.

The dimensions of the identified variables are qualities that correspond to different aspects of the variables, for example, its scale of measurement or the consideration of a variable as an index. The levels of each dimension are the various exclusionary values that a specified variable can have for each dimension. Variables may be recodified into different levels, however, in order to broaden the search for an acceptable graphic solution.

It is possible that a certain combination of variables cannot be associated with any graphic, or that the task to be performed is more efficiently executed with a different combination derived from the original. Keeping track of data provenance makes it possible to evaluate all the possible derivations as well as inform the user of the changes in the perception of the graphics.

Below we will describe a new approach to automating statistical graphics based on the characteristics of the variables. In this section, we will describe the dimensions and levels in which we propose to characterize the variables and provide examples of our proposal’s application.

Graphic measurement scale

The first variable-characterization dimension considers the relationship of order among the values of the variable and the possibility of quantifying it as greater or lesser, or a relationship that allows for the addition of the values. A first distinction should be made between qualitative and quantitative variables, since the values of the former cannot be summed up. Among the qualitative variables, we can distinguish ordered variables ${Or}$ , which have values that maintain an ordered relationship, such as urban concentrations like “town,”“city,” and “metropolis,” from unordered variables ${Un}$ that do not maintain such a relationship, like the classification of vertebrate animals, such as “mammals,”“birds,”“reptiles,”“amphibians,” and “fishes.” A final consideration is that qualitative variables must be ordered one way or another if they are to be represented graphically. Thus, many graphical methods may be used for both types of qualitative variables. Nonetheless, it is worth making this distinction given that there are specific graphical methods that accentuate the different nature of these variables, such as a matrix with reorderable columns and rows as opposed to a matrix with fixed columns and rows.

Among the quantitative variables, we differentiate three levels based on the number of limits between which the values of the variable are bounded. The first level is composed of variables with arbitrarily referenced scalar values ${0 i}$ , meaning those whose origin or zero is owed to a convention or is of no interest, such that the assignation of a number to a particular attribute determines that of another object with the same attribute once the origin or zero has been arbitrarily set. Such is the case of values measured according to an interval scale, like the temperature in celsius degrees or our year of birth. A second level is composed of scalars bounded by one end ${1 i}$ , meaning those whose origin or zero is not arbitrary, which typically correspond to variables with values measured by a ratio scale, such as a person’s age based on the date of birth, or the volume of daily transactions in a stock exchange, which has zero as its minimum. The third level consists of variables composed by scalars bounded on both ends ${2 i}$ , which typically correspond to values measured according to an absolute scale, such as the probability that a certain event will occur, which tends to be bounded between zero and one.

The graphic measurement scales used for the variables are susceptible to being changed under certain circumstances. For instance, when the variable scales are bounded on two ends, it may be convenient to recodify them as bounded on one end. This is the case when all the values to be represented are concentrated close to one of the ends. An example of this is a dataset of the probability of three independent events ( $P_{a} = 0.132$ , $P_{b} = 0.117$ , $P_{c} = 0.125$ ) for which it would be difficult to distinguish the variation between the values in a graphic with a scale bounded between zero and one (see Figure 1, top); in this case, they are better distinguished using a bar graph with a baseline of zero and an upper limit determined by the highest observed value (see Figure 1, bottom). Similarly, it may be beneficial to recodify variable scales that are bounded on one end as unbounded if what we are interested in illustrating is the variation between observed values instead of its relation with an origin or zero. A case in which just such a recodification makes sense is with the values of the dataset on Lake Huron’s water level between 1875 and 1972 (see Figure 2, top). When codified on a scale bounded on the low end, the resulting graphic makes it difficult to appreciate the variations in water level. But if recodified on an unbounded quantitative scale, the resulting graphic accentuates this variation (see Figure 2, bottom). In an analogous way, the unbounded quantitative variables may be recodified by grouping the values in ordered categories by intervals; for example, participants in an experiment may be categorized by year of birth (born before 1990, born from 1990 to 2000, and born after 2000). Finally, ordered qualitative variables may be transformed into unordered qualitative variables if the type of analysis dispenses with the ordered relationship. For instance, the days of the week may be recodified as work days and non-work days regardless of order or sequence. Table 4 summarizes the levels and possible recodifications between graphic measurement scales.

Figure 1.

Scalar bounded on both ends and recoded to bounded on one end.

Figure 2.

Scalar bounded on one end and recoded to unbounded.

Table 4.

First dimension: graphic measurement scales.

Code	Graphic scale	Recodifications
$U_{n}$	Unordered qualitative	Un
Or	Ordered qualitative	Or
$0 i$	Unbounded quantitative	$0 i$
$1 i$	One-way bounded quantitative	$1 i$
$2 i$	Two-way bounded quantitative	$2 i$

Data aggregation method

The second variable-characterization dimension distinguishes between sequential and non-sequential variables. Non-sequential variables are further divided into two levels according to the difference in the cardinality of the variable, meaning the number of a variable’s unique values in a dataset and the cardinality of the variable scale (in other words, the number of unique values that can be potentially observed and obtained in the dataset).

We have identified three levels in this dimension. The first level, which we call sequential variables ${seq}$ , encompasses variables that are typically found in a column of data and that are presented in the order in which their values were acquired. Variables of this type are typically the succession of years or days, but not only that; other examples include the order in which survey respondents were interviewed and the order in which a succession of experiments were administered. Given that the order information may be fixed by the position of the values in a data column, the sequential variable may be any variable with values in a fixed position, and not necessarily a column of values with information on the order of acquisition of the values of the rest of the variables.

Additionally, we have non-sequential variables; in this case, the position of the values in a data column does not correspond to the order in which they have been acquired. Among non-sequential variables, we distinguish between variables of population type ${pop}$ and those of sample type ${sam}$ , which comprise the other two levels of this dimension. The population type variable is the second level. What distinguishes it is that the cardinality of the values in the data coincides or practically coincides with the cardinality of the variable’s scale; thus, if a certain value of the variable has not been observed, it may be beneficial to specify all the potentially observable values of the scale. Examples of this variable type are typically those that take values from a predetermined set, for example, the respondent’s sex, educational level, and age group (young, adult, and elderly, or other age groups defined by equidistant intervals).

On the other hand, our third level, sample type variables, is composed of variables with a scale cardinality that is so elevated with respect to the cardinality of the data that it does not make sense to predetermine the set of potential scale values; instead, it may be defined as between the range of a minimum and a maximum value. An example of this type of variable is a column of winning lottery numbers, a list of common names for newborns, and the weight in grams of students in a physical education course.

Sample type variables can be easily recodified as population type variables by grouping observed values in equidistant intervals or categories, such as age groups. The inverse recodification, however, is not possible because the data lacks information about specific observations. It is also possible to recodify sequential variables as population and sample type variables if the sequence in which the values have been obtained does not interfere with the analysis and, consequently, does not need to be graphically represented. But here, too, the inverse recodification is not possible given that the population type and sample type variables do not contain information about the sequence in which their variables were acquired. Table 5 summarizes the data aggregation levels and the possible recodifications between them.

Table 5.

Second dimension: data aggregation method.

Code	Aggregation method	Recodifications
sam	Sample type variable	sam
pop	Population type variable	pop
seq	Sequence type variable	seq

Cyclicality

The third data characterization dimension considers the possible cyclicality of the variables, which could suggest more specific graphical representations that use polar, cylindrical, or spherical coordinate systems. In this dimension, we differentiate between two levels: cyclic variables (cycl) and noncyclic variables (ncyc). Quantitative and ordered qualitative variables can be characterized as cyclical, but unordered qualitative variables cannot, precisely because their values lack an ordered relationship. Additionally, the cyclic or noncyclic character of the variables allows for two-way recodification thanks to the duality of some variables, such as the days of the week, that can be characterized as one or the other depending on the values and the type of analysis. Table 6 summarizes the cyclicality levels and their possible recodifications.

Table 6.

Third dimension: cyclicality.

Code	Periodicity	Recodifications
cycl	Cyclic	cycl
ncyc	Noncyclic	ncyc

Explicitness

The fourth dimension considers the graphic possibility of representing variables on a non-explicit scale. This typically happens when the particular values of the variables are not an essential element in the analysis, but other characteristics of these variables are; an example is the cardinality of the data. Scatter graphs provide a clear illustration of how graphics utilize this dimension; they depict points without the possibility of our knowing the order of each observation or, for that matter, any other unique value identifier for any point. Here, we distinguish between two levels: explicit variables (exp) and ambiguous variables (amb). It is possible to recodify this characteristic of a variable in both directions given that this simply impacts whether or not the scale of the variable is represented. Table 7 summarizes the explicitness levels and the possible recodifications between them.

Table 7.

Fourth dimension: explicitness.

Code	Explicitness	Recodifications
exp	Explicitly represented	exp
amb	Ambiguously represented	amb

Length of variables

The fifth dimension is the length of variables as defined by Bertin,¹⁹ referring to the number of unique values that it is useful to identify, and, consequently, to represent. The length of a variable should not be confused with the cardinality of its values in the data, which is the number of unique values obtained for a variable in a dataset, nor with the cardinality of its scale, which is the number of unique values that can potentially be obtained for a variable. In the case of categorical variables, it may be beneficial to combine two categories into one when the difference between them is irrelevant for a given analysis. In the case of quantitative variables, it is possible to reduce their length by doing away with decimal values when the precision they provide is irrelevant for a given analysis. It is also possible to recodify variables from sample to population type using equidistant intervals or varying intervals if the precision required is not uniform throughout the domain. Finally, quality metrics can be used to establish optimal length, for example, the number of bins to be represented in a histogram.

The length of a variable is especially useful when selecting a graphical representation because each visual variable makes it possible to differentiate between a varying number of different levels. Additionally, a visual variable can have one defined length limit to identify the different categories of a variable characterized as ${U n^{pop}}$ (e.g. the use of color to distinguish different countries on a line graph) and another length limit for ${0 i^{sam}}$ type variables (e.g. the use of color on a heat map). The number of rows and columns on a table is also limited by screen size as well as the font size of the accompanying text. The length of a variable also limits the possible use of translations, rotations, and reflections to the point that there are graphical representations that are suited to variables with a determined length. For example, population pyramids are specific to dichotomous variables.

Although there are many rules on the length of variables that can be used to evaluate the suitability of a graphic, it is useful to differentiate between variable lengths equal to one and those equal to two. For variable lengths equal to one, there are well-known specific graphics, such as an analogical barometer that shows the atmospheric pressure at a given time and a dichotomous pie chart that shows proportions relative to a whole. For variable lengths equal to two, it is important to examine logic variables and dichotomous factorial variables in many datasets, as they may suggest the use of reflections. For other variable lengths, we view it as preferable to use ad hoc limits with which to characterize the graphics.

The characterization of graphics based on the multidimensional characterization of the data

In this section, we show how the different data characterization dimensions make it possible to classify statistical graphics in increasingly smaller groups, thus limiting the gamut of possible graphics for a given dataset. To classify graphics based on the characteristics of the identifiable variables, it is necessary for the variables to be associated with at least one visual variable, even when they have gone through a transformation. For this reason, variables that have had their dimensions reduced via the application of a statistical method cannot be deduced from the graphic analysis. However, statistical methods that reduce dimensions produce new datasets with variables that have new characteristics and that can be associated with another gamut of suitable graphics.

Matrix of graphic measurement scales and data aggregation methods

The first two dimensions introduced in the previous section allow us to characterize the variables according to each of the cells in the matrix shown in Figure 3, in which the columns represent the graphic measurement scales and the rows represent the data aggregation methods. Figure 3 also indicates the possible recodifications of the levels of the variables based on these two dimensions.

Figure 3.

Matrix of graphic scales and data aggregation methods.

The matrix of graphic measurement scales and data aggregation methods produces a total of 15 combinations in which each variable in a dataset can be placed. Below we demonstrate different combinations for one, two, and three variables, with graphics that suit the number and characteristics of these variables.

One-variable combinations

As previously stated, examples of graphics that represent one single variable can be placed in each of the 15 combinations. A list of tasks to be performed or of winning lottery numbers, for example, can be classified as a set of values for a sample type unordered qualitative variable ${U n^{sam}}$ if the values lack an order and the data does not include all the tasks that could be performed nor all the lottery numbers that could have won. If the list of values or the winning numbers are ordered according to priority or prize amount, the characterization would be ${O r^{sam}}$ and they would be represented in order with the primary value in the first position.

The different observations of an unbounded scalar ${0 i^{sam}}$ can be represented in an uniaxial point graph. If the observations are with respect to a scalar bounded on one end ${1 i^{sam}}$ , the representation can be a single bar chart or a multiple bar chart with added random jitter. If the scalar is bounded on both ends ${2 i^{sam}}$ , a dichotomous pie chart is a possibility.

The reorderable matrix conceived by Bertin²⁰ is a possible representation of, for example, a column of responses to closed-ended questions in which the choices are a reduced number of unordered categories ${U n^{pop}}$ . In the case of responses to closed-ended questions where the choices are values on a Likert-type scale ${O r^{pop}}$ , a semi-orderable matrix is a possible representation. In the case of responses to closed-ended questions where the choices are unbounded scalars ${0 i^{pop}}$ , such as one’s favorite year during the 1990s, a dot plot is a possible representation. When the possible responses are among scalars bounded on one end ${1 i^{pop}}$ , then a bar chart may be used. And in the case of scalars bounded on both ends ${2 i^{pop}}$ , an example of a possible representation is a series of bar charts with bars ranging between 0% and 100%.

A data column can also fix the sequence in which values were acquired, and in the case of unordered attributes like ${U n^{seq}}$ , they can be represented as a sequentially ordered list or as an arc diagram with orderable nodes. In the case of ${O r^{seq}}$ , they can be represented through a checklist in a sequentially ordered Likert-type scale or an arc diagram. In the case of scalars ${0 i^{seq}}$ , ${1 i^{seq}}$ , ${2 i^{seq}}$ , possible representations include line graphs, area graphs, and area graphs bounded on both ends. Figure 4 lists the aforementioned graphics together with a small matrix that identifies the combinations to which they correspond (refer also to Figure 3).

Figure 4.

Examples of graphics suitable for one-variable combinations.

Two-variable combinations

With two variables, the number of combinations increases to 120; in this section, therefore, we will only look at some possible graphics for a reduced number of combinations.

If we have two variables of type ${U n^{sam}}$ , one possible representation is two lists; for instance, a list of tasks to be performed and another of people to be contacted. But if the observations are paired, some possible representations include a list of paired observations and a mapping diagram. Two variables of type ${0 i^{sam}}$ can be graphically represented by two uniaxial point graphs, or in the case of paired observations, a scatterplot or a parallel coordinate diagram. In the case of two variables of type ${1 i^{sam}}$ , one representation could be a scatterplot with drop lines or two bar graphs with added random jitter. Two variables of type ${U n^{pop}}$ can be graphically represented in two reorderable matrices or a binary bidirectional reorderable matrix in the case of paired observations. And two variables of type ${O r^{pop}}$ can be represented in two semi-reorderable matrices or, in the case of paired observations, in a bidirectional binary matrix that does not allow for the changing of row or column positions (Figure 5).

Figure 5.

Examples of graphics suitable for two-variable combinations.

Combinations of type ${U n^{pop}, 0 i^{sam}}$ can be represented as simple point graphs or a list of uniaxial point graphs. For combinations of type ${U n^{pop}, 1 i^{sam}}$ or type ${O r^{pop}, 1 i^{sam}}$ , one possibility is to use a bar or pie chart if there is one observation per category, or a bar graph with added random jitter if there are multiple observations per category. Combinations of type ${U n^{pop}, 2 i^{sam}}$ can be represented using a pie chart, a matrix of pie charts, or a stacked or progressive bar graph.

For combinations of type ${0 i^{sam}, 1 i^{sam}}$ , it is possible to use a scatter graph with drop lines. For combinations ${0 i^{sam}, 2 i^{sam}}$ , a quantile graph or percentile graph can be used. In the case of an unbounded scalar population type variable and a bounded on one end scalar sample type variable ${0 i^{pop}, 1 i^{sam}}$ , a histogram or violin plot can be used.

Combinations that include one sequential variable, such as ${U n^{seq}, 0 i^{sam}}$ , can be represented with a sequentially ordered list linked to a simple point graph or an arc diagram with reorderable nodes linked to a point graph. Combinations of type ${O r^{seq}, 0 i^{sam}}$ can be graphically represented with a control list for responses on a Likert-type scale linked to a simple point graph or in an arc diagram with nodes linked to uniaxial point graphs. For combinations of type ${U n^{seq}, 1 i^{sam}}$ , possible representations include a sequentially ordered list linked to a simple bar graph or an arc diagram with reorderable nodes linked to a bar graph. For combinations of type ${O r^{seq}, 1 i^{sam}}$ , possible representations include a checklist of responses on a Likert-type scale linked to a simple bar graph or an arc diagram linked to a bar graph with added random jitter. Finally, the combination ${0 i^{seq}, 1 i^{sam}}$ goes well, for example, with a volume chart like those that tend to accompany the line graphs that show the evolution of stock prices. Figure 5 summarizes the graphics for two-variable combinations.

Three-variable combinations

Graphics that represent three or more variables usually use not just spatial variables, but also other retinal variables that make it possible to distinguish between a more limited range of values. The use of one retinal variable or another produces a great variety of possible graphics per dataset. With three variables, the number of combinations increases to 680; in this section, therefore, we will identify possible graphics for a reduced number of combinations.

Combinations of type ${U n^{pop}, U n^{pop}, 0 i^{sam}}$ can be represented by a reorderable matrix with cells that have varying color densities, based on the values of sample type variables, if said value is unique for each pair of ${U n^{pop}, U n^{pop}}$ variables, as would be the case with the frequencies observed in each position. The case is similar for combinations of the type ${0 i^{pop}, 0 i^{pop}, 0 i^{sam}}$ for which it would be possible to use a heat map. When the three variables are characterized as ${0 i^{pop}, 0 i^{pop}, 0 i^{pop}}$ , a heat map with a reduced range of color densities based on equidistant intervals could be used. Figure 6 summarizes the graphics for three-variable combinations.

Figure 6.

Examples of graphics suitable for three-variable combinations.

Improving graphic selection with cyclicality

The third data characterization dimension is cyclicality. As previously stated, this dimension is applicable to ordered qualitative variables and quantitative variables, but not to unordered qualitative variables. Cyclicality helps narrow the gamut of possible graphics because cyclic variables can be more effectively represented in graphics with polar, cylindrical, or spherical coordinate axes. Below we will show different combinations that include this dimension and suggest graphics that could be used with them.

Combinations of a variable type ${0 i_{cycl}^{samp}}$ can use a compass chart if there is only one observation or a circular point graph if there are multiple observations. Combinations of the type ${{Or}_{cycl}^{seq}}$ can use a small-world network, and combinations of the type ${0 i_{cycl}^{seq}}$ can use a circular arc diagram.

Combinations of two variables of type ${{Or}_{cycl}^{pop}, 0 i^{sam}}$ can be represented in a radar graph, and type ${{Or}_{cycl}^{pop}, 1 i^{sam}}$ in a pie chart. For type ${0 i_{cycl}^{pop}, 0 i^{sam}}$ , circular line graphs can be used, and for type ${0 i_{cycl}^{pop}, 1 i^{sam}}$ , circular area graphs or circular silhouette graphs can be used. For combinations of two variables of sample type ${0 i_{cycl}^{sam}, 0 i^{sam}}$ , it is possible to use a scatter graph with polar coordinates.

Finally, for combinations of three variables of type ${0 i_{cycl}^{sam}, 0 i^{sam}, U n^{pop}}$ , a grouped scatter graph can be used. Figure 7 summarizes the graphics that are suitable for use with cyclic variables.

Figure 7.

Examples of graphics suitable for use with cyclic variables.

Improving graphic selection with explicitness

The fourth data characterization dimension is explicitness. This dimension can be attached at any level of the previously mentioned dimensions and facilitates graphic selection because, as with the previous dimensions, it limits the gamut of graphic possibilities in accordance with the characterization of the variables as explicit or ambiguous.

A classic example of a variable that tends to be represented as ambiguous is the order of respondents in an opinion survey, for instance. Usually, the order is irrelevant; what is relevant is that the respondent be unique, that the information be structured based on multiple responses from each respondent, and that the sample consists of a concrete number of respondents.

Graphics that contain ambiguous variables display information about these variables, but not their scale. For example, to see whether men’s and women’s ages are equally distributed in a population, a population pyramid is often employed, but it is not necessary to know which side represents the male and female populations to discern if there is symmetry or not. In this case, the characterization of the “gender” variable as ambiguous would result in a population pyramid that would not display the values of the scale for this variable.

Here is another example of how explicitness in the characterization of variables can suggest a more precise graphical representation. Previously, we indicated that two variables of type ${0 i^{sam}}$ can be graphically represented by two juxtaposed uniaxial point graphs or, in the case of paired observations, a scatter graph or a parallel coordinate graph. If to these two variables, we add a third that indicates the order of respondents without this order being relevant (therefore characterized as ${0 i^{sam}, 0 i^{sam}, {Or}_{amb}^{pop}}$ (as shown in Figure 8), it is no longer adequate to represent this combination of three variables with juxtaposed uniaxial point graphs because the points of both diagrams are not related with the third variable.

Figure 8.

Examples of graphics suitable for three-variable combinations that include ambiguous variables or having a specific length.

Improving graphic selection with variable length

For various reasons, the length of a variable is a crucial factor when defining the gamut of possible graphics. This is because, first off, there are specific graphical representations for certain variable lengths. For example, an analogical clock with a single hand that represents a variable of type ${^{(1)} 0 i_{cycl}^{sam}}$ or those that make use of reflection such as population pyramids of type ${0 i^{pop}, 1 i^{sam}, {Un}_{(2)}^{pop}}$ also included in Figure 8, which are specific for combinations that include a dichotomous variable like “gender” or the fourfold display, which is specifically for combinations of two dichotomous variables. In other cases, the length of the variable can suggest the use of translations to avoid the overlapping of marks in a single panel and can also suggest the use of one visual variable or another based on their number of distinct values.

Examples of graphic types presented to the user based on a small dataset

Having described the characterization of graphics based on the multidimensional characterization of data, we now present the results that an automated statistical graphics system might suggest based on this framework. For this purpose, we used the Loblolly dataset, limited to four variables, that relates the growth of loblolly pine trees in 84 plantations. For each plantation, the dataset includes the average “height” of the trees measured in feet, the “age” of the plantation in years, and the source “seed” for the trees. Despite this dataset’s limited number of variables, the results would also be valid for combinations of variables with the same characteristics in other datasets.

Characterization of variables

The plantation “Id” variable is composed of unordered categories. The aggregation mode of the data, assuming that all the values of interest are present, can be characterized as population type, but given that the order in which these numeric codes appear is not strictly ascending, it is preferable to characterize the variable as sequential type in order to not lose information that could be of interest. In terms of the other dimensions, this variable is characterized as noncyclic and explicit (because its values are to be presented graphically), and its length is 84.

The variable “height” is composed of scalars bounded on one end of sample type (given that this variable’s 84 values represent a small sample of the potentially observable values). Its domain is noncyclic, the scale is explicit, and its length is nearly 650 if we consider that a tenth of a foot is sufficiently precise for the graphic’s decodification.

The variable “age” is also composed of scalars bounded on one end. The number of unique values is six, each with a frequency of 14, such that this variable is characterized as population type, noncyclic, explicit, and with a length of 6.

The variable “seed” is composed of 14 qualitative categories ordered according to the results obtained in the variable “height.” It is a variable of population type, the domain is noncyclic, the scale is explicit, and the length is 14 (the number of categories).

In order to organize the possible graphics based on the selected variables and their possible recodifications, we first describe the graphics that can represent each variable separately. Then we combine two or more variables characterized a priori. Finally, we identify other possible graphical representations from a selection of specific variables on which a recodification is applied to a level of at least one of the variables. The resulting graphics, together with the combination of selected variables, are listed in Figures 9 –11 and the number that follows the names of each graphic in these figures refers to the number of the figure in the supplementary materials.

Figure 9.

One-variable combinations.

Figure 10.

Combinations with more than one variable.

Figure 11.

Combinations with one or more recodified variable.

Combinations

One-variable combinations

If we select the “Id” variable separately, given that the variable length is 84, one possible representation is a list of these codes ordered by their position in the dataset. The list could be presented as an array of values ordered by rows and columns, a single row or column with a scroll bar, or several ordered panels that the user can click through. If we select the “height” variable, the data could be presented on a jittered point graph with drop lines or a point graph with drop lines (see Figure 12). If we select the “age” variable, the data could be presented on a superposed area graph with rectangles that have one dimension proportional to age and the other to the frequency count of the values or a point graph with drop lines similar to the one in Figure 12. Finally, if we only select the “seed” variable, the data can again be presented in an ordered list, a dot chart with drop lines or a bar graph with the length of each line or bar proportional to the frequency count, and ordered according to the order assigned to this qualitative variable.

Figure 12.

Point graph with drop lines.

Combinations with more than one variable

Any combination of the “Id” variable with the others can be represented via a semi-graphic table. If we combine the variable “Id” with “height” or with “age,” we can present an ordered list in which the column that corresponds to “height” or “age” can be represented via a simple bar graph with bars proportional to “height” or “age.” Given the length of the “Id” variable, the same aforementioned techniques can be used to display the list. If we include the “seed” variable, each row can include a mark filled with different color densities in a sequential increase in accordance with the “seed” variable’s 14 ordered values. If the four variables are selected, the table can include all of the aforementioned columns.

The combination “height” and “age” can be represented via a superposed area graph with rectangles that have dimensions that are proportional to these two variables (see Figure 13). Given that the “age” variable is described as a population type, the rectangles can be grouped by age and then represented in an array of superimposed area graphs. The combination “height” and “seed” can be represented with an array of jittered point graphs with drop lines ordered by seed type. The combination “age” and “seed” can be represented by an array of superposed area graphs also ordered by seed. If the variables “height,”“age,” and “seed” are selected, these can be represented by an array of superposed area graphs ordered by seed type.

Figure 13.

Superposed area graph.

Combinations with recodified variables

One possible recodification is to consider “height” as an unbounded scalar variable. This makes sense if we are interested more in the relationship between the observed values than in their relationship with the origin or zero. If in this case only this variable is selected, it could be translated in an uniaxial point graph that can use various techniques to avoid point collisions, or a stripe graph. If selected together with the “Id” variable, the semi-graphic table column that corresponds to the variable “height” would display a simple point graph instead of a simple bar graph. A possible representation of the “height” variable combined with the “age” variable is a graph that plots age on its x-axis and height on its y-axis and connects the points to the y-axis with lines (see Figure 14). Finally, the combination of the “height” and “seed” variables would now produce a series of point graphs or stripe graphs ordered by seed type according to the order assigned for this variable.

Figure 14.

Point graph with drop lines.

If, in addition to the aforementioned recodification, we also regard the “age” variable as an unbounded scalar variable, the selection of the “height” and “age” variables would result in a scatter graph that would not necessarily include zero on either of its axes. If we then recodify “height” as a population type variable and select only this variable, we would end up with a violin plot, a box plot, or a histogram. If we combine it with “age,” we would end up with a succession of any of the aforementioned diagrams ordered by age group.

Recodifying any of the variables as ambiguous would result in the scale for that variable being omitted in the graphic. This includes the superposition of panels instead of its juxtaposition, the exclusion of scale tags in the axes in the case of spatial variables, as well as the exclusion of the legend in the case of retinal variables. For example, the combination of the variables “height” and “age” recoded as unbounded scalars, in addition with the variable “seed” recoded as ambiguous, would produce a spaghetti plot (see Figure 15) instead of the aforementioned array of line graphs ordered by seed type.

Figure 15.

Spaghetti plot.

Discussion

In this section, we will compare our variable characterization proposal with the other solutions we reviewed and show how our new approach generates different and somewhat more accurate results. It should be noted that the systems we are comparing our framework to are in some way pioneering systems with limitations in terms of the set of graphics presented to users. The CHART system, for instance, only displays bar graph matrices that can have varying shading and circular graphic matrices of varying sizes. The BHARAT system only presents pie charts, bar graphs, and line graphs, as well as combinations of these. The APT, SAGE, BOZ, and EAVE systems display diagrams and networks, but only with two spatial dimensions. The NSP and Vista systems also include three-dimensional graphics, while the Polaris and Tableau systems include maps. Yet the catalog of graphic types in these four systems is also limited. Finally, the ViSta system emphasizes dynamic interaction with dynamically linked graphics, but the number of graphic types it offers is also limited.

Previously, we described the different strategies that make it possible to refine the selection of graphics based on the characteristics of the data, the user, the hardware, and the representation models. The characterization of the data presented follows the functional strategy based on the characteristics of the data. This allows graphical representations to be characterized based on the characterization of the data in order to present the user with a gamut of possible graphics for a given dataset. The double characterization of data and graphics has been implemented by systems like SAGE with SageBook and Tableau with Show ME.

Mackinlay⁷ considered this approach to be overly simplified because there was no guarantee that an appropriate design existed for such a great variety of situations. Therefore, it was necessary to consider the full list of ad hoc solutions, even though only a few alternatives might be acceptable. From our point of view, the argument that there is no guarantee of finding an appropriate method for a great variety of combinations serves, first, as a challenge to find these combinations and, second, as an opportunity for the creators of visualizations to propose appropriate graphical methods for these combinations. With respect to the need to consider the full list of ad hoc solutions, we believe that it is necessary to classify the greatest number of graphical methods precisely in order to discard those alternatives that are not acceptable.

The presentation of graphics without previously determining the task to be performed results in graphics that are suitable to a certain task with varying degrees of effectiveness. Systems like APT, Vista, and EAVE do not inquire as to the task, and provide only one, supposedly optimal graphical representation. Conversely, the strategy we propose presents the user with several graphic possibilities for a dataset, as does the ViSta system,⁶ which also includes other considerations for suggesting graphics, such as the theoretical distribution, which it compares with the empirical, and the type of statistical analysis selected. In terms of the strategy presented, it has a drawback, though, in that it offers a limited gamut of graphics, selected ad hoc, for the user to choose from. In order to improve the automatic selection of graphics in accordance with the strategy presented, it would be beneficial to undertake cognitive studies that classify sets of possible graphics for each combination of variables based on the ease with which they make it possible to execute a series of perceptual tasks.

The characterization of the data presented derives from the work of Jaques Bertin, who, however, did not consider the different measurement scales for the quantitative variables; consequently, his characterization of data groups together, in one single combination, graphics as diverse as bar charts, pie charts, and stacked bar charts. Additionally, in the level of ordered variables, Bertin also mixes in qualitative variables that maintain a greater-to-lesser relationship as well as sequential variables, which results in a single combination with graphics as diverse as a Gantt chart and a semi-reorderable matrix.

With the CHART system, it is only possible to graph quantitative variables. Because of this and the fact that it was a pioneering statistical graphics automating system, its gamut of graphics is very limited. Other systems, like APT, NSP, BOZ, Vista, EAVE, Polaris, Tableau and VizRec, consider between two and six levels in a single dimension. The possible combinations with as many as three variables with two levels are nine. With three levels, it increases to 19, with four to 34, with five to 55, and with six to 83. The SAGE system uses a bi-dimensional characterization, but the second dimension, the domain of membership, is unconvincing given that it does not consider other fundamental physical magnitudes, such as the intensity of an electrical current or of a light source, nor magnitudes derived from fundamental physical magnitudes. The BHARAT system has up to five dimensions; the first two are dichotomous, but the system does not establish predetermined dimensions for the others and it appears the algorithm is forced to use ad hoc limits when evaluating each possible graphic, which makes it impossible to know the number of combinations this characterization enables. The characterization presented, considering only the first two dimensions, enables a total of 815 combinations of up to three variables, and therefore, the gamut of possible graphics is necessarily reduced.

Limitations

Although this framework for classifying and automatically presenting graphics is valid for graphics that represent a great number of variables, this study is limited to graphic representations of a maximum of three variables. This is because, as the number of variables selected from a dataset increases, it results in an exponential increase in the number of possible combinations and the gamut of possible graphics for each combination. This is so because each variable in a dataset can be represented in various forms (as points, lines, or areas), with various visual variables, and various coordinate systems. Additionally, a juxtaposition or superposition of panels can be used. The study also does not consider variables comprised of vector and tensor type values.

Conclusion and future research

We have presented a multidimensional characterization for individual variables that can serve as a framework for the classification of statistical graphics and make it possible to notably reduce the gamut of graphic possibilities for a given dataset. The proposed method can be used to automate the presentation of statistical graphics based on the characteristics of the data and also to find new combinations that do not presently have graphical methods associated with them, thus creating new opportunities to design novel visualizations.

The next step in this line of work would be to create a database of graphics that are characterized according to the typology of the data source that each graphic is compatible with. This database can be built from the graphics mentioned, for instance, in the scientific literature. For each combination of variables, we can create a tree of compatible graphics that also considers the possible recodifications between levels for each variable. A second complementary task would be the creation and distribution of an R package that would implement this characterization and make possible the presentation of a gamut of graphics that are compatible with a dataset. Finally, to improve the selection of adequate graphics for each situation, it would be necessary to include the perceptual task as a criterion in the selection of the graphical representation; this would require cognitive studies to evaluate the effectiveness of the graphics associated with each combination with a taxonomy of perceptual tasks.

Footnotes

Acknowledgements

The authors recognize the kindness, generosity, and valuable feedback of Michael Friendly.

Conflict of interest

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Friendly

Denis

DJ.

Milestones in the history of thematic cartography, statistical graphics, and data visualization, 2001, http://www.datavis.ca/milestones/ (accessed 21 August 2016).

Kamps

Diagram design: a constructive theory. Berlin, Heidelberg: Springer, 1999.

Schulz

Nocke

Heitzler

. A systematic view on data descriptors for the visual analysis of tabular data. Information Visualization. Epub ahead of print 19 October 2016. DOI: 10.1177/1473871616667767, http://ivi.sagepub.com/cgi/content/abstract/1473871616667767v1; http://ivi.sagepub.com/cgi/rapidpdf/1473871616667767v1

Benson

Kitous

. Interactive analysis and display of tabular data. New York: ACM.

Gnanamgari

Information presentation through default displays. PhD dissertation, University of Pennsylvania, Philadelphia, PA, 1981.

Valero-Mora

Ledesma

Friendly

The history of ViSta: the visual statistics system. Wires Comput Stat 2012; 4(3): 295–306.

Mackinlay

Automating the design of graphical presentations of relational information. ACM T Graphic 1986; 5(2): 110–141.

Roth

Mattis

. Data characterization for intelligent graphics presentation. In: Proceedings of the SIGCHI conference on human factors in computing systems, Seattle, WA, 1–5 April 1990, pp. 193–200. New York: ACM.

Casner

SM.

Task-analytic approach to the automated design of graphic presentations. ACM T Graphic 1991; 10(2): 111–151.

10.

Lam

Bertini

Isenberg

. Seven guiding scenarios for information visualization evaluation. Technical report 992-04, 2011. Calgary, AB, Canada: University of Calgary.

11.

Bertini

Tatu

Keim

Quality metrics in high-dimensional data visualization: an overview and systematization. IEEE T Vis Comput Gr 2011; 17(12): 2203–2212.

12.

Mutlu

Veas

Trattner

. VizRec: a twostage recommender system for personalized visualizations. In: Proceedings of the 20th international conference on intelligent user interfaces companion (IUI Companion ’15), Atlanta, GA, 29 March–1 April 2015, pp. 49–52. New York: ACM.

13.

Engelhardt

The language of graphics: a framework for the analysis of syntax and meaning in maps, charts and diagrams (ILLC dissertation series: instituut voor Taal, Logica en Informatie). PhD dissertation, Institute for Logic, Language and Computation, Universiteit van Amsterdam, Amsterdam, 2002.

14.

Roth

Kolojejchick

Mattis

. Interactive graphic design using automatic presentation knowledge. In: Proceedings of the SIGCHI conference on human factors in computing systems, Boston, MA, 24–28 April 1994, pp. 112–117. New York: ACM.

15.

Stolte

Tang

Hanrahan

Polaris: a system for query, analysis, and visualization of multidimensional relational databases. IEEE T Vis Comput Gr 2002; 8(1): 52–65.

16.

Senay

Ignatius

A knowledge-based system for visualization design. IEEE Comput Graph 1994; 14(6): 36–47.

17.

Mackinlay

Hanrahan

Stolte

Show me: automatic presentation for visual analysis. IEEE T Vis Comput Gr 2007; 13(6): 1137–1144.

18.

Robertson

. A methodology for scientific data visualisation: choosing representations based on a natural scene paradigm. In: Proceedings of the first IEEE conference on visualization, San Francisco, CA, 23–26 October 1990, pp. 114–123. New York: IEEE.

19.

Bertin

Sémiologie graphique. Paris: Mouton, 1967.

20.

Bertin

La graphique et le traitement graphique de l’information (Nouvelle Bibliothèque Scientifique). Paris: Flammarion, 1977.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.01 MB