Abstract
Data quality assessment outcomes are essential to ensure useful analytical processes results. Relevant computational approaches provide assessment support, especially to data defects that present more precise rules. However, data defects that are more dependent of data context knowledge challenge the data quality assessment since the process involves human supervision. Visualization systems belong to a class of supervised tools that can make visible data defect structures. Despite their considerable design knowledge encodings, there is little support design to visual quality assessment of data defects. Therefore, this work reports a case study that has explored which and how visualization properties facilitate visual detection of data defect. Its outcomes offer a first set of implications to design visualization system to permit data quality visual assessment.
Keywords
Introduction
New technologies enable industry and scientific organizations to collect, store, and distribute large relational databases to address their analytical processes. More than data processing power, these knowledge-intensive processes depend on reliable data to provide useful outcomes.
Improving and keeping data quality at proper levels require reaching out an alternative based on several methods, techniques, procedures, processes, and technological approaches. The data quality assessment process (DQAp) provides relevant and practical inputs to choose the most suitable alternative through a data defect mapping. A data defect denotes a non-conformity between a data instance and its contextual meaning which may arise at any point of the data life cycle. 1
Significant computational approaches support this process, especially for data defects that present more precise rules such as functional dependency violation. 2 These approaches apply quantitative 3 or assertion-based 4 methods and usually limit the human interpretation of their outcomes. 5
However, the DQAp strongly depends on data context knowledge since it is impossible to confirm or refute a defect based only on data.5,6 The context determines the structure of meaning between data and the environment (e.g. an organization department). Hence, human supervision is essential throughout this process, even more for those data defects that have difficult rule specification (e.g. false tuple and missing reference 2 ) and require high human supervision. Visualization systems belong to a class of supervised approaches that combine computational capability with pattern-finding and semantic distinctions innate to human beings to permit data quality visual assessment.
The term “data quality visual assessment” denotes a nonlinear analytical process of comprehension of the current data quality state mediated by visualization systems. Through interactive visual representations, data quality appraisers search, extract, correlate, and understand meanings (patterns, relationships, and metrics) on different granularities until they integrate semantic evidences to confirm or refute a data defect. The nonlinear nature of such flow denotes that the learnings of a step provide inputs to refine and reapply the step itself or the next one. This concept of successive visual analysis to reach a synthesis has been expressed by different perspectives, including information-seeking mantra, 7 visual analytics mantra, 8 and sense making. 9
Much literature has encoded design knowledge regarding visualization systems. Certain relevant literature offers high-level, 7 spatial-driven, 10 or architectural-driven 11 design perspectives. Related to data quality assessment, this knowledge has been encoded through implementations, 12 reference models, 13 or evaluation studies. 14
However, the analysis of this literature mostly reveals concerns about communicating quality metrics measured on data with physical reference. Furthermore, there is little concern on how to facilitate visual comprehension and assessment of data defect structures within abstract data (e.g. sales or billing in relational databases) and that require high human supervision. This situation hampers the selection of the most appropriate visualization properties to support the visual assessment of data quality.
In order to explore this situation, this work reports an exploratory and observational case study that investigated which visualization properties would enable quality visual assessment on different data resolutions, according to data defect structures. In this article, the term “property” denotes visualization technique elements (including marks, visual variables, dimensions) and interactive techniques which allow the specification of the data of interest and its visual appearance. The main contributions of this work are as follows:
Relationship characterization between visualization properties and defect structures within abstract data and that require high human supervision.
Basis implications for designing visualization system to support data quality visual assessment.
The work reported here is organized as follows: section “Related works” reviews all related works. Section “Case study methodology design” describes the case study methodology and its support elements. Section “Case study results” describes the case study results, while section “Case study limitations” presents their limitations. Finally, section “Discussion” discusses the key findings of the case study and section “Conclusion” presents the conclusions of this work.
Related works
Visualization systems designed to support the DQAp are based on quality-aware or visual diagnosis–driven approaches. The first denotes the use of computational resources to extract data quality metrics which are visually communicated through highlights, glyphs, or labels. The latter denotes intensive visual analysis of meanings to detect data defects.
Knowledge concerning visualization system design is encoded in works with different perspectives and purposes. Taxonomies identify and organize the core concepts regarding visualization systems. 7 Guidelines describe recommendations to design visualization systems in given conditions. While certain guidelines provide directions for particular issues, 15 others consider an architectural standpoint. 11
In contrast, implementations offer design examples through the description of visualization systems for a specific domain. For data quality assessment domain, implementations apply the quality-aware approach (except for Tennekes et al. 16 ) and provide little information about design decisions or the strengths and weaknesses of their visual representations.12,16–18
Evaluations denote comparisons 19 or perceptual-cognitive studies20,21 that analyze certain techniques (e.g. visualization techniques) to identify their strengths and limitations within a certain problem domain. An evaluation may apply different techniques, including experiments, heuristics, and ethnographic studies. Strongly related to this work, certain evaluation studies14,22–24 discuss the capability of a restrict set of visualization techniques to detect atypical values (named here as “atypical tuple”). However, their results are mostly based on visualization techniques of overlapped visual properties (especially position), provide very little discussion about interactions support, apply very low data resolution (up to 4000 tuples), and cover the simplest variants of atypical tuple, except for Ward and Theroux 24 that addresses atypical values in interposed data categories (section “Study unit—atypical tuple”).
Reference models offer the most robust support resources to the visualization system design and are driven by dissimilar purpose and theoretical basis. Certain models are based on psychophysics, 25 visual perception, 26 or cognitive psychology theories. 27 In terms of purpose, particular models are concerned with automatic visualization generation28,29 and design space modelling.10,30 Closer to this work, the latter provides relevant conceptual models that organize visualization properties according to a comprehension order with regard to data or elementary task characteristics.
Also related to this work, the spatial data quality (SDQ) and the uncertainty visualization (UV) explore the relationship between the data quality and the decision-making processes. Both research areas systematically describe models and classifications which combine data characteristics, space, and time to determine the appropriate visual elements to expose uncertainties in data with physical reference.13,31
Table 1 synthesizes the characteristics of state-of-art knowledge about visualization system design with regard to DQAp domain. Despite its broadness and relevance, this knowledge does not address the structures of data defects and does not provide which are the most useful visualization properties to enable visual detection of these structures (as seen, it communicates data quality metrics, the aim of quality-aware approach). A data quality appraiser searches for data patterns and defective data relationships related to the data defects structures to take a course of action. In other words, these are the meanings to be extracted from the visual stimuli for subsequent cognitive processing (the very nature of visual diagnosis–driven approach). Hence, the visualization system design must consider the properties that facilitate the meaning extraction to enable the data quality assessment.26,27,32
Design knowledge of visualization systems—summary.
SDQ: spatial data quality; UV: uncertainty visualization.
This missing perspective makes it difficult to answer relevant questions to visualization system design to support visual assessment of high human supervision defects: What visual properties better reveal defects according to their structure? What are the auxiliary properties which can mitigate the occlusion effect on high data resolutions and still ensure the revealing capability of the visual properties? Which interactive properties are more appropriate to data quality visual assessment?
The following sections describe the exploratory case study conducted to delineate the first answer to these questions.
Case study methodology design
The case study method has been adopted to analyze technical, social, perceptual, and cognitive issues of visualization system based on different approaches (e.g. qualitative observational studies). 33
This work applied such method to answer the following question: which and how visualization system properties may facilitate the data quality visual assessment of defects that require high human supervision? Furthermore, the case study adopted an exploratory approach since the relevance of visualization properties for visual assessment of each data defect is unknown in advance. 33
This section characterizes the case study design to achieve the aforementioned question in relation to the tasks, procedure, artifacts, and analytical activities applied to comprehend and report the results.
Framework of tasks
The visualization system design is a complex process which requires an alignment between several resources and task requirements. 26 The “task” is an abstraction that guides the use of a visualization system. In this article, this abstraction represents the conscious effort to perceive and relate meanings to determine the data quality level.
This work uses a framework to model high-level tasks of data defect assessment to support the case study performing. The framework is based on the data defect structures,
2
certain data characteristics, the notations by Schulz et al.
30
and Andrienko and Andrienko,
34
and the tasks hierarchy derived from activity theory
47
and levels of interaction,
35
denoted by
The analytic task represents the highest analysis process for a particular domain. In turn, the sub-analytic task represents more concrete analysis goals to achieve the corresponding analytic task goal. The third level (interactive task) involves atomic analytic steps performed upon a visual representation, while the operational task denotes actions that have little semantic value.
The assessment of any data defect is denoted by analytic task
Sub-analytic tasks have two purposes or types: search and correlate. The first denotes the action of identifying and firming features that are close to the data defect structure being assessed. In turn, the latter represents analytical techniques applied to compare and establish relationships among these features. These task types may manipulate all data or specific regions.
Feature specifies the meanings of interest of a sub-analytic task type. Search tasks observe data patterns (Data pattern is an essential characteristic of data behavior which represents a phenomenon inherent to data that may not be detected. 34 ) (Table 2) in high data resolutions or particular data regions. In contrast, correlation tasks observe data patterns, data regions, or specific objects to derive relationship structures such as cyclic, discrepant, alternate differences, opposite magnitude, random, similar, smooth change, but not limited to them.
Interaction dynamically modifies the visual representation to enable data interpretation according to the task goal. This case study uses the interactions observed in Table 3.
DataSpace denotes the attribute sets g and b required for a sub-analytic task, represented by
Result represents the informational entities created by a task. An empty result determines an absent or insufficient visual perception of a defect.
Pattern classes and patterns.
This pattern list is based on Andrienko and Andrienko 34 and it is not a definitive one.
Interactions per classes.
Task 1 connects the three-tuple notation (derived from Schulz et al. 30 and Andrienko and Andrienko 34 ) and the previous definitions to illustrate the analytic task composition. It shows the relationship among Feature, Interactions, and DataSpace elements, the pipeline within the analytic task, and introduces the symbols “|” and “*.” The first symbol is the separator of multiple parameters for any task element, while the second denotes any reference attributes (including none).
According to the previous notation, Task 2 exemplifies the analytic assessment task for inclusion dependency violation (Also known as reference violation, such defect occurs when a tuple
Based on such pattern, the analytic task of such defect is composed of four sub-analytic tasks: it starts from searching (1) and understanding (2) the defect pattern all over the data until it concentrates (3) the analysis on specific data regions regarding the suspect cases (4).
Data artifacts
Each data defect had five defective relations with resolutions from a 1000 to 10 million tuples. Algorithms generated up to 1% of defective data according to the defect structure and ensured the remaining comply with a set of consistency, integrity, and completeness rules. The defective data for each relation were in different data regions to avoid biased data analysis. Table 4 exhibits the relation schema and the generation criteria for one of the data defects covered by this case study.
Generated data characteristics by data defect—a fragment.
Participant
This case study relied on an experienced participant because its procedure depends on the know-how of data quality assessment. Engaging experienced participants prevents from discrepant observations about detection and quantification issues 36 and provides more adequate results to validate or refute the case study goal.14,22 The participant is a 6-year data quality analyst who applies different resources for daily analysis in data warehouse environment, including programmable features (SQL), OLAP Cubes, descriptive statistics and traditional graphics such as scatter plot, histogram, and line graphs.
However, experienced participants are not always available in reasonable number and amount of time. In this case study, even though the number of participants may seem low, the findings were based on about 400 data quality assessment cycles (section “Procedure”).
Conditions and apparatus
The participant stayed in a reserved room equipped with a workstation, a 21-in monitor of 1920 × 1080 pixel resolution, data artifacts metadata, descriptions of assessment tasks, a notepad, and some pens. An observer followed all the tasks performed and took notes and observations verbalized by the participant with regard to detecting and quantifying defects.
This work built a visualization system tool (This tool is based on R-Project 3.1.2 environment due to its portability and analytic-driven architecture. 48 ) named Visualization for Defect Detection or Vis4DD—to gather only the visualization system properties aligned to the case study goal. The Vis4DD (Figure 1) provided regular visual arrangement and the same color scale to all visual representations in which settings were kept as similar as possible to avoid influence on behalf of certain properties. Each visual representation contained a single visualization technique without data diffusion to avoid screen scrolling. Moreover, this tool recorded all interactions (including the corresponding parameters and time ranges), enabled to mark data regions with defects (according to participant judgment), and saved the current visual representation as a PNG file.

Visual spaces of Vis4DD.
Procedure
The case study followed the stages of artifacts and environment arrangement, work procedure planning, participant preparation, case study execution for each study unit (data defect), and final analysis. Once the arrangement stage was completed, a pilot was used to refine and establish the work procedure observed in Figure 2. In the preparation stage, the participant received all the information and contextualization related to the case study, the data defect structures, the assessment tasks, and the semantic of defective data artifacts. Additionally, there was a 30-min training to become familiar with the Vis4DD and the case study procedure.

Work procedure steps for the case study with multiple study units (BPMN representation).
The 60-min sessions were used to perform the task assessment for each data defect, except by atypical tuple and heterogeneous granularity defects which required three and two sessions, respectively. In each session, the relations corresponding to the data defect in question were submitted one by one to an ordered set of visual representations, according to Vis4DD tool. In each isolated visual representation, the participant performed the data analysis based on assessment tasks, marked the regions of defective data, determined the number of defects, and saved the current visual representation. In contrast, those visual representations which did not enable defect detection were marked as inadequate. There were no time restrictions. After the completion of each session, all the notes and observations related to the study unit were organized and synthesized.
Having finished all the case study units, the final analysis stage used data quality assessment cycles (A cycle denotes the occasion during which the participant arranged the attributes, detected, and quantified each variant of a specific defect in a certain data resolution. It also represents the occasion when the visual representation did not enable assessment of a certain defect. Each cycle has a set of notes, observation, and interaction records used by the analysis step. There was about 400 cycles in this case study.) as analysis units to perform two steps in sequence. The first step developed the configuration concept that gathers the Vis4DD properties (as described below) that had causal participation in the detection or quantification of data defects. The configurations were grouped by the visual properties used to encode the target attribute, as observed in Table 5.
Basis property denotes a visual variable that encodes a target attribute for assessment purpose. This work addressed size, saturation/lightness, hue, position, shape, and connection.10,29 The hue property was based on segmented color scale, while saturation/lightness used the unsegmented one. Additionally, the trellis technique (also called small multiples) was considered because of its capability to promote a juxtaposing view of data by means of panel. 10
Description type denotes how the values of a target attribute are encoded by the basis property.
Dimension determines the maximum number of represented attributes.
Continuous and discrete reference denote which data types are encoded at intervals or distinct units. These encoding modes determine the capability to represent extensive or discrete attribute domains, respectively.
Trellis indicates whether a configuration offers an additional encoding through a discrete panel.
Superimposing strategy represents objects in the same coordinate system, leading to overlapping.
Juxtaposing strategy represents objects separately in space and time.
Filter allows to choose the data of interest according to simplification criteria defined by a widget control.
Motion uses different parameters to promote apparent data movement.
Geometric zoom enlarges or shrinks objects while zooming in or out, respectively.
Ordering provides data arrangement according to an attribute of interest.
Compaction is an approach of compressing a certain number of objects into a small visual area.
Opacity change allows to determine the opacity degree (by alpha blending) for the visually represented data.
Point displacement allows to add random variations for the represented data by means of Jittering.
Matrix of configurations and their properties.
Q: quantitative; D: date; C: categorical.
In this study, the term “detection” represents the capability to reveal the data defect structure and enable its analysis. The capability criteria considered (1) full or partial identification of the defect structure, (2) computational cost of rendering not exceeding 60 s, and (3) feasible analysis in high-density data representations. In turn, the term “quantification” denotes the degree of accuracy in counting tuples involved in the identified defect, being defined as 80%, between 30% and 80% included, less than 30% or unquantifiable 0%.
The second step analyzed the relationship among different properties and data characteristics that led to detection and quantification for all the study units. Descriptive statistics, contrast, and evidence construction 37 were the main techniques used by the analytical procedure to build the final report discussed next.
Case study results
This section reports the main findings regarding each study unit. Due to space restrictions and extent of the analysis, only one of the study units (incorrect and missing reference) is fully described since these defects are poorly explored by visualization implementations that support data quality assessment. Refer to Borovina Josko et al. 2 for further details of data defect structures mentioned below.
Study unit—atypical tuple
An atypical tuple deviates from the behavior of the remaining tuples of a relation for different reasons. 2 This study considered four variants to analyze the visualization system properties.
The first two variants denote an attribute with different amounts of atypical values (about 0.1% and 1%, respectively). Almost all configurations enabled the visual detection of these variants, but with different quantification capability and participant effort. For example, configuration d2 (based on hue) required filtering and point displacement interactions to detect both variants. In contrast, configurations based on position property and compaction were more suitable because the defect pattern was noticeable in all data resolutions and required very few interactions, as observed in Figure 3(a). The findings related to position property complement the ones in Marghescu, 14 Grinstein et al., 22 Hoffman, 23 and Ward and Theroux 24 regarding to data resolution and quantification issues as well as by use of compaction.

Scene fragments of atypical tuple assessment: (a) detection and quantification with frequency in hue in resolution of 107 tuples—second variant and (b) detection with position and line supported by filtering, trellis, and opacity change in resolution of 104 tuples—fourth variant.
A third variant represents atypical values interposed among data categories with certain superimposition (about 15%). This structure influenced the detection or quantification capability of all configurations. Certain configurations did not enable the comparison between the target attribute and data categories, such as c1. Those based on position were more suitable for the same aforementioned reasons, in spite of requiring more filtering, point displacement, and opacity change to assess this variant. These findings complement the ones in Ward and Theroux 24 that did not consider interactions techniques and defect quantification for this variant.
The fourth atypical variant denotes unusual combination of values considering multiples attributes. This work combined three or four attributes from five available. Few configurations permitted the detection of this variant; partially, though. Those with three dimensions (including a1, a2, and b1) imposed several data arrangement and filter interactions to task completion. This uncomfortable operation led the participant to mark some false positives. In contrast, configurations a7 and c5 were more appropriate to assess due to their N-dimension nature, but with limitations. The first one was appropriate up to 105 tuples (Figure 3(b)) due to visual occlusion, while the latter required the intensive use of zooming interactions (up to 10) to isolate the defective cases. The c5 configuration result complements the one in Tennekes et al. 16 that did not discuss this variant.
Study unit—false tuple
A false tuple is an instance that fulfills all active restrictions of a determined relation R, but it does not have representativeness to a universe of discourse (UoD). 2 This study unit analyzed temporal gaps in historical business facts, while historical transactions records and reference attributes alternatives were not considered.
The configuration based on saturation/lightness facilitated the assessment task and enabled longitudinal analysis of time-oriented data, as observed in Figure 4(a). Its juxtaposed data presentation was more explicit through all resolutions than the remaining configurations which detected the temporal gap. Additionally, filter interaction provided a valuable resource to slicing time-framed data.

Scene fragments of false tuple and heterogeneous measurement unit assessment: (a) temporal gap detection with saturation/lightness supported by filter in resolution of 106 tuples and (b) detection without quantification with size proportional to frequency supported by filter and trellis in resolution of 107 tuples.
Motion-based configuration (c4) also enabled longitudinal analysis through an expansion and contraction pulse of constant direction. However, it led the participant to an imprecise overall comprehension of temporal gaps and to fatigue due to the need of mentally connecting facts between animated transitions (velocity of 1.8 s). The extensive domain of the assessed attributes (increasing from 28 to 450, according to data resolution) may have blocked the perception of the changes’ nature and rate. These findings complement the studies in Robertson et al. 20 and Ware. 26
Study unit—heterogeneous measurement unit
This defect denotes attribute values represented in physical quantities with different magnitudes. 2 This study examined one variant of heterogeneity factor (the distance between the required and heterogeneous magnitudes) of 1/5. Such structure led to heterogeneous and homogeneous tuples overlapping in some data regions.
The absence of reference elements (attributes or relationships) hampered the perception or introspection of heterogeneous instances among homogeneous instances, restricting the detection of defective cases in data regions with reduced or no overlapping. Very few position-based configurations (a1 and a2) provided a moderate support through point displacement and filter combination. In turn, certain size-based configurations (c2 and c5) led to an intensive data exploration until the isolation of some defective cases, as observed in Figure 4(b). This corroborates the limitation of size property on providing accurate perception of smooth fluctuations of quantitative data. 21
Study unit—heterogeneous granularity
Heterogeneous granularity denotes attribute values represented by different abstraction levels. 2 This study covered two variants of random behavior and disparate heterogeneity levels (the distance between the required and heterogeneous abstraction levels).
Several configurations detected the first variant due to its high heterogeneity factor (about 12). For instance, size property enabled an easy perception of great disparities (Figure 5(a)) and was as suitable as position (corroborating.14,21). In turn, the second variant was set with a less pronounced heterogeneity factor (1/2) which led to certain overlapping among defective and non-defective tuples, as observed in Figure 5(b). This situation reduced the quantification capability and increased the analysis effort of certain configurations. As an example, the participant detected very few cases of this variant in higher data resolutions through saturation-based configuration.

Scene fragments of heterogeneous granularity assessment: (a) heterogeneous granularity detection with proportion whole-part restricted to the resolution of 103 tuples—first variant and (b) visual occlusion with size proportional to absolute value in resolution of 103 tuples—second variant.
Study unit—incorrect and missing reference
Inconsistent relationships are the analysis goal shared by the incorrect and missing reference defects, as observed in Task 3. The first denotes the representation of nonexistent relationships in the UoD, while the second expresses the opposite. 2 Only certain configurations of the class graph (Figure 6) were able to expose the defective relationship structures by means of imposing network, confirming the studies in Bertin 10 and Ware. 26

Configuration results for the incorrect and missing relationships: (a) resolution detection and (b) resolution quantification.
The absence of reference attributes in the configuration g2 (unlabeled node-link) inhibited the extraction of meanings. Equally inadequate, the undirected rectilinear node-link (configuration g3) led the participant to perform several visual transitions because the data representation was limited to 100 tuples.
In contrast, Figure 7(a) illustrates the radial directed node-link (configuration g1) which combines reference attributes and the Fuchterman-Reingold 38 technique to enhance the spatial arrangement. Such combination made easy the detection and quantification of defective relationships due to the fluidity and absence of interruptions as it follows the continuation principle (The continuation principle denotes the tendency of human vision to follow a certain direction determined by an object to reach another one. 42 )), but they were restricted to the lowest resolution as the visual occlusion increased. Moreover, the absence of reference attributes blocked insights related to marriage relationships which did not exist in the UoD.

Scene fragments of incorrect and missing relationships’ assessments: (a) detection and quantification with directed radial connection in resolution of 103 tuples and (b) defective relationships detection with four-level hierarchy in resolution of 107 tuples.
The division and filling space approach of configuration h1 (Figure 7(b)) provided an overall view of data in all resolutions and facilitated the perception of defective relationships. Nevertheless, this configuration was less accurate than the directed graph, required some approximations (even in low data resolution) and did not allow any quantification.
Study unit—defects without visual evidence
Certain data defects were not related to visualization properties because their structures are not visually perceptible. In the case of missing tuple and overloaded tuple defects, the corresponding facts from UoD are not in the assessed relation. Duplicated tuples have several intricate combinations that may denote potential duplicated cases, whose task of detection is hard even for computational approaches. 39 Other defects such as synonymous values, homonymous values, and incorrect value could be visually assessed on very low data resolutions and reduced domains (less than 103 and up to 10 distinct elements, respectively). Therefore, such cognitive-based assessment is infeasible in large data resolutions.
Case study limitations
This section describes the main limitations of this case study design. One of them is related to the applied set of visualization and interaction techniques. Although being representative, this set did not cover certain visual variables and marks (e.g. texture and pixels, respectively), multiple coordinated views, embedded visual metaphors, coordinated screens, normalization of ratings, data derivation based on user expressions or relationship models—for example, linear regression. Furthermore, the color preference setting was unavailable and the applied color model was not close to human perception such as CIELab. 40
Influences related to familiarity degree of visualization techniques, aesthetics, visual acuity, and cognitive style were not measured since these issues are beyond the scope of this work. Additional resources such as collaboration, knowledge base, provenance metadata, and learning effects represent future research issues.
Relations with isolated defects are uncommon in several UoD. Hence, the outcomes of this case study are subject to the influence of a complex interdependency among certain data defects. However, the controlled data generation strategy was very relevant to the goals of this work as follows: it ensured the participant’s attention to the assessment task of each defect, eliminated the dispersion caused by complex interdependence among defects, permitted the control over data characteristics and semantics, and made possible to gather particular qualitative notes for each data defect over each visual representation.
Discussion
The case study findings revealed relationships between certain visualization systems properties and data defects that require high human supervision, as characterized in section “Property–defect relationship.” These findings allowed this work to answer the questions related to data quality visual assessment (section “Related works”) which are not addressed by the state-of-art visualization literature. The relevance of the property–defect relationship is discussed in section “Practical use of property–defect relationship.”
Property–defect relationship
Property–defect relationship exposes sets of interactive and visual properties which facilitate visual assessment of different defects up to certain data resolutions, as observed in Table 6. In this table, the column “resolution level” denotes the highest data resolution (in 10
N
) where defect detection was possible. Each data defect column (and corresponding variant) is related to a quantification capability factor. Such factor is based on
Summary of property–defect relationship based on configurations.
Quantification capability factor: ++ for [0.75, 1], + for [0.5, 0.75), − for [0.3, 0.5), −− for [0, 0.3), · (single point) for not detected, space blank for not studied.
Restricted to N dimensions visualization techniques.
Restricted to 104 data resolution.
The relationships discussed below are not prescriptive, but provide a closer orientation for visualization system design according to the needs of data quality visual assessment. It is worth noticing that the characteristics of assessed data (data type, distribution, and domain size) have direct impact on the capabilities of the properties and therefore must be considered in the selection of these properties. Moreover, most relationships depend on the availability of reference elements (attributes, relationships, or historical records) to facilitate meaning extraction; otherwise, a data defect may not be revealed.
The property position provided the most appropriate support for data quality visual assessment because it spatially preserved the defects patterns in data, particularly in cases of quantitative analysis. Moreover, position combined with compaction permitted analysis at volumes greater than the limit of 105 tuples provided by spatial distortion and opacity change (limit which corroborates23,41). Among the position-based configurations with no compaction, the description by points was superior than the one described by line. The latter showed a fast visual occlusion which was not reduced by opacity change in many occasions. This situation complicates the judgment of orientation and slope, corroborating the results in Grinstein et al., 22 Ward and Theroux, 24 and Unwin et al. 41
The properties hue and trellis obstructed almost all assessment tasks of target attributes with extensive domains due to the high density of trellis panels and distinct colored descriptors, respectively. Some case study observations suggested that visual search for defective cases encoded in these properties was unusual. In contrast, these properties were well suited for categorization (attributes up to 10 and 70 distinct domain elements, respectively) which leveraged correlations and comparisons. These results corroborate the strengths and weaknesses of previous studies related to hue 40 and trellis.19,41
The property saturation/lightness made more difficult to interpret absolute values or numerical relations from a hue-magnitude reference map, as pointed by Bertin. 10 For example, this property led the participant to mark different false positives on those data defects with little disparity between defective and non-defective values. In contrast, this property surpassed the position property in longitudinal analysis partially due to juxtaposing. In fact, this visual data arrangement seems to have allowed the participant to detect meanings about data defects easily, although his preferences were not collected. It is possible that the very nature of data quality visual assessment requires more proper visual representations for data comparing.
The quantitative interpretation through size property was also difficult. Both compaction-based configurations imposed intensive use of zoom, filter, and resizing intervals, according to task complexity and data resolution. This situation was partially caused by distortion of arithmetic mean (description based on proportional to average) and value range partitioning (description based on frequency). Those without compaction experienced visual occlusion in 103 data resolution (c3 and c4) or were inappropriate for magnitude comparisons (c6), mainly due to little disparity between defective and non-defective values. The latter corroborates the comparison studies in Yang et al. 21
Based-shape configuration was inadequate for all data defects. It demanded an intense data navigation due to its visual representation limit of 150 tuples. Moreover, the shape length calculation only considers the exhibited data and not data as a whole, which prevented the participant from completing the assessment tasks. This finding deviates from the results in Ward and Theroux 24 and Rusu et. al. 49 which considered shapes (icons) suitable for the first two variants of atypical tuples—section “Study unit—atypical tuple.”
The properties recursive hierarchy and connection of graphs were useful for defective relationship comprehension, but with different capabilities. Among connection-based configurations, the node-link directed graph with one dimension (g1) provided the most intuitive assessment of relationships limited up to 103 tuples. The remaining connection-based graphs were inadequate due to the absence of reference attributes or directed edges.
In contrast, the recursive hierarchy graph expressed defective relationships in every data resolution through recursive division of visual space based on the attributes. However, such graph showed severe occlusion for attributes with extensive domain, inhibited the unidirectional relationships detection, and demanded more effort to detect defective cases than the node-link graph. Although this work did not gather data on reasoning effects, the results of this property may have been influenced by the adoption of a line of thinking based on flows.
For all defects, the filter and geometric zoom were relevant for an easy and continuous refinement of data regions that are object of quality analysis. Observations recorded from different study units suggested the need of a filter with multiple predicates. Such flexibility was specially relevant to those assessment tasks with intense visual explorations or attributes with extensive domains. These findings complement the ones in Marghescu, 14 Grinstein et al., 22 and Ward and Theroux 24 regarding interaction techniques. The attribute arrangement was also relevant since it facilitated assessment tasks through the adjacent arrangement of the attributes of interest, corroborating. 43
The alignment of the number of attributes (or dimensions) available in a visual representation and the assessment task (data space) were equally essential for all data defects. The misalignment (i.e. task attributes greater than visualization dimensions) affected the data assessment effectiveness since it imposes cognitive burden to maintain and integrate facts between data arrangement transitions. These issues corroborate previous works by Ware, 26 Mackinlay, 29 and Card et al. 44
Practical use of property–defect relationship
Building visualization systems involve the complex combination of the properties selected among numerous possibilities. Since the use of these systems affects the outcomes of analytical processes, only the combination of certain properties provides appropriate support to such processes.26,32,35 Hence, human factors are key elements in the design of visualization systems.
One of the ways to consider human factors is to select the properties of the visualization systems based on specificities of domain tasks, such as the ones of data quality visual assessment.10,26,32 The absence of correspondence between a visualization and the task purpose (e.g. perception of meanings related to defect structures) prevents the data quality appraiser from answering questions through visualizations.27,32,45 Therefore, every visualization must make easy the perception of the meanings required to task completion.
The purpose of the DQAp is to provide analyses concerning data quality. These analyses comprise the following steps: comprehension of data defect structure, the search for its structure, and the diagnosis of the search outcomes. In this process domain, visualization systems provide support to the search and diagnosis steps (the assessment tasks) through two significant approaches (section “Related works”): quality-aware and visual diagnosis–driven.
The quality-aware approach denotes the semi or non-supervised use of computational resources based on assertions or statistical models to extract conformity patterns or mathematical signature from data, respectively. Visual representations communicate this quality information through visual highlights, glyphs, or labels. Such approach is extensively used for defects that require low-to-moderate human supervision or for structures that are not visually perceptible (section “Study unit—defects without visual evidence”).
This informative approach requires semantic and technical knowledge to select and set the appropriate resource to detect each defect. 5 Naturally, human appraisal of the potential quality problems computed by these resources is needed. Once this procedure may include several adjustments, exposing the results of such resources in any visualization inhibits their evaluation and, subsequently, the data quality assessment itself.
Figure 8 shows two visualization systems that exemplify the aforementioned circumstance. The one observed in Figure 8(b) facilitates the analysis of potentially duplicated tuples (lanes 2 and 4) by detailing their corresponding relationships to other tuples. As discussed in section “Study unit—incorrect and missing reference,” defects that refer to relationships among tuples are better observed in graphs.

In contrast, the system in Figure 8(a) represents the potential duplicated in an inconclusive way. It is not correct to say that every exposed tuples are duplicated and the applied spacing also obstructs the correlation between the tuples, i.e, an proximity principle violation (The proximity principle that elements are processed as a group rather than being processed separately. Therefore, the proximity benefits a quicker understanding of information. 42 ))
Since the visualization properties determine which meanings may be derived,10,44 the example above corroborates the relevance of the property–defect knowledge to the quality-aware approach. Indeed, the alignment between properties that facilitate the perception of a defect structure and the related computational-based quality information can leverage the support to the data quality assessment.
The visual diagnosis–driven approach denotes extensive use of visual analysis of meanings to determine defective data. It is possible to shape, explore, criticize, or confirm data quality hypotheses in a direct and interactive way. Therefore, this approach depends on experience, data context knowledge, and attentive effort of a data quality appraiser supported by a proper visualization system. For this reason, knowledge about property–defect relationship is naturally essential to select the proper interactive and visualization techniques.
The supervised nature of this visual approach is important for those defects whose analysis strongly depends on human supervision and when contributions from computational approaches are restricted (when available).
The discussion above shows that the property–defect relationships (section “Property–defect relationship”) are relevant inputs to the design of visualization systems concerning data quality assessment, regardless of the approach. These relationships provide a basis mapping that reduces subjectivity and identifies whether certain visualization properties are adequate to assess a data defect in a given data resolution. Moreover, this mapping is the starting point to evaluation studies related to more specific factors (including cognitive style, personal preferences, experience level, learning effects, collaboration) with regard to data quality visual assessment. The knowledge regarding property–defect relationships is partially covered in visualization literature, as discussed in section “Related works.”
The main limitation of property–defect knowledge refers to manipulation of very high data resolutions, an intrinsic issue of DQAp. Few visualization techniques can support significant data resolutions (greater or equal to 108) and possibly facilitate the assessment of the defect set covered by this work. All data are not required to learn about quality issues. Therefore, the segmentation of data in representative subsets may be a feasible solution. Here, the “representative” term denotes the idea of data subsets that necessarily have instances of desired defects. However, the segmentation method constitutes a matter of open research.
Conclusion
This work reports a set of property–defect relationships that establish that certain visual and interactive properties are more suitable for visual assessment of certain data defects in a given data resolution. Additionally, data defects which cannot be visually detected and require computational approach support to be assessed are also discussed. An exploratory case study provides these findings based on several qualitative observations, notes, and interaction log. This case study consists of multiple study units (data defects with high human supervision) and it is supported by a tool built for this purpose only.
The property–defect relationships offer a first set of basis implications to design more appropriate visualization systems to the needs of data quality visual assessment. Moreover, these relationships are also relevant to the visualization systems based on the quality-aware approach. The alignment between computational-based quality information and the properties that facilitate data defect perception does leverage data quality assessment. Nevertheless, the property–defect relationships do not cover certain properties (e.g. pixels and multiple coordinated views) and do not provide information about learning or reasoning issues. In future works, it is intended to conduct a longitudinal case study and experiments with a reasonable number of less experienced participants, analyze the property–defect relationships for time-oriented data defect, and identify segmentation methods to permit data quality visual assessment in very large databases.
Footnotes
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by CNPq (Brazilian National Research Council) grant number 141647/2011-6.
