Abstract
Visual quality measures (VQMs) are designed to support analysts by automatically detecting and quantifying patterns in visualizations. We propose a new VQM for visual grouping patterns in scatterplots, called ClustML, which is trained on previously collected human subject judgments. Our model encodes scatterplots in the parametric space of a Gaussian Mixture Model and uses a classifier trained on human judgment data to estimate the perceptual complexity of grouping patterns. The numbers of initial mixture components and final combined groups quantify visual cluster patterns in scatterplots. It improves on existing VQMs, first, by better estimating human judgments on two-Gaussian cluster patterns and, second, by giving higher accuracy when ranking general cluster patterns in scatterplots. We use it to analyze kinship data for genome-wide association studies, in which experts rely on the visual analysis of large sets of scatterplots. We make the benchmark datasets and the new VQM available for practical use and further improvements.
Introduction
Cluster discovery is a typical task in visual data analysis.1,2 Clusters can have various shapes, densities, and other characteristics, 3 and may exist in different data subspaces. Fully automated clustering techniques are not always satisfying and might not match the end-user expectations. 4 Hence, end-users often use data visualization, usually in scatterplots, to find or validate clusters of interest. 5
To support these users, several pipelines for visual cluster analysis have been proposed in Visual Analytics. 6 One way to visually discover clusters in multidimensional (HD) spaces is to use multidimensional projection techniques, 7 RadViz, 8 or star coordinate plots. 9 Examining the resulting scatterplots allows for detecting grouping patterns that could support the existence of their multidimensional counterpart. But these two-dimensional projections generate artifacts,7,9 and often one view is not enough to reliably discover all the multidimensional cluster structures.10,11 Moreover, clusters may exist only in subspaces of the data. Hence, visual cluster analysis requires generating projections from possibly many (weighted) combinations of the initial features and different tuning of the parameters of projection techniques.1,12–16 Eventually, these techniques allow the analyst to spot the most interesting visual cluster patterns for further investigation.
With increasing data dimensionality, however, this process often becomes tedious and cumbersome due to the large number of projections to explore visually. Visual Quality Measures 17 (VQM) can support users in such situations by automatically detecting and quantifying visual patterns.18–20 Ranking and arranging visualizations by order of interest concerning a specific type of pattern10,21 lets analysts focus their limited-time budget on the most promising views. We are primarily interested in VQMs for cluster and grouping patterns in our work. Several VQMs have been proposed for that purpose.22–24 Of these, the ClustMe method 24 is based on merging and counting components of a Gaussian Mixture Model (GMM) 25 of the points in the scatterplot. It was the first GMM-based VQM for quantifying visual cluster patterns in scatterplots. ClustMe has shown the most accurate performance in ranking human perceptual judgment benchmark data among all competitors. Still, its agreement with these human perceptual rankings is in the [60% - 80%] range, a relatively low score which we expect to improve by replacing the merging component of ClustMe with a new data-driven model, forming ClustML.
GMM-based VQMs like ClustMe and the proposed ClustML are made of three main stages illustrated in Figure 1:

A visual quality measure (VQM) based on a Gaussian Mixture Model (GMM) for cluster patterns in scatterplots is made of three stages: (1) a data-driven process estimates the parameters of a GMM of the data points density in the scatterplot; (2) the degree of overlapping of each pair of GMM components is computed to provide additional characteristics of interest to quantify cluster patterns; (3) The data points, the GMM parameters, and the pairwise quantities are aggregated to compute the visual quality measure. ClustMe and ClustML are both GMM-based VQMS, differing in the way they quantify pairwise overlap of GMM components (Stage 2).
In ClustMe, the overlapping evaluation (Stage 2) is based on a computational heuristic called Demp that decides when two GMM components overlap too much; they are merged or linked together to represent a single cluster instead of two symbolically. The ClustMe VQM score is a linear combination of the number of GMM components and the number of connected components of the graph formed by the Demp links, with more weight given to the latter.
In this work, in contrast to ClustMe, we set out to develop ClustML, a new GMM-based VQM whose overlapping evaluation and merging decision (Stage 2) is learned from human judgments of cluster patterns in scatterplots rather than using the Demp heuristic.
We demonstrate the superiority of ClustML against ClustMe, its main competitor, in terms of agreement with human judgments on two perceptual studies datasets, S1 and S2, previously collected for the development and evaluation of ClustMe 24 :
In the ClustMe paper, S1 is used to select the best merging decision model among a finite set of 7 heuristics. In contrast, in this work, S1 is used to train an automatic classifier to mimic human merging decisions. In both the ClustMe paper and this work, S2 is used to evaluate the resulting GMM-based VQM for a pairwise ranking task. Using the same datasets, S1 and S2 allows a fair and objective comparison between ClustMe and ClustML.
We also propose qualitative comparisons between ClustMe and ClustML and a usage scenario in the domain of genome-wide association studies (GWAS). In this domain, interesting cluster patterns can be missed because the analysts explore only the scatterplots spanning the leading principal components of the data. 11 We show that ClustML can help detect cluster patterns hidden in subspaces spanned by low-variance principal components without requiring an exhaustive search among all pairs of components.
Finally, we discuss the challenges in developing hybrid computational-perceptual VQMs for cluster patterns and argue for creating perceptual-study-based benchmark datasets for evaluating and designing new VQMs.
R codes and datasets S1 and S2 are publicly available. 26
Related work
We review related work on visual quality measures (VQMs) designed to detect and quantify cluster patterns, VQMs built from data rather than heuristics, and merging decision techniques used in Gaussian Mixture Models specific to our GMM-based VQM approach.
Visual quality measures for clustering
Visual cluster patterns have been taxonomized 27 and empirically studied. 2 These works show various characteristics, demonstrating how challenging it is to develop VQMs for such loosely defined pattern types. Several approaches have been proposed to design VQMs for grouping patterns, each focusing on some specific definition. The Clumpiness measure 22 detects clumps in a scatterplot. It is part of the Scagnostics scatterplot descriptors. 18 Other VQM approaches are based on CLIQUE clustering. 23 Existing VQMs are mostly heuristics loosely related to human perceptual data. For instance, Pandey et al. 2 showed that Scagnostics are not well-related to their participants’ judgments (they were never explicitly designed for that, though).
In contrast, ClustML is a data-driven VQM directly optimized to mimic human judgments.
Data-driven VQMs
Beyond heuristics-based approaches, data-driven approaches like ScatterNet 28 or perception-based VQMs 29 get trained on human judgment data. Recent work on data-driven approaches has shown that fine-tuning a VQM on a specific pattern to mimic perceptual judgments outperforms heuristic techniques, for instance, in the case of class separation measures for class color-coded scatterplots.30,31 These data-driven VQMs led to new applications in supervised dimensionality reduction of labeled data 32 and color optimization for scatterplots. 33
Regarding cluster patterns (i.e. no color for class labels in the scatterplot), the X-means, 34 DBSCAN, 35 and CLIQUE 36 clustering techniques, and the Clumpiness 18 VQM have been compared to the ClustMe data-driven VQM 24 on human judgment benchmark datasets S1 and S2. ClustMe outperformed all others in terms of Vanbelle kappa 37 agreement index.
Among all these approaches, only ScatterNet 28 relies on a data-driven parametric model (auto-encoder) of human judgments rather than a predefined heuristic. Parameters of the model are optimized to predict pairwise similarity judgments between monochrome scatterplots. However, no such model exists for quantifying grouping patterns.
GMM-based VQM approach
Closest to our work is ClustMe, 24 a VQM for grouping patterns based on Gaussian Mixture Models (GMMs). ClustMe builds a GMM whose components are merged to detect more complex, non-Gaussian, grouping patterns (See Refs.25,38 for an overview). Human-subject data S1 has been used to evaluate and select the best merging criterion (Demp) among seven heuristics, 39 resulting in 60% to 80% agreement between ClustMe and human perceptual judgments. However, these heuristics are designed by data analysts grounded on mathematical principles rather than directly from perceptual judgments. A more recent work 40 uses an approach similar to ClustMe but considers cluster ambiguity measured with Shannon entropy of human judgments S1 instead of cluster separation. It also uses feature engineering to generate various aggregate heuristics of the GMM parameters and analyze factors at play in visual perception of cluster ambiguity. Due to the success of machine learning approaches in many domains, we hypothesized that applying such an approach instead of heuristics or feature engineering could benefit GMM-based VQM.
ClustML uses the same GMM-based VQM architecture as ClustMe (Figure 2(a)); however, human perceptual judgment in dataset S1 are directly used to model the merging decision function by training an automatic classifier (Figure 2(b)). As a result, the merging model in ClustML reaches more than 96% agreement (almost perfect agreement) with human-judgment evaluation data. This study presents the detailed architecture and training process of the merging function that makes ClustML outperform ClustMe on a second human-judgment benchmark dataset 24 S2 designed to evaluate VQMs by ranking scatterplots based on their grouping patterns.

ClustMe and ClustML are GMM-based VQMs for cluster patterns. (a) The VQM pipeline of ClustMe uses a heuristic (Demp) as a merging decision function for each pair of GMM components. (b) ClustML follows the same pipeline as ClustMe but uses an automatic classifier as a merging decision function (green) trained on 1000 monochrome scatterplots from a previous study.
24
These scatterplots were generated in study S1 from varying the parameters
ClustML: Principle and design
We give a more technical view of the ClustMe pipeline (Figure 2(a)), then we present the main principle of ClustML’s merging function and its pre-processing and training protocols on data S1 (Figure 2(b)).
ClustMe VQM for grouping patterns
In the following, we consider a set of
The previously proposed ClustMe 24 follows the three GMM-based VQM stages illustrated in Figure 2(a):
with
The parameter vector
Stage 2: GMM components pairwise characterization.
Each pair of components
In contrast to ClustMe,
24
the main idea of ClustML is to use an automatic binary classifier trained on human judgment data to realize the merging function
ClustML merging from human judgment data
The scatterplot stimuli used in study S1 of ClustMe
24
were generated from a bivariate GMM made of two

ClustML measures the amount of grouping in scatterplots based on a classifier trained on human judgments: A bivariate Gaussian Mixture Model (Stage 1) models the distribution of the points in the scatterplot to evaluate. Each possible pair of its
Based on S1 data, we can assign label 1 or 0 to a vector
We follow the below protocol to train this classifier:
Now, we justify and detail each step of this protocol.
Summarizing human judgments
Study S1 gives several human judgments for each scatterplot. Still, we need a single judgment (class label) per scatterplot to train a binary classifier, so we summarize these judgments using a majority vote in the following way.
We form the labeled dataset
Parameter space alignment
The space
Consider a single pair
where

Parameters
Following,
38
the parameters
The data
Initial data S1 from Abbas et al.
24
In this work, we ignore
Moreover, given the points
Finally, we obtain the input data
Regarding training data
The data set
Data augmentation
The way we parameterize the pairs of Gaussian components and the way the data S1 were generated lead to a possible lack of data to cover
Data augmentation 42 is a process to enrich the data space with new data in areas where they are lacking to ensure a better prediction by the model, but without requiring additional human labeling. It relies on symmetries to justify that existing labeled data can be replicated in other places of the data space.
We consider the symmetries arising in the parametric representation of a pair of Gaussian components

Data augmentation process: (a) We expect that each set of parameters of a pair of GMM components corresponds to a unique scatterplot up to the sampling variability and vice-versa. But there are symmetries for some settings of these parameters or some scatterplots. (b) Parameters of a pair of components (A, B, C, D) can be different while they represent the exact same cluster pattern in the scatterplot respectively (A’, B’, C’, D’) due to symmetry or rotation of the group of points in the scatterplot. (c) Data augmentation involves exploiting these known symmetries to generate additional data (A’, B’, C’, D’) with labels corresponding to their symmetrical version (A, B, C, D), enriching the dataset and improving classifier generalizability. (d) GMMs can model the same scatterplot with different parameters, leading to different locations in the feature space. (e) We generate new data in the feature space leading to the same scatterplot, hence the same label. In all cases (c and e), we need to cover the feature space with labeled examples better to support the training of the classifier; otherwise, the classifier will generalize poorly in these areas (Left side, (b and d)). The human judgment dataset S1 does not contain such symmetries because it has been designed to avoid showing twice the same scatterplot to human subjects. Therefore, we need to augment these data in the feature space by duplicating labeled scatterplots considering these symmetries (Right side, (c and e)).
In contrast, we need to cover extensively the parameter space
For any aligned data
We generate the following replica to account for y-axis symmetry:
We account for the non-identifiability of the Gaussian components by swapping components
We also generate replicas to account for the cases of isotropic covariance, where
We generate replicas
The initial data and all its replicas form the extended dataset
Then we filter out any duplicate data from that set to avoid over-sampling of some data and get the final set used to train the classifier:
Training merging models
Finally, training on
The training process uses a standard approach in data-driven estimation of parameters of supervised classifiers (see details in Experiment 1).
Experiments
ClustML and ClustMe are both GMM-based VQMs. We first demonstrate our claim that the merging decision of ClustML, being trained on perceptual data, is better than the one from ClustMe based on heuristics. ClustMe merging decision Demp was the best over six other merging heuristics assessed on the benchmark dataset S1.
24
We use the same dataset S1 to get the optimal merging decision
Second, ClustMe VQM has already been proven more accurate than competitors at ranking scatterplots based on cluster patterns on the benchmark dataset S2. 24 We use the same benchmark S2 to show that ClustML VQM improves accuracy over ClustMe VQM and, hence, over previous competitors.
Finally, we propose a usage scenario of ClustML with real genomic data.
Experiment 1: training the ClustML merger on perceptual data
To get the ClustML classifier for merging, we first align and augment the data and human judgments from the available dataset, then, we present the classification techniques and protocols, and finally, select the best among the trained classifiers.
Human judgments data
The initial human judgment data from study S1
24
is summarized in Table 1. We ignore the
Classifiers and training protocol
The 996 initial scatterplots, although chosen to cover the space of parameters uniformly, were assigned unequally to the two classes by the 34 S1’s subjects. Therefore, 81.5% of the data ended up with a merging decision of
The 16,181 scatterplots
We used the R-package
Note xgbTree and gbm only used None and C pre-processing.
To evaluate and select the best classifier on the test data, we computed the Matthews Correlation Coefficient (MCC), which is regarded as immune to large class imbalance.47,48
ClustML merger is better than Demp
Table 3 lists the best setting for each classification technique. The overall best combination to realize the ClustML merging function
The best of each classification technique based on Matthew’s correlation coefficient (MCC) is given together with its specific class balancing and pre-processing compounds (See Table 2).
Treebag with upsampling and all pre-processing options is the most accurate on the 3236 test data of Experiment 1.
Vanbelle’s kappa
There are two ways to compare

ClustML merger
In case 1 favoring
Experiment 2: ClustML is better at ranking scatterplots
In this experiment, we compare ClustML and ClustMe. Both are GMM-based VQMs, as illustrated in Figure 2(a). We use them to rank pairs of scatterplot projections of real and synthetic multidimensional data from the dataset S2. 24 None of these scatterplots has been used in the training process of ClustML, nor in determining parameters of ClustMe. S2 is made of all 435 possible pairs of 30 monochrome scatterplots selected among the dataset composed of 257 scatterplots from an earlier study. 27 31 subjects have judged each pair to rank the scatterplots by the perceived group structure complexity of the displayed point patterns on a 3-category scale: “<”, “=”, “>”.
We use the
ClustML gets
Qualitative comparison of ClustML and ClustMe
ClustMe and ClustML are used to rank the 257 scatterplots from. 27 Their scores are compared in Figure 7. Scatterplots (SPs) at the bottom show details of the dots in the top view. Numbers indicate identifiers of the SPs in the dataset. The caption of the figure gives detailed observations. ClustML seems more sensitive to cluster sharpness, while ClustMe seems more sensitive to cluster numerosity.

Top: Comparison of ClustML and ClustMe scores of 257 scatterplots from. 27 Bottom: 16 selected scatterplots (SPs) from the top view with corresponding colors, numbers, and approximate locations. Both ClustMe and ClustML give equally low scores to SPs 238, 244, 169 with no strong cluster patterns and similarly high scores to SPs 221, 229, 56 with sharp and numerous cluster patterns. ClustML gives high scores to SPs 28, 36, 102, medium scores to SPs 72, 114, 130, and low scores to SPs 108, 216, while ClustMe gives them all a medium score. ClustML seems better than ClustMe at distinguishing sharp cluster patterns from slightly noisy and very noisy ones. On the other hand, ClustMe seems more sensitive to the cluster numerosity, distinguishing low numerosity clusters in SPs 30 and 240 from medium numerosity in SPs 72, 114, 130, and high numerosity in SPs 229, 56 while ClustML gives to all of them a medium-high score.
Interpretation of Vanbelle kappa with a worst-case analysis
To better understand the meaning of these ranking scores, we compute the Vanbelle index when altering 10,000 times,

Distribution of Vanbelle kappa when altering 10,000 time
Altering a decision occurs whenever the order of two of the scatterplots is changed. For instance, on average, the difference between ClustML and ClustMe is equivalent to moving a single scatterplot down or up by 20 positions in the total ordering or changing the rank of more scatterplots by a total of 20 rank alterations.
Let us consider a realistic usage scenario where the user has a time budget so they can afford to explore only the top-K scatterplots in depth in search of new insights. Moving
We can extrapolate this observation to any dataset size. For a dataset with
ClustMe alters in the worst case about 11% of all pairs of
Usage scenario with genomic data
To illustrate ClustML’s potential utility to real-world analyses, we provide a usage scenario with genomic data. In many domains of micro-biology, analysts rely on data visualization to spot interesting patterns that deserve further detailed analysis. Automatic clustering of single-cell data is known to be challenging. 51 As such, biologists often resort to dimensionality reduction and visualizing scatterplots to decide about clusters of cells and their features. 52 Alternatively, scatterplot matrices (SPLOMs) are used, for instance, to visually identify interesting groups of cells in scatterplots determined by pairs of eigengenes (axes), each eigengene coding a group of coexpressed genes. 53 In genome-wide association studies, analysts project the genetic data into principal components space for visual inspection.54,55 In all these situations, the numerous projection methods and their parameters lead to possibly hundreds of scatterplots representing different facets of the same multidimensional data, similar to the type of data used in study S2.
In this usage scenario, we consider the data from the 1000 Genome Project phase 3 dataset 56 composed of genetic data of 26 populations of about 100 individuals each. We measure kinship between individuals of each population separately, computing identity-by-descent. 55 We project these data using Multidimensional Scaling into 30 dimensions and compute ClustML on each possible pair of principal components for each of the 26 populations separately. The top view in Figure 9 shows the 11,310 SPs in the space of the ClustML score and the proportion of variance explained. The bottom view shows several SPs found exploring the highest ClustML scores in search of complex patterns that could relate to subgroups of individuals in each population.

Top: Distribution of 11,310 scatterplots from all pairs of top 30 principal components (PCs) of 1000 Genome Project kinship data in the space of ClustML score and percentage of variance explained. Dots are color-coded by the axis with the most variance in the scatterplots, showing the ones directed mainly by the first (orange), second (red), third (purple), or fourth (blue) PC (black otherwise). Solid orange dots with thick edges are SPs directed by the first and second PCs. Analysts typically limit their exploration to SPs, explaining most of the variance at the top of the summary scatterplot. Those scatterplots mostly involve top PCs only. Bottom: we show top-level SPs directed by PC1-PC2 of LWK, MSL, PJL, and STU populations (left side of each pair), and lower-level SPs directed by PC5-PC17 for LWK, PC8-PC13 for MSL, PC17-PC18 for PJL, and PC9-PC22 for STU (right side of each pair). The lower-level SPs have about the same or even a higher ClustML score than the PC1-PC2 SPs of the same population. ClustML allows the analyst to detect a cluster pattern in each population (manually lassoed blue dots, right side), which would have been missed exploring only the PC1-PC2 SP of that same population (same blue dots, left side) because the same data points do not form a cluster pattern therein. Notice pairs of SPs are displayed at the same scale, showing that cluster patterns on their right side are of similar importance to those on their left side in terms of within and between variance.
Analysts typically rely on exploring SPs spanning pairs of the top-most principal components only (orange dots with a thick edge), possibly missing essential patterns as pointed out in a recent work. 11 Thanks to ClustML, we can discover SPs spanning lower order components (e.g. down to the 17th PC for the PJL population) containing cluster patterns of potential interest to the analyst which cannot be detected in the SPs directed by the top two principal components (See Figure 9 bottom). ClustML can guide the analyst, avoiding a very costly exhaustive exploration of the 11,310 SPs.
Discussion and future work
We proposed a new data-driven, GMM-based VQM for cluster patterns. ClustML’s main novelty is to use a merging component fully trained on human judgment data.
Options for improving ClustML
The ClustML merging component uses a majority vote to transform the collective judgments of 34 participants into a binary value, losing the richness of the human judgments more akin to a probability value. The GMM also restricts the type of cluster patterns that can be quantified to a mixture of Gaussians while other types of distributions and mixtures could be explored. At last, GMM-based VQMs act at the geometric encoding stage of the visualization pipeline, ignoring the aesthetic aspects of the scatterplot like color, opacity, size, and shape of the marks; other parameters which can also impact the perception of cluster patterns.57,58 All these aspects leave room for further study and improvements.
Toward hybrid computational-perceptual models of cluster patterns
It is typically challenging to learn a model for usually unsupervised tasks such as cluster pattern quantification: there is a lack of available representative and human-annotated scatterplot data to train supervised models due to an extreme variation of the cluster patterns,2,27 and a lack of a relevant representation space common to all these data. A related approach uses a deep network model
28
trained on scatterplot images to model the human-perceived similarity between patterns in monochrome scatterplots; working with the image pixels as common representation space is an ecologically valid option but still requires collecting enough human-annotated data to cover the vast amount of possible patterns in visualization images. Other heuristic-based techniques use a binning process to reduce the dimension of the image space where to look for visual patterns.57,58 In contrast, in this work, we transformed a typically unsupervised cluster pattern quantification problem into a supervised one, observing that the GMM (Stage 1) acts as a representation model, embedding the underlying points
The development of such hybrid models raises the question of how to collect a sufficient amount and quality of perceptual data in the first place. Pattern recognition models have long been studied and trained on natural images annotated by experts or crowdsourcing. 59 But only a few studies use perceptual-data-driven approaches for pattern recognition in visualization images.28,30,31 We advocate for driving new research in that area to develop data-driven perceptual-based VQM for clusters and other visual patterns in scatterplots, parallel coordinate plots, and other visualization idioms.20,60
Beyond user study evaluations
As stated in Refs.,61,62 new algorithms like ClustML should typically be evaluated for accuracy and computing resources. But VQMs algorithms are designed to support humans by replacing them in repetitive perceptual tasks. 17 Thus, accuracy is measured by comparing VQM scores to human judgments on the same visual stimuli. Hence, the design of new VQMs naturally relies on collecting perceptual judgment data from quantitative user studies. However, when the same judgment data can be re-used for comparing different VQM algorithms because they target the same perceptual task, it is unnecessary to run a new user study for each new VQM variant. Re-using study data was first achieved successfully for the design and evaluation of data-driven VQMs for class separation in scatterplots,30,31 with data from an earlier project. 3 The present paper is a renewed demonstration of that approach, comparing ClustML with ClustMe on S1 and S2 previously collected study data, relieving us of the need to run another user study to evaluate ClustML.
The use of benchmark data for algorithmic assessment is standard in computer science 61 and benefits the replicability, fairness, and objectivity of the comparison while scaling up the design process of new techniques. 63 Benchmark data also enables the data-driven design of new models using machine-learning techniques. Benchmarking in visualization is not new for comparing algorithmic approaches, 64 but it is pretty novel when considering human judgment data. Once a benchmark of judgment data is set, it avoids investing unnecessary expert resources to design user studies and collect similar data, and it prevents the additional risk of failure in doing so. By being able to re-use previously collected judgment data S1 and S2, our work demonstrates that it is possible to generate such benchmark data once and use them multiple times for the evaluation and the design of new VQMs. Hence, we advocate for including in the design process of quantitative user studies a reflection on the possibility to re-use the collected data beyond evaluation, to enable generating and training new models. How to develop such benchmark judgment data to facilitate their re-use in visualization design is a challenging research topic worthy of investigation.
