ClustML: A measure of cluster pattern complexity in scatterplots learnt from human-labeled groupings

Abstract

Visual quality measures (VQMs) are designed to support analysts by automatically detecting and quantifying patterns in visualizations. We propose a new VQM for visual grouping patterns in scatterplots, called ClustML, which is trained on previously collected human subject judgments. Our model encodes scatterplots in the parametric space of a Gaussian Mixture Model and uses a classifier trained on human judgment data to estimate the perceptual complexity of grouping patterns. The numbers of initial mixture components and final combined groups quantify visual cluster patterns in scatterplots. It improves on existing VQMs, first, by better estimating human judgments on two-Gaussian cluster patterns and, second, by giving higher accuracy when ranking general cluster patterns in scatterplots. We use it to analyze kinship data for genome-wide association studies, in which experts rely on the visual analysis of large sets of scatterplots. We make the benchmark datasets and the new VQM available for practical use and further improvements.

Keywords

Visual quality measure cluster pattern data-driven Gaussian mixture model perceptual data

Introduction

Cluster discovery is a typical task in visual data analysis.^1,2 Clusters can have various shapes, densities, and other characteristics,³ and may exist in different data subspaces. Fully automated clustering techniques are not always satisfying and might not match the end-user expectations.⁴ Hence, end-users often use data visualization, usually in scatterplots, to find or validate clusters of interest.⁵

To support these users, several pipelines for visual cluster analysis have been proposed in Visual Analytics.⁶ One way to visually discover clusters in multidimensional (HD) spaces is to use multidimensional projection techniques,⁷ RadViz,⁸ or star coordinate plots.⁹ Examining the resulting scatterplots allows for detecting grouping patterns that could support the existence of their multidimensional counterpart. But these two-dimensional projections generate artifacts,^7,9 and often one view is not enough to reliably discover all the multidimensional cluster structures.^10,11 Moreover, clusters may exist only in subspaces of the data. Hence, visual cluster analysis requires generating projections from possibly many (weighted) combinations of the initial features and different tuning of the parameters of projection techniques.^1,12–16 Eventually, these techniques allow the analyst to spot the most interesting visual cluster patterns for further investigation.

With increasing data dimensionality, however, this process often becomes tedious and cumbersome due to the large number of projections to explore visually. Visual Quality Measures¹⁷ (VQM) can support users in such situations by automatically detecting and quantifying visual patterns.^18–20 Ranking and arranging visualizations by order of interest concerning a specific type of pattern^10,21 lets analysts focus their limited-time budget on the most promising views. We are primarily interested in VQMs for cluster and grouping patterns in our work. Several VQMs have been proposed for that purpose.^22–24 Of these, the ClustMe method²⁴ is based on merging and counting components of a Gaussian Mixture Model (GMM)²⁵ of the points in the scatterplot. It was the first GMM-based VQM for quantifying visual cluster patterns in scatterplots. ClustMe has shown the most accurate performance in ranking human perceptual judgment benchmark data among all competitors. Still, its agreement with these human perceptual rankings is in the [60% - 80%] range, a relatively low score which we expect to improve by replacing the merging component of ClustMe with a new data-driven model, forming ClustML.

GMM-based VQMs like ClustMe and the proposed ClustML are made of three main stages illustrated in Figure 1:

Stage 1: Gaussian Mixture Modeling of the data points density in the scatterplot. Each Gaussian component of the mixture represents a local subset of the data. The model assumes the data are independently sampled from isolated Gaussian distributions or clusters. A data-driven process estimates the mean, the covariance matrix, and the relative contribution of each component to the global density distribution of the data points.

Stage 2: GMM components pairwise characterization of overlap. When two Gaussian components overlap too much, it is assumed that they likely belong to the same local cluster. Hence, evaluating these pairwise overlaps from the data and the parameters of the GMM provides additional characteristics of interest to quantify cluster patterns.

Stage 3: Visual quality measure computation. All previous quantities are aggregated to form the final VQM score that quantifies visual cluster patterns in the scatterplot.

Figure 1.

A visual quality measure (VQM) based on a Gaussian Mixture Model (GMM) for cluster patterns in scatterplots is made of three stages: (1) a data-driven process estimates the parameters of a GMM of the data points density in the scatterplot; (2) the degree of overlapping of each pair of GMM components is computed to provide additional characteristics of interest to quantify cluster patterns; (3) The data points, the GMM parameters, and the pairwise quantities are aggregated to compute the visual quality measure. ClustMe and ClustML are both GMM-based VQMS, differing in the way they quantify pairwise overlap of GMM components (Stage 2).

In ClustMe, the overlapping evaluation (Stage 2) is based on a computational heuristic called Demp that decides when two GMM components overlap too much; they are merged or linked together to represent a single cluster instead of two symbolically. The ClustMe VQM score is a linear combination of the number of GMM components and the number of connected components of the graph formed by the Demp links, with more weight given to the latter.

In this work, in contrast to ClustMe, we set out to develop ClustML, a new GMM-based VQM whose overlapping evaluation and merging decision (Stage 2) is learned from human judgments of cluster patterns in scatterplots rather than using the Demp heuristic.

We demonstrate the superiority of ClustML against ClustMe, its main competitor, in terms of agreement with human judgments on two perceptual studies datasets, S1 and S2, previously collected for the development and evaluation of ClustMe²⁴:

Dataset S1 is a set of binary judgments from 34 subjects tasked to decide if they can see one or more-than-one clusters within each of 1000 scatterplots data generated by sampling two-component GMMs with various parameters.

Dataset S2 is independent of S1. It is a set of ternary judgments from 31 subjects tasked to decide for each of 435 pairs of scatterplots if one or the other shows the most complex cluster pattern or if both are equally complex.

In the ClustMe paper, S1 is used to select the best merging decision model among a finite set of 7 heuristics. In contrast, in this work, S1 is used to train an automatic classifier to mimic human merging decisions. In both the ClustMe paper and this work, S2 is used to evaluate the resulting GMM-based VQM for a pairwise ranking task. Using the same datasets, S1 and S2 allows a fair and objective comparison between ClustMe and ClustML.

We also propose qualitative comparisons between ClustMe and ClustML and a usage scenario in the domain of genome-wide association studies (GWAS). In this domain, interesting cluster patterns can be missed because the analysts explore only the scatterplots spanning the leading principal components of the data.¹¹ We show that ClustML can help detect cluster patterns hidden in subspaces spanned by low-variance principal components without requiring an exhaustive search among all pairs of components.

Finally, we discuss the challenges in developing hybrid computational-perceptual VQMs for cluster patterns and argue for creating perceptual-study-based benchmark datasets for evaluating and designing new VQMs.

R codes and datasets S1 and S2 are publicly available.²⁶

Related work

We review related work on visual quality measures (VQMs) designed to detect and quantify cluster patterns, VQMs built from data rather than heuristics, and merging decision techniques used in Gaussian Mixture Models specific to our GMM-based VQM approach.

Visual quality measures for clustering

Visual cluster patterns have been taxonomized²⁷ and empirically studied.² These works show various characteristics, demonstrating how challenging it is to develop VQMs for such loosely defined pattern types. Several approaches have been proposed to design VQMs for grouping patterns, each focusing on some specific definition. The Clumpiness measure²² detects clumps in a scatterplot. It is part of the Scagnostics scatterplot descriptors.¹⁸ Other VQM approaches are based on CLIQUE clustering.²³ Existing VQMs are mostly heuristics loosely related to human perceptual data. For instance, Pandey et al.² showed that Scagnostics are not well-related to their participants’ judgments (they were never explicitly designed for that, though).

In contrast, ClustML is a data-driven VQM directly optimized to mimic human judgments.

Data-driven VQMs

Beyond heuristics-based approaches, data-driven approaches like ScatterNet²⁸ or perception-based VQMs²⁹ get trained on human judgment data. Recent work on data-driven approaches has shown that fine-tuning a VQM on a specific pattern to mimic perceptual judgments outperforms heuristic techniques, for instance, in the case of class separation measures for class color-coded scatterplots.^30,31 These data-driven VQMs led to new applications in supervised dimensionality reduction of labeled data³² and color optimization for scatterplots.³³

Regarding cluster patterns (i.e. no color for class labels in the scatterplot), the X-means,³⁴ DBSCAN,³⁵ and CLIQUE³⁶ clustering techniques, and the Clumpiness¹⁸ VQM have been compared to the ClustMe data-driven VQM²⁴ on human judgment benchmark datasets S1 and S2. ClustMe outperformed all others in terms of Vanbelle kappa³⁷ agreement index.

Among all these approaches, only ScatterNet²⁸ relies on a data-driven parametric model (auto-encoder) of human judgments rather than a predefined heuristic. Parameters of the model are optimized to predict pairwise similarity judgments between monochrome scatterplots. However, no such model exists for quantifying grouping patterns.

GMM-based VQM approach

Closest to our work is ClustMe,²⁴ a VQM for grouping patterns based on Gaussian Mixture Models (GMMs). ClustMe builds a GMM whose components are merged to detect more complex, non-Gaussian, grouping patterns (See Refs.^25,38 for an overview). Human-subject data S1 has been used to evaluate and select the best merging criterion (Demp) among seven heuristics,³⁹ resulting in 60% to 80% agreement between ClustMe and human perceptual judgments. However, these heuristics are designed by data analysts grounded on mathematical principles rather than directly from perceptual judgments. A more recent work⁴⁰ uses an approach similar to ClustMe but considers cluster ambiguity measured with Shannon entropy of human judgments S1 instead of cluster separation. It also uses feature engineering to generate various aggregate heuristics of the GMM parameters and analyze factors at play in visual perception of cluster ambiguity. Due to the success of machine learning approaches in many domains, we hypothesized that applying such an approach instead of heuristics or feature engineering could benefit GMM-based VQM.

ClustML uses the same GMM-based VQM architecture as ClustMe (Figure 2(a)); however, human perceptual judgment in dataset S1 are directly used to model the merging decision function by training an automatic classifier (Figure 2(b)). As a result, the merging model in ClustML reaches more than 96% agreement (almost perfect agreement) with human-judgment evaluation data. This study presents the detailed architecture and training process of the merging function that makes ClustML outperform ClustMe on a second human-judgment benchmark dataset²⁴ S2 designed to evaluate VQMs by ranking scatterplots based on their grouping patterns.

Figure 2.

ClustMe and ClustML are GMM-based VQMs for cluster patterns. (a) The VQM pipeline of ClustMe uses a heuristic (Demp) as a merging decision function for each pair of GMM components. (b) ClustML follows the same pipeline as ClustMe but uses an automatic classifier as a merging decision function (green) trained on 1000 monochrome scatterplots from a previous study.²⁴ These scatterplots were generated in study S1 from varying the parameters $ϕ_{uv}$ of a GMM with 2 components and labeled by 34 subjects (H₁, …, H₃₄) seeing one(H_n = 0) or more-than-one(H_n = 1) clusters.

ClustML: Principle and design

We give a more technical view of the ClustMe pipeline (Figure 2(a)), then we present the main principle of ClustML’s merging function and its pre-processing and training protocols on data S1 (Figure 2(b)).

ClustMe VQM for grouping patterns

In the following, we consider a set of $N$ data points $X = {x_{1}, \dots, x_{N}} \in (R^{2})^{N}$ in a 2-dimensional real space, represented graphically as a scatterplot $SP (X)$ .

The previously proposed ClustMe²⁴ follows the three GMM-based VQM stages illustrated in Figure 2(a):

Stage 1: Gaussian Mixture Modeling. The probability density of the data points is modeled with a Gaussian Mixture Model $M (X, ϕ, K)$ ^25,38 with $K$ bivariate Gaussian distribution components $g$ . The probability density at any point $x \in R^{2}$ is estimated given model parameter $ϕ = (π_{1}, \dots, π_{K}, μ_{1}, \dots, μ_{K}, Σ_{1}, \dots, Σ_{K})$ by:

p (x | ϕ, K) = \sum_{k = 1}^{K} π_{k} g (x, μ_{k}, Σ_{k})

(1)

with $g (x, μ, Σ) = det (2 π Σ)^{- \frac{1}{2}} e^{- \frac{1}{2} {(x - μ)}^{⊤} Σ^{- 1} (x - μ)}$ and $\sum_{k} π_{k} = 1$ .

The parameter vector $ϕ_{K}$ controls the location $μ_{k}$ , shape $Σ_{k}$ and weight $π_{k}$ of each component of $M$ . The Bayesian Information Criterion defined by Schwarz⁴¹ is maximized to determine the best model $M^{*}$ , with the number of components $K^{*}$ and parameter $ϕ^{*}$ .

Stage 2: GMM components pairwise characterization.

Each pair of components $(u, v) \in {1, \dots, K^{*}}^{2}, u \neq v$ of $M^{*}$ is independently screened by the Demp merging heuristic $G_{Demp}$ to decide if it forms a single cluster locally. $G_{Demp} (X, ϕ^{*}, u, v) \in {0, 1}$ takes the binary decision to merge (1) or not (0) the two components $u$ and $v$ based on the optimal parameter $ϕ^{*}$ and data $X$ .

Stage 3: Score computation The adjacency matrix ${(G)}_{u, v}$ forms a graph whose vertices are the $K^{*}$ components of $M^{*}$ and $M (\leq K^{*})$ its number of connected components. Finally, the pair $VQ M_{ClustMe} (X) = (M, K^{*})$ quantifies the complexity of the visual cluster pattern in $SP (X)$ : scatterplots $SP (X_{i})$ are ranked first by $M_{i}$ then by $K_{i}^{*}$ for equal $M_{i}$ . In other words, ClustMe tells that a scatterplot $SP (X_{h})$ displays a more complex cluster pattern than $SP (X_{l})$ if

M_{l} < M_{h} or (M_{l} = M_{h} and K_{l}^{*} < K_{h}^{*})

$VQ M_{ClustMe} (X_{i}) = M_{i} + \frac{K_{i}^{*}}{1 + K_{\max}^{*}}$ can be used instead, with $K_{\max}^{*}$ the maximum number of GMM components obtained across all $SP (X_{i})$ to be compared.

In contrast to ClustMe,²⁴ the main idea of ClustML is to use an automatic binary classifier trained on human judgment data to realize the merging function $G_{ClustML}$ instead of $G_{Demp}$ in Stage 2. All other processes in the above stages are identical for ClustMe and ClustML. However, for the same data $X$ and GMM $M^{*}$ (Stage 1), the different $G_{ClustML}$ merging function (Stage 2) can lead to a different value of $M$ and finally, a different $VQ M_{ClustML}$ score (Stage 3). We detail the design of $G_{ClustML}$ and its training protocol in the next sections.

ClustML merging from human judgment data

The scatterplot stimuli used in study S1 of ClustMe²⁴ were generated from a bivariate GMM made of two $(K = 2)$ Gaussian components $u$ and $v$ (Figure 2(b)), varying parameters $ϕ_{uv}$ . The parameter space $S$ spanned by the vectors $ϕ_{uv}$ contains all possible mixtures of two bivariate Gaussian distributions. A point $ϕ_{uv}^{[i]}$ in that space determines a unique mixture distribution $M_{ϕ_{uv}^{[i]}}$ from which one can randomly sample $N$ points $X^{[i]}$ to generate a 2D scatterplot $SP (X^{[i]})$ with a unique cluster pattern up to sampling variation. As illustrated in Figure 3, in some regions of this multidimensional parameter space $S$ , the generated scatterplots will show two clearly separated Gaussian clusters (top left blue area), while in other regions the scatterplots will show a single blob of two strongly overlapping Gaussian distributions (bottom right red area). How can we decide about merging two components $u$ and $v$ depending on the position in that space, that is, depending on the values of the parameter $ϕ_{uv}$ ?

Figure 3.

ClustML measures the amount of grouping in scatterplots based on a classifier trained on human judgments: A bivariate Gaussian Mixture Model (Stage 1) models the distribution of the points in the scatterplot to evaluate. Each possible pair of its $K^{*}$ Gaussian components is assessed for merging (Stage 2). For that purpose and as the main novelty of that work, a binary classifier $G$ has been trained in the parameter space $Φ_{uv}$ of component pairs $(u, v)$ (red and blue dots on the right; actually, this space has 8 dimensions). Scatterplots (Solid red and blue frames) generated by 1000 pairs have been labeled in a previous experiment²⁴ by 34 subjects tasked to decide whether each scatterplot shows one (Red) or more-than-one (Blue) clusters. Five such “Training” scatterplots with plain-line blue or red frames are displayed, and four others in the right column with the percentage of subjects seeing more-than-one cluster. After training, the classifier $G_{ClustML}$ automatically predicts the merging decision (Green solid line separating blue and red areas) that humans would take for yet unseen 2-Gaussian scatterplots (Dashed green frames). This GMM component pairwise merging decision generates a set of $M$ connected components (purple frame). Finally, the ClustML VQM (Stage 3) of the evaluated scatterplot is given by the pair $(M, K^{*})$ ; the higher the score, the more complex the grouping pattern.

Based on S1 data, we can assign label 1 or 0 to a vector $ϕ_{uv}^{[i]}$ for which most participants judged the scatterplot $SP (X^{[i]})$ was showing one or more-than-one clusters respectively. Then 1 codes for the merge decision, while 0 codes for the do-not-merge decision. ClustML uses such data $ϕ_{uv}^{[i]}$ to train a binary classifier $G_{ClustML}$ in the space $S$ to model this human judgment. Finally, the classifier computes a merging decision $G_{ClustML} (X, ϕ, u, v)$ for any possible pair $(u, v)$ of components in $M$ projected in $S$ .

We follow the below protocol to train this classifier:

Summarize human judgments from the study S1 to form the labeled dataset $X_{uv}$ ;

Align space $S$ of the dataset $X_{uv}$ with the parameter space of the density model obtained at Stage 1;

Augment the dataset $X_{uv}$ to ensure better generalization of the classifier $G_{ClustML}$ ;

Train the classifier $G_{ClustML}$ with the augmented data to get the optimal merging decision at Stage 2.

Now, we justify and detail each step of this protocol.

Summarizing human judgments

Study S1 gives several human judgments for each scatterplot. Still, we need a single judgment (class label) per scatterplot to train a binary classifier, so we summarize these judgments using a majority vote in the following way.

We form the labeled dataset $X_{uv} = {(input, label)}_{i} = {(ϕ_{uv}^{[i]}, H^{[i]})}_{i}$ by pairing the summary $H^{[i]}$ of 34 perceptual judgments $(H_{1}^{[i]}, \dots, H_{34}^{[i]})$ of cluster patterns in scatterplots stimuli $SP (X^{[i]})$ collected from study S1 together with the parameters $ϕ_{uv}^{[i]}$ of the 2-dimension 2-component GMM from which were sampled the points $X^{[i]}$ . We summarize the 34 human judgments into a binary class $H^{[i]} \in {0, 1}$ by applying a majority vote. Label $H^{[i]} = 0$ (do not merge) is assigned to input $ϕ_{uv}^{[i]}$ if most of the judgments on $SP (X^{[i]})$ are more-than-one cluster. Label $H^{[i]} = 1$ (merge) is assigned otherwise. A training data $χ_{i}$ is a pair $(ϕ_{uv}^{[i]}, H^{[i]}) \in X_{uv}$ . We note $Φ_{uv} = {ϕ_{uv}^{[i]}}_{i}$ the unlabeled part of these data.

Parameter space alignment

The space $S$ spanned by vectors $Φ_{uv}$ of the GMM used to generate scatterplots in S1 does not match with the space spanned by the parameters $ϕ$ of the GMM (Stage 1) of $X$ . We need to transform $ϕ$ into $ϕ_{uv}$ to get the labeled data $X_{uv}$ .

Consider a single pair $(u, v)$ of components of $M^{*}$ . The space spanned by the parameters related to $u$ and $v$ only, $ϕ_{uv} = (π_{u}, π_{v}, μ_{u}, μ_{v}, Σ_{u}, Σ_{v}) \subseteq ϕ$ has a fixed dimension $(| ϕ_{uv} | = 14)$ independent of $K^{*}$ , which makes it suitable for standard vector-based machine learning. $ϕ_{uv}$ can be further reduced to a set of 8 independent parameters (Figure 4) due to cross-dependencies:

ϕ_{uv} = (τ, μ, σ_{u}^{x}, σ_{u}^{y}, σ_{v}^{x}, σ_{v}^{y}, θ_{u}, θ_{v}) \in [0, 1] \times (R^{+})^{5} \times [0, π / 2]^{2}

(2)

where $τ = π_{u} / (π_{u} + π_{v})$ , and $μ = | | μ_{v} - μ_{u} | |$ . In the sequel, $S$ is the space spanned by these 8-dimension vectors $ϕ_{uv}$ .

Figure 4.

Parameters $ϕ_{uv} = (τ, μ, σ_{u}^{x}, σ_{u}^{y}, σ_{v}^{x}, σ_{v}^{y}, θ_{u}, θ_{v})$ of a pair of Gaussian components $(u, v)$ of $M^{*}$ control the direction $(θ)$ , the probability $(τ)$ , the extent $(σ)$ , and the distance $(μ)$ of the two component distributions, hence the (perceptual) overlap of their sampled data. These parameter vectors span the feature space $S$ (Figure 3) input of the classifier $G_{ClustML}$ taking decision of merging $u$ and $v$ .

Following,³⁸ the parameters $σ$ and $θ$ in $ϕ_{uv}$ come from the Singular Value Decomposition of the covariance $Σ_{i}$ $(i \in {u, v})$ into the diagonal “scaling” matrix of eigenvalues $S_{i}$ and the “rotation” matrix of eigenvectors $R_{i}$ : $Σ_{i} = R_{i} S_{i}^{2} R_{i}^{T}$ . $S_{i}$ is a diagonal scaling matrix with independent scales $σ_{i}^{x}$ and $σ_{i}^{y}$ along $x$ and $y$ orthogonal axes respectively. This gives an elliptic shape to the mixture components with width and length driven by $x$ and $y$ , whenever $σ_{i}^{x} \neq σ_{i}^{y}$ . $R_{i}$ is a rotation matrix of angle $θ_{i}$ which orients the elliptic shape with respect to the $x$ -axis:

S_{i} = (\begin{matrix} σ_{i}^{x} & 0 \\ 0 & σ_{i}^{y} \end{matrix}) R_{i} = (\begin{matrix} \cos θ_{i} & - \sin θ_{i} \\ \sin θ_{i} & \cos θ_{i} \end{matrix})

(3)

The data $X$ of any scatterplot in study S1 were generated with a GMM by specifying rotation $(θ_{i} \in [0, π / 2])$ and scaling $(σ_{i}^{x, y})$ parameters to get the covariance matrix $Σ_{i} = R_{i} S_{i}^{2} R_{i}^{T} = f (θ_{i}, σ_{i}^{x}, σ_{i}^{y})$ (see equation (3), Table 1, Figure 4). Let’s consider a scatterplot $SP (Y)$ to be scored with ClustML, and $M^{*} (Y)$ the best GMM modeling the density of its points $Y$ . The estimated covariance matrix ${\hat{Σ}}_{i}$ of each component $i$ of $M^{*} (Y)$ must be decomposed using SVD into ${\hat{S}}_{i}$ and ${\hat{R}}_{i}$ (3) from which we get angle ${\hat{θ}}_{i}$ and scaling parameters ${\hat{σ}}_{i}^{x, y}$ . Unfortunately, the estimated angle ${\hat{θ}}_{i}$ lies in the range $[- π / 2, π / 2]$ . In order to align angles $θ_{i}$ of training data with estimated angles ${\hat{θ}}_{i}$ , we passed each triplet $(θ_{i}, σ_{i}^{x}, σ_{i}^{y})$ of all training data $ϕ_{uv}$ into the SVD composition-decomposition process: $(θ'_{i}, σ'_{i}^{x}, σ'_{i}^{y}) = SVD (f (θ_{i}, σ_{i}^{x}, σ_{i}^{y}))$ .

Table 1.

Initial data S1 from Abbas et al.²⁴ $ϕ_{uv}$ are 1000 unique parameter sets $ϕ_{uv}^{[i]}$ picked randomly among the following values.

Param.	Description	Values
$τ$	Prior proba. of $u$	{0.1, 0.2, 0.3, 0.4, 0.5}
$μ$	$u$ to $v$ Euclid. dist.	{0, 1, 2, 3, 5, 8, 13, 21}
$σ_{u, v}^{x, y}$	Scaling factors	{0.5, 1, 1.5, 2, 2.5, 3}
$θ_{u}$ , $θ_{v}$	Rotation angles	{0, $π /$ 8, $π /$ 4, 3 $π /$ 8, $π /$ 2}
$α$	Rot. angle of $SP (X)$	{0, $π /$ 2, 5 $π /$ 4}
$N$	Num. of points $X$	{100, 1000}

In this work, we ignore $α$ and $N$ . It remains 996 unique sets of parameters forming $X_{uv}^{align}$ .

Moreover, given the points $Y$ of a new scatterplot, the optimal parameters $ϕ_{uv}^{*} = {τ, μ, σ_{u}^{x}, σ_{u}^{y}, σ_{v}^{x}, σ_{v}^{y}, θ_{u}, θ_{u}}$ , obtained from a pair $(u, v)$ of components of the best model $M^{*} (Y)$ , need to be scaled. Indeed, it is likely that the scale of the points $Y$ is orders of magnitude bigger or smaller than the one of the points $X$ in S1 ’s scatterplots. This scaling factor impacts parameters $μ$ and $σ$ . We must also correct the angles $θ_{u}$ and $θ_{v}$ defined relatively to the axis orthogonal to $(μ_{u} - μ_{v})$ . At the same time, the rotation matrix $R_{i}$ of the SVD decomposition of inferred $Σ_{i}$ is relative to the vector space of the points $Y$ . Therefore, for any parameter vector $ϕ_{uv}^{*}$ inferred from a new scatterplot $SP (Y)$ , we first compute the correcting angle $β = ∠ (\vec{μ_{u} μ_{v}}, \vec{y})$ between the two components’ centers and the y-axis of points $Y$ . We add $β$ to all $θ$ angles. Then we rescale the parameters $σ_{u}^{x}, σ_{u}^{y}, σ_{v}^{x}, σ_{v}^{y}$ and $μ$ by dividing them by the maximum of these values $s = max ({μ, σ_{u}^{x}, σ_{u}^{y}, σ_{v}^{x}, σ_{v}^{y}})$ .

Finally, we obtain the input data $ϕ_{uv}^{* align}$ to the merging function (classifier) $G_{ClustML}$ :

\begin{matrix} ϕ_{uv}^{* align} = align (ϕ_{uv}^{*}) \\ = (τ, \frac{μ}{s}, \frac{σ_{u}^{x}}{s}, \frac{σ_{u}^{y}}{s}, \frac{σ_{v}^{x}}{s}, \frac{σ_{v}^{y}}{s}, θ_{u} + β, θ_{v} + β) \end{matrix}

(4)

Regarding training data $Φ_{uv}$ from study S1, we first apply the composition-decomposition process $SVD ° f$ to get $θ'_{i}$ , then we rescale $μ$ and $σ$ . However, correcting the rotation by $β$ is useless as the y-axis is, by definition, directed by the components’ centers $(β = 0)$ (See Figure 4):

\begin{array}{l} X_{u v}^{a l i g n} = {(τ, \frac{μ}{s}, \frac{{σ^{'}}_{u}^{x}}{s}, \frac{{σ^{'}}_{u}^{y}}{s}, \frac{{σ^{'}}_{v}^{x}}{s}, \frac{{σ^{'}}_{v}^{y}}{s}, {θ^{'}}_{u}, {θ^{'}}_{v}, H_{i}) . \\ | (τ, μ, σ_{u}^{x}, σ_{u}^{y}, σ_{v}^{x}, σ_{v}^{y}, θ_{u}, θ_{v}, H_{i}) \in X_{u v}, \\ ({θ^{'}}_{j}, {σ^{'}}_{j}^{x}, {σ^{'}}_{j}^{y}) = S V D (f (θ_{j}, σ_{j}^{x}, σ_{j}^{y})), \forall j \in {u, v}, \\ s = \max ({μ, {σ^{'}}_{u}^{x}, {σ^{'}}_{u}^{y}, {σ^{'}}_{v}^{x}, {σ^{'}}_{v}^{y}})} \end{array}

(5)

The data set $X_{uv}^{align}$ forms the aligned data to be augmented before training the classifier $G_{ClustML}$ .

Data augmentation

The way we parameterize the pairs of Gaussian components and the way the data S1 were generated lead to a possible lack of data to cover $S$ sufficiently and get a more generalizable classifier $G_{ClustML}$ .

Data augmentation⁴² is a process to enrich the data space with new data in areas where they are lacking to ensure a better prediction by the model, but without requiring additional human labeling. It relies on symmetries to justify that existing labeled data can be replicated in other places of the data space.

We consider the symmetries arising in the parametric representation of a pair of Gaussian components $(u, v)$ in $S$ (see Figure 5). Indeed, the parameters used to generate the data sample in study S1 were intended to fall into a restricted part of $S$ to avoid the same stimuli being shown to the participant while different random parameters were generated. For instance, the scatterplot $SP (ϕ_{uv})$ generated by $ϕ_{uv} = (τ, \dots, θ_{u}, θ_{v})$ is identical up to sampling variation, to the one generated by $ϕ'_{uv} = (τ, \dots, θ_{u}, θ_{v} + π)$ despite $ϕ_{uv} \neq ϕ'_{uv}$ .

Figure 5.

Data augmentation process: (a) We expect that each set of parameters of a pair of GMM components corresponds to a unique scatterplot up to the sampling variability and vice-versa. But there are symmetries for some settings of these parameters or some scatterplots. (b) Parameters of a pair of components (A, B, C, D) can be different while they represent the exact same cluster pattern in the scatterplot respectively (A’, B’, C’, D’) due to symmetry or rotation of the group of points in the scatterplot. (c) Data augmentation involves exploiting these known symmetries to generate additional data (A’, B’, C’, D’) with labels corresponding to their symmetrical version (A, B, C, D), enriching the dataset and improving classifier generalizability. (d) GMMs can model the same scatterplot with different parameters, leading to different locations in the feature space. (e) We generate new data in the feature space leading to the same scatterplot, hence the same label. In all cases (c and e), we need to cover the feature space with labeled examples better to support the training of the classifier; otherwise, the classifier will generalize poorly in these areas (Left side, (b and d)). The human judgment dataset S1 does not contain such symmetries because it has been designed to avoid showing twice the same scatterplot to human subjects. Therefore, we need to augment these data in the feature space by duplicating labeled scatterplots considering these symmetries (Right side, (c and e)).

In contrast, we need to cover extensively the parameter space $S$ with labeled data to get the best possible generalization from the classifier $G_{ClustML}$ , that is, predicting accurately human judgments for yet unseen scatterplots. Indeed, two scatterplots with perceptually very similar point distributions $X_{A}$ and $X_{B}$ will likely get the same human judgment $H_{A} = H_{B}$ . However, $ϕ_{A}$ can end up very close in $S$ to a training data $ϕ_{T} \in Φ_{uv}$ , while $ϕ_{B}$ can end up far from it due to the inference process to get $M^{*}$ . Thus, a classifier trained on $(ϕ_{T}, H_{T})$ will be able to predict $H_{A} \approx H_{T}$ but will not be good at predicting $H_{B}$ . Therefore, we propose to augment the data $X_{uv}$ by replicating some of the training data $χ_{i} = (ϕ_{i}, H_{i})$ in different locations $ϕ_{i'}$ of $S$ to better cover it, getting new data $χ_{i'} = (ϕ_{i'}, H_{i})$ with same label.

For any aligned data $χ_{i} = (ϕ_{i}, H_{i}) \in X_{uv}^{align}$ :

χ_{i} = (τ, μ, σ_{u}^{x}, σ_{u}^{y}, σ_{v}^{x}, σ_{v}^{y}, θ_{u}, θ_{v}, H_{i})

(6)

We generate the following replica to account for y-axis symmetry:

\begin{matrix} (6) \Rightarrow χ_{i}^{-} = (τ, μ, σ_{u}^{x}, σ_{u}^{y}, σ_{v}^{x}, σ_{v}^{y}, - θ_{u}, - θ_{v}, H_{i}) \end{matrix}

(7)

We account for the non-identifiability of the Gaussian components by swapping components $u$ and $v$ for the cases (7) and (6) above:

\begin{matrix} (6) \Rightarrow χ_{i}^{swap} = (1 - τ, μ, σ_{v}^{x}, σ_{v}^{y}, σ_{u}^{x}, σ_{u}^{y}, θ_{v}, θ_{u}, H_{i}) \\ (7) \Rightarrow χ_{i}^{- swap} = (1 - τ, μ, σ_{v}^{x}, σ_{v}^{y}, σ_{u}^{x}, σ_{u}^{y}, - θ_{v}, - θ_{u}, H_{i}) \end{matrix}

(8)

We also generate replicas to account for the cases of isotropic covariance, where $σ_{u} = σ_{u}^{x} = σ_{u}^{y}$ or $σ_{v} = σ_{v}^{x} = σ_{v}^{y}$ . So, for any data

χ_{i} = (τ, μ, σ_{u}, σ_{u}, σ_{v}^{x}, σ_{v}^{y}, θ_{u}, θ_{v}, H_{i})

(9)

or χ_{i} = (τ, μ, σ_{u}^{x}, σ_{u}^{y}, σ_{v}, σ_{v}, θ_{u}, θ_{v}, H_{i})

(10)

We generate replicas

\begin{matrix} (9) \Rightarrow χ_{i}^{σ_{u}} = {(τ, μ, σ_{u}, σ_{u}, σ_{v}^{x}, σ_{v}^{y}, θ_{u}, θ_{v}, H_{i}) \\ | θ_{u} \in {- \frac{π}{2}, - \frac{3 π}{8}, - \frac{π}{4}, - \frac{π}{8}, 0, \frac{π}{8}, \frac{π}{4}, \frac{3 π}{8}, \frac{π}{2}}} \\ (10) \Rightarrow χ_{i}^{σ_{v}} = {(τ, μ, σ_{u}^{x}, σ_{u}^{y}, σ_{v}, σ_{v}, θ_{u}, θ_{v}, H_{i}) \\ | θ_{v} \in {- \frac{π}{2}, - \frac{3 π}{8}, - \frac{π}{4}, - \frac{π}{8}, 0, \frac{π}{8}, \frac{π}{4}, \frac{3 π}{8}, \frac{π}{2}}} \end{matrix}

(11)

The initial data and all its replicas form the extended dataset $X_{uv}^{all}$ :

\begin{matrix} X_{uv}^{all} = {χ_{i}, χ_{i}^{-}, χ_{i}^{swap}, χ_{i}^{- swap}, χ_{i}^{σ_{u}}, χ_{i}^{σ_{v}} | χ_{i} \in X_{uv}^{align}} \end{matrix}

(12)

Then we filter out any duplicate data from that set to avoid over-sampling of some data and get the final set used to train the classifier:

\begin{matrix} X_{uv}^{uni} = Unique (X_{uv}^{all}) \end{matrix}

(13)

Training merging models

Finally, training on $X_{uv}^{uni}$ , we can obtain the ClustML merger $G_{ClustML}^{*}$ optimal at predicting the labels $H_{i}$ from the input $ϕ_{i} \in Φ_{uv}^{uni}$ , and use it to predict the label $\hat{H}$ of the current input $ϕ_{uv}^{* align}$ (4):

\hat{H} = G_{ClustML}^{*} (ϕ_{uv}^{* align}) \in {0, 1}

(14)

$\hat{H}$ estimates the unobserved aggregated judgments humans would make for the scatterplot $SP (X ~ M (ϕ_{uv}^{*}))$ .

The training process uses a standard approach in data-driven estimation of parameters of supervised classifiers (see details in Experiment 1).

Experiments

ClustML and ClustMe are both GMM-based VQMs. We first demonstrate our claim that the merging decision of ClustML, being trained on perceptual data, is better than the one from ClustMe based on heuristics. ClustMe merging decision Demp was the best over six other merging heuristics assessed on the benchmark dataset S1.²⁴ We use the same dataset S1 to get the optimal merging decision $G_{ClustML}^{*}$ for ClustML, and we show that this merging decision is better than $Demp$ on S1, hence, also better than the six other merging heuristics.

Second, ClustMe VQM has already been proven more accurate than competitors at ranking scatterplots based on cluster patterns on the benchmark dataset S2.²⁴ We use the same benchmark S2 to show that ClustML VQM improves accuracy over ClustMe VQM and, hence, over previous competitors.

Finally, we propose a usage scenario of ClustML with real genomic data.

Experiment 1: training the ClustML merger on perceptual data

To get the ClustML classifier for merging, we first align and augment the data and human judgments from the available dataset, then, we present the classification techniques and protocols, and finally, select the best among the trained classifiers.

Human judgments data

The initial human judgment data from study S1²⁴ is summarized in Table 1. We ignore the $α$ parameter, which gave an additional random rotation to the whole scatterplot for each trial, and we ignore the number $N \in {100, 1000}$ of points generated in the scatterplot, as none of these parameters appears in the GMM modeling the density of the scatterplot. The dataset $Φ_{uv}$ is a sample of 1000 of these scatterplots. We discovered four of them are duplicates, which means four scatterplots were generated twice with the same set of parameters but differing by the number of sampled points ( $N = 100$ and $N = 1000$ ), or they turned out having the same parameters after data alignment. These four duplicates were removed. We ended up with 996 unique scatterplots with 34 human judgments that we summarized by majority vote. The final alignment and augmentation processes (equation (13)) led to 16,181 scatterplots in the set $X_{uv}^{uni}$ forming the Benchmark dataset 1.

Classifiers and training protocol

The 996 initial scatterplots, although chosen to cover the space of parameters uniformly, were assigned unequally to the two classes by the 34 S1’s subjects. Therefore, 81.5% of the data ended up with a merging decision of $H^{[i]} = 1$ . This class imbalance requires a specific process for training classifiers to avoid bias favoring the majority class. Another issue is the relative scale of the parameters; for instance, the parameter $μ$ scales up to two orders of magnitude larger than $τ$ . Correlated features must also be dealt with. This requires pre-processing steps.

The 16,181 scatterplots $X_{uv}^{uni}$ were stratified by class, each subset being randomly split into 80% training and 20% testing to finally get 12,945 training and 3236 test points preserving the (imbalanced) class distribution. Notice that the 3236 test data points correspond to 709 of the 996 unique scatterplots while the 12,945 training data points correspond to 991 of them. Still, none is duplicated in the parameter space $S$ after augmentation, forming valid independent training and test sets for learning the automatic classifiers in that space.

We used the R-package CARET⁴³ for training 12 classification techniques, trying 4 methods to deal with class imbalance, and 4 pre-processing methods for scaling and remove correlated features, all summarized in Table 2. This process resulted in 320 different classification models. We used 10-fold cross-validation on the training set, with 10 repetitions of the training with random initialization,

Table 2.

Methods used from the R-package CARET.^43,44

	Method	Description
Pre-processing	None	No pre-processing
	Center+Scale (C)	Zero mean and unit variance
	C+BoxCox (CB)	Box-Cox transformation
	C+PCA (CP)	Principal Component Analysis
	C+B+P (CBP)
	CBP+spatialSign (CBPS)	Dividing by norm (unit sphere)
Balancing	None	No balancing
	upSample	Rand. replica of minor. class
	downSample	Rand. sampling of major. class
	ROSE	Rand. over-sampling⁴⁵
	Smote	Synth. minor. class NN⁴⁶
Classification technique	nb	Naive Bayes
	knn	k-Nearest Neighbors
	rf	Random Forest
	treebag	Bagged Classif. Adap. Reg. Tree
	blackBoost	Boosted Reg. Tree
	gbm	Gene. Boosted Reg. Model
	xgbTree	Extreme Gradient Boosting Tree
	earth	Multivar. Adap. Reg. Spline
	svmRadial	Radial Kernel Sup. Vec. Mach.
	mlpWeightDecay	Multi-Layer Perceptron
	glm	Generalized Linear Model
	glmnet	GLM penal. max. lik.

Note xgbTree and gbm only used None and C pre-processing.

To evaluate and select the best classifier on the test data, we computed the Matthews Correlation Coefficient (MCC), which is regarded as immune to large class imbalance.^47,48

ClustML merger is better than Demp

Table 3 lists the best setting for each classification technique. The overall best combination to realize the ClustML merging function $G_{ClustML}$ is a bagged Classification and Regression Tree (CART) model (treebag) with up-sampling of the minority class (upSample) and running all pre-processing methods (Center+Scale+BoxCox+PCA+spatialSign). In study S1,²⁴ $Demp$ is the best among seven merging heuristics and is used to form $ClustMe$ . Following that study, we use the Vanbelle’s Kappa $κ_{v}$ agreement index³⁷ to compare both merging techniques with the 34 human judgments.

Table 3.

The best of each classification technique based on Matthew’s correlation coefficient (MCC) is given together with its specific class balancing and pre-processing compounds (See Table 2).

Classification tech.	Balancing	Pre-Processing	MCC
treebag	upSample	CBPS	0.970
rf	None	None	0.959
gbm	upSample	None	0.953
mlp	None	CBPS	0.893
knn	None	CB	0.888
earth	None	CBPS	0.876
blackBoost	Smote	C	0.875
xgbTree	None	None	0.868
svmRadial	None	CBPS	0.866
glmnet	None	CBPS	0.832
glm	None	None	0.831
nb	None	CB	0.828

Treebag with upsampling and all pre-processing options is the most accurate on the 3236 test data of Experiment 1.

Vanbelle’s kappa $κ_{v}$ considers both the agreement between the group of human raters and the VQM and the within-group inter-rater agreements. The $κ_{v}$ values are interpreted using a standard scale⁴⁹: $< 0$ poor, ] $0, 0.2]$ slight, $] 0.2, 0.4]$ fair, $] 0.4, 0.6]$ moderate, $] 0.6, 0.8]$ substantial, and $] 0.8, 1]$ almost perfect agreements. We run $10000$ evaluations on the bootstrap samples⁵⁰ of the test data to estimate the average score the two mergers would have obtained varying the distributions of scatterplot parameters and to better quantify their difference.

There are two ways to compare $G_{ClustML}$ and $Demp$ mergers. In case 1, we compute $κ_{v}$ on the 3236 augmented test data, which are unique for $G_{ClustML}$ but duplicates of some of the $709$ $Demp$ merging decisions, biasing the comparison toward the duplicate cases (Figure 6 left). In case 2, we compute $κ_{v}$ on the 709 scatterplots from the test set, which is fair for $Demp$ , but forces us to summarize the $G_{ClustML}$ predictions by a majority vote over the duplicated data. (Figure 6 center).

Figure 6.

ClustML merger $(G_{ClustML})$ is noticeably better than ClustMe merger ( $Demp$ ) based on Vanbelle Kappa score on the 3236 augmented data of the test set in Experiment 1 (a) and on the 709 test scatterplots with class computed by majority vote of $G_{ClustML}$ predictions on augmented data (b). ClustML VQM is noticeably better than ClustMe VQM at ranking the 435 pairs of scatterplots in Experiment 2 (c). However, as expected, scores are lower than in Experiment 1 as these scatterplots display more complex patterns and involve the full VQM pipeline. All box plots are based on 10000 bootstrap samples.

In case 1 favoring $G_{ClustML}$ , it gets $κ_{v} = 0.986$ , 16% greater than $Demp$ ’s $κ_{v} = 0.848$ , both being in Almost perfect agreement with human judgments. In case 2, favoring $Demp$ , $G_{ClustML}$ gets $κ_{v} = 0.962$ (Almost perfect agreement), a 22% improvement over $Demp$ ’s $κ_{v} = 0.786$ (Substantial agreement; consistent with the state-of-the-art score $κ_{v} = 0.788$ computed over the full $1000$ dataset²⁴). The ClustML merger $(G_{ClustML})$ is better than the ClustMe merger $(Demp)$ by a large margin, with more than 15% accuracy improvement in both cases.

Experiment 2: ClustML is better at ranking scatterplots

In this experiment, we compare ClustML and ClustMe. Both are GMM-based VQMs, as illustrated in Figure 2(a). We use them to rank pairs of scatterplot projections of real and synthetic multidimensional data from the dataset S2.²⁴ None of these scatterplots has been used in the training process of ClustML, nor in determining parameters of ClustMe. S2 is made of all 435 possible pairs of 30 monochrome scatterplots selected among the dataset composed of 257 scatterplots from an earlier study.²⁷ 31 subjects have judged each pair to rank the scatterplots by the perceived group structure complexity of the displayed point patterns on a 3-category scale: “<”, “=”, “>”.

We use the mclust R-package with $BIC$ model selection to train the GMM. We run ClustML and ClustMe merging functions on each pair of components identified by the GMM and finally get the respective VQM for each of the 30 scatterplots. Finally, following,²⁴ we use this VQM score to rank the scatterplots, and we compare the ranking with that of human judgments on all 435 pairs using the Vanbelle’s kappa index.

ClustML gets $κ_{v} = 0.727$ , improving over ClustMe’s $κ_{v} = 0.671$ top score to date. Figure 6 shows the 10,000 bootstrap samples distribution of the 435 pairs of scatterplots for the two VQMs. ClustML is still noticeably better than ClustMe on this data. However, the score difference is lower than in the previous experiment with only 8% improvement, and both scores are within the Substantial agreement range.

Qualitative comparison of ClustML and ClustMe

ClustMe and ClustML are used to rank the 257 scatterplots from.²⁷ Their scores are compared in Figure 7. Scatterplots (SPs) at the bottom show details of the dots in the top view. Numbers indicate identifiers of the SPs in the dataset. The caption of the figure gives detailed observations. ClustML seems more sensitive to cluster sharpness, while ClustMe seems more sensitive to cluster numerosity.

Figure 7.

Top: Comparison of ClustML and ClustMe scores of 257 scatterplots from.²⁷ Bottom: 16 selected scatterplots (SPs) from the top view with corresponding colors, numbers, and approximate locations. Both ClustMe and ClustML give equally low scores to SPs 238, 244, 169 with no strong cluster patterns and similarly high scores to SPs 221, 229, 56 with sharp and numerous cluster patterns. ClustML gives high scores to SPs 28, 36, 102, medium scores to SPs 72, 114, 130, and low scores to SPs 108, 216, while ClustMe gives them all a medium score. ClustML seems better than ClustMe at distinguishing sharp cluster patterns from slightly noisy and very noisy ones. On the other hand, ClustMe seems more sensitive to the cluster numerosity, distinguishing low numerosity clusters in SPs 30 and 240 from medium numerosity in SPs 72, 114, 130, and high numerosity in SPs 229, 56 while ClustML gives to all of them a medium-high score.

Interpretation of Vanbelle kappa with a worst-case analysis

To better understand the meaning of these ranking scores, we compute the Vanbelle index when altering 10,000 times, $k$ ClustML decisions for each $k \in {1, \dots, 435}$ randomly. By alteration, we mean changing any of <, =, or > order relations to a different order relation from the same set. The resulting distribution of the Vanbelle Kappa for each value of $k$ is displayed in Figure 8. The ClustML score decreases in proportion to the number of alterations. It requires between $r_{\min} = 9$ $(2 %)$ and $r_{\max} = 49$ $(11 %)$ alterations, with $r = 20$ $(4.6 %)$ on average, to get down to the ClustMe score.

Figure 8.

Distribution of Vanbelle kappa when altering 10,000 time $k$ values randomly chosen among the ClustML decisions (k∈ {1, …, 435}\vskip-1pt) over the 435 pairs of 30 scatterplots in Experiment 2. The dark gray area shows one standard deviation above and below the average value (black line). Light gray extends between the minimum and maximum values of the 10,000 samples. This serves to evaluate how much ClustMe would worsen ClustML ordering.

Altering a decision occurs whenever the order of two of the scatterplots is changed. For instance, on average, the difference between ClustML and ClustMe is equivalent to moving a single scatterplot down or up by 20 positions in the total ordering or changing the rank of more scatterplots by a total of 20 rank alterations.

Let us consider a realistic usage scenario where the user has a time budget so they can afford to explore only the top-K scatterplots in depth in search of new insights. Moving $n$ elements out of the top-K set $(K \geq n)$ requires at least $r = n^{2}$ rank permutations if we pick up the bottom $n$ of that set. For instance, if $abcdef | ghijkl . . . z$ is an ordered set of $26$ scatterplots and the user as a time budget to explore only the top $K = 6$ ( $a$ to $f$ delimited by |), then moving $n = 2$ scatterplots out of the top-6, say $e$ and $f$ to get $abcd gh | ef ij . . . z$ , requires altering at least $r = 4$ pairwise rankings ( $g \leftrightarrow e$ , $g \leftrightarrow f$ , $h \leftrightarrow e$ , $h \leftrightarrow f$ ). To push any set of $n$ items out of the top-K, any other group of permutations requires at least $n^{2}$ rank permutations. Hence, in the worst case, given a ClustML ordering of the 30 scatterplots, ClustMe, in comparison, may down-rank between $n_{\min} = \sqrt{r_{\min}} = 3$ and $n_{\max} = 7$ scatterplots off the top- $K$ most potentially insightful ones. It is also possible that most or all the alterations created by ClustMe occur outside of the top-K, so they would not impact the time-budgeted insight gathering, ClustMe and ClustML having identical top-K sets.

We can extrapolate this observation to any dataset size. For a dataset with $N$ scatterplots, a simple calculus shows that if $p$ is the percentage of ranking alterations over the $N (N - 1) / 2$ pairs, with $p \leq 50$ , then the number of ranking alterations is $r = N (N - 1) p / 200$ . Moreover, in the worst case, the percentage of down-graded scatterplots $q$ is $100 \sqrt{r} / N$ . Hence, $q = 10 \sqrt{(N - 1) p / 2 N} \approx \sqrt{50 p}$ for large $N$ .

ClustMe alters in the worst case about 11% of all pairs of $N$ scatterplots $(p = 100 \times r_{\max} / N = 100 \times 49 / 435 \approx 11)$ . Therefore, it could downgrade up to 23.5% of the scatterplots ordered by ClustML in a worst-case scenario.

Usage scenario with genomic data

To illustrate ClustML’s potential utility to real-world analyses, we provide a usage scenario with genomic data. In many domains of micro-biology, analysts rely on data visualization to spot interesting patterns that deserve further detailed analysis. Automatic clustering of single-cell data is known to be challenging.⁵¹ As such, biologists often resort to dimensionality reduction and visualizing scatterplots to decide about clusters of cells and their features.⁵² Alternatively, scatterplot matrices (SPLOMs) are used, for instance, to visually identify interesting groups of cells in scatterplots determined by pairs of eigengenes (axes), each eigengene coding a group of coexpressed genes.⁵³ In genome-wide association studies, analysts project the genetic data into principal components space for visual inspection.^54,55 In all these situations, the numerous projection methods and their parameters lead to possibly hundreds of scatterplots representing different facets of the same multidimensional data, similar to the type of data used in study S2.

In this usage scenario, we consider the data from the 1000 Genome Project phase 3 dataset⁵⁶ composed of genetic data of 26 populations of about 100 individuals each. We measure kinship between individuals of each population separately, computing identity-by-descent.⁵⁵ We project these data using Multidimensional Scaling into 30 dimensions and compute ClustML on each possible pair of principal components for each of the 26 populations separately. The top view in Figure 9 shows the 11,310 SPs in the space of the ClustML score and the proportion of variance explained. The bottom view shows several SPs found exploring the highest ClustML scores in search of complex patterns that could relate to subgroups of individuals in each population.

Figure 9.

Top: Distribution of 11,310 scatterplots from all pairs of top 30 principal components (PCs) of 1000 Genome Project kinship data in the space of ClustML score and percentage of variance explained. Dots are color-coded by the axis with the most variance in the scatterplots, showing the ones directed mainly by the first (orange), second (red), third (purple), or fourth (blue) PC (black otherwise). Solid orange dots with thick edges are SPs directed by the first and second PCs. Analysts typically limit their exploration to SPs, explaining most of the variance at the top of the summary scatterplot. Those scatterplots mostly involve top PCs only. Bottom: we show top-level SPs directed by PC1-PC2 of LWK, MSL, PJL, and STU populations (left side of each pair), and lower-level SPs directed by PC5-PC17 for LWK, PC8-PC13 for MSL, PC17-PC18 for PJL, and PC9-PC22 for STU (right side of each pair). The lower-level SPs have about the same or even a higher ClustML score than the PC1-PC2 SPs of the same population. ClustML allows the analyst to detect a cluster pattern in each population (manually lassoed blue dots, right side), which would have been missed exploring only the PC1-PC2 SP of that same population (same blue dots, left side) because the same data points do not form a cluster pattern therein. Notice pairs of SPs are displayed at the same scale, showing that cluster patterns on their right side are of similar importance to those on their left side in terms of within and between variance.

Analysts typically rely on exploring SPs spanning pairs of the top-most principal components only (orange dots with a thick edge), possibly missing essential patterns as pointed out in a recent work.¹¹ Thanks to ClustML, we can discover SPs spanning lower order components (e.g. down to the 17th PC for the PJL population) containing cluster patterns of potential interest to the analyst which cannot be detected in the SPs directed by the top two principal components (See Figure 9 bottom). ClustML can guide the analyst, avoiding a very costly exhaustive exploration of the 11,310 SPs.

Discussion and future work

We proposed a new data-driven, GMM-based VQM for cluster patterns. ClustML’s main novelty is to use a merging component fully trained on human judgment data.

Options for improving ClustML

The ClustML merging component uses a majority vote to transform the collective judgments of 34 participants into a binary value, losing the richness of the human judgments more akin to a probability value. The GMM also restricts the type of cluster patterns that can be quantified to a mixture of Gaussians while other types of distributions and mixtures could be explored. At last, GMM-based VQMs act at the geometric encoding stage of the visualization pipeline, ignoring the aesthetic aspects of the scatterplot like color, opacity, size, and shape of the marks; other parameters which can also impact the perception of cluster patterns.^57,58 All these aspects leave room for further study and improvements.

Toward hybrid computational-perceptual models of cluster patterns

It is typically challenging to learn a model for usually unsupervised tasks such as cluster pattern quantification: there is a lack of available representative and human-annotated scatterplot data to train supervised models due to an extreme variation of the cluster patterns,^2,27 and a lack of a relevant representation space common to all these data. A related approach uses a deep network model²⁸ trained on scatterplot images to model the human-perceived similarity between patterns in monochrome scatterplots; working with the image pixels as common representation space is an ecologically valid option but still requires collecting enough human-annotated data to cover the vast amount of possible patterns in visualization images. Other heuristic-based techniques use a binning process to reduce the dimension of the image space where to look for visual patterns.^57,58 In contrast, in this work, we transformed a typically unsupervised cluster pattern quantification problem into a supervised one, observing that the GMM (Stage 1) acts as a representation model, embedding the underlying points $X$ of a scatterplot $SP (X)$ into the GMM’s parameter space. By considering only pairs of GMM components, this representation space additionally got a fixed and reduced dimensionality, not only enabling the use of standard supervised classifiers but also drastically limiting the variety of cluster patterns to be learned (two-Gaussian-based distributions only), so the amount of data to be collected. Finally, it happened that the scatterplot stimuli of the S1 dataset were also generated by sampling such a space; hence, they could be used to train such classifiers. This option was not technically straightforward, as demonstrated by the data pre-processing, cleaning, and augmentation steps required to train the ClustML’s merging function. Overall, our work opens the door to introducing human-perceptual judgment data in originally unsupervised models, developing new hybrid computational-perceptual models for pattern recognition in visualization and pursuing pioneering work in that area.^4,32,33

The development of such hybrid models raises the question of how to collect a sufficient amount and quality of perceptual data in the first place. Pattern recognition models have long been studied and trained on natural images annotated by experts or crowdsourcing.⁵⁹ But only a few studies use perceptual-data-driven approaches for pattern recognition in visualization images.^28,30,31 We advocate for driving new research in that area to develop data-driven perceptual-based VQM for clusters and other visual patterns in scatterplots, parallel coordinate plots, and other visualization idioms.^20,60

Beyond user study evaluations

As stated in Refs.,^61,62 new algorithms like ClustML should typically be evaluated for accuracy and computing resources. But VQMs algorithms are designed to support humans by replacing them in repetitive perceptual tasks.¹⁷ Thus, accuracy is measured by comparing VQM scores to human judgments on the same visual stimuli. Hence, the design of new VQMs naturally relies on collecting perceptual judgment data from quantitative user studies. However, when the same judgment data can be re-used for comparing different VQM algorithms because they target the same perceptual task, it is unnecessary to run a new user study for each new VQM variant. Re-using study data was first achieved successfully for the design and evaluation of data-driven VQMs for class separation in scatterplots,^30,31 with data from an earlier project.³ The present paper is a renewed demonstration of that approach, comparing ClustML with ClustMe on S1 and S2 previously collected study data, relieving us of the need to run another user study to evaluate ClustML.

The use of benchmark data for algorithmic assessment is standard in computer science⁶¹ and benefits the replicability, fairness, and objectivity of the comparison while scaling up the design process of new techniques.⁶³ Benchmark data also enables the data-driven design of new models using machine-learning techniques. Benchmarking in visualization is not new for comparing algorithmic approaches,⁶⁴ but it is pretty novel when considering human judgment data. Once a benchmark of judgment data is set, it avoids investing unnecessary expert resources to design user studies and collect similar data, and it prevents the additional risk of failure in doing so. By being able to re-use previously collected judgment data S1 and S2, our work demonstrates that it is possible to generate such benchmark data once and use them multiple times for the evaluation and the design of new VQMs. Hence, we advocate for including in the design process of quantitative user studies a reflection on the possibility to re-use the collected data beyond evaluation, to enable generating and training new models. How to develop such benchmark judgment data to facilitate their re-use in visualization design is a challenging research topic worthy of investigation.

Footnotes

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Michael Sedlmair is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 251654672 -– TRR 161.

ORCID iD

Michael Aupetit

References

Cavallo

Demiralp

. Clustrophile 2: Guided visual clustering analysis. IEEE Trans Vis Comput Graph 2018; 25(1): 267–276.

Pandey

Krause

Felix

, et al. Towards understanding human similarity perception in the analysis of large sets of scatter plots. In: Proc. ACM Conf. on human factors in computing systems (CHI), 2016, pp. 3659–3669. DOI: 10.1145/2858036.2858155.

Sedlmair

Munzner

Tory

. Empirical guidance on scatterplot and dimension reduction technique choices. IEEE Trans Vis Comput Graph 2013; 19(12): 2634–2643.

Aupetit

Sedlmair

Abbas

, et al. Toward perception-based evaluation of clustering techniques for visual analytics. In: 2019 IEEE visualization conference, VIS 2019, Vancouver, BC, Canada, 20–25 October 2019, pp.141–145. New York, NY: IEEE. DOI: 10.1109/VISUAL.2019. 8933620.

Brehmer

Sedlmair

Ingram

, et al. Visualizing dimensionally-reduced data: interviews with analysts and a characterization of task sequences. In: Proc. Workshop on beyond time and errors on novel evaluation methods for visualization (BELIV), 2014, pp.1–8. ACM.

Wenskovitch

Crandell

Ramakrishnan

, et al. Towards a systematic combination of dimension reduction and clustering in visual analytics. IEEE Trans Vis Comput Graph 2018; 24(1): 131–141.

Nonato

Aupetit

. Multidimensional projection for visual analytics: linking techniques with distortions, tasks, and layout enrichment. IEEE Trans Vis Comput Graph 2019; 25: 2650–2673.

Hoffman

Grinstein

Pinkney

. Dimensional anchors: A graphic primitive for multidimensional multivariate information visualizations. In: Proc of the NPIV 99, 1999, pp.9–16. ACM Press.

Rubio-Sánchez

Raya

Díaz

, et al. A comparative study between radviz and star coordinates. IEEE Trans Vis Comput Graph 2016; 22(1): 619–628.

10.

Tatu

Maass

Farber

, et al. Subspace search and visualization to make sense of alternative clusterings in high-dimensional data. In: Proc. IEEE symp. on visual analytics science & technology, 2012, pp.63–72. New York, NY: IEEE. DOI: 10.1109/VAST.2012.6400488.

11.

Elhaik

. Principal component analyses (pca)-based findings in population genetic studies are highly biased and must be reevaluated. Sci Rep 2022; 12(1): 14683.

12.

Jeong

Ziemkiewicz

Fisher

, et al. iPCA: an interactive system for pca-based visual analytics. In: Proceedings of the 11th Eurographics/IEEE -VGTC conference on visualization, EuroVis09, Chichester, 2009, p.767–774, GBR: The Eurographs Association & John Wiley & Sons, Ltd. DOI: 10.1111/j.1467-8659.2009.01475.x.

13.

Bruneau

Pinheiro

Broeksema

, et al. Cluster sculptor, an interactive visual clustering system. Neurocomputing 2015; 150(Part B): 627–644.

14.

Wang

Mueller

. The subspace voyager: Exploring high-dimensional data along a continuum of salient 3d subspaces. IEEE Trans Vis Comput Graph 2018; 24(2): 1204–1222.

15.

Friedman

Tukey

. A projection pursuit algorithm for exploratory data analysis. IEEE Trans Comput 1974; C-23(9): 881–890.

16.

Buja

Cook

Asimov

, et al. Theory of dynamic projections in high-dimensional data visualization, http://stat.wharton.upenn.edu/∼buja/PAPERS/paper-dyn-proj-math.pdf (2004, accessed 9 January 2024)

17.

Bertini

Santucci

. Visual quality metrics. In: Proc. workshop on beyond time and errors on novel evaluation methods for visualization (BELIV), 2006, pp.1–5. ACM. DOI: 10.1145/1168149.1168159.

18.

Wilkinson

Anand

Grossman

. Graph-theoretic scagnostics. In: Proc. IEEE Information Visualization Symp. (INFOVIS) (eds Stasko

Ward

), 2005, p.21. New York, NY: IEEE Computer Society. DOI: 10.1109/INFOVIS.2005.14.

19.

Matute

Telea

Linsen

. Skeleton-based scagnostics. IEEE Trans Vis Comput Graph 2018; 24(1): 542–552.

20.

Bertini

Tatu

Keim

. Quality metrics in high-dimensional data visualization: an overview and systematization. IEEE Trans Vis Comput Graph 2011; 17(12): 2203–2212.

21.

Dang

Wilkinson

. Scagexplorer: Exploring scatterplots by their scagnostics. In Proc. IEEE Paciﬁc visualization symp. (PaciﬁcVis), 2014, pp.73–80. New York, NY: IEEE. DOI: 10.1109/ PaciﬁcVis.2014.42.

22.

Tukey

. Computer graphics and exploratory data analysis: an introduction. In Proc. the sixth annual conference and exposition: computer graphics, Vol. III, technical sessions, 1985, pp.773–785. Fairfax, VA: National Computer Graphics Association.

23.

Johansson

. Interactive dimensionality reduction through user-defined combinations of quality metrics. IEEE Trans Vis Comput Graph 2009; 15(6): 993–1000.

24.

Abbas

Aupetit

Sedlmair

, et al. Clustme: A visual quality measure for ranking monochrome scatterplots based on cluster patterns. Comput Graph Forum 2019; 38: 225–236.

25.

Fraley

Raftery

. Model-based clustering, discriminant analysis and density estimation. J Am Stat Assoc 2002; 97: 611–631.

26.

Mostafa

Abbas

. Datasets and code for ClustMe and ClustML visual quality measures of grouping patterns in monochrome scatterplots. Zenodo data repository 2023; DOI: 10.5281/zenodo.10208143.

27.

Sedlmair

Tatu

Munzner

, et al. A taxonomy of visual cluster separation factors. Comput Graph Forum 2012; 31(3pt4): 1335–1344.

28.

Tung

AKH

Wang

, et al. Scatternet: a deep subjective similarity model for visual analysis of scatterplots. IEEE Trans Vis Comput Graph 2020; 26: 1562–1576.

29.

Albuquerque

Eisemann

Magnor

. Perception-based visual quality measures. In Proc. IEEE symp. on visual analytics science & technology, 2011, pp.13–20. New York, NY: IEEE. DOI: 10.1109/VAST.2011.6102437.

30.

Sedlmair

Aupetit

. Data-driven evaluation of visual quality measures. Comput Graph Forum 2015; 34(3): 201–210.

31.

Aupetit

Sedlmair

. Sepme: 2002 new visual separation measures. In: Proc IEEE Paciﬁc visualization symp (PaciﬁcVis), 2016, pp.1–8. New York, NY: IEEE. DOI: 10.1109/PACIFICVIS. 2016.7465244.

32.

Wang

Feng

Chu

, et al. A perception-driven approach to supervised dimensionality reduction for visualization. IEEE Trans Vis Comput Graph 2018; 24(5): 1828–1840.

33.

Wang

Chen

, et al. Optimizing color assignment for perception of class separability in multiclass scatterplots. IEEE Trans Vis Comput Graph 2018; 25(1): 820–829.

34.

Pelleg

Moore

. X-means: extending k-means with efﬁcient estimation of the number of clusters. In: Proc. int. conf. on machine learning (ICML), 2000, pp.727–734. Morgan Kaufmann.

35.

Ester

Kriegel

Sander

, et al. A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD’96: Proceedings of the second international conference on knowledge discovery and data mining, 1996, pp.226–231. AAAI Press.

36.

Agrawal

Gehrke

Gunopulos

, et al. Automatic subspace clustering of high dimensional data for data mining applications. In: Proc. ACM int. conf. on management of data (SIGMOD), 1998, pp.94–105. ACM Press.

37.

Vanbelle

Albert

. Agreement between an isolated rater and a group of raters. Stat Neerl 2009; 63(1): 82–100.

38.

Bensmail

Celeux

Raftery

, et al. Inference in model-based cluster analysis. Stat Comput 1997; 7(1): 1–10.

39.

Hennig

. Methods for merging gaussian mixture components. Adv Data Anal Classif 2010; 4(1): 3–34.

40.

Jeon

Quadri

Lee

, et al. Clams: a cluster ambiguity measure for estimating perceptual variability in visual clustering. IEEE Trans Vis Comput Graph 2024; 30: 770–780.

41.

Schwarz

. Estimating the dimension of a model. Ann Stat 1978; 6: 461–464.

42.

Shorten

Khoshgoftaar

. A survey on image data augmentation for deep learning. Big Data 2019; 6: 60.

43.

Kuhn

. caret: Classiﬁcation and regression training. R package version 60-82. 2019, https://CRAN.R-project.org/package=caret

44.

Kuhn

. Building predictive models in R using the caret package. J Stat Softw 2008; 28(5): 1–26.

45.

Menardi

Torelli

. Training and assessing classification rules with imbalanced data. Data Min Knowl Discov 2014; 28: 92–122.

46.

Chawla

Bowyer

Hall

, et al. Smote: synthetic minority over-sampling technique. J Artif Intell Res 2002; 16: 321–357.

47.

Bekkar

Djemaa

Alitouche

. Evaluation measures for models assessment over imbalanced datasets. J Inf Eng Appl 2013; 3(10): 27–38.

48.

Boughorbel

Jarray

El-Anbari

. Optimal classifier for imbalanced data using Matthews correlation coefficient metric. PLoS One 2017; 12(6): 1–17.

49.

Landis

Koch

. The measurement of observer agreement for categorical data. Biometrics 1977; 33: 159–174.

50.

Efron

Tibshirani

. An introduction to the bootstrap. New York, NY: Chapman et Hall, 1993.

51.

Kiselev

Andrews

Hemberg

. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet 2019; 20(5): 273–282.

52.

Feng

Liu

Zhang

, et al. Dimension reduction and clustering models for single-cell RNA sequencing data: a comparative study. Int J Mol Sci 2020; 21(6): 2181.

53.

Han

Johnson

Zhang

, et al. Functional virtual ﬂow cytometry: a visual analytic approach for characterizing single-cell gene expression patterns. Biomed Res Int 2017; 2017: 9.

54.

Aupetit

Ullah

Rawi

, et al. A design study to identify inconsistencies in kinship information: The case of the 1000 genomes project. In: 2016 IEEE Paciﬁc Visualization Symposium, PaciﬁcVis 2016 (eds, Hansen

Viola

Yuan

) Taipei, Taiwan, 19–22 April 2016, pp. 254–258. New York, NY: IEEE Computer Society. DOI: 10.1109/PACIFICVIS.2016.7465281.

55.

Ullah

Aupetit

Das

, et al. Kinvis: a visualization tool to detect cryptic relatedness in genetic datasets. Bioinformatics 2019; 35(15): 2683–2685.

56.

Auton

Brooks

Durbin

, et al. A global reference for human genetic variation. Nature 2015; 526(7571): 68–74.

57.

Quadri

Rosen

. Modeling the influence of visual density on cluster perception in scatterplots using topology. IEEE Trans Vis Comput Graph 2021; 27(2): 1829–1839.

58.

Quadri

Nieves

Wiernik

, et al. Automatic scatterplot design optimization for clustering identification. IEEE Trans Vis Comput Graph 2023; 29(10): 4312–4327.

59.

Irshad

Montaser-Kouhsari

Waltz

, et al. Crowdsourcing image annotation for nucleus detection and segmentation in computational pathology: evaluating experts, automated methods, and the crowd. Pac Symp Biocomput 2015; 294–305.

60.

Dasgupta

Kosara

. Pargnostics: Screen-space metrics for parallel coordinates. IEEE Trans Vis Comput Graph 2010; 16(6): 1017–1026.

61.

Munzner

. A nested model for visualization design and validation. IEEE Trans Vis Comput Graph 2009; 15(6): 921–928.

62.

Meyer

Sedlmair

Munzner

. The four-level nested model revisited: blocks and guidelines. In: Workshop on beyond time and errors: novel evaluation methods for visualization, 2012, p. 6. New York, NY, USA: Association for Computing Machinery.

63.

Sim

Easterbrook

Holt

. Using benchmarking to advance research: a challenge to software engineering. In: 25th International conference on software engineering, 2003. Proceedings, Portland, OR, 2003, pp.74–83. New York, NY: IEEE. DOI: 10.1109/ICSE.2003.1201189.

64.

Espadoto

Martins

Kerren

, et al. Toward a quantitative survey of dimension reduction techniques. IEEE Trans Vis Comput Graph 2021; 27(3): 2153–2173.