A Mapper Algorithm with Implicit Intervals and Its Optimization

Abstract

The Mapper algorithm is an essential tool for visualizing complex, high-dimensional data in topological data analysis and has been widely used in biomedical research. It outputs a combinatorial graph whose structure encodes the shape of the data. However, the need for manual parameter tuning and fixed (implicit) intervals, along with fixed overlapping ratios, may impede the performance of the standard Mapper algorithm. Variants of the standard Mapper algorithms have been developed to address these limitations, yet most of them still require manual tuning of parameters. Additionally, many of these variants, including the standard version found in the literature, were built within a deterministic framework and overlooked the uncertainty inherent in the data. To relax these limitations, in this work, we introduce a novel framework that implicitly represents intervals through a hidden assignment matrix, enabling automatic parameter optimization via stochastic gradient descent (SGD). In this work, we develop a soft Mapper framework based on a Gaussian mixture model for flexible and implicit interval construction. We further illustrate the robustness of the soft Mapper algorithm by introducing the Mapper graph mode as a point estimation for the output graph. Moreover, a SGD algorithm with a specific topological loss function is proposed for optimizing parameters in the model. Both simulation and application studies demonstrate its effectiveness in capturing the underlying topological structures. In addition, the application to an RNA expression dataset obtained from the Mount Sinai/JJ Peters VA Medical Center Brain Bank successfully identifies a distinct subgroup of Alzheimer’s Disease. The implementation of our method is available at https://github.com/FarmerTao/Implicit-interval-Mapper.git.

1. INTRODUCTION

The Mapper algorithm is a powerful tool in topological data analysis used to explore the shape of datasets. Initially introduced for 3D object recognition (Singh et al., 2007), the Mapper algorithm has demonstrated remarkable efficiency in extracting topological information across various data types. Its application spans various areas, notably in biomedical research (Skaf and Laubenbacher, 2022). For example, it successfully identified new breast cancer subgroups with superior survival rates (Nicolau et al., 2011). In single-cell RNA-Seq analysis, a Mapper-based algorithm was used to study unbiased temporal transcriptional regulation (Rizvi et al., 2017). The variants of Mapper algorithms were also used to uncover higher-order structures of complex phenomics data (Kamruzzaman et al., 2021). Moreover, many efforts have been made to integrate the Mapper algorithm with other machine learning techniques. For instance, Bodnar et al. (2021) enhanced the performance of a graph neural network by incorporating the Mapper algorithm as a pooling operation. When combined with autoencoders, the Mapper algorithm can be used as a robust classifier (Cyranka et al., 2019), remedying the shortcomings of traditional convolutional neural networks, such as susceptibility to gradient-based attacks. Recently, a visualization method called shapevis, which was inspired by the Mapper algorithm, was proposed to produce a more concise topological structure than the standard Mapper graph (Kumari et al., 2020). In addition to advances in the Mapper algorithms in applications, there have also been new advancements in theory. Carriere et al. (2018) analyzed the Mapper algorithm from a statistical perspective and introduced a novel method for parameter selection. Other studies have explored the algorithm’s convergence to Reeb graphs (Brown et al., 2021) and examined its topological properties (Dey et al., 2017).

The standard Mapper algorithm generates a graph that captures the shape of the data by dividing the support of the filtered data into several overlapped intervals with a fixed length (Singh et al., 2007). Two major limitations of the standard Mapper algorithm are the requirement of a fixed length of intervals and an overlapping rate. Many algorithms were proposed to address these limitations. For instance, F-Mapper (Bui et al., 2020) utilized the fuzzy clustering algorithm to allow flexible interval partitioning by introducing an additional threshold parameter to control the degree of overlap. However, the introduction of the parameter also increased the algorithm’s computational complexity. In a different vein, the ensemble version of Mapper was proposed by Kang and Lim (2021); Fitzpatrick et al. (2023), which can recover the topology of datasets without parameter tuning. However, this method has a significantly high computational cost due to the execution of multiple instances of the Mapper algorithm. Differently, the Ball Mapper (Dłotko, 2019) directly constructed a cover on a given dataset, avoiding the difficulty of choosing filter functions. Nevertheless, its construction process is somewhat arbitrary and could inadvertently introduce extra parameters, increasing computational complexity. The D-Mapper, as introduced by Tao and Ge (2025), developed a probabilistic framework to construct intervals of the projected data based on a mixture distribution. This approach offers flexibility in interval construction. Additionally, the soft Mapper, a smoother version of the standard Mapper algorithm, was proposed by Oulhaj et al. (2024). This algorithm focused on the optimization of filter functions by minimizing the topological loss.

In this study, our objective is to enhance the Mapper algorithm by incorporating a probability model that requires fewer parameter selections and enables flexible interval partitioning. Inspired by the D-Mapper (Tao and Ge, 2025), our approach utilizes a mixture distribution. Still, it implicitly represents intervals through a hidden assignment matrix, eliminating the need for the interval threshold $α$ . This framework optimizes the mixture distribution by integrating stochastic gradient descent (SGD), facilitating automatic parameter tuning. This research contributes three key innovations as follows. First, we introduce a novel framework for constructing Mapper graphs based on mixture models, enabling flexible and implicit interval definitions. Second, we consider the uncertainty in a Mapper graph and introduce a mode-based estimation for the Mapper graph. Finally, we develop a SGD algorithm to optimize these intervals, resulting in an improved Mapper graph.

In addition, the simulation results show that our method is comparable to the standard Mapper algorithm and is more robust. Our model is also applied to an RNA expression dataset and successfully identifies a distinct subgroup of Alzheimer’s Disease (AD).

2. PRELIMINARY

2.1. Mapper algorithm

The Mapper algorithm has been widely used to output a graph of a given dataset to represent its topological structure (Dey et al., 2017), the output graph can be regarded as a discrete version of the Reeb graph. The core concept of the Mapper algorithm is as follows. First, it projects a given high-dimensional dataset into a lower-dimensional space. Then, it constructs a reasonable cover for the projected data, pulls back the original data points according to each interval, and clusters the points within each preimage (or pullback set). Finally, each cluster is treated as a node of the output graph, and an edge is added between two clusters if they share any data points.

Consider a dataset with n data points, ${x_{1}, \dots, x_{n}}$ , $x_{i} \in R^{d}$ , $i = 1, \dots, n$ . Let $X_{n}$ be the support of this dataset. Denote $f (\cdot) : X^{n} \to R^{m}$ , $m < n$ , a single-valued filter function, which maps each data point from $X^{n}$ to a lower-dimensional space, $R^{m}$ ( $m \geq 1$ ). As in most of the literature (Bui et al., 2020; Kang and Lim, 2021; Fitzpatrick et al., 2023), here m is set to 1. We use a simple example to illustrate the Mapper algorithm. The example dataset has a cross shape, as shown in Figure 1a. The Mapper algorithm first projects data onto a real line using a filter function, $f (x) = y, x \in R^{2}, y \in R$ , as shown in the upper panel of Figure 1b, the filter function is set to $f (x) = \frac{1}{n} \sum_{i = 1}^{n} ‖ x - x_{i} ‖$ , i.e., the mean distance of $x$ to all points $x_{i}$ , $i = 1, \dots, n$ . Set $a = \min {f (x_{i}), i = 1, \dots, n}$ , $b = \max {f (x_{i}), i = 1, \dots, n}$ . Then divide $[a, b]$ into K equal length intervals with p percentage overlapping ratio between any two adjacent intervals, denoted as $I_{j} = [a_{j}, b_{j}]$ , $j = 1, \dots, K$ , as shown in the lower panel of Figure 1b. These intervals such that $[a, b] = \cup_{j = 1}^{K} I_{j}$ , $| b_{j} - a_{j + 1} | = p | b_{1} - a_{1} |$ , $j = 1, \dots, n - 1$ , where $a = a_{1} < b_{1} < a_{2} < \dots < a_{n} < b_{n - 1} < b_{n} = b$ .

FIG. 1.

A demonstration of the Mapper algorithm applied to a dataset with a cross structure. (a) A visualization of the dataset. (b) The projected data and its overlapped intervals when $K = 6$ , $p = 0.33$ . (c) The output graph of the Mapper algorithm. The clustering algorithm implemented here is the density-based spatial clustering of applications with noise (DBSCAN) $(ϵ = 0.6, minPts = 5)$ (Schubert et al., 2017). The output Mapper graph presents a cross shape, which is consistent with the shape of the dataset.

Each interval $I_{j}$ is then pulled back to the original space through the inverse mapping $f^{- 1} (I_{j}), j = 1, \dots, K$ . A clustering algorithm is applied to original data points fall into each $f^{- 1} (I_{j})$ , partitioning the points into $r_{j}$ disjoint clusters, represented as $C_{j, i}$ , $i = 1, \dots, r_{j}$ , such that $\cup_{r_{j}} C_{j, r_{j}} \subset f^{- 1} (I_{j})$ , where $j = 1, \dots, K$ . We then obtain the pullback cover, $C = {C_{1, 1}, \dots, C_{1, r_{1}}, \dots, C_{K, 1}, \dots, C_{K, r_{K}}}$ . The Mapper graph is constructed from this cover, where each cluster in C corresponds to a node in the graph. An edge is added between any two nodes if their corresponding clusters intersect, i.e., if $C_{i, j} \cap C_{i^{'}, j^{'}} \neq \emptyset$ . As shown in Figure 1c, the resulting Mapper graph accurately captures the shape of the data.

2.2. Soft mapper

Traditional Mapper graphs are constructed through intervals based on the filtered data. The soft Mapper constructs Mapper graphs with a hidden assignment matrix without requiring fixed intervals (Oulhaj et al., 2024). The hidden assignment matrix, a $n \times K$ matrix, depicts the allocation relationship between given n data points and K groups (i.e., the implicit K intervals). This matrix is denoted as $H = {(H_{i j})}_{n \times K}$ , in which $H_{i j} = 1$ if the i-th point belongs to j-th group (implicit interval), otherwise $H_{i j} = 0$ , for $i = 1, \dots, n, j = 1, \dots, K$ . With a hidden assignment matrix H, a Mapper graph is constructed directly through the standard process of pulling back and clustering, which we refer to as a Mapper function, defined as follows.

Definition 1 (Mapper function). A Mapper function $ϕ : H \mapsto G$ , is a map from a hidden assignment matrix H to a Mapper graph G. The function is defined by pulling back and clustering operations in the standard Mapper algorithm.

When a hidden assignment matrix H is a random matrix, the soft Mapper can be viewed as a stochastic version of the Mapper, parametrized by a Mapper function $ϕ$ and a probability density function defined over a hidden assignment matrix H. The resulting random graph $G = ϕ (H)$ is a function of a hidden assignment matrix H. The simplest example of a soft Mapper is obtained by assigning a Bernoulli distribution to each element $H_{i j}$ of a hidden assignment matrix H.

Definition 2 (Soft Mapper with a Bernoulli distribution). Suppose $H_{i j}$ follows a Bernoulli distribution with parameter $Q_{i j}$ , $H_{i j} \sim B (Q_{i j}), i = 1, \dots, n, j = 1, \dots, K .$

Where $Q = {(Q_{i j})}_{n \times K}$ is a probability matrix of the Bernoulli distributions. Each element of a hidden assignment matrix H can be drawn independently from a Bernoulli distribution with probability of success $Q_{i j}$ , $0 \leq Q_{i j} \leq 1$ for $i = 1, \dots, n, j = 1, \dots, K$ .

With this definition, the model inference is simplified to estimate the probability matrix Q, which will yield the distribution of Mapper graphs. However, explicitly estimating Q is challenging. In this work, we address this issue by making certain modifications to the soft Mapper approach.

3. METHODS

3.1. Gaussian mixture model soft Mapper

In the standard Mapper algorithm, a cover is constructed on the projected data, and a cover consists of several intervals and any two adjacent intervals overlap. These intervals can be constructed by partitioning the projected data into multiple groups. Then, pull back the data points by allocating the corresponding projected value into each interval. Analogously, this allocating process of projected data points into intervals of a cover is similar to the allocating process of labels for a mixture probability model. In this work, we focus on a Gaussian mixture model (GMM) to fit the projected data due to its simplicity and flexibility. By incorporating a GMM, we can naturally define a hidden assignment matrix H through soft clustering, and a probability matrix Q can be easily derived from the GMM parameters. We fit a GMM to the projected data ${y_{1}, \dots, y_{n}}$ . As a result, the $Q_{i j}$ represents the probability that a data point $y_{i}$ is assigned to j-th class within the GMM framework.

Definition 3 (GMM soft Mapper). Assuming that the projected data ${y_{1}, \dots, y_{n}}$ follow a Gaussian mixture distribution. We define the weights, means, and variances of each component as $π = {π_{1}, \dots, π_{K}}$ , $μ = {μ_{1}, \dots, μ_{K}}$ , $σ^{2} = {σ_{1}^{2}, \dots, σ_{K}^{2}}$ , respectively. The set $θ = {π, μ, σ^{2}}$ denotes all model parameters. Conditional on $θ$ , the distribution of each data point $y_{i}$ is given by: $y_{i} | θ \sim \sum_{k = 1}^{K} π_{k} N (y_{i} | μ_{k}, σ_{k}^{2}) .$

Then, for each point $y_{i}$ , it is natural to set the probability of $y_{i}$ belonging to j-th class as follows, $Q_{i j} (y_{i}) = \frac{π_{j} N (y_{i} | μ_{j}, σ_{j}^{2})}{\sum_{k = 1}^{K} π_{k} N (y_{i} | μ_{k}, σ_{k}^{2})} .$ where $i = 1, \dots, n, j = 1, \dots, K$ .

This GMM-based assignment scheme imposes an implicit constraint on each row of the probability matrix Q, such that the sum of the probabilities for each data point equals one, $\sum_{j = 1}^{K} Q_{i j} = 1$ . However, the independent Bernoulli distribution assignment scheme of $H_{i j}$ in the soft Mapper may lead to extreme assignments, some points being unassigned to any cluster or a point is being assigned to all clusters. For instance, if $K = 4$ , data point $y_{i}$ might have an assignment probability vector of $Q_{i .} = [0.3, 0.2, 0.4, 0.1]$ , $i = 1, \dots, n$ . Then the independent Bernoulli distribution assignment scheme may result in an assignment vector $H_{i .} = [0, 0, 0, 0]$ , which means that the data point $y_{i}$ is being unassigned to any clusters, or $H_{i .} = [1, 1, 1, 1]$ , which means the data point $y_{i}$ belongs to all clusters. The former case violates the Mapper algorithm requirement that each data point belongs to at least one cluster, and the extreme case of being unassigned to any clusters should be avoided. The latter case conflicts with the fact that a data point in the Mapper algorithms typically falls into, at most, two adjacent intervals. Therefore, we propose a GMM-multinomial soft Mapper approach. Instead of sampling each element of H independently through the Bernoulli distributions, we sample each row of the assignment matrix H through a multinomial distribution regarding each row of the probability matrix Q. The multinomial distribution guarantees that each point is assigned to at least one cluster, and at most to m clusters as follows.

Definition 4 (Soft Mapper with a multinomial distribution). Let $Q_{i \cdot} = (Q_{i 1}, \dots, Q_{i K})$ represent the i-th row of Q, and $H_{i \cdot} = (H_{i 1}, \dots, H_{i K})$ represent the i-th row of hidden assignment matrix H, $i = 1, \dots, n, j = 1, \dots, K$ . Assume each row $H_{i \cdot}$ follows a multinomial distribution with total number of events $m = 2$ and event probability vector $Q_{i .}$ , $H_{i \cdot} \sim Multi (m, Q_{i .}), i = 1, \dots, n .$

Here $H_{i j}$ may take values from ${0, 1, 2}$ , and $H_{i j} \geq 1$ indicates that the i-th point is assigned to j-th implicit interval (group), otherwise $H_{i j} = 0$ . For ease of notation, we denote $H \sim Multi (m, Q)$ to indicate that each row of H independently follows a multinomial distribution with event number parameter $m = 2$ and event probabilities being each row of Q.

Our proposed soft Mapper samples each row $H_{i \cdot}$ independently from a multinomial distribution $Multi (2, Q_{i \cdot}), i = 1, \dots, n$ . The total of events m of the multinomial distribution is set to 2, ensuring that each data point must be assigned to at least one group and not more than two groups. This setting is reasonable; as in the Mapper algorithm, a data point typically falls within at least one group and at most two groups.

Our approach differs from the soft Mapper described in Oulhaj et al. (2024) in two key ways. First, by using the GMM, our $Q_{i j}$ can be specified as a function of the parameters $θ$ . Thus, updating the GMM parameters $θ$ is equivalent to updating Q. In contrast, Q in Oulhaj et al. (2024) was determined by the parameters in the filter functions. Second, we use an independent multinomial distribution assignment scheme to sample each row of the hidden assignment matrix H. The multinomial distribution avoids extreme cases where a data point is unassigned to any clusters or assigned to too many clusters.

3.2. The Mapper graph mode

The soft Mapper is a probabilistic version of the Mapper algorithm, where a random hidden assignment matrix H can result in varying Mapper graphs. Although the topological structures of these soft Mapper graphs differ in detail, they often share some common features as they are derived from the same underlying distribution. We aim to obtain a Mapper graph that encapsulates all these common structures as its final representation. Since H is a discrete random matrix, and each row is independently distributed, we can take the matrix of which each row is the mode of each row of H as the mode representation of H and then apply the Mapper function to get a Mapper graph, which we call the Mapper graph mode. The mode of a multinomial distribution has an explicit formula, allowing us to compute the mode of the Mapper graph directly and efficiently.

Definition 5 (The mode of a soft Mapper with a multinomial distribution). For a soft Mapper with a multinomial distribution, each row follows an independent multinomial distribution. Given the i-th row of Q, we define, $\begin{array}{c} Q_{i}^{*} = \max {Q_{i 1}, \dots, Q_{i K}}, \\ i^{*} = \underset{i 1, \dots, i K}{\arg \max} {Q_{i 1}, \dots, Q_{i K}}, \\ Q_{i}^{* *} = \max {Q_{i 1}, \dots, Q_{i K}} ∖ {Q_{i}^{*}}, \\ i^{* *} = \underset{{i 1, \dots, i K} ∖ {i^{*}}}{\arg \max} {Q_{i 1}, \dots, Q_{i K}} ∖ {Q_{i}^{*}} . \end{array}$

Here, the $Q_{i}^{*}$ and $Q_{i}^{* *}$ are the largest and the second largest elements of i-th row of Q, respectively. $i^{*}$ and $i^{* *}$ are their corresponding indices. For $i = 1, \dots, n$ , the mode of the i-th row of the hidden assignment matrix is then determined as follows (see Supplementary Appendix A1 for the theoretical derivation).

$If \frac{1}{2} Q_{i}^{*} > Q_{i}^{* *},$ $H_{mode} (i, j) = {\begin{matrix} 1 & if j = i^{*}, \\ 0 & else . \end{matrix}$ $If \frac{1}{2} Q_{i}^{*} \leq Q_{i}^{* *},$ $H_{mode} (i, j) = {\begin{matrix} 1 & if j = i^{*}, i^{* *}, \\ 0 & else . \end{matrix}$

The mode of the Mapper graph can be defined by a Mapper function and the mode of the hidden assignment matrix $H_{mode}$ , $G_{mode} = ϕ (H_{mode}) .$

The explicit form of $H_{mode}$ significantly reduces the computational burden of the algorithm. In the following sections, we will adopt the mode of the Mapper graph as the representation of the graph and present several Mapper graph samples to illustrate the inherent uncertainty of these graphs.

3.3. Loss function

The probability matrix Q is derived from a GMM fitted to the projected data. Therefore, optimizing the Mapper graph is equivalent to optimizing the parameter $θ$ of the GMM. However, the maximum likelihood estimation of parameter $θ$ only accounts for the distributions of the projected data, neglecting the topological information of the Mapper graph. To construct an appropriate Mapper graph, it is essential to consider both the likelihood of the projected data and the topology of data simultaneously. We design a loss function that incorporates both the likelihood of projected data and the topological information of a Mapper graph. The log-likelihood of the projected data is given by $\log L (Y_{n} | θ) = \sum_{i = 1}^{n} \log \sum_{k = 1}^{K} π_{k} N (y_{i} | μ_{k}, σ_{k}^{2}) .$

We adopt a similar formula to Oulhaj et al. (2024) to measure the topological information loss. The topological information of a Mapper graph can be encoded by the extended persistence diagram (Carriere and Oudot, 2018). A significant advantage of the extended persistence diagram is its ability to capture the branches of a Mapper graph. The branches of a Mapper graph are crucial in downstream data analysis. We denote the extended persistence diagram as D, where $D = {(b_{i}, d_{i}) | i = 1, \dots, M}$ is a multiset containing M points. We use a function $Pers : G \mapsto D$ to represent the process of computing the extended persistence diagram for a given Mapper graph.

To compute the extended persistence diagram, we need to define a filtration function for each node of a Mapper graph. Typically, the filtration function of a node on a Mapper graph is defined as the average of the filtered data within that node. However, to integrate the model parameters into the optimization framework, we introduce a novel node filtration function for computing the extended persistence diagram. For each node c on a Mapper graph, the filtration function on this node is $f_{M} (c) = \frac{\sum_{y_{i} \in c} \log L (y_{i} | θ)}{card (c)} .$

This function calculates the averaged log-likelihood of the data points assigned to node c, where the number of points at node c is denoted as $card (c)$ . The edge filtration is the maximum value of the corresponding paired nodes.

To represent the topological information on the extended persistence diagram, a function that maps the diagram to a real number is defined, denoted as $l : D \mapsto R$ . This function is called the persistence-specific function, which measures the topological information captured in the extended persistence diagram. Various functions are available to represent the overall topological information, such as persistence landscapes (Bubenik et al., 2015) or computing the bottleneck distance from a target extended persistence diagram (Bauer et al., 2024). Not all points on a persistence diagram are meaningful, some points may be generated through noise in the dataset. Typically, points with short persistence are considered noise, whereas those with long persistence are regarded as meaningful signals. An effective diagram should primarily consist of signal points, reflecting the significant topological structures of the dataset. The Mapper graph intermediates between the data and the persistent homology, so that a good Mapper graph has not (relatively) small noise but in fact few features attributable to noise. To obtain a more robust Mapper graph, in this work, we adopt the averaged persistence which measures the averaged persistence time of each point as the persistence-specific loss, assuming that there are M features regarding the Mapper graph mode, the topological loss is defined as: $l (D) = \frac{1}{M} \sum_{i = 1}^{M} | d_{i} - b_{i} | .$

This value offers a reasonable summary of the extended persistence diagram. An effective diagram should predominantly feature signal points. The detonator in this term is used to avoid encouraging spurious structures in the Mapper graph.

The extended persistence diagram D is derived from a Mapper graph G, while G is generated by the Mapper function of the hidden assignment matrix H. For notation simplicity, we denote $\begin{matrix} l (H) & = l ◦ Pers ◦ ϕ (H) \\ = l ◦ Pers (G) \\ = l (D) . \end{matrix}$

Since the hidden assignment matrix H is random, the resulting $l (H)$ is also random. The conventional way to deal with the random loss function is to take the expectation $E (l (H))$ . However, this method is computationally expensive when using Monte Carlo sampling. Instead, we use the persistence-specific loss of the Mapper graph mode defined in Definition 5 to represent the overall topological loss concerning H, denoted as $l (H_{mode})$ . The Mapper graph mode can be computed directly without sampling, which significantly reduces the computational cost.

The total loss function of the GMM soft Mapper can be defined as a weighted average of the negative log-likelihood of the filtered data and the topological loss concerning H, $Loss (θ | X_{n}, Y_{n}) = - λ_{1} \frac{\log L (Y_{n} | θ)}{n} - λ_{2} l (H_{mode}) .$

Here, the first term is the averaged log-likelihood, representing the information carried out by the filter data. By averaging, we ensure that the log-likelihood is comparable across datasets of varying sample sizes. The parameters $λ_{1}$ and $λ_{2}$ control the relative weights of the likelihood of the projected data and the topological loss of the Mapper graph.

3.4. SGD parameter estimation

In this study, we use the SGD algorithm (Bottou, 2010) to optimize the loss function defined above. Our goal is to find parameters $θ$ for a GMM that minimize the total loss function. The optimization process aims to improve the topological structure of a Mapper graph while ensuring a high likelihood of the filtered data. Once the optimal parameters are determined, we can either sample from the GMM to create a soft Mapper graph or directly compute the Mapper graph mode. Notably, our method does not require the derivation of an explicit gradient expression. This simplification is facilitated by automatic differentiation frameworks such as PyTorch (Paszke et al., 2019) or TensorFlow (Abadi et al., 2016), which effectively implement SGD.

To apply SGD to the log-likelihood of a GMM, it is crucial to handle the constraints on the parameters. For weights $π$ , each SGD update step must ensure that $\sum_{i = 1}^{K} π_{i} = 1$ , and $π_{i} > 0, i = 1, \dots, K$ . To solve this problem, we adopt the method from Gepperth and Pfülb (2021), which introduces free parameters $ξ_{i}, i = 1, \dots, K$ . Let $π_{i} = \frac{e^{ξ_{i}}}{\sum_{i = 1}^{K} e^{ξ_{i}}}, i = 1, \dots, K .$

Instead of updating $π_{i}$ directly, this approach updates $ξ_{i}$ . This transformation guarantees that $π_{i}$ complies with the constraints at each step. Similarly, to satisfy the standard variance constraint $σ_{i} > 0$ , we take the transformation $\log σ_{i} = ξ_{i}^{'}$ . Then we update $ξ_{i}^{'}$ instead of $σ_{i}$ at each step. The detailed SGD algorithm for the GMM soft Mapper model is presented in Algorithm 1. In this work, we implement the proposed algorithm in Python version 3.10. The extended persistence diagram and bottleneck distance is computed by Python package GUDHI version 3.8.0, the stochastic gradient descent is implemented through PyTorch 1.13.

4. SYNTHETIC DATASETS

In this section, we compare our proposed algorithm to the standard Mapper algorithm and D-Mapper on synthetic datasets to demonstrate its effectiveness. The output graphs indicate that our proposed method performs comparable or better than the standard Mapper and D-Mapper algorithms concerning the topological structures. To quantitatively compare these algorithms, we calculate the Silhouette Coefficient (SC) (Han et al., 2022) and the adjusted SC ( $S C_{adj}$ ) (Tao and Ge, 2025) for each dataset in Section 4.4, see Table 2. To illustrate the uncertainty in the soft Mapper graph, we also output a few samples from the optimized GMM soft Mapper distribution of each synthetic dataset (see Supplementary Appendix A2, Figs. A1–A7). These figures indicate that for each dataset, though with the same probability matrix, the resulting graphs could be very different in subtle features, which indicates the inherent uncertainty of the graphs. We also provide certain summary statistics (including mean, confidence interval, median, and mode) on certain summary metrics (i.e., degree, number of connected components, and loops in a graph) of 1000 samples from the optimized distribution for each synthetic dataset in Supplementary Table A1 in Supplementary Appendix A2. The training loss trace plot of the GMM soft Mapper while optimizing parameters for each synthetic dataset is given in Supplementary Figure A12a–f in Supplementary Appendix A5.

4.1. Uniform dataset

We start with two uniformly sampled datasets. 1000 data points are sampled from two separate circles and two overlapping circles, respectively. Figure 2a and f provide a visualization of these two datasets and (implicit) intervals produced by each algorithm. To visualize the implicit intervals, we take the minimum value within each cluster as the starting point of the interval and the maximum value within each cluster as the ending point of the interval. In both datasets, the filter function is the coordinates of the x-axis. For the disjoint circles dataset, we set the number of (implicit) intervals to $K = 6$ and apply the DBSCAN clustering algorithm with parameters $ϵ = 0.3$ and $minPts = 5$ . For the intersecting circles dataset, we take $K = 5$ , the same clustering algorithm with parameters $ϵ = 0.2$ and $minPts = 5$ . Full details of the parameters (including this dataset and the rest synthetic datasets) are listed in Table 1. The Mapper graph modes, both with and without optimization, are depicted in Figures 2d, i and e, j. The output Mapper graphs of the standard Mapper algorihtm and D-Mapper algorithm are given in Figures 2b, g and c, h. The output graphs demonstrate that all the Mapper graphs effectively approximate the topological structures of these datasets.

FIG. 2.

Comparison of the standard Mapper algorithm, D-Mapper algorithm and our proposed algorithm on a two disjoint circles dataset and a two intersecting circles dataset. (a, f) A visualization of the datasets. Colored lines represents the (implicit) intervals produced by the proposed algorithm with optimization. (b, g) The output graphs of the standard Mapper algorithm with $K = 6, p = 0.33$ for the two disjoint circles and $K = 5, p = 0.2$ for the two intersecting circles. (c, h) The output graphs of D-Mapper algorithm with $α = 0.08$ for the disjoint circles and $α = 0.31$ for the intersecting circles. (d, i) The output Mapper graph mode of our proposed method without optimization. (e, j). The output Mapper graph mode of our proposed method with optimization.

Table 1.

Model Parameters Setting for Synthetic Datasets

Dataset	K	Clustering	Learning rate	N
TCs	6	DBSCAN(0.3,5)	0.005	200
TICs	5	DBSCAN(0.2,5)	0.01	300
Unequal-sized TCs	5	DBSCAN(0.35,5)	0.001	300
Unequal-sized TICs	5	DBSCAN(0.2,5)	0.001	400
TICs with small noises	6	DBSCAN(0.2,5)	0.002	250
TICs with big noises	6	DBSCAN(0.2,5)	0.001	300
3D human	8	DBSCAN(0.1,5)	0.0001	200

DBSCAN, density-based spatial clustering of applications with noise; TCs, two disjoint circles; TICs, two intersecting circles.

As the underlying data structure for these two small datasets is relatively simple, all these three algorithms can easily reveal the underlying structure. The topological structure of the data can even be well captured by the Mapper graph mode with an appropriate choice of the GMM initial values without optimization.

We make slight modifications to the datasets above, resulting in two disjoint and two intersecting circles with different radii, each containing 1000 points. Figure 3a and f provide a visualization of these two datasets and (implicit) intervals produced by each algorithm. The same filter function is applied. For the unequal-sized disjoint circles, the DBSCAN parameters are set to $ϵ = 0.35$ and $minPts = 5$ , and for the unequal-sized intersecting circles, parameters are $ϵ = 0.2$ and $minPts = 5$ . The results of the three algorithms are shown in Figure 3b–e for the unequal-sized disjoint circles, and in Figure 3g–j for the unequal-sized intersecting circles. The standard Mapper struggles to capture the correct topological structure due to its fixed interval setting (Fig. 3b and g), whereas the D-Mapper (Fig. 3c and h) and our method (Fig. 3d, e, and j) can adjust intervals based on data distribution, facilitating the accurate capture of the data’s underlying shape. In the case of the unequal-sized intersecting circles dataset, the Mapper graph mode without optimization fails to capture the data structure correctly (Fig. 3h). However, after optimization, the implicit intervals (groups) are slightly adjusted, leading to more accurate results (Fig. 3f and j).

FIG. 3.

Comparison of the standard Mapper algorithm, D-Mapper algorithm and our proposed algorithm on a two unequal-sized disjoint circles dataset and a two unequal-sized intersecting circles dataset. (a, f) A visualization of the datasets. Colored lines represents the (implicit) intervals produced by proposed algorithm with optimization. (b, g) The output graphs of the standard Mapper algorithm with $K = 6, p = 0.3$ for the disjoint circles and $K = 5, p = 0.2$ for the intersecting circles. (c, h) The output graphs of D-Mapper algorithm with $α = 0.06$ for the disjoint circles and $α = 0.05$ for the intersecting circles. (d, i) The output Mapper graph mode of our proposed method without optimization. (e, j) The output Mapper graph mode of our proposed method with optimization.

4.2. Noisy dataset

In this section, we assess the robustness of our proposed algorithm by introducing noises into the intersecting circles dataset. The noise is drawn from a Gaussian distribution and is added to each point’s coordinates. Initially, we introduce a small amount of noise, characterized by a Gaussian distribution with mean of zero and standard deviation of 0.1. The dataset is shown in Figure 4a. With this setting for noises, both the standard Mapper, D-Mapper and our Mapper graph mode can accurately represent the correct topological structure. The results are shown in Figure 4b–e.

FIG. 4.

Comparison of the standard Mapper algorithm, D-Mapper algorithm and our proposed algorithm on a two intersecting circles dataset with noises. (a, f) A visualization of the datasets. Colored lines represents the (implicit) intervals produced by the proposed algorithm with optimization. (b, g) The output graphs of the standard Mapper algorithm with $K = 6, p = 0.4$ for the small noise dataset and $K = 6, p = 0.4$ for the big noise dataset. (c, h) The output graphs of D-Mapper algorithm with $α = 0.025$ for the small noise dataset and $α = 0.07$ for the big noise dataset. (d, i) The output Mapper graph mode of our proposed method without optimization. (e, j) The output Mapper graph mode of our proposed method with optimization.

Subsequently, we increase the noise level, increasing the standard deviation to 0.3. The dataset is shown in Figure 4f. At this noise level, the standard Mapper algorithm is unable to output meaningful structures, as illustrated in Figure 4g. The D-Mapper algorithm can capture the main two loops of the data, while still has some extra branches and an isolated node, as shown in Figure 4h. Without optimization, the Mapper graph mode does not capture the two circles, as shown in Figure 4i. After optimization, our Mapper graph mode can capture the two primary loops, as shown in Figure 4j, though with some additional branches. These results highlight the efficacy of flexible implicit interval partitioning in uncovering the data’s inherent structure, even in the presence of significant noise. Parameter optimization plays a crucial role in accurately determining the implicit intervals, which boosts the algorithm’s performance.

4.3. 3D human dataset

We finally test our proposed algorithm on a 3D human dataset from Oulhaj et al. (2024). In this example, the number of (implicit) intervals is set to $K = 8$ and the DBSCAN clustering with parameters $ϵ = 0.1, minPts = 5$ is used. The filter function is the mean value of the distance between each sample and others. The standard Mapper outputs an asymmetrical skeleton, while both the D-Mapper and the Mapper graph mode without optimization produce a symmetrical skeleton. After optimization, the Mapper graph mode outputs a more concise structure (a symmetric skeleton with less nodes) than the standard Mapper graph and D-Mapper graph as shown in Figure 5. These results indicate that the proposed soft Mapper algorithm performs better than the standard Mapper concerning the visualization of topological structures, and the optimization process successfully further optimizes the topological structures.

FIG. 5.

Comparison of the standard Mapper algorithm, D-Mapper and our proposed algorithm on a 3D human dataset. (a) A visualization of the 3D human dataset. (b) The projected data and resulted (implicit) intervals of each algorithm. (c) The output graph of the standard Mapper algorithm, the number of intervals is 8 and the overlap rate is 0.1. (d) The output graph of D-Mapper algorithm, the parameter $α = 0.071$ . (e) The Mapper graph mode without optimization. (f) The Mapper graph mode with optimization.

4.4. Quantitative summary of synthetic datasets

Through a series of synthetic dataset experiments, we find that the optimized Mapper graph mode can capture dataset shapes more accurately in general. It should be noted that the quantitative evaluation of the Mapper graphs is a challenging problem, as the topological information of given data is often not available (Bui et al., 2020; Carriere et al., 2018). As in most of the literature, here we compute the SC, this metric accounts for the clustering performance only (Bui et al., 2020; Chalapathi et al., 2021; Dłotko, 2019). The $S C_{adj}$ accounts for both the clustering performance as well as the topological information, to some extent (Tao and Ge, 2025).

As shown in Table 2, in all cases but the two intersecting circles dataset and the 3D human dataset, the D-Mapper performs comparably or better than the standard Mapper and the Mapper graph mode performs best in terms of metric $S C_{adj}$ . For the two intersecting circles dataset, the $S C_{adj}$ of the D-Mapper is lower than the standard Mapper graph. This is mainly due to the lower topological signal rate (TSR) value of the D-Mapper. This dataset indicates that the D-Mapper algorithm may be unstable when the number of intervals K ( $= 5$ ) is small for some cases, and when K is set to a large number ( $K = 8$ ), denoted as D-Mapper* in Table 2, a larger TSR is obtained. The graph of D-Mapper* is given in Supplementary Figure A8 in Supplementary Appendix A3, and more explanations are provided in Supplementary Appendix A3. For the 3D human dataset, the $S C_{adj}$ of the Mapper graph mode is slightly lower than that of the standard Mapper and D-Mapper, but as the output graphs indicate (see Fig. 5), the Mapper graph mode has more condensed structures.

Table 2.
Quantitative Comparison of the Mapper, D-Mapper, and Mapper Graph Mode Across Different Datasets

Datasets Methods SC_norm TSR SC_adj

TCs Mapper 0.59 1 0.8

D-Mapper 0.68 1 0.84

Mapper mode 0.73 1 0.86

TICs Mapper 0.58 1 0.79

D-Mapper 0.59 0.33 0.46

D-Mapper* 0.59 1 0.79

Mapper mode 0.65 1 0.83

3D human Mapper 0.51 1 0.76

D-Mapper 0.5 1 0.75

Mapper mode 0.44 1 0.72

Unequal-sized TCs Mapper 0.61 1 0.8

D-Mapper 0.63 1 0.82

Mapper mode 0.7 1 0.85

Unequal-sized TICs Mapper 0.64 1 0.82

D-Mapper 0.63 1 0.82

Mapper mode 0.68 1 0.84

TICs with small noises Mapper 0.52 0.29 0.40

D-Mapper 0.55 0.33 0.44

Mapper mode 0.5 1 0.77

TICs with big noises Mapper 0.43 0.25 0.34

D-Mapper 0.47 0.25 0.36

Mapper mode 0.6 0.48 0.54

Datasets	Methods	SC_norm	TSR	SC_adj
TCs	Mapper	0.59	1	0.8
	D-Mapper	0.68	1	0.84
	Mapper mode	0.73	1	0.86
TICs	Mapper	0.58	1	0.79
	D-Mapper	0.59	0.33	0.46
	D-Mapper*	0.59	1	0.79
	Mapper mode	0.65	1	0.83
3D human	Mapper	0.51	1	0.76
	D-Mapper	0.5	1	0.75
	Mapper mode	0.44	1	0.72
Unequal-sized TCs	Mapper	0.61	1	0.8
	D-Mapper	0.63	1	0.82
	Mapper mode	0.7	1	0.85
Unequal-sized TICs	Mapper	0.64	1	0.82
	D-Mapper	0.63	1	0.82
	Mapper mode	0.68	1	0.84
TICs with small noises	Mapper	0.52	0.29	0.40
	D-Mapper	0.55	0.33	0.44
	Mapper mode	0.5	1	0.77
TICs with big noises	Mapper	0.43	0.25	0.34
	D-Mapper	0.47	0.25	0.36
	Mapper mode	0.6	0.48	0.54

Generally, the D-Mapper has a higher value of metric $S C_{adj}$ than the standard Mapper and the Mapper graph mode performs best concerning metric $S C_{adj}$ . The D-Mapper* refers to the D-Mapper but with a larger number of intervals ( $K = 8$ ).

TCs, two disjoint circles; TICs, two intersecting circles.

5. APPLICATION

We also apply our method to an RNA expression dataset to assess its ability to identify subgroups within AD patients. The dataset is obtained from the Mount Sinai/JJ Peters VA Medical Center Brain Bank (MSBB) (Wang et al., 2018), which includes RNA gene expression profiles from four distinct brain regions, the frontal pole (FP) in Brodmann area 10, the superior temporal gyrus (STG) in area 22, the parahippocampal gyrus (PHG) in area 36, and the inferior frontal gyrus (IFG) in area 44. In this application study, we focus on brain area 36, which includes 215 patient samples, each with over 20,000 gene expression values. Each patient is given a Braak AD staging score, ranging from 0 to 6, with higher scores indicating more severe disease stages (Braak et al., 2003). The Braak score is treated as a label for each sample. Our goal is to identify subgroups with a significantly different distribution of Braak scores from the rest of the population given the gene expression profiles. These subgroups can be branches or isolated nodes on the hidden topological features of the gene expression profiles.

As indicated in Zhou and Sharpee (2021), the gene expression data could exhibit a hyperbolic structure especially as the number of genetic sites increases. To capture the complex and hierarchical relationships between samples based on gene expression more effectively, we use the Lorentzian distance, a standard hyperbolic distance, to measure the similarity between samples. The hyperbolic distance can preserve the hierarchical structures inherent in gene expression data and is considered particularly suitable for high-dimensional data (Liu et al., 2024). In the Lorentzian distance, the centroid is crucial because of its ability to reveal the hierarchical structure of datasets. In this study, we set each data point in the dataset as the centroid to get a distance matrix repetitively and use the averaged distance matrix as the final distance matrix. The filter function is set to the mean value of the distance between each sample and others. Due to the high computational cost of calculating the distance matrix for over 20, 000 genes, we select the top 37 gene sites as listed in Wang et al. (2016). These gene sites are from the top 50 probes ranked in association with disease traits in 19 brain regions, excluding those unmatched. The selected gene sites are then used to compute the hyperbolic distances.

We set the number of (implicit) intervals to $K = 15$ , the learning rate to $γ = 0.001$ , and the number of steps to $N = 300$ . We apply agglomerative clustering with a threshold of 3.13 (Müllner, 2011). In addition, we initialize the parameters by fitting a GMM to the projected data. The graph modes of the GMM soft Mapper algorithm without and with optimization are given in Supplementary Figures A9 and A10 in Supplementary Appendix A4. The training loss trace plot of the GMM soft Mapper while optimizing the parameter is given in Supplementary Figure A12g, h in Supplementary Appendix A5. The resulting optimized Mapper graph mode offers valuable insights into AD progression. The $χ^{2}$ test is used to determine if there is a statistically significant difference between the distribution of the identified subgroup and the rest of the population (Mannan and Meslow, 1984).

In brain area 36, we find a distinct branch that is associated with high Braak scores, as shown in the red dashed box in Figure 6a. The bar charts in Figure 6b reveal distribution differences between this branch and the other nodes. The p value of the $χ^{2}$ test is 0.0047, less than 0.05, indicating a significant difference in distribution. In total, 73% of patients in this unique group have severe AD (with Braak scores $= 5$ or $= 6$ ), while the averaged ratio in the rest nodes is 36.8%. This indicates that the gene expression patterns in this cohort differ from the rest and merit further investigation. Figure 6c provides the visualization of the projected data and the corresponding optimized GMM probability density.

FIG. 6.

Results of the brain area 36. (a) The output Mapper graph mode of a GMM soft Mapper with optimization. The pie charts at each node illustrate the distribution of Braak scores within this node, with the number next to each node representing the sample size in the node. Nodes within the red dashed box form a cluster that exhibits a distinct distribution compared with samples in the rest nodes. (b) The bar chart compares the distribution of Braak scores of the nodes within the red dashed box to those of the rest nodes. (c) A visualization of the projected data along with the optimized GMM probability density. In this figure, the values of the major y-axis represent the GMM density and the values of the minor y-axis represent the Braak scores of each data point. Colored markers signify the various implicit intervals to which a point belongs. GMM, Gaussian mixture model.

We also run the standard Mapper and D-Mapper on this dataset, and the results are shown in Supplementary Figure A11 in Supplementary Appendix A4, respectively. The standard Mapper exhibits a distinct branch (in the red dashed box) that demonstrates a significant distributional difference compared with other nodes (81% severe conditions vs. 37.1% severe conditions with the p value of the $χ^{2}$ test is 0.0024), as indicated in Supplementary Figure A11a in Supplementary Appendix A4. However, this branch has more nodes and sub-branches, which indicates a more separate structure than that of the Mapper graph mode. As shown in Supplementary Figure A11b in Supplementary Appendix A4, no similar branch is observed in the D-Mapper graph, and it contains many scattered nodes (or connected components). This also further supports that one of the advantages of our algorithm is its ability to produce a more concise topological structure of data.

6. DISCUSSION

In this work, we introduce a novel approach for flexibly constructing a Mapper graph based on a probability model. We develop implicit intervals for soft Mapper graphs using a Gaussian mixture model and a multinomial distribution. With a given number of implicit intervals, our algorithm automatically assigns each data point to an implicit interval based on the allocation probability. Then, based on these implicit intervals, we derive the concept of the Mapper graph mode as a point estimation. Additionally, we design an optimization approach that enhances the topological structure of Mapper graphs by minimizing a specific loss function. This function considers both the likelihood of the projected data with respect to the GMM and the topological information of the Mapper graph mode. This optimization process is particularly suited for complex and noisy datasets, as the standard Mapper algorithm can be sensitive to noise and may fail in certain situations. Both simulation and application studies demonstrate its effectiveness in capturing the underlying topological structures. The application of our proposed algorithm to the gene expression of brain area 36 of the MSBB successfully identifies a unique subgroup whose distribution of Braak scores differs significantly from the rest of the nodes.

It is worth noting that the multimodal nature of the objective function makes it challenging to find a globally optimal Mapper graph. We suggest performing multiple optimization runs with careful parameter tuning and incorporating domain knowledge to generate high-quality Mapper graphs. In addition, the mean persistence loss function used in this work is a simple representation of information on the extended persistence diagram. Numerous alternative approaches may be available, for example, one could consider distinguishing signals from noise in a Mapper graph using confidence sets (Fasy et al., 2013). Finding a more robust persistence-specific loss function will be one of the future work directions. Moreover, as you may have noticed, the filter function is also important for constructing a meaningful Mapper graph. In this work, covers are constructed based on the filtered data, as in most literature. Optimizing the filter function and (implicit) intervals simultaneously would be a challenging but meaningful direction. Moreover, a more appropriate quantitative assessment of the Mapper algorithms poses significant challenges. We adopt $S C_{adj}$ in this work, but this measure still has several limitations, for example, how to balance the trade-off between clustering performance and topological information. Besides, the favor of fine clustering of SC may also limit the performance of the $S C_{adj}$ . Developing a more robust and objective quantitative evaluation metric for the Mapper algorithms constitutes a critical open problem in the field. Lastly, how to use a Mapper algorithm to achieve association studies between genetic sites and phenotypes simultaneously is an insightful direction, for example, which will further empower the applications of the Mapper algorithms in computational biology.

Footnotes

ACKNOWLEDGMENTS

The authors thank ShanghaiTech University for supporting this work through the startup fund and the High Performance Computing (HPC) platform.

AUTHORS’ CONTRIBUTIONS

Y.T.: Conceptualization, data collection, formal analysis, methodology, coding, visualization, writing—original draft. S.G.: Conceptualization, data curation, formal analysis, methodology, coding evaluation, project administration, supervision, writing—review and editing. All authors read and approved the final article.

AVAILABILITY OF DATA AND MATERIALS

The article employs publicly accessible datasets in the 3D human synthetic and application experiments. The 3D human dataset was originally created by Oulhaj et al. (2024), and is accessible at https://github.com/ZiyadOulhaj/Mapper-Optimization/tree/main/Datasets. The RNA expression sequences used in the application are obtained from the Mount Sinai/JJ Peters VA Medical Center Brain Bank (Wang et al., 2018) and are available at https://www.synapse.org/ with SynID 20801188. The rest of the synthetic data, the results of all experiments, and the code are available at .

AUTHOR DISCLOSURE STATEMENT

The authors declare they have no conflicting financial interests and no personal interests to disclose.

FUNDING INFORMATION

This project was supported by the startup fund of ShanghaiTech University, Shanghai Science and Technology Program (No. 21010502500) and National Natural Science Foundation of China (12401383).

SUPPLEMENTARY MATERIAL

References

Abadi

, Barham

, Chen

, et al. TensorFlow: A system for Large-Scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016.

Bauer

, Botnan

, Fluhr

. Universal distances for extended persistence. J Appl and Comput Topology, 2024; 8(3):475–530.

Bodnar

, Cangea

, Liò

. Deep graph mapper: Seeing graphs through the neural lens. Front Big Data, 2021; 4:680535.

Bottou

. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010: 19th International Conference on Computational Statistics, Paris France, August 22–27, 2010 Keynote, Invited and Contributed Papers, pages 177–186. Springer, 2010.

Braak

, Del Tredici

, Rüb

, et al. Staging of brain pathology related to sporadic parkinson’s disease. Neurobiol Aging, 2003; 24(2):197–211.

Brown

, Bobrowski

, Munch

, et al. Probabilistic convergence and stability of random mapper graphs. J Appl and Comput Topology, 2021; 5(1):99–140.

Bubenik

, et al. Statistical topological data analysis using persistence landscapes. J. Mach. Learn. Res, 2015; 16(1):77–102.

Bui

Q-T

, Vo

, Do

H-AN

, et al. F-mapper: A fuzzy mapper clustering algorithm. Knowledge-Based Systems, 2020; 189:105107.

Carriere

, Oudot

. Structure and stability of the one-dimensional mapper. Found Comput Math, 2018; 18(6):1333–1396.

10.

Carriere

, Michel

, Oudot

. Statistical analysis and parameter selection for mapper. Journal of Machine Learning Research, 2018; 19(12):1–39.

11.

Chalapathi

, Zhou

, Wang

. Adaptive covers for mapper graphs using information criteria. In 2021 IEEE International Conference on Big Data (Big Data), pages 3789–3800. IEEE, 2021.

12.

Cyranka

, Georges

, Meyer

. Mapper based classifier. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pages 1099–1106. IEEE, 2019.

13.

Dey

, Mémoli

, Wang

. Topological analysis of nerves, reeb spaces, mappers, and multiscale mappers. arXiv Preprint arXiv, 2017.

14.

Dłotko

. Ball mapper: A shape summary for topological data analysis. arXiv Preprint arXiv, 2019.

15.

Fasy

, Lecci

, Rinaldo

, et al. Confidence sets for persistence diagrams. arXiv Preprint arXiv, 2013.

16.

Fitzpatrick

, Jurek-Loughrey

, Dłotko

, et al. nsemble learning for mapper parameter optimization. In 2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI), pages 129–134. IEEE, 2023.

17.

Gepperth

, Pfülb

. Gradient-based training of gaussian mixture models for high-dimensional streaming data. Neural Process Lett, 2021; 53(6):4331–4348.

18.

Han

, Pei

, Tong

. Data Mining: Concepts and Techniques. Morgan kaufmann; 2022.

19.

Kamruzzaman

, Kalyanaraman

, Krishnamoorthy

, et al. Hyppo-x: A scalable exploratory framework for analyzing complex phenomics data. IEEE/ACM Trans Comput Biol Bioinform, 2021; 18(4):1535–1548.

20.

Kang

, Lim

. Ensemble mapper. Stat, 2021; 10(1):e405.

21.

Kumari

, Rupela

, Gupta

, et al. Shapevis: High-dimensional data visualization at scale. In Proceedings of The Web Conference 2020, pages 2920–2926, 2020.

22.

Liu

, Lubold

, Raftery

, et al. Bayesian hyperbolic multidimensional scaling. J Computational Graphical Statistics, 2024; 33(3):869–882.

23.

Mannan

, Meslow

. Bird populations and vegetation characteristics in managed and old-growth forests, northeastern oregon. J Wildlife Management, 1984; 48(4) pages:1219–1238.

24.

Müllner

. Modern hierarchical, agglomerative clustering algorithms. arXiv Preprint arXiv, 2011.

25.

Nicolau

, Levine

, Carlsson

. Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proc Natl Acad Sci U S A, 2011; 108(17):7265–7270.

26.

Oulhaj

, Carrière

, Michel

. Differentiable mapper for topological optimization of data representation. arXiv Preprint arXiv, 2024.

27.

Paszke

, Gross

, et al. Pytorch: An imperative style, high-performance deep learning library. arXiv Preprint arXiv, 2019; 10.

28.

Rizvi

, Camara

, Kandror

, et al. Single-cell topological rna-seq analysis reveals insights into cellular differentiation and development. Nat Biotechnol, 2017; 35(6):551–560.

29.

Schubert

, Sander

, Ester

, et al. Dbscan revisited, revisited: Why and how you should (still) use dbscan. ACM Trans Database Syst, 2017; 42(3):1–21.

30.

Singh

, Mémoli

, Carlsson

, et al. Topological methods for the analysis of high dimensional data sets and 3d object recognition. PBG Eurographics, 2007; 2:91–100.

31.

Skaf

, Laubenbacher

. Topological data analysis in biomedicine: A review. J Biomed Inform, 2022; 130:104082.

32.

Tao

, Ge

. A distribution-guided mapper algorithm. BMC Bioinformatics, 2025; 26(1):73.

33.

Wang

, Roussos

, McKenzie

, et al. Integrative network analysis of nineteen brain regions identifies molecular signatures and networks underlying selective regional vulnerability to Alzheimer’s disease. Genome Med, 2016; 8(1):104.

34.

Wang

, Beckmann

, Roussos

, et al. The mount sinai cohort of large-scale genomic, transcriptomic and proteomic data in Alzheimer’s disease. Sci Data, 2018; 5: 180116–180185.

35.

Zhou

, Sharpee

. Hyperbolic geometry of gene expression. Iscience, 2021; 24(3):102225.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.57 MB