A REsampling and Visual EvALuation Method to Detect and Map Local Model Violations During Biomolecular Sequence Analysis

Abstract

A fundamental assumption in phylogenetics and phylogenomics is that a single, global evolutionary model can adequately characterize the substitution processes operating across all sites in a molecular sequence alignment. However, this assumption is frequently violated in practice due to heterogeneity in evolutionary processes, leading to local model mis-specification and potential bias in downstream inference. While a variety of statistical and machine learning-based approaches have been developed to address this issue, these methods often rely on restrictive model assumptions or are designed for narrowly scoped applications, limiting their generalizability across diverse datasets and evolutionary contexts. Here, we present REVEAL (“REsampling and Visual EvALuation”), a general-purpose statistical framework for detecting and localizing model mis-specification in biomolecular sequence data. REVEAL operates without introducing additional assumptions beyond those inherent to standard global model-based analyses. It employs sequence-aware statistical resampling to construct a local support matrix along the sequence alignment, facilitating the identification of site-level model violations. Through extensive simulation experiments, we demonstrate that REVEAL achieves robust control of both type I and type II errors, with precision of $90 %$ or greater and recall of $85 %$ or greater across diverse evolutionary scenarios involving different sources of model heterogeneity, varying dataset sizes in terms of sequence length and number of taxa, and other experimental factors. We further apply REVEAL to genomic data from mouse and mosquito, uncovering localized model violations that are consistent with previously reported biological signals. These results establish REVEAL as a flexible and effective tool for evaluating model adequacy in phylogenetic and phylogenomic analyses.

Keywords

biomolecular sequence analysis model mis-specification phylogenetic estimation statistical resampling

1. INTRODUCTION

Statistical phylogenetic methods typically rely on parametric evolutionary models to reconstruct and analyze evolutionary histories. These models are designed to account for the complex evolutionary processes that shape genetic data over time. For instance, maximum likelihood estimation methods require a model of sequence evolution to describe changes in biomolecular sequences over time. However, a given parametric model can fail to adequately describe underlying evolutionary processes and bias statistical estimation—an outcome referred to as a model violation. Model violations generally fall into two categories: “global” and “local” model violations.

Global model violations occur when the evolutionary model fails to adequately represent evolutionary processes that apply to all sites in the locus or loci under study. For example, substitution model violations can occur in traditional phylogenetic analyses, and computational methods have been proposed to detect and address these model violations (Shepherd and Klaere, 2019; Burgstaller-Muehlbacher et al., 2023).

Local model violations occur when a given evolutionary model may be suitable for some sites and/or regions in a biomolecular sequence dataset but not for other sites/regions. Complex evolutionary processes such as recombination, region-specific selection pressures, incomplete lineage sorting (ILS), and horizontal gene transfer can lead to deviations from the global modeling assumption that all sites under study evolved in an independent and identically distributed (i.i.d.) manner (Felsenstein, 1985). Local model violations can manifest as various types of heterogeneity within sequences, including evolutionary rate heterogeneity across sites and lineages (Jayaswal et al., 2014), base frequency heterogeneity, and local topological heterogeneity across sites. Many studies have utilized parametric model-based methods to examine biomolecular sequence patterns associated with specific evolutionary processes. For example, local genealogical variation due to genetic drift and ILS (Dutheil et al., 2009), recombination and recombination hotspots (Hobolth et al., 2007), and the complex interplay between substitution, recombination, and gene conversion (Gao and Liu, 2021) have been investigated in genetic and genomic sequence data. In parallel, machine learning-based methods have emerged as powerful alternatives to explicitly parametric models. For example, supervised machine learning and deep neural networks have been used to identify genomic regions that evolved via introgression (Ray et al., 2024), map recombination breakpoints and recombination hotspots (Adrion et al., 2020; Li et al., 2022), and detect genomic signatures of selective sweeps (Zhao et al., 2023). While parametric model-based and machine learning-based methods can be effective in certain contexts, they are typically designed for specific evolutionary processes that require a priori modeling assumptions, are specialized to particular biomolecular sequence analysis tasks, and often require labeled training data, which may not always be available or accurately annotated.

In this study, we aim to detect local model violations without making any assumptions about sequence evolution (beyond those made in a global model-based analysis) or imposing restrictions on the types of local model violations. The problem formulation under study and our algorithmic solution for the problem point to an automated, data-driven alternative to the traditional approach of formulating a priori modeling assumptions for specific biomolecular sequence analysis tasks and then performing iterative model refinement.

2. METHODS

A primary contribution of this study is a new general-purpose statistical method for detecting and mapping local model mis-specification during biomolecular sequence analysis. A key requirement is that the new method requires no additional modeling assumptions beyond those used for global data analysis (i.e., the traditional simplifying assumption that a single “global” statistical model adequately captures the underlying processes that generated all parts of the input dataset). Stated another way, the new method does not utilize any additional parametric models for subsets of the dataset or the entire dataset (beyond the single model used for global inference and learning over the entire dataset).

We now define the computational problem under study. The input consists of a multiple sequence alignment (MSA) A with N aligned sequences and K sites, a global model $θ$ , and a global model-based estimation method $f_{θ}$ . The output is a classification of each site of $a_{i} \in A$ for $1 \leq i \leq K$ to one of z model classes. The most basic task for detecting local model violation utilizes $z = 2$ classes, where a “background” class corresponds to the $θ$ model and a “locally variable” class corresponds to local model(s) (which violate the assumption that $θ$ suffices as a single global model for all sites in A). As a proof of concept, our study’s experiments focus on a particular model and task: finite-site models of nucleotide substitution [i.e., the GTR model (Rodriguez et al., 1990) and nested models of nucleotide substitution] and their use in maximum likelihood estimation of phylogenetic trees. We note that the specific computational problem under study, our algorithmic solution to this problem, and the study design can be adapted to other statistical models and inference/learning tasks. We expand on this point in the Discussion and Conclusions sections.

To address this problem, we introduce REVEAL—a “REsampling and Visual EvALuation” framework to detect and map local model violations during biomolecular sequence analysis. REVEAL consists of a three-stage computational pipeline. A schematic overview of the workflow is shown in Figure 1, and the corresponding pseudocode for local model violation detection is provided in Algorithm 1. The pseudocode for RAWR-based local resampling is provided in Algorithm 2 in the Supplementary Data. In stage one, local resampling is performed on the input MSA A to generate a set of local replicates. Each local replicate is then re-estimated using the model-based method $f_{θ}$ under the global model $θ$ . In stage two, repeatability/agreement of the set of local re-estimates is assessed using statistical calculations. The resulting site-level statistics are packaged into a 2-D (two dimensional) matrix to facilitate the final stage of analysis. In stage three, regions in the 2-D matrix with similar site-level local statistics that suggest similar local re-estimation repeatability or lack thereof are identified. The identified regions delineate local model variation that can result in model mis-specification if not properly accounted for (e.g., as in a traditional model-based sequence analysis that assumes a single global model suffices for all data under study). The 2-D input matrix can be readily visualized as an image, which naturally lends itself to unsupervised machine learning approaches for image processing. We now provide technical details for each stage of the REVEAL algorithm.

FIG. 1.

Illustrated overview of REVEAL, a “REsampling and Visual EvALuation” framework for global-model-agnostic and local-model-free mapping of local model violations during biomolecular sequence analysis. REVEAL consists of a computational pipeline with three stages. (1) The first stage performs local resampling and re-estimation along an input multiple sequence alignment A. (2) The second stage quantitatively assesses agreement/disagreement among local re-estimates from the first stage. Statistics used for quantitative assessment are referred to as “local support”. Reduced local support values (and reproducibility of local re-estimation) provide a key indicator of local model variation that can confound biomolecular sequence analysis if not accounted for properly. Per-site local support values are aggregated into a 2-D matrix. The illustration includes an example matrix C. (3) The 2-D matrix can be visualized as an image and also lends itself well to image processing techniques. With this insight in mind, the final stage of REVEAL uses unsupervised clustering to estimate regions with similar local model variation in the site-level 2-D matrix. The illustration includes an example with REVEAL-estimated regions shown in green (and compared against ground truth in orange).

2.1. REVEAL algorithm

2.1.1. REVEAL stage 1: local resampling and re-estimation

The inputs to REVEAL consist of a MSA A, where each of the N rows represents an aligned sequence corresponding to a specific taxon, and each of the K columns represents a site, as well as the global model $θ$ and estimation method $f_{θ}$ for performing a global analysis of the entirety of A. The first stage of REVEAL performs local resampling and re-estimation on A.

A sliding-window approach is used to perform local resampling and re-estimation. We define a local window $s_{i} = A [:, i - \frac{w}{2} : i + \frac{w}{2}]$ to be a subset of columns in A that are centered at the $i^{t h}$ site and w is the window length. To systematically resample local sequences, local resampling is performed at regular intervals along A. The process begins at the first site and proceeds with a fixed step size p. The window centers are located at $i = 0, p, 2 p, 3 p, \dots, ⌊ \frac{K}{p} ⌋ p$ , where K is the total length of the A. Consequently, the total number of extracted local sequences is $b = ⌊ \frac{K}{p} ⌋ + 1$ .

Local resampling and re-estimation consist of performing $τ$ iterations of non-parametric sequence resampling on a local sequence window s, yielding a collection of $τ$ resampled replicates denoted as ${\hat{s}}_{1}, {\hat{s}}_{2}, \dots, {\hat{s}}_{τ}$ . Each replicate ${\hat{s}}_{i}$ is subsequently subjected to re-estimation using the inference method $f_{θ}$ under the same evolutionary model $θ$ employed during the global sequence analysis. In the context of this study, we focus on phylogenetic tree estimation/re-estimation using unaligned sequence inputs. Since both MSAs and trees are estimated and re-estimated, an added benefit for detecting and mapping local model violations is that MSA and tree reconstruction uncertainty can offer more signal versus either re-estimated MSAs or re-estimated trees alone. Re-estimation proceeds in two phases: first, unaligned sequences within each replicate are aligned to produce a re-estimated MSA; second, a phylogenetic tree is inferred from the re-estimated MSA. This process yields a collection of re-estimated phylogenetic trees, denoted $Σ_{t} = {t_{1}, t_{2}, \dots, t_{τ}}$ .

The non-parametric sequence resampling method used in this study is RAWR (Wang et al., 2021). RAWR is a sequence-aware statistical resampling technique that avoids the simplifying assumption of i.i.d. input data—unlike standard bootstrap resampling and other widely used non-parametric resampling techniques. Here, we briefly recap the RAWR resampling procedure [cf. Algorithm 1 in Wang et al. (2021)], and detailed pseudocode is provided in Algorithm 2 in the Supplementary Appendix. RAWR resampling takes the form of a random walk conducted on the input MSA A, resulting in a resampled RAWR replicate: (1) to begin, a starting site and walk direction are chosen uniformly at random, (2) sites are resampled as the walk proceeds along the initial walk direction, with walk reversals occurring with certainty at the first and last site of A and with probability $γ$ elsewhere, (3) resampling concludes once the resampled replicate length equals the length of A, and (4) the resampled sequences are unaligned to obtain the replicate set of unaligned sequences. The experiments in our study utilize a RAWR reversal probability of $γ = 0$ . REVEAL also uses default settings of $p = 50$ , $w = 300$ , and $τ = 20$ .

2.1.2. REVEAL stage 2: calculating local support values and their 2-D image matrix representation

The second stage of REVEAL calculates local support values to assess repeatability of local tree re-estimation. The local support values quantify phylogenetic agreement/disagreement between the local phylogenetic trees and the global phylogenetic tree $T_{G}$ inferred from A using the method $f_{θ}$ under model $θ$ .

REVEAL utilizes three different classes of local support values. The first class is the topological branch support $p \in R^{1 \times (N - 3)}$ , which quantifies the occurrence frequency of branches in the global tree $T_{G}$ within $Σ_{t}$ . For a given branch l in the internal edge set of $T_{G}$ , we define $Σ_{t | l}$ as the subset of $Σ_{t}$ containing branch l. Consequently, the element in p corresponding to branch l in $T_{G}$ is given by $\frac{| Σ_{t | l} |}{| Σ_{t} |}$ . The second and third classes concern re-estimated branch lengths (rather than re-estimated topologies). The second class is the mean re-estimated branch length $m \in R^{1 \times (2 \times N - 3)}$ , which is the average length of each branch in $T_{G}$ across the re-estimated tree set $Σ_{t}$ . The third class is the standard deviation of re-estimated branch lengths $d \in R^{1 \times (2 \times N - 3)}$ , i.e., the standard deviation of each branch length in $T_{G}$ as observed in the re-estimated tree set $Σ_{t}$ .

The local support values are calculated within each window as part of REVEAL’s sliding-window analysis. Let $P \in R^{b \times (N - 3)}$ be the matrix of topological branch support values p across all windows in sequence order; similarly, the matrices $M \in R^{b \times (2 \times N - 3)}$ and $D \in R^{b \times (2 \times N - 3)}$ contain the mean m and standard deviation d of locally re-estimated branch lengths across all windows in sequence order, respectively. The three matrices are combined into a single matrix $C = [P; M; D]$ that can be naturally visualized as a 2-D image.

2.1.3. REVEAL stage 3: mapping higher-level regions with local model variation in the lower-level 2-D support value matrix

The final step of the REVEAL framework applies clustering analysis to the 2-D image representation of the concatenated local support value matrix C. As a preprocessing step before clustering, values in the matrix C are normalized to the unit interval.

The goal is to classify each site in the input MSA, where each class corresponds to one of z different site models. In our study, we focus on $z = 2$ classes where one class corresponds to the global model $θ$ , and the other class corresponds to a local model that can cause model mis-specification if not accounted for. REVEAL uses the K-means algorithm (Hartigan and Wong, 1979) to perform unsupervised clustering on the 2-D matrix C. The output is an assignment of each window to one of z clusters, where each cluster represents a distinct site model class. The site at the center of each window is assigned the window’s cluster, and cluster assignments for all other sites are based on nearest neighbor interpolation.

2.2. Simulation study

2.2.1. Simulation conditions

To evaluate the performance of the REVEAL framework, we consider a variety of evolutionary processes that can cause local model violations during biomolecular sequence analysis and utilize model-based simulations for performance benchmarking purposes. In our simulations, sequences evolve under a mixture model consisting of a background model $θ_{B}$ and one or more variable region models. The background model consists of the traditional multi-species coalescent (MSC) model (Hein et al., 2004). In contrast, the variable region models evolved under evolutionary processes that are not captured by the $θ_{B}$ model and induce local model violations during $θ_{B}$ model-based sequence analysis. We investigate four distinct types of variable region models—each representing a different evolutionary process—to comprehensively evaluate the performance of the REVEAL framework.

The first type of variable region model builds upon the traditional MSC model but exhibits greater evolutionary divergence than the background model. We denote this variable-divergence variable region model as $θ_{H}$ . Such local model violations can arise from various causes such as natural selection and mutation hotspots.

The second type of variable region model is the multi-species coalescent with recombination (MSCwR) model, which extends the traditional multi-species coalescent (MSC) model by incorporating recombination events, and we denote this model as model $θ_{R}$ . This model accounts for local model violations resulting from variations in recombination rates across the genome, a common phenomenon observed in many species. Some genomic regions experience high recombination rates (recombination hotspots), while others have low or negligible recombination rates (recombination cold spots).

The third type of variable region model follows the multispecies network coalescent with recombination (MNSCwR) model, which incorporates both recombination and reticulation events. The reticulations capture non-tree-like evolutionary processes such as introgression, hybridization, and horizontal gene transfer. These processes often create discordant patterns from the background evolutionary model, resulting in local model violations. We denote this variable region model as model $θ_{I}$ .

The last type of variable region model is the natural selection model, denoted as model $θ_{S}$ . Natural selection is a well-studied driver of local model violations, particularly when it exerts differential selective pressure on different genomic regions. For instance, positive selection accelerates the fixation of advantageous mutations, creating divergence from the background evolutionary model. Conversely, purifying selection eliminates deleterious mutations, which can also contribute to reduced variation in certain regions. These selective processes result in genomic regions that deviate significantly from the expected neutral evolution, often leading to local model violations, which the $θ_{S}$ model is designed to capture.

In this study, we construct the mixture model comprising one background model and either one or two variable region models. When a single variable region model is included, we examine four distinct types, each under a separate simulation model condition named after the corresponding variable region model: model conditions H, R, I, and S correspond to mixture models where the single variable region model consists of $θ_{H}$ , $θ_{R}$ , $θ_{I}$ , and $θ_{S}$ , respectively. Mixture models with two variable region models are designated by model condition M, where a pair of variable region models is randomly selected from the set of $θ_{H}$ , $θ_{R}$ , and $θ_{I}$ models.

2.2.2. Simulation procedures

Loci and sites evolving in the background region are simulated under the MSC model. First, random birth-death model trees with a tree height of 1.0 coalescent units and $N = 10$ taxa were sampled using r8s version 1.7 (Sanderson, 2003). Then, local coalescent histories and gene trees were sampled under the MSC model using ms (Hudson, 2002).

To simulate local coalescent histories and gene trees under the variable region model $θ_{H}$ , we follow the same procedures as the background model, except for the final local tree height h. The value of h is progressively increased to define the model conditions H.1, H.2, and H.3, each representing a higher level of local divergence.

To simulate local coalescent histories and gene trees under the variable region model $θ_{R}$ , the same procedures are used as in the background model with one change: the MSCwR model is enabled by specifying the - r switch in ms. The finite-sites model of recombination is parameterized by a recombination rate r. To assess the impact of different recombination rates, we define the model conditions R.1, R.2, and R.3, each characterized by increasing recombination probabilities r.

For the variable region model $θ_{I}$ , we begin by constructing a phylogenetic network. First, a random birth-death model tree with a height of 1.0 coalescent unit is sampled using r8s. Then, a single reticulation event is added by selecting a time $t_{M}$ uniformly at random from the interval $(0, 1 / 4)$ , following the method outlined in (Wuyun et al., 2019). Next, msmove (Garrigan and Geneva, 2014) is used to simulate local coalescent histories and gene trees under the MNSCwR model. To explore varying levels of introgression or gene flow, we construct the model conditions I.1, I.2, and I.3 by increasing the admixture probability $β$ .

The variable region model $θ_{S}$ is simulated using SFS_CODE (Hernandez, 2008), a simulation software package designed for modeling sequence evolution in populations under selection. The selection type was set to positive, indicating that mutations are beneficial with a probability of 1.0. To assess varying levels of selection intensity, we construct the model conditions S.1, S.2, and S.3, which are quantified by increasing selection coefficients $γ$ and therefore progressively stronger selection pressure.

For the background model $θ_{B}$ and all variable region models other than $θ_{S}$ , local gene trees were deviated from ultrametricity using the method of Liu et al. (2009) with a deviation factor of $c = 2.0$ . Sequences were then simulated under a finite-sites nucleotide substitution and insertions/deletions (indels) model along these trees using INDELible v1.03 (Fletcher and Yang, 2009), with branch lengths converted following equation 3.1 from Hein et al. (2004). Substitutions are modeled under the GTR model using parameter values from Gao and Liu (2021), while sequence insertion and deletion events are modeled with the medium gap–length distribution from Liu et al. (2012) with an indel rate of 0.02. For the variable region model $θ_{S}$ , sequence evolution was simulated using SFS_CODE (Hernandez, 2008), applying the same substitution model and indel rate but with a different gap–length distribution. Full simulation commands are provided in the Supplementary Data.

Sequence regions evolving under variable region models are denoted V, and those evolving under the background model are denoted as B. In our simulation procedure, regions V are randomly positioned within simulated sequences. For mixture models with a single variable region model, the total root sequence length L is set to 2000; for those with two variable region models, L is set to 4000. For all simulation conditions other than H.3, the root sequence length of a variable region V, denoted by $L_{v}$ , follows a Gaussian distribution with a mean of 500 and a standard deviation of 100. To help maintain comparable background-to-variable sequence length, the simulation condition H.3 utilizes root sequence length that is Gaussian distributed with a mean of 300 and a standard deviation of 50. Additionally, the position of the variable region is randomly selected within the range of 100 to $L - L_{v} - 100$ .

For each model condition, the simulation and experimental procedures are repeated to obtain 20 independent replicates. Model condition parameters and summary statistics for the simulated datasets are presented in Table 1. Model parameters for the background model $θ_{B}$ are set to $h = 1.0$ , $r = 0$ , $β = 0$ , and $γ = 0$ . Only parameters that differ from the background model are reported in Table 1.

Table 1.
Model Condition Parameters and Summary Statistics for Simulated Datasets

Model Cond. Parameter settings Global MSA MSA in region V

Len. ANHD Gap. Len. ANHD Gap.

H.1 $h_{1} = 2.0$ 2938.6 0.521 0.314 828.8 0.622 0.416

H.2 $h_{2} = 4.0$ 3281.0 0.537 0.387 1194.3 0.696 0.599

H.3 $h_{3} = 8.0$ 3428.6 0.526 0.409 927.3 0.731 0.748

R.1 $r_{1} = 0.01$ 2818.6 0.518 0.286 702.4 0.510 0.288

R.2 $r_{2} = 0.05$ 2826.8 0.494 0.284 794.1 0.503 0.333

R.3 $r_{3} = 0.1$ 2939.5 0.510 0.307 818.5 0.500 0.347

I.1 $β_{1} = 0.4$ 2797.4 0.504 0.280 743.2 0.510 0.279

I.2 $β_{2} = 0.5$ 2804.7 0.507 0.284 757.2 0.500 0.269

I.3 $β_{3} = 0.6$ 2835.3 0.514 0.291 778.2 0.498 0.279

S.1 $γ_{1} = 20$ 3151.3 0.405 0.357 875.5 0.510 0.446

S.2 $γ_{2} = 50$ 3220.4 0.409 0.367 977.0 0.528 0.465

S.3 $γ_{3} = 100$ 3227.8 0.397 0.369 913.4 0.535 0.490

M.1 $m_{1}, m_{2} \in$ ${h_{1}, r_{1}, β_{1}}$ 5677.8 0.515 0.293 1517.2 0.540 0.337

M.2 $m_{1}, m_{2} \in$ ${h_{2}, r_{2}, β_{2}}$ 6132.4 0.522 0.336 1920.2 0.576 0.455

M.3 $m_{1}, m_{2} \in$ ${h_{3}, r_{3}, β_{3}}$ 6739.3 0.524 0.385 2584.0 0.556 0.510

Model Cond.	Parameter settings	Global MSA	MSA in region V
H.1	$h_{1} = 2.0$	2938.6	0.521	0.314	828.8	0.622	0.416
H.2	$h_{2} = 4.0$	3281.0	0.537	0.387	1194.3	0.696	0.599
H.3	$h_{3} = 8.0$	3428.6	0.526	0.409	927.3	0.731	0.748
R.1	$r_{1} = 0.01$	2818.6	0.518	0.286	702.4	0.510	0.288
R.2	$r_{2} = 0.05$	2826.8	0.494	0.284	794.1	0.503	0.333
R.3	$r_{3} = 0.1$	2939.5	0.510	0.307	818.5	0.500	0.347
I.1	$β_{1} = 0.4$	2797.4	0.504	0.280	743.2	0.510	0.279
I.2	$β_{2} = 0.5$	2804.7	0.507	0.284	757.2	0.500	0.269
I.3	$β_{3} = 0.6$	2835.3	0.514	0.291	778.2	0.498	0.279
S.1	$γ_{1} = 20$	3151.3	0.405	0.357	875.5	0.510	0.446
S.2	$γ_{2} = 50$	3220.4	0.409	0.367	977.0	0.528	0.465
S.3	$γ_{3} = 100$	3227.8	0.397	0.369	913.4	0.535	0.490
M.1	$m_{1}, m_{2} \in$ ${h_{1}, r_{1}, β_{1}}$	5677.8	0.515	0.293	1517.2	0.540	0.337
M.2	$m_{1}, m_{2} \in$ ${h_{2}, r_{2}, β_{2}}$	6132.4	0.522	0.336	1920.2	0.576	0.455
M.3	$m_{1}, m_{2} \in$ ${h_{3}, r_{3}, β_{3}}$	6739.3	0.524	0.385	2584.0	0.556	0.510

All model conditions used a fixed number of taxa ( $N = 10$ ).

Each $θ_{H}$ -based model condition is named H.1 through H.3, reflecting a generally increasing order of evolutionary divergence and local model violation intensity, as explained in the text. The other model conditions are named similarly. The average normalized Hamming distance (“ANHD”), gappiness (“Gap”), and multiple sequence alignment (MSA) length (“Len”) are reported for both the entire MSA and the local variable regions under each model condition. All results presented are averages calculated across 20 experimental replicates per model condition.

To further evaluate the robustness of the REVEAL framework, we also investigated the impact of additional experimental factors: (i) root sequence length L, (ii) the number of taxa N, and (iii) the gap–length distribution of the sequence insertion/deletion model. All three are known to play a role in the statistical and computational difficulty of phylogenetic reconstruction and analysis (Liu et al., 2012; Warnow, 2012; Mirarab et al., 2014). In each experimental setup, all other parameters were set to their default values, allowing us to isolate the effects of individual variables and assess the specific impact of the variable of interest on detection performance. Table 2 presents summary statistics for the simulated datasets with increased numbers of taxa ( $N = 20$ and $N = 50$ ). Summary statistics for simulations using alternative root sequence lengths and gap–length distributions are provided in Supplementary Tables S1 and S2.

Table 2.

Summary Statistics for Additional Simulations with Varying Numbers of Taxa

Model Cond.	N = 20			N = 50
Model Cond.	Len.	ANHD	Gap.	Len.	ANHD	Gap.
H.1	3571.3	0.532	0.438	4718.1	0.521	0.564
H.2	3945.9	0.524	0.485	5731.3	0.53	0.639
H.3	4614.3	0.535	0.562	7032.9	0.538	0.709
R.1	3288.6	0.504	0.391	4609.1	0.50	0.556
R.2	3475.0	0.501	0.413	4698.0	0.499	0.567
R.3	3521.0	0.498	0.419	4830.7	0.502	0.574
I.1	3298.1	0.502	0.387	4476.8	0.510	0.547
I.2	3344.9	0.504	0.395	4372.4	0.498	0.534
I.3	3340.1	0.520	0.400	4491.1	0.501	0.547
S.1	3612.5	0.406	0.449	4925.9	0.368	0.584
S.2	3732.8	0.400	0.459	5000.2	0.376	0.595
S.3	3799.9	0.410	0.465	5062.3	0.373	0.599

The additional simulations included either 20 or 50 taxa (rather than 10 taxa as in the rest of the simulation study). The following summary statistics for simulated MSAs are reported as an average across 20 experimental replicates per model condition: average normalized Hamming distance (“ANHD”), gappiness (“Gap”), and MSA length (“Len”). Table layout and description are otherwise similar to Table 1.

2.2.3. Performance evaluation criteria

REVEAL’s performance was assessed in terms of both type I and type II errors using precision and recall. To construct the confusion matrix for calculating precision and recall, we compared REVEAL’s site class prediction against the ground truth. The confusion matrix consists of four elements. True positives (TP) consist of sites within true variable regions V that are correctly identified. False positives (FP) consist of sites within true background regions B that are incorrectly classified as part of V. True negatives (TN) consist of sites within true background regions B that are correctly identified. False negatives (FN) consist of sites within the true variable regions V that are mistakenly classified as part of B. Precision is then calculated as $\frac{T P}{T P + F P}$ , and recall is defined as $\frac{T P}{T P + F N}$ .

Computational runtime and peak memory usage were also reported for REVEAL. All experiments were conducted on the MSU Institute for Cyber-Enabled Research High-Performance Computing Center (HPCC). We utilized HPCC computing nodes equipped with Intel Xeon Gold 6148 CPUs running at 2.40 GHz and featuring between 5 and 10 GiB of memory.

2.3. Empirical study

We also performed REVEAL analyses of two empirical datasets. The first dataset consists of genomic sequence data for wild-derived strains of house mouse (Mus musculus) and M. spretus. The clade is an emerging model of adaptive interspecific introgression (Liu et al., 2015), and genomic maps of adaptive introgression have been reported in past studies (Liu et al., 2015). We downloaded whole genome sequences and genome-wide SNP data for classical and wild-derived mouse strains from the Mouse Genomes Project, where the house mouse genome version GRCm39 served as the reference genome. We used bcftools to filter out non-biallelic variants and retain only those with a missing genotype call rate of less than $10 %$ . Haplotype phasing was then performed using SHAPEIT (Browning et al., 2021). From the SHAPEIT output, we extracted SNP haplotype alignments along with corresponding genomic coordinate information. We focused exclusively on wild-derived inbred strains, as they better reflect natural genomic variation compared to classical mouse strains. The dataset includes wild-derived strains from five mouse species and/or subspecies: M. spretus, M. musculus musculus, M. musculus castaneus, M. musculus molossinus, and Mus musculus domesticus. Summary statistics for the mouse dataset are displayed in Table 3. Additional sample metadata is listed in Supplementary Table S3.

Table 3.
Summary Statistics for the Mouse Dataset

Chromosome MSA len ANHD

1 5708738 0.330

2 4903298 0.330

3 4726156 0.329

4 4385242 0.331

5 4350420 0.330

6 4409028 0.332

7 3960179 0.332

8 3741240 0.330

9 3476435 0.332

10 3840392 0.334

11 3367825 0.331

12 3392005 0.330

13 3424971 0.330

14 3427412 0.330

15 3052631 0.331

16 2854028 0.332

17 2782971 0.331

18 2585271 0.331

19 1723903 0.337

X 3241441 0.316

Chromosome	MSA len	ANHD
1	5708738	0.330
2	4903298	0.330
3	4726156	0.329
4	4385242	0.331
5	4350420	0.330
6	4409028	0.332
7	3960179	0.332
8	3741240	0.330
9	3476435	0.332
10	3840392	0.334
11	3367825	0.331
12	3392005	0.330
13	3424971	0.330
14	3427412	0.330
15	3052631	0.331
16	2854028	0.332
17	2782971	0.331
18	2585271	0.331
19	1723903	0.337
X	3241441	0.316

MSA length (“MSA len”) and average normalized Hamming distance (“ANHD”) are reported for each chromosome.

The second dataset is derived from Fontaine et al. (2015)’s study of adaptive introgression in mosquitoes. The dataset consists of a whole genome sequence (WGS) alignment from Fontaine et al. (2015). Six members of the Anopheles gambiae species complex (AGC) are sequenced at high depth: Anopheles gambiae, A. coluzzii, A. arabiensis, A. quadriannulatus, A. merus and A. melas, as well as a reference genome for A. gambiae PEST. The WGS alignment was further filtered to remove columns with missing data, gaps, and fixed sites. Summary statistics for the resulting genomic SNP alignments are presented in Table 4.

Table 4.

Summary Statistics for the Mosquito Dataset

Chromosome	MSA length	ANHD
2.L	9919726	0.457
2.R	14042557	0.469
3.L	8433270	0.483
3.R	11719222	0.470
X	1258222	0.456

Table layout and description are otherwise identical to Table 3.

3. RESULTS

3.1. Simulation study

3.1.1. REVEAL’s performance on simulation conditions with different types of local model violations

Figure 2 shows precision and recall of REVEAL’s site-level classification across all model conditions with a single variable region model, using default settings for REVEAL’s method parameters. Each model condition includes a local region that evolved under an evolutionary process that is un-modeled during estimation/re-estimation (i.e., a local model mis-specification): either variable gene tree height (or elevated local mutation rate, equivalently), genetic recombination, non-tree-like evolution in the form of introgression, or natural selection. REVEAL’s overall performance is high across the model conditions, with some minor variability. These results highlight the robustness and effectiveness of the framework under various evolutionary processes that can cause local model violations. Precision consistently ranges between 0.9 and 1.0, with recall maintaining values above 0.9. Across all model conditions, neither precision nor recall exhibits a strong or consistent monotonic trend as each variable region’s model parameter value increases. Both remain relatively stable, with only mild parameter-dependent fluctuations. In particular, under model conditions H and S, we observe small changes in precision and recall as parameters h and $γ$ vary, respectively. However, these variations are modest and do not indicate a clear upward or downward trend.

FIG. 2.

Estimation performance of REVEAL across model conditions with a single variable region model. REVEAL was run with default settings for its method parameters. Mean and standard error of precision and recall for REVEAL estimation are shown for each model condition across 20 experimental replicates per model condition.

We also present precision and recall results of the REVEAL framework for model conditions with two-variable region models in Figure 3. Precision remains consistently high across all model conditions—exceeding 0.9 —and shows a slight increase with greater evolutionary divergence within the variable region models. While recall drops below 0.8 for model condition M.1, it improves as evolutionary divergence increases, reaching values above 0.85 for model conditions M.2 and M.3. This trend highlights REVEAL’s ability to recover detection performance as sequence variability grows. Despite the added complexity of detecting and mapping two variable region models, REVEAL continues to demonstrate reliable detection performance, maintaining high precision and recall.

FIG. 3.

Type I and type II error of REVEAL on model conditions with two-variable region models. REVEAL was run using default settings for its method parameters. Mean and standard error of REVEAL’s precision and recall are reported across 20 experimental replicates for each model condition. Whereas each simulation experiment with a single variable region in Figure 2 varies a single experimental factor (i.e., a simulation model parameter) and explores its impact on REVEAL’s performance ceteris paribus, the two variable region simulations are inherently heterogeneous and preclude a similar comparison. This difference is emphasized by presenting results for the M1/M2/M3 model conditions along the horizontal axis in a discrete manner without any line connections.

3.1.2. Additional experiments on REVEAL method parameters and algorithmic design

In Figure 4, we compare the estimation performance of REVEAL across different values of $τ$ (i.e., the number of local resampling and re-estimation iterations for local sequences). As the value of $τ$ increases, the estimation performance is stable or improves slightly across the different model conditions. To evaluate the computational demands of the REVEAL framework, we systematically profiled runtime and memory usage across varying model conditions and different settings for the number of resampled local replicates $τ$ . As shown in Figure 5, increasing the value of $τ$ from 10 to 50 leads to a corresponding increase in runtime across all model conditions. The growth in runtime is approximately linear, with the highest cost observed at $τ = 50$ , approaching 0.8 hours. This trend reflects the expected trade-off between estimation robustness and computational cost. However, the performance improvements beyond $τ = 20$ are relatively minor, indicating that setting $τ = 20$ provides a practical trade-off between estimation accuracy and computational efficiency. In contrast, memory usage remains relatively stable at all values of $τ$ , fluctuating narrowly around 80 MB regardless of the model condition.

FIG. 4.

Estimation performance of the REVEAL across model condition types with a single variable region, under varying values of $τ$ . All parameters for the REVEAL use default values except for $τ$ . The mean and standard error of precision and recall are reported across 20 experimental replicates and averaged over model conditions within the same model condition type.

FIG. 5.

Runtime and memory usage of REVEAL across model conditions with a single variable region model, under varying value of $τ$ (the number of local resampling iterations). All parameters for REVEAL use default values except for $τ$ . For each model condition type, mean and standard error of both runtime and peak memory usage are reported across all model conditions of that type and 20 experimental replicates per model condition.

In the Supplementary Data, we also investigate how the performance of the REVEAL framework is influenced by two additional parameters: the step size p and the window length w. We further performed ablation experiments to assess the influence of an alternative option for the local support value matrix on the performance of the REVEAL framework.

3.1.3. Additional sensitivity experiments on other experimental factors

REVEAL’s precision and recall are shown on simulation conditions with root sequence lengths of 2000 and 4000 in Figure 6. In the latter case, the mean length of the local variable region $L_{v}$ represents only one-eighth of L, compared to one-fourth in the default setting. REVEAL exhibits stable precision and recall values irrespective of root sequence length, with the following exception. Under model conditions I and S, we observe a slight decline in precision (approximately $3 - 4 %$ ) when L increases from 2000 to 4000. This reduction is likely attributable to the relative shrinkage of the local variable region V, which potentially limits detectable signal and impairs detection performance.

FIG. 6.

Estimation performance of the REVEAL across model conditions with a single variable region, under varying root sequence lengths (L). The layout and axis settings are identical to those in Figure 2.

In Figure 7, we evaluate the impact of varying the number of taxa ( $N = 10$ , 20, and 50) on the detection performance of REVEAL. We observe that, as the number of taxa increases, the detection performance of REVEAL remains consistently strong or even improves. Specifically, when the number of taxa reaches $N = 50$ , we observe a notable improvement in precision—up to approximately 4%—under model conditions R, I, and S. Additionally, REVEAL’s recall improves by 5% to 9% on model conditions H and S with $γ = 50$ . These performance improvements likely result from increased evolutionary information and enhanced statistical power provided by richer taxon sampling. Together, these results demonstrate that although both longer sequences and more taxa could theoretically complicate detection due to signal shrinkage in the local variable region or greater phylogenetic complexity, REVEAL’s performance remains robust. Neither doubling the root sequence length to $L = 4000$ nor increasing the number of taxa to $N = 50$ produces a detrimental effect on performance.

FIG. 7.

Estimation performance of REVEAL across model conditions with a single variable region, under varying numbers of taxa (N). The layout and axis settings are identical to those in Figure 2.

In Figure 8, we evaluate the impact of increasing the number of taxa (N) on the computational requirements of the REVEAL framework. The observed trend suggests that runtime increases approximately exponentially with the number of taxa. Specifically, runtime increases from under 0.5 hours at $N = 10$ to over 5 hours at $N = 50$ for model conditions R and I. For model conditions S and H, runtime exceeds 8 hours, which is also affected by their longer MSA lengths, as shown in Table 2. Memory usage also shows an increasing pattern with N, rising from approximately 75 MB at $N = 10$ to over 110 MB at $N = 50$ . We note that the observed peak memory usage is modestly sized and well within the capabilities of modern commercial computing hardware. These findings indicate that the number of taxa has a more pronounced impact on computational demands than $τ$ . This is attributable not only to the increase in dataset size but also to the greater evolutionary complexity introduced by a larger taxon set, resulting in longer and more gapped MSAs that pose a greater challenge to phylogenetic reconstruction.

FIG. 8.

Runtime and memory usage of REVEAL across model conditions with a single variable region model and varying numbers of taxa N. All REVEAL parameters were set to default values. For each model condition type, mean and standard error of runtime and peak memory usage are reported across all model conditions of that type and 20 experimental replicates per model condition.

In our default simulation setup, evolutionary sequences were generated with indel events, using a medium gap–length distribution as described in Liu et al. (2012), and an indel rate of 0.02. In Figure 9, we examine how variations in gap–length distributions for indel events influence the detection performance of the REVEAL framework. Specifically, we compare detection performance on datasets simulated under the short, medium, and long gap–length distributions as characterized by Liu et al. (2012). Since model condition S did not use INDELible to simulate sequences with customized indel events, in contrast to model conditions H, R, and I, our comparison is limited to the latter three. Overall, our results indicate that REVEAL achieves robust precision and recall across all three examined gap–length indel scenarios, with only slight performance variations. Notably, for model conditions R and I, datasets simulated with the long gap–length distributions show a modest improvement in precision, ranging from 2% to 4%.

FIG. 9.

Estimation performance of REVEAL across model conditions with a single variable region, for different gap–length distributions. The layout and axis settings are identical to those in Figure 2.

In contrast, recall values are stable and remain relatively unaffected by changes in gap–length distribution across all model conditions. The robustness in recall, coupled with the incremental improvement in precision under longer gap distributions, highlights REVEAL’s consistent performance in local variable region detection and genomic sequence analysis.

3.2. Empirical study

Figure 10 presents the results for REVEAL’s analysis of the mouse dataset. The local support value matrix C produced by the REVEAL analysis is shown. As a point of comparison, the figure also includes a visualization of introgression patterns identified by PhyloNet-HMM [adapted from Liu et al. (2015)], a statistical method specifically designed for mapping introgression patterns in genomes. PhyloNet-HMM performs inference and learning under a bespoke model that combines a multi-species coalescent model, a finite-sites substitution model, and a hidden Markov model. In the local phylogenetic matrix C of the REVEAL framework, we observe a marked decrease in local support values in and around the PhyloNet-HMM-inferred introgression regions, as compared to the neighboring regions. We present the completed local phylogenetic matrix C for all mouse chromosomes in the Supplementary Data.

FIG. 10.

A comparison of introgression patterns identified by the PhyloNet-HMM method in Liu et al. (2015) and the REVEAL’s local support value matrix C. (a) and (c) illustrate the introgression regions identified across 20 M. m. domesticus samples using PhyloNet-HMM [reproduced from Figure 4 of Liu et al. 2015)]. Red squares along the x-axis indicate the locations of genes within these introgression regions. (b) and (d) show the local support value matrix C from REVEAL’s analysis, focusing on the regions where the PhyloNet-HMM model detected introgression.

Figure 11 shows the local support value matrix C generated by REVEAL for chromosome 2 L of the mosquito dataset, accompanying the 2La inversion region (approximately $\sim$ 20–41 Mb) as identified by Fontaine et al. (2015). Notably, we found that the matrix C displays visibly lower support values across the $\sim$ 23–49 Mb interval, which overlaps with the annotated 2La inversion region. This region has been shown to exhibit introgression patterns that diverge from genome-wide expectations during interspecific gene flow (White et al., 2007; Cheng et al., 2012). For example, the capacity of An. arabiensis to tolerate desiccating environments has been attributed to the introgression of the 2La inversion region from An. gambiae and An. coluzzii (Fontaine et al., 2015). The complete set of local phylogenetic matrices C for all mosquito chromosomes is provided in the Supplementary Data.

FIG. 11.

REVEAL analysis of chromosome 2 L in the mosquito dataset. Panel (a) is reproduced from Fontaine et al. (2015) and shows the results of a sliding-windows D-statistic analysis along chromosome 2 L, as presented in Figure 4 of (Fontaine et al., 2015). Panel (b) presents the local support value matrix C derived from REVEAL’s analysis of chromosome 2 L.

4. DISCUSSION

A key advantage of the REVEAL framework is that it operates without imposing any assumptions or restrictions on the evolutionary processes that cause model violations in genomic sequences. Unlike traditional approaches that are designed to detect one specific evolutionary process, it can detect a diverse range of local model violations without being constrained by predefined evolutionary models. Furthermore, it does not depend on supervised learning techniques and does not require manually labeled data, making it a more flexible and scalable solution for local model violation detection.

The simulation study validates the robustness and effectiveness of the REVEAL framework across a range of simulation and experimental conditions. REVEAL successfully detected local model violations caused by divergence heterogeneity, recombination, introgression, and natural selection. Even in more complex cases with multiple locally variable regions, REVEAL’s precision remained consistently above 0.9, although recall experienced a slight decline. Notably, as the intensity of local model violations increases—whether through more extreme deviations in local evolutionary parameters or through longer local model violation regions—both detection accuracy and site-level mapping performance improve. Moreover, REVEAL demonstrates enhanced performance when the input MSA contains a greater number of taxa or exhibits increased gappiness, likely due to the richer phylogenetic signal and alignment structure that these conditions provide. Collectively, these findings underscore REVEAL’s robustness and adaptability across a wide range of evolutionary scenarios, highlighting its ability to generalize effectively without relying on prior knowledge of the underlying local model violations.

We fully acknowledge that a fully parametric model-based analysis is expected to outperform a model-agnostic method like REVEAL, under specific assumptions including: (1) the correct model or a very accurate model can be assumed to be available a priori, and (2) one or a few closely-related sequence analysis tasks is/are under study. However, these assumptions are quite strong and constraining. If the assumed local model(s) are incorrect or inadequate, then model mis-specification can impair detection and mapping of local model variation. If different tasks are performed on a dataset in a study, then purpose-built models and model-based methods must be developed and applied for each task.

Rather, REVEAL points to a different and practical alternative. Instead of requiring strong a priori modeling assumptions, REVEAL provides an automated, data-driven, global-model-agnostic, and local-model-free approach to detect and pinpoint local model variation as part of a biomolecular sequence analysis. The results of a REVEAL analysis can then (1) inform practitioners that the original global analysis is insufficient and requires modeling improvement and (2) precisely map regions in the dataset that require locally variable models, resulting in improved model-based sequence analysis.

5. CONCLUSIONS

In this study, we introduce REVEAL, a general-purpose framework for detecting and mapping local model violations during biomolecular sequence analysis. Unlike existing approaches, REVEAL is readily adapted to different biomolecular sequence analysis tasks and requires no additional modeling assumptions beyond those required for traditional global sequence analysis. Our simulation experiments validate the robustness and effectiveness of REVEAL across various evolutionary scenarios and causes of local model violations. Furthermore, we apply REVEAL to two empirical datasets and identify widespread local model violation regions across chromosomes. These results corroborate findings in past empirical studies that were inferred using parametric model-based algorithms for narrowly specialized inference and learning tasks.

While this study provides an initial proof of concept, future research promises to unlock further algorithmic advances. For example, we anticipate that REVEAL’s algorithmic formulation will generalize to other biomolecular sequence analysis tasks such as genome rearrangement mapping and structural variant analysis. Additional future experimentation is needed in this regard.

AUTHORS’ CONTRIBUTIONS

M.G.: Methodology, software, validation, formal analysis, investigation, data curation, writing—original draft, writing—review and editing, visualization. K.J.L.: Conceptualization, supervision, project administration, funding acquisition, methodology, investigation, writing—original draft, writing—review and editing.

Footnotes

ACKNOWLEDGMENT

The authors would like to thank the anonymous reviewers for their constructive feedback. This publication is an extended version of an article that appeared in the proceedings of the 22nd RECOMB-CG conference. Computational experiments and analyses were performed on the MSU High Performance Computing Center.

AUTHOR DISCLOSURE STATEMENT

The authors declare no potential conflicts of interest with respect to the research, authorship, and publication of this article.

FUNDING INFORMATION

This work has been supported by the NSF (DBI-2144121, DBI-2214038, and CCF-1714417 to K.J.L.).

Supplemental Material

References

Adrion

, Galloway

, Kern

. Predicting the landscape of recombination using deep learning. Mol Biol Evol, 2020; 37(6):1790–1808.

Browning

, Tian

, Zhou

, et al. Fast two-stage phasing of large-scale sequence data. Am J Hum Genet, 2021; 108(10):1880–1890.

Burgstaller-Muehlbacher

, Crotty

, Schmidt

, et al. ModelRevelator: Fast phylogenetic model estimation via deep learning. Mol Phylogenet Evol, 2023; 188:107905.

Cheng

, White

, Kamdem

, et al. Ecological genomics of Anopheles gambiae along a latitudinal cline: A population-resequencing approach. Genetics, 2012; 190(4):1417–1432.

Dutheil

, Ganapathy

, Hobolth

, et al. Ancestral population genomics: The coalescent hidden Markov model approach. Genetics, 2009; 183(1):259–274.

Felsenstein

. Confidence limits on phylogenies: An approach using the bootstrap. Evolution, 1985; 39(4):783–791.

Fletcher

, Yang

. INDELible: A flexible simulator of biological sequence evolution. Mol Biol Evol, 2009; 26(8):1879–1888.

Fontaine

, Pease

, Steele

, et al. Extensive introgression in a malaria vector species complex revealed by phylogenomics. Science, 2015; 347(6217):1258524.

Gao

, Liu

. Statistical analysis of GC-biased gene conversion and recombination hotspots in eukaryotic genomes: A phylogenetic hidden Markov model-based approach. In: Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. ACM; 2021; pp. 1–24.

10.

Garrigan

, Geneva

. (2014) msmove: A modified version of Hudson’s coalescent simulator ms allowing for finer control and tracking of migrant genealogies.

11.

Hartigan

, Wong

. Algorithm AS 136: A k-means clustering algorithm. J R Stat Soc Ser C (Appl Stat), 1979; 28(1):100–108.

12.

Hein

, Schierup

, Wiuf

. Gene genealogies, variation and evolution: a primer in coalescent theory. Oxford University Press: New York, YK; 2004.

13.

Hernandez

. A flexible forward simulator for populations subject to selection and demography. Bioinformatics, 2008; 24(23):2786–2787.

14.

Hobolth

, Christensen

, Mailund

, et al. Genomic relationships and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden Markov model. PLoS Genet, 2007; 3(2):e7.

15.

Hudson

. Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics, 2002; 18(2):337–338.

16.

Jayaswal

, Wong

, Robinson

, et al. Mixture models of nucleotide sequence evolution that account for heterogeneity in the substitution process across sites and across lineages. Syst Biol, 2014; 63(5):726–742.

17.

, Chen

, Rapakoulia

, et al. Deep learning identifies and quantifies recombination hotspot determinants. Bioinformatics, 2022; 38(10):2683–2691.

18.

Liu

, Raghavan

, Nelesen

, et al. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science, 2009; 324(5934):1561–1564.

19.

Liu

, Warnow

, Holder

, et al. SATe-II: Very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Syst Biol, 2012; 61(1):90–106.

20.

Liu

, Steinberg

, Yozzo

, et al. Interspecific introgressive origin of genomic diversity in the house mouse. Proc Natl Acad Sci U S A, 2015; 112(1):196–201.

21.

Mirarab

, Nguyen

, Warnow

. PASTA: Ultra-large multiple sequence alignment. In: International conference on research in computational molecular biology. Springer; 2014; pp. 177–191.

22.

Ray

, Flagel

, Schrider

. IntroUNET: Identifying introgressed alleles via semantic segmentation. PLoS Genet, 2024; 20(2):e1010657.

23.

Rodriguez

, Oliver

, Marin

, et al. The general stochastic model of nucleotide substitution. J Theor Biol, 1990; 142(4):485–501.

24.

Sanderson

. r8s: Inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock. Bioinformatics, 2003; 19(2):301–302.

25.

Shepherd

, Klaere

. How well does your phylogenetic model fit your data? Syst Biol, 2019; 68(1):157–167.

26.

Wang

, Hejasebazzi

, Zheng

, et al. Build a better bootstrap and the RAWR shall beat a random path to your door: Phylogenetic support estimation revisited. Bioinformatics, 2021; 37(Suppl_1):i111–i119.

27.

Warnow

. Standard maximum likelihood analyses of alignments with gaps can be statistically inconsistent. PLoS Curr, 2012; 4:RRN1308.

28.

White

, Hahn

, Pombi

, et al. Localization of candidate regions maintaining a common polymorphic inversion (2La) in Anopheles gambiae. PLoS Genet, 2007; 3(12):e217.

29.

Wuyun

, VanKuren

, Kronforst

, et al. Scalable statistical introgression mapping using approximate coalescent-based inference. In: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. 2019; pp. 504–513.

30.

Zhao

, Souilljee

, Pavlidis

, et al. Genome-wide scans for selective sweeps using convolutional neural networks. Bioinformatics, 2023; 39(39 Suppl 1):i194–i203.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

21.87 MB

0.00 MB