TRACE: Open-source software for quantifying somatic variation of tandem repeats by capillary electrophoresis

Abstract

Expanded short tandem DNA repeats are implicated in over 60 human disorders. In many, somatic instability (SI) of the repeat plays a critical role in disease pathogenesis. For example, SI in vulnerable neurons is a key driver of clinical symptoms in Huntington's disease. Quantifying SI has traditionally relied on PCR followed by capillary electrophoresis, with metrics describing the shape of repeat size distributions, such as the expansion index. However, current tools often require costly proprietary software, are time-consuming, and rely on custom pipelines that vary between labs. To address these challenges, we developed Tandem Repeats Analysis by Capillary Electrophoresis (TRACE), an open-source software that processes fragment analysis data end-to-end, from raw files to SI metrics. Additionally, we created an associated web app TRACE-shiny (https://traceshiny.mgh.harvard.edu), for interactive usage. Instability metrics from TRACE benchmarked against published datasets confirm its utility for studying genetic and pharmacological modifiers of SI. TRACE eliminates the need for proprietary software or custom pipelines, making advanced tools for analysis of somatic repeat expansion widely accessible.

Keywords

cag repeat instability repeat expansion dynamics and biochemistry repeat instability cag repeat expansions genetics

Introduction

Expanded short tandem repeats (STRs) in DNA are implicated in over 60 human disorders, a number that continues to grow with advancements in sequencing technologies.^1,2 In many STR disorders, somatic instability (SI) of the repeat plays a critical role in disease pathogenesis. For example, in Huntington's disease (HD), CAG repeat expansions in vulnerable neurons are somatically unstable,^3–5 and genetic association studies implicate DNA repair genes in modifying this expansion,³ directly linking somatic repeat instability to clinical symptoms. With growing academic and therapeutic activities in this area across >60 STRs, tools for quantifying SI metrics are increasingly necessary to enable automation, reproducibility, and high-throughput data analysis.

Quantifying SI has traditionally relied on polymerase chain reaction amplification (PCR) followed by capillary electrophoresis, which generates a distribution whose sequential peaks differ by one repeat unit, and then applying metrics that describe the shape of repeat distribution. For example, the instability index was devised to quantify the mean change in CAG length relative to the inherited repeat length.⁶ The approach can be applied to both contraction and expansion and has been widely adopted for quantifying the average repeat increase in mice or human tissues.^7,8

Despite this, widely used tools for processing raw capillary electrophoresis data rely on expensive proprietary software and substantial manual intervention. For example, GeneMapper implements a semi-automated workflow in which ladder reference sizes are used to infer fragment lengths. A known limitation of fragment analysis is that repeat-containing fragments migrate faster than non-repetitive ladder fragments, leading to systematic underestimation of repeat length and compressed base-pair spacing between peaks.⁹ In GeneMapper, this is addressed by assigning peaks to manually defined bins, typically calibrated using control samples of known repeat length in each run. Because of this manual curation, the workflow remains time-consuming, and additional processing is required after fragment length tables have been exported for downstream analysis.

Alternatively, researchers have assembled custom pipelines¹⁰ that incorporate tools like the Fragman R package.¹¹ While effective, these tools were not designed specifically for STR analysis, often require extensive tuning, and do not correct run-to-run variation. Such ad hoc solutions complicate cross-study comparisons and are challenging to adapt or customize for specific SI applications.

To address these challenges and enhance throughput and reproducibility, we present the Tandem Repeats by Capillary Electrophoresis (TRACE) R package. TRACE introduces new algorithms for processing raw capillary electrophoresis data, incorporates features tailored to somatic instability (SI) experiments by enabling sample grouping for instability metric calculation, and provides frameworks to correct for the systematic underestimation of repeat length when converting from base pairs. We demonstrate that discordance between base pair length and repeat length varies across fragment analysis runs but can be mitigated using shared control samples through a simple batch correction approach or by applying an accurate repeat sizing method, which automates a concept similar to the manual binning strategy commonly used in GeneMapper. We also developed TRACE-shiny (https://traceshiny.mgh.harvard.edu/), a web-based platform for code-free interfacing with the TRACE package. TRACE and TRACE-shiny eliminates the need for proprietary software or custom pipelines, making advanced tools for consistent analysis of somatic repeat expansion widely accessible.

Material and methods

TRACE R package

The underlying code for processing fragment analysis data and generation of metrics is built upon our R package TRACE (https://zachariahmclean.github.io/trace/). This aggregates our algorithms described previously^3,12 into a unified framework. Users interact primarily with the high-level trace() function, which processes raw fragment analysis data and produces structured output. This output can then be supplied to calculate_instability_metrics() function to generate metrics and visualizations of the data. The pipeline is built on custom R classes and is designed to work efficiently with the TRACE-Shiny application.

Processing signal, fitting ladder and assigning base pair

The FSA files (Applied Biosystems fragment analysis file format with .fsa extension) are imported into memory in R using read.abif from the seqinr package.¹³ To fit the ladder, we test combinations of observed peaks for their linear correlation to the ladder. However, testing all possible combinations becomes computationally prohibitive as the number of identified peaks increases relative to the number of ladder reference sizes. We developed an optimization to reduce the total number of combinations based on the assumption that simultaneous testing of all possible combinations is unnecessary. Since ladder reference sizes follow a predictable size-ordered pattern, we can process them sequentially in windows (Supplementary Figure 1). When assigning smaller reference sizes, we only consider combinations up to the point where all remaining observed peaks would line up with the remaining number of ladder reference sizes. This eliminates vast numbers of irrelevant combinations from consideration, while also accounting for the non-linearity of capillary electrophoresis data.^9,14 For each window of reference sizes, we test combinations from a constrained window of candidate observed peaks (the remaining observed peaks would fit one-to-one with the remaining size standard), progressively moving this window as we assign larger reference sizes. At each window, the algorithm branches by a user-defined top-N, with non-promising branches pruned by an R-squared threshold. Base-pair sizes are interpolated using a generalized additive model with cubic regression splines using the mgcv package,¹⁵ which provides a flexible, accurate sizing model by accounting for slight non-linearities in the data. Peaks are identified in the ladder and trace data with findpeaks from pracma (https://CRAN.R-project.org/package = pracma).

Allele calling

In this software, metrics are calculated from a single allele. This simplifies analysis in many samples, such as i) mouse models with expanded repeats in the context of human gene sequence, where there is a single amplifiable expanded repeat allele, such Hdh^Q111 (Q111)¹⁶ and R6/2¹⁷ models, or ii) human cell models where the expanded repeat of interest is typically >100 units larger than the normal allele(s), such that the normal allele(s) can be ignored.^12,18 For HD human samples with an allele in the inherited adult-onset range, both wild-type and expanded alleles can be identified. But for simplicity, metrics are only calculated from the longer of the two alleles. The modal allele is identified within each sample by clustering peaks into “peak regions” based on height and size thresholds, then calling the tallest in the cluster as the main allele.

Repeat calling

Repeat sizes for each peak are estimated based on the size of the PCR product, the number of nucleotides in the repeat unit, and the number of non-repeat nucleotides flanking the repeat. We intentionally do not report repeat sizes as integers, since they are derived from continuous raw signal with base-pair positions predicted relative to a size ladder. Additionally, exact integer calls are unnecessary for calculating instability metrics, so repeat sizes are therefore retained as continuous values to avoid inaccuracies introduced by rounding. However, we do include an optional setting to force whole repeat units between peaks, providing cross-compatibility with data processed using traditional pipelines (e.g., GeneMapper).

Since repeat-containing amplicons migrate non-linearly relative to the ladder, their base-pair sizes may be underestimated and show batch-to-batch variation. To address this issue, we also provide several important features including algorithms for correcting batch effects through a simple correction or by accurate repeat sizing. The simple batch correction algorithm lines up common samples across runs and adds a correction factor to address run-to-run variability. First, a smoothing algorithm is applied to locate the most representative modal size. Second, batch effects are estimated via mixed modeling using the lme4 package,¹⁹ providing flexibility for unbalanced data and handling non-overlapping sample sets. For accurate repeat sizing, a predefined repeat length provided in the metadata is assigned to the modal allele and adjacent repeat units. This creates a linear model for the relationship between base-pair size and repeat length. For both batch and repeat correction, a correction factor is applied to the peaks of all samples.

To evaluate the accuracy of the repeat-calling algorithm, we compared the modal allele called in samples processed manually using GeneMapper (v5.0) with those processed by TRACE. A total of 872 HD knock-in Q111 mouse tissue samples across 32 fragment analysis runs were analyzed (Supplementary Figure 2). Each fragment analysis submission had a sample with a known, validated repeat length (historically validated through Sanger sequencing, and more recently through short or long read amplicon sequencing of the HTT repeat) for best practice accurate repeat sizing. For the GeneMapper analysis, bins of repeat sizes of HTT CAG-containing PCR amplicons, that incremented by one CAG repeat, were manually adjusted to line up with the repeat length of the modal peak of the validated reference sample. For processing the data in TRACE, the run and sample information was recorded in the metadata file using the categories batch_run_id (date of fragment analysis run), batch_sample_id (id of the validated reference sample that allows linking across runs for quality control), and batch_sample_modal_repeat (repeat length of the modal peak of the validated reference sample) and the correction parameter was set to repeat.

The underestimation of repeat size and the spacing between consecutive peaks was calculated using the same dataset as above. To calculate the spacing in base pairs between consecutive peaks (Supplementary Figure 2(a)), for each sample, the data were subsetted to peaks ranging from 110 to 130 to select only robust peaks and the median taken of the difference between consecutive peaks. This metric helps demonstrate the consistent narrow difference between peaks which should theoretically be exactly three base pairs. Next, the modal peak underestimation for each sample was found by dividing the repeat length calculated directly from base-pair size to the repeat length accurately determined with TRACE. This helps quantify the overall shortfall in modal repeat length when relying just on base-pair size. For Supplementary Figure 2(b), the median value for each metric across samples was computed to enable correlation across runs. This helps demonstrate how the batch-to-batch differences in spacing between consecutive peaks might help explain the variability of allele repeat length.

Instability metrics and benchmarking

The main instability metrics of interest are calculated as described previously⁶ but also explained briefly below. To help with clarity, we introduced a new term “index peak”, defined as the reference peak within a fragment analysis trace. This may correspond to the inherited repeat length in a mouse or the time zero sample in a cell culture experiment. It is primarily used for instability index and related calculations and may or may not be the modal peak in a trace.

The expansion ratio (also known as peak proportional sum or somatic expansion ratio) focuses on the proportion of signal contributed by expansion or contraction peaks relative to the index peak, making it less sensitive to noise and well-suited for analyzing human samples with adult-onset ranges.^3,20 We benchmarked the expansion ratio from the TRACE software to a previous pipeline we recently reported for 733 HD blood samples of the Registry dataset³) with patients with adult onset repeat lengths.

The expansion index as originally described⁶ measures expanded repeat lengths relative to the index peak, weighted by peak signal and distance from the index peak. It provides insight into the spread or diversity of repeat sizes by finding the average repeat change and is predominantly used for samples where there are mixed cell populations with differing rates of repeat expansion, which can produce a bimodal distribution of non-repeat-expanding cells and expanding cells (e.g., mouse disease model brain tissue). For benchmarking, we reanalyzed FSA files from 879 Q111 mouse (with a single expanded CAG repeat) tissues across 32 fragment analysis runs, with a 5% relative peak height threshold and the TRACE parameter force_whole_repeat_units and accurate repeat sizing as described above, to match the repeat level GeneMapper data.

Average repeat change calculates the difference in weighted mean repeat size between the sample and the index samples.¹² This metric is particularly intuitive for studying population-level shifts in cell models, where repeat distributions often remain approximately normally distributed. It is almost identical to the “instability index change“ (instability index of the sample subtracted by the instability index of the index sample),¹⁸ which uses the instability index formula rather than weighted mean, so relies more on accurate index peak assignment. We compared the average repeat change from the TRACE software to values generated with a custom pipeline that included FSA file processing with GeneMapper.¹² These 53 samples were from the RPE1-AAVS1-CAG115 cell model testing the effect of cell division and dox-induced transcription.

Finally, the contraction index is the inverse of the expansion index⁶ and measures the contracted repeat lengths relative to the index peak, weighted by peak signal and distance from the index peak. For benchmarking, we used 138 samples with a CCCTCT repeat associated with X-linked dystonia parkinsonism (XDP), with samples taken from various regions in post-mortem brain tissue.²¹

TRACE web server

We developed a standalone web-server application, TRACE-shiny (https://traceshiny.mgh.harvard.edu). TRACE-shiny was built using the shiny package in R (https://CRAN.R-project.org/package = shiny) and leverages shiny reactivity and plotly²² visualizations to interactively guide the user through the fragment analysis pipeline. The focal point of the web server is to enable users without experience in R to progress through the TRACE pipeline. The interactivity element of TRACE-shiny enables additional features including; 1) The ability to dynamically adjust ladder peaks that were incorrectly assigned (see processing raw signal). 2) Interactive assignment of the index repeat used in instability metrics calculations. 3) Tailored plotting tools designed to enhance visualization and emphasize major differences between raw signal traces. These include the option to normalize traces between samples by the highest peak or the sum of all peaks. The raw data can also be aggregated and the average found using best fit curve by local polynomial regression (loess).

Results

Here we present R package TRACE and associated web server TRACE-shiny. The TRACE R package provides functions in R for processing and analyzing raw fragment analysis data (FSA files) (Figure 1). The TRACE-shiny web server adds a user-friendly interface to guide analysis without previous R programming knowledge. TRACE-shiny enhances the analysis with a layer of interactivity and provides quickly accessible summary data at each step of the analysis. Below we highlight some of the unique features within this software and provide benchmarking of repeat instability metrics across a range of contexts.

Figure 1.

Data processing pipeline. (a) pipeline underlying the TRACE R package to process FSA files through to instability metrics. (b) An example of the raw signal for the size ladder (black) and sample (green).

Ladder assignment algorithm

A significant challenge in capillary electrophoresis is assigning the reference ladder of known base-pair sizes to the corresponding peaks in the raw signal. In TRACE, we test combinations of observed peaks for their fit to the ladder using an optimization that minimizes the total number of combinations to test. Briefly, the observed peaks are tested for linear fit to ladder reference sizes in consecutive windows, which simultaneously overcomes non-linearity of capillary electrophoresis data^9,14 while constraining combinations to prevent fruitless searches where insufficient peaks remain to be assigned. We tested this approach using challenging ladders that display contamination in the ladder channel, which introduces more peaks than are usually expected and makes the assignment difficult by combinatorics (Supplementary Figure 3). The TRACE approach reduces the total number of combinations to test by 10-1000-fold, depending on the sample and window size used (Figure 2(a)). The accuracy also depends on the window size used, with window sizes too small or large resulting in incorrect assignments (Figure 2(a)). In our tests using this challenging dataset, this approach reduced the time from a mean of 21 s per sample to test all possible combinations to 0.07 s with a window size of five. To prevent the algorithm from becoming trapped in local optimal fits, we implemented branching at each step, which only moderately increased combinations but was required for correct ladder fitting in several samples (Figure 2(b)).

Figure 2.

TRACE ladder assignment algorithm reduces the computational complexity of ladder-to-peak fitting. (a) Impact of linear size window on the number of combinations tested relative to exhaustive search (branch number = 5). Window sizes >13 all have the same number of combinations because the algorithm doesn’t allow subsequent iterations to have a window of less than 4. Relative combinations are log10 transformed; lines represent the nine individual samples; black dots indicate correct final ladder assignments, red dots incorrect assignments. (b) Branch retention during iterative fitting relative to no branching (size window = 5). At each window step, the top N branches are retained to avoid local optima. (c) Benchmarking the speed of the TRACE ladder assignment algorithm compared to Fragman R package for the same nine samples. Default settings used for each package (TRACE: size window = 5, branch number = 5) with the fits plotted in supplementary 3.

Figure 3.

Correcting batch effects to standardize fragment analysis runs. (a) In the first approach, batch correction factors (red and blue indicate the same sample but different runs) are determined by smoothing (black line) the trace and identifying maxima (vertical dotted lines), aligning traces even when they differ in their modal peaks. In this example, the top panel's modal peak is one repeat unit smaller than that of the bottom panel when converted from base pairs. (b) The same sample from different runs is overlaid, with colors indicating different fragment analysis runs. (c) The same samples as in b, shown after applying the batch correction factor. (d) In the second approach, samples of known repeat lengths are used to assign the modal repeat length, and neighboring peaks are identified sequentially by jumping from peak to peak. (e) A linear model is created using the assigned repeats and their corresponding base-pair sizes, which is then used to predict the modal repeat length of all samples within a run. (f) The difference between the modal repeat lengths called by GeneMapper and TRACE for 872 samples across 32 runs.

Benchmarking against the R package Fragman validates this approach. In contrast to TRACE, Fragman fits polynomial models to initial peak assignments and predicts where subsequent peaks should appear. When utilized on our dataset of challenging ladders, Fragman's assignments were not correct (Supplementary Figure 3), while also taking 2.5-fold longer (Figure 2(c)).

Fragment analysis batch effects

Another challenge in fragment analysis is that repeat-containing amplicons do not migrate linearly with the ladder fragments. The distance, measured in base pairs, between consecutive amplicon peaks is frequently shorter than predicted based on the repeat length. In our mouse model dataset, robust peaks within the 110–130 CAG repeat range had a median spacing of 2.85 base pairs, leading to systematic underestimation of repeat sizes when inferred directly from base-pair sizes. Across the 34 capillary electrophoresis runs, this spacing varied significantly (ANOVA, p < 0.001; Supplementary Figure 2(a)) with medians ranging from 2.83 to 2.87. These differences introduce batch effects that make FSA files from different fragment analysis runs not directly comparable. The gold standard for accurate repeat sizing, GeneMapper, addresses these batch effects by relying on validated size standard samples for manual calibration.

To address batch effects and/or the underestimation of repeat lengths inferred from base-pair sizes, we provide two approaches: (i) a correction factor applied across runs to bring the traces into alignment or (ii) accurate repeat sizing using validated size standard samples with known repeat lengths. The first approach involves smoothing the signal trace and aligning the positions of the smoothed peak maxima across runs (Figure 3(a)), enabling the application of a batch correction factor to standardize differences (Figure 3(b) and (c)). This method is straightforward to implement since it requires only shared samples across runs and is useful when the accurate repeat length is not required (e.g., in a cell line experiment where the average repeat change is tracked over time¹²). However, it does not resolve the underestimation of repeat sizes, and comparisons between experiments without shared samples are not possible.

The second approach accurately predicts repeat size within each run, which also corrects the batch effects. A linear model is constructed between base-pair sizes and repeats using size standards with known repeat lengths for each run. This model can then predict repeat sizes for all other samples in the run (Figure 3(d) and (e)). For the HTT CAG repeats, the linear model (R² of 0.9999, base pair coefficient p < 2e-16) remained accurate up to 50 repeat units (Supplementary Figure 2(c)), with residual analysis highlighting increasing non-linearity beyond this range (Supplementary Figure 2(d)). Benchmarking this method against repeat sizes called in GeneMapper using the same size standards with known repeat lengths showed high concordance, with the modal peak of 868/872 (99.5%) samples falling within the range of −0.5 to 0.5 repeat unit difference (Figure 3(f)). Across runs, the percent difference in modal repeat length was negatively correlated with the base-pair spacing between peaks (R² = 0.44, p < 0.001; Supplementary Figure 2(b)). This supports the interpretation that run-to-run variation in amplicon migration rates is a major source of batch effects.

TRACE-shiny provides interactive tools that enhance the analysis and interpretation of batch correction methods. For both approaches, users can visualize the batch correction samples both before and after correction to validate the adjustments made. In the second approach, users can refine their analysis by selecting the validated repeat size of the modal peak. This is a critical feature when the modal peak shifts by chance from batch to batch. Interactivity enables users to tailor the analysis to their specific datasets, ensuring more precise and reliable repeat size predictions.

Index assignment and metrics

In many experiments that follow the fate of an initial repeat allele, the inherited or starting repeat length (referred to here as the index peak) is a critical reference point for downstream metrics. By default, the index peak corresponds to the modal peak of the chosen index sample. For samples where the modal repeat length may have changed after experimentation, the index sample's modal repeat length provides the baseline for comparison. In the simplest scenario, the modal peak remains the same in both samples, so the index peak is also the modal in each. In other cases, however, the modal peak of the sample of interest may have shifted due to repeat expansion or contraction; in these situations, the software still assigns the index peak based on its correspondence to the modal peak of the index sample. This allows, for example, an expanded CAG repeat knock-in mouse liver sample to define the inherited repeat length used to assess expansion in the isolated hepatocytes,⁷ or a time-zero sample to define the starting repeat length for a cell line monitored over time (Figure 4(a)).¹²

Figure 4.

Index peak assignment and benchmarking of instability metrics. (a) Example of index peak assignment in mouse tissues (top) and cell lines (bottom). In the mouse liver sample (red), the index (I) peak corresponds to the modal (M) peak, reflecting the inherited repeat length. This index peak is then applied to the isolated hepatocyte sample (blue), where somatic expansion shifts the modal peak toward larger repeat sizes. In the cell line experiment (bottom), the modal peak of the time-zero sample (red) is used as the index peak for the 28-day sample (blue). (b) Expansion ratio in HD blood samples calculated with TRACE compared to Gem-HD Consortium data (2025). (c) Correlation of expansion index in HD mouse model samples calculated using GeneMapper with the traditional pipeline versus TRACE. (d) Average repeat change in the RPE1-AAVS1-CAG115 cell model measured with GeneMapper and the traditional pipeline versus TRACE. (e) Contraction index of CCCTCT XDP associated repeat in brain samples processed via GeneMapper and a custom pipeline compared to the values calculated by TRACE.

Finally, we benchmarked SI metrics generated by TRACE against previous pipelines. For HD blood samples, the expansion ratio calculated by TRACE was nearly identical to values from the previous custom pipeline (R² = 0.999; Figure 4(b)). Likewise, results previously generated using GeneMapper for expansion index in HD mouse model samples (R² = 0.997; Figure 4(c)) and average repeat gain in an in vitro cell model (R² = 0.997; Figure 4(d)) showed very strong concordance with TRACE. Rare deviations in the expansion index arose because TRACE was more inclusive of peaks above threshold at the extreme of the expansion distribution, whereas the traditional pipeline excluded them once any had dropped below the threshold.

By default, TRACE does not enforce integer repeat spacing between peaks. As we demonstrated, repeat-containing fragments exhibit slightly compressed base-pair spacing, resulting in peak intervals smaller than a full repeat unit. In GeneMapper, integer spacing is effectively imposed through manual peak binning. To assess the impact of this feature on instability metrics, we evaluated the option to enforce integer spacing between consecutive peaks using TRACE's force_whole_repeat_units argument (Supplementary Figure 4). Enforcing integer spacing increased the slope of the relationship between TRACE and GeneMapper expansion index values, bringing it closer to unity (the y = x relationship). By contrast, omitting this step resulted in a modestly reduced slope, consistent with slight underestimation of repeat length between peaks. Notably, the R² was unchanged, indicating that enforcing integer spacing affects absolute scaling but does not alter relative comparisons between samples or experimental conditions.

TRACE's pipeline is versatile and accommodates diverse repeat types by incorporating amplicon size and repeat unit length into its calculations. As an example, we applied TRACE to the hexameric CCCTCT repeat associated with X-linked Dystonia-Parkinsonism (XDP),²¹ which showed near identical results for the contraction index across brain tissue (R² = 0.997; Figure 4(e)).

Discussion

Traditional methods for analysis of tandem repeat sizes in DNA by capillary electrophoresis have relied on proprietary software that required extensive manual curation and custom pipelines that vary between research groups. Here, we present an automated, open-source tool, TRACE, and associated web server TRACE-shiny. Benchmarking against data processed using the widely used standard analysis tool, GeneMapper, demonstrated near-identical instability metrics. We also introduce unique features such as batch correction, accurate repeat sizing, and index assignment for instability metrics calculation. Crucially, TRACE enables automated high-throughput pipelines, which are key to accelerating basic research and therapeutic development for SI. The TRACE-shiny web app requires no coding, lowering the barrier of entry to this tool.

A significant complication of fragment analysis is that repeat length cannot be directly inferred from base-pair size derived from reference ladder sizes. We demonstrated that base pairs between consecutive peaks varied across runs and correlated with the underestimation of repeat length. This variability likely arises from differences in migration rates between repeat-containing amplicons and the ladder size standard.⁹ The resulting batch effects may influence SI metrics under specific conditions, such as (i) assigning the index peak across samples analyzed in separate runs or (ii) directly comparing metrics dependent on repeat size, such as the average repeat gain (the difference in average repeat sizes) or the difference in modal repeat size.

To address the underestimation of repeat length, we provide two correction approaches. The simple batch correction method is straightforward to implement and only requires shared samples across runs. This simplicity comes with the limitations that absolute repeat sizes may not be directly comparable across analyses and repeat lengths will still be underestimated. For more accurate repeat sizing, we introduced a second approach that resolves these issues but requires samples with validated repeat lengths. The effectiveness of this method depends on the quality of the reference samples, which is also true for the manual genotyping approach using GeneMapper. Ideally, the modal peaks should have clear separation in height from neighboring peaks to ensure their consistent identification. This is particularly critical for repeat lengths exceeding 100, where the signal flattens significantly due to inherent mosaicism resulting in a lack of a single predominant repeat size and/or PCR-induced errors. Overall, when carrying out SI experiments, careful planning should be taken to determine if batch correction is required and how to achieve that using these tools.

Underestimation of repeat length also influences instability metrics, primarily through scaling effects. Given the inherent uncertainty in capillary electrophoresis-based sizing, TRACE does not enforce integer repeat spacing by default. This design choice preserves the underlying signal structure when transforming continuous fragment sizes into discrete peak calls. For certain metrics, such as expansion ratio, this has no effect because they depend only on relative peak signal. By contrast, metrics based on repeat-length differences, such as the expansion index, are sensitive to non-integer peak spacing. TRACE addresses this through the optional force_whole_repeat_units parameter, which enforces integer repeat spacing relative to the modal peak within each sample. Enforcing integer spacing alters the absolute value of the expansion index but primarily introduces a scaling factor. Because this effect is multiplicative, relative comparisons between samples or experimental conditions are preserved. Thus, enforcing integer repeat spacing facilitates direct comparison with metrics derived from GeneMapper, but does not improve sensitivity for detecting differences between samples. This underscores the need for caution when comparing expansion index values across methodologies and the importance of explicitly reporting these choices.

The primary goal of TRACE is to quantify somatic instability using relative metrics rather than to provide precise genotyping of repeat lengths. Accordingly, TRACE is not intended as a replacement for clinical or diagnostic repeat genotyping workflows, which require accurate allele–specific repeat determination. Although TRACE’s modal repeat–calling approach showed high concordance with GeneMapper in a controlled benchmarking context, this evaluation was limited to datasets with a single expanded allele and relatively homogeneous repeat sizes. Accurate genotyping in more complex scenarios including multiple alleles of similar length, broader repeat distributions, and varying sequence composition remains challenging for electrophoresis–based approaches. In such contexts, sequencing–based methods are better suited for robust repeat genotyping.

It is possible that sequencing technologies will eventually supersede fragment analysis for quantifying SI.^23,24 Short-read and long-read sequencing has proven highly effective for certain applications, such as detecting instability at adult-onset repeat lengths in human blood samples.³ However, model systems often contain repeat tracts exceeding 100 units to enable quantification of metrics within realistic timeframes, which require long-read technologies. These long-read approaches are significantly more expensive than fragment analysis and typically require pooling large numbers of samples to bring costs into a comparable range.^23,24 Such designs require large-batch analyses and careful balancing of read depth to maintain accurate representations of repeat distributions. By contrast, fragment analysis remains a cost-effective and reliable method for capturing repeat distributions and continues to serve as the standard approach for calculating SI metrics across diverse systems.

Overall, TRACE and TRACE-shiny democratize access to essential tools for SI analysis, fostering reproducibility and broadening accessibility for researchers. They represent a timely resource, given the rapid growth of the repeat expansion disorder field, the increasing recognition of SI's clinical importance, and the ongoing development of therapies targeting this process.

Supplemental Material

sj-docx-1-hun-10.1177_18796397261448958 - Supplemental material for TRACE: Open-source software for quantifying somatic variation of tandem repeats by capillary electrophoresis

Supplemental material, sj-docx-1-hun-10.1177_18796397261448958 for TRACE: Open-source software for quantifying somatic variation of tandem repeats by capillary electrophoresis by Andrew Jiang, Kevin Correia, Tammy Gillis, Esaria L Oliver, Benjamin P Jones, Branduff McAllister, Alan Mejia Maza, Marcy E MacDonald, Ricardo Mouro Pinto, Vanessa C Wheeler, James F Gusella and Zachariah L McLean in Journal of Huntington's Disease

Footnotes

Acknowledgements

We thank the many individuals at the Massachusetts General Hospital Center for Genomic Medicine for rigorously testing TRACE and TRACE-shiny and for their valuable input toward its improvement.

ORCID iDs

Andrew Jiang

Benjamin P Jones

Zachariah L McLean

Ethical considerations

Use of HD³ and XDP²¹ samples for benchmarking was approved by the Mass General Brigham Institutional Review Board (MGB IRB).

Author contributions

AJ: Conceptualization, Methodology, Software, Writing – review & editing. KC: Conceptualization, Methodology, Software. TG: Resources. ELO: Data curation, Resources. BPJ: Data curation. BM: Conceptualization, Software. AMM: Resources. MEM: Methodology, Funding acquisition, Writing – review & editing. RMP: Conceptualization, Resources, Methodology, Writing – review & editing. VCW: Funding acquisition, Resources, Methodology, Supervision, Writing – review & editing. JFG: Funding acquisition, Project administration, Methodology, Supervision, Writing – review & editing. ZLM: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing – original draft.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Supported by National Institutes of Health grants NS091161 (JFG), NS126420 (RMP), NS049206 (VCW), 1K99NS144514-01 (AMM), the CHDI Foundation (JFG, MEM), Hereditary Disease Foundation Fellowship (ZLM), and the Huntington's Disease Society of America Human Biology Project (ZLM).

Declaration of conflicting interests

The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: J.F.G. and V.C.W. were founding scientific advisory board members with a financial interest in Triplet Therapeutics Inc. Their financial interests were reviewed and are managed by Massachusetts General Hospital (MGH) and Mass General Brigham (MGB) in accordance with their conflict-of-interest policies. V.C.W. is a scientific advisory board member of LoQus23 Therapeutics Ltd and has provided paid consulting services to Acadia Pharmaceuticals Inc., Alnylam Inc., Biogen Inc., Passage Bio, Rgenta Therapeutics and Ascidian Therapeutics. J.F.G. consults for Harness Therapeutics, Inc. (dba Harness Therapeutics) and has previously provided paid consulting services to Wave Therapeutics USA Inc., Biogen Inc. and Pfizer Inc.

Data availability

TRACE is available on CRAN (https://cran.r-project.org/web/packages/trace/index.html) and TRACE-shiny is accessible at traceshiny.mgh.harvard.edu. Alternatively, a docker version is accessible through the Docker Hub under the name ‘trace/traceshiny’ or it can be installed directly from github (). Data are available upon request.

Supplemental material

Supplemental material for this article is available online.

References

Depienne

Mandel

J-L

. 30 Years of repeat expansion disorders: what have we learned and what are the remaining challenges? Am J Hum Genet 2021; 108: 764–785.

Chen

, et al. Repeat expansion disorders. Pract Neurol 2025; 25: 204–216.

Genetic Modifiers of Huntington’s Disease (GeM-HD) Consortium Genetic modifiers of somatic expansion and clinical phenotypes in Huntington's disease highlight shared and tissue-specific effects. Nat Genet 2025; 57: 1426–1436.

Mätlik

, et al. Cell-type-specific CAG repeat expansions and toxicity of mutant huntingtin in human striatum and cerebellum. Nat Genet 2024; 56: 383–394.

Swami

, et al. Somatic expansion of the Huntington's disease CAG repeat in the brain is associated with an earlier age of disease onset. Hum Mol Genet 2009; 18: 3039–3047.

Lee

, et al. A novel approach to investigate tissue-specific trinucleotide repeat instability. BMC Syst Biol 2010; 4: 29.

Mouro Pinto

, et al. In vivo CRISPR–Cas9 genome editing in mice identifies genetic modifiers of somatic CAG repeat instability in Huntington’s disease. Nat Genet 2025; 57: 314–322.

Mouro Pinto

, et al. Patterns of CAG repeat instability in the central nervous system and periphery in Huntington's disease and in spinocerebellar ataxia type 1. Hum Mol Genet 2020; 29: 2551–2567.

Chastain

2nd , et al. Anomalous rapid electrophoretic mobility of DNA containing triplet repeats associated with human disease genes. Biochemistry 1995; 34: 16125–16131.

10.

McAllister

, et al. Exome sequencing of individuals with Huntington's disease implicates FAN1 nuclease activity in slowing CAG expansion and disease onset. Nat Neurosci 2022; 25: 446–457.

11.

Covarrubias-Pazaran

, et al. Fragman: an R package for fragment analysis. BMC Genet 2016; 17: 62.

12.

McLean

, et al. Splice modulators target PMS1 to reduce somatic expansion of the Huntington’s disease-associated CAG repeat. Nat Commun 2024; 15: 3182.

13.

Charif

Lobry

. Seqinr 1.0-2: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis. In: Bastolla

, et al. (eds) Structural approaches to sequence evolution: molecules, networks, populations. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp.207–232.

14.

Southern

. Measurement of DNA length by gel electrophoresis. Anal Biochem 1979; 100: 319–323.

15.

Wood

. Fast stable restricted Maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J R Stat Soc Ser B Stat Methodol 2010; 73: 3–36.

16.

Wheeler

, et al. Length-Dependent gametic CAG repeat instability in the Huntington's disease knock-in mouse. Hum Mol Genet 1999; 8: 115–122.

17.

Menalled

Chesselet

M-F

. Mouse models of Huntington's disease. Trends Pharmacol Sci 2002; 23: 32–39.

18.

Ferguson

, et al. Therapeutic validation of MMR-associated genetic modifiers in a human ex vivo model of Huntington disease. Am J Hum Genet 2024; 111: 1165–1183.

19.

Bates

, et al. Fitting linear mixed-effects models using lme4. J Stat Softw 2015; 67: 1–48.

20.

Genetic Modifiers of Huntington’s Disease

(GeM-HD) Consortium

. CAG Repeat not polyglutamine length determines timing of Huntington's disease onset. Cell 2019; 178: 887–900.e14.

21.

Mejia Maza

, et al. MSH3 Is a genetic modifier of somatic repeat instability in X-linked dystonia parkinsonism. Am J Hum Genet 2026; 113: 83–99.

22.

Shalabh. Interactive web-based data visualization with R, plotly, and shiny. J R Stat Soc Ser A Stat Soc 2021; 184: 1150–1150.

23.

Ciosi

, et al. Approaches to sequence the HTT CAG repeat expansion and quantify repeat length variation. J Huntingtons Dis 2021; 10: 53–74.

24.

Maestri

, et al. Navigating triplet repeats sequencing: concepts, methodological challenges and perspective for Huntington’s disease. Nucleic Acids Res 2025; 53(1): gkae1155.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.62 MB

0.00 MB