BPscore: An Effective Metric for Meaningful Comparisons of Structural Chromosome Segmentations

Abstract

Studying the three-dimensional structure of chromosomes is an emerging field flourishing in recent years because of rapid development of experimental approaches for studying chromosomal contacts. This has led to numerous studies providing results of segmentation of chromosome sequences of different species into so-called topologically associating domains (TADs). As the number of such studies grows steadily and many of them make claims about the perceived differences between TAD structures observed in different conditions, there is a growing need for good measures of similarity (or dissimilarity) between such segmentations. We provide here a bipartite (BP) score, which is a relatively simple distance metric based on the bipartite matching between two segmentations. In this article, we provide the rationale behind choosing specifically this function and show its results on several different data sets, both simulated and experimental. We show that not only the BP score is a proper metric satisfying the triangle inequality, but also that it is providing good granularity of scores for typical situations occurring between different TAD segmentations. We also introduce local variant of the BP metric and show that in actual comparisons between experimental data sets, the local BP score is correlating with the observed changes in gene expression and genome methylation. In summary, we consider the BP score a good foundation for analyzing the dynamics of chromosome structures. The methodology we present in this study could be used by many researchers in their ongoing analyses, making it a popular and useful tool.

1. Introduction

Scientific studies of the cell nucleus and its structure started >100 years ago when embryologists were performing their experiments on the fertilized eggs and subsequent rapid cell divisions. In the past 50 years, after discovery of the DNA structure and its role, the topic of chromatin structure was considered to be of slightly lower importance, as the researchers have been mostly preoccupied with studying the DNA sequence of the genomes to uncover its function.

In recent years, genome conformation studies gained significant attention again, for at least two reasons. On the technical side, this is mainly due to development of Chromosome Conformation Capture, or 3C technique (Dekker et al., 2002), and its derivative methods such as Hi-C (Lieberman-Aiden et al., 2009), 4C (Simonis et al., 2006; Zhao et al., 2006), or 5C (Dostie et al., 2006). In contrast, a perhaps more important reason why there is so much interest in the newly acquired data on chromosomal conformation is that current efforts in explaining the function of noncoding regulatory elements without the information on the contacts between enhancers and promoters are proving it to be an extremely difficult problem (Wilczynski and Furlong, 2010). The problem of gene regulation is difficult even given that the positions of such elements can be identified by DNAse Seq or related techniques (Song et al., 2011). This leads to the situation where many researchers are probing chromosome structure using 3C-related techniques and the research in the field is progressing rapidly. This, as always is the case in blooming scientific fields, leads to new technical challenges in describing the chromosomal structures and comparing them between cell types and experimental conditions.

Going back to the technical side of the chromosome capture techniques, in the Hi-C assay DNA is cross-linked, shared into pairs of fragments, and ligated. Ligated fragments are then pulled down and merged with adapters. Finally, fragments are paired-end sequenced to produce library of paired reads, indicating pairs of genomic regions that are in proximity in the three-dimensional (3D) space of the nucleus. Even though the method is not exact in the sense of mapping the 3D distance to the probability of ligation followed by successful identification of a pair, it has been proven to be very reproducible provided that we are interested in the average contact frequency over a large pool of cells statistically sampled by deep sequencing (Rao et al., 2014). Single cell Hi-C techniques are much more recently developed, but there is significant progress there as well and we should expect soon to be able to compare chromosomal contact structures also between individual cells (Nagano et al., 2013, 2017). Also, other methods, which are not ligation dependent, have been developed for probing chromosomal contacts, most notably the genome architecture mapping results corroborate the Hi-C-derived chromosomal structures (Beagrie et al., 2017).

Once the sequencing is completed and contacting pairs of segments are identified, usually the analysis of interactions is based on construction of contact maps: symmetric matrices that summarize contacts between bins of a predefined size. If we are interested in cis-chromosomal contacts, which is usually the case, such large genome-wide matrix is usually broken down into separate matrices for each chromosome. In these matrices, each cell contains the number of interactions between two particular regions of the same chromosome. Such matrices usually need to be normalized, to account for several technical issues, such as GC content bias and diverse mappability across genomic regions that lead to nonuniform read coverage along chromosomes. There are methods that are specialized to a given Hi-C protocol (Yaffe and Tanay, 2011) as well as more generic normalization methods based on the assumption of uniform coverage (Imakaev et al., 2012).

In recent years, many studies have suggested that in addition to the well-established chromosomal compartments representing the division of chromatin into active euchromatin and inactive heterochromatin, there exists a finer structure of topologically associating domains (TADs) (Dixon et al., 2012; Dekker and Heard, 2015). The presence of TAD-like structures was confirmed in almost all metazoan Hi-C data, with notable exclusion of some chromosomes in Caenorhabditis elegans (Crane et al., 2015), whereas there appears to be no sign of TADs in yeast and plants (Feng et al., 2014). The majority of animals, including all studied mammals, show significant concentrations of contacts into TADs and these appear to be largely conserved between species and conditions (Dixon et al., 2012). Even though, mostly due to variability in Hi-C protocols and TAD-calling methods, it still remains a challenge to assess physical properties and biological function of TADs in individual cells, it was shown beyond doubt that regions of genome demarcated by TADs are indeed correlated with gene regulation (Flyamer et al., 2017). In particular, it was shown that a deletion of TAD boundary regions may lead to different developmental disorders (Nora et al., 2012; Andrey et al., 2013; Lupiáñez et al., 2015; Bianco et al., 2018).

Numerous studies exploiting Hi-C to compare genome conformation between different cell types or conditions were conducted (Filippova et al., 2014; Le Dily et al., 2014; Chandra et al., 2015; Dixon et al., 2015). The purpose of such analyses is to capture differences in spatial organization of chromatin and its relationships with gene expression or epigenetic modifications. One approach is to spot and quantify regions of chromosome where TAD arrangement is (dis)similar between two experiments. A common method for this purpose is to count number or overlapping TAD boundaries that was used among others by Dixon et al. (2012), Barutcu et al. (2015), and Fraser et al. (2015).

However, this approach suffers one serious disadvantage that large domains with relatively small shift in their boundaries contribute to lowering the overall dissimilarity as much as small domains with relatively large shifts. Another approach that is not as boundary-centric is by measuring variation of information (VI) between different TAD segmentations as used in Filippova et al. (2014). This approach, however, suffers from the fact that it does not take into account the linear structure of the genome and usually overestimates the deviation from the randomized control and, therefore, overestimates the TAD structure conservation. Given the rapid advancement of the experimental research in the field and constantly growing flow of new data with TAD structures, it is important to have a proper metric to compare different experimental results as well as different computational methods for TAD detection. In perfect scenario, such dissimilarity measure should be a proper metric satisfying triangle inequality.

In this study, we present new distance measure for comparing TAD sets and prove that our measure fulfills metric properties. We use simulated and real data to show that our metric is able to capture interesting properties of different domain sets. We also compared the bipartite (BP) score with a widely used approach that measures similarity between chromosome segmentations as a function of the number of overlapping boundaries. The outcome demonstrates smoother behavior of the BP score allowing for more precise analysis of chromosome segmentations. Finally we present how our metric can be used to estimate (dis)similarity of subchromosomal regions, enabling for more in depth analysis.

2. Results

2.1. BP distance

In this section we give the definition of BP distance and explain notation. Precise definitions can be found in Supplementary Material.

Let A and B define two intervals of equal length N. Each of them is partitioned on subintervals \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$A [ i ]$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$B [ j ]$$ \end{document} (Fig. 1 and Supplementary Fig. S1). A pair of subintervals \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$A [ i ]$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$B [ j ]$$ \end{document} may induce segment \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$o [ k ]$$ \end{document} , which is a nonempty intersection between those subintervals. BP distance between domain sets A and B is defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} d \left( { A , B } \right) = 1 - \mathop \sum \limits_ { i = 1 } ^n { \frac { \vert o [ i ] \vert } { \max \left( { \vert { f_ { A , B } } ( i ) \vert , \vert { f_ { B , A } } ( i ) \vert } \right) } } \cdot \frac { { \vert o [ i ] \vert } } { N } \end{align*} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} = 1 - \frac { 1 } { N } \mathop \sum \limits_ { i = 1 } ^n { \frac { \vert o [ i { { ] \vert } ^2 } } { \max \left( { \vert { f_ { A , B } } ( i ) \vert , \vert { f_ { B , A } } ( i ) \vert } \right) } } , \end{align*} \end{document}

FIG. 1.

Chromosome partitioning and the induced segmentation. Red domains \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$A [ 2 ]$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$B [ 1 ]$$ \end{document} induce segment \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$o [ 2 ]$$ \end{document} marked with gray dashed lines. In this setting, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${f_{A , B}} ( 2 ) = A [ 2 ]$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${f_{B , A}} ( 2 ) = B [ 1 ]$$ \end{document} .

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${f_{A , B}} ( i )$$ \end{document} is a function mapping \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$o [ i ]$$ \end{document} to subinterval in A, which induced \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$o [ i ]$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\vert \cdot \vert$$ \end{document} denotes interval length, and n is the number of segments induced by partition A, B. The mentioned formulation of our metric results from two concerns: 1.

The ratio of overlap (segment) length divided by the length of longer subinterval (domain) that induced this overlap expresses the extent of rearrangement between a pair of domains.

The ratio of overlap length divided by the overall length of interval (N) expresses the weight of given match/mismatch—larger regions contribute more to overall score than smaller regions.

Theorem 1. The above-defined function is a metric.

The proof can be found in Supplementary Material. Hereunder we present some properties of function \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$d ( A , B )$$ \end{document} : 1.

Every change that is introduced has proportional impact to the length of the interval affected.

Inclusions, deletions, and shifts are treated equally.

We can compare domain sets of different sizes.

The mentioned properties help avoid problems, which arise in alternative TADs similarity measures, where one assesses similarity based on the number of TAD boundary overlaps across chromosome (see the following section, “Comparison with existing methods”).

2.2. Comparison with existing methods

To demonstrate advantage of our metric over boundaries overlap count approach, we simulated artificial sets of TADs. We wanted to ensure that TAD length resembles real data distribution. To achieve this, we used negative binomial distribution to fit TAD length distribution of human chromosome 1 embryonic stem cell (ESC) TADs determined using dynamic programming (DP) approach (Filippova et al., 2014). Then we used the fit to produce 600 TADs—the approximate number of TADs in a real data set. Next we generated match sets according to three scenarios:

FIG. 2.

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\varepsilon$$ \end{document} -Matching scenarios (a) Boundaries in B are moved at most \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\varepsilon$$ \end{document} bins left or right. (b) New boundary is inserted into B at most \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\varepsilon$$ \end{document} bins to the left of existing boundary. (c) Domains in B are partitioned according to binary interval partitioning scheme, where n is the number of partitions of each initial TAD. TAD, topologically associating domain.

New boundary at \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\varepsilon$$ \end{document} —new boundary is introduced at most \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\varepsilon$$ \end{document} bins to the left of each TAD boundary, except of the first one (Fig. 2b).

Binary boundary additions— \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${2^n} - 1$$ \end{document} boundaries are introduced into every TAD according to binary interval partitioning scheme (Fig. 2c).

Moreover, each of the mentioned cases is replicated three times with increasing parameter \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\varepsilon$$ \end{document} or n. Scenario 1 can represent a situation where structural organization of two chromosomes is almost identical; however, due to noisy character of data generating process (the Hi-C protocol), coordinates of detected TADs do not overlap. Scenario 2 can be considered either a noisy match (just like 1) or large mismatch. The former case may occur when a little interval such as non-TAD region [often produced when using directionality index method (Dixon et al., 2012) or DP approach (Filippova et al., 2014)] is located next to a large domain and both of them overlap with a large domain in corresponding TAD set. The latter is possible if two neighboring TADs of similar length overlap with a single large TAD—a situation indicating poor match. Scenario 3 can be considered as mismatch except for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$n = 0$$ \end{document} . In particular, we can imagine complete mismatch if a TAD in experiment A has length N and in experiment B there are N TADs, each of length 1.

In addition, we compare our metric with VI metric adopted to assess similarity of segmentation as described in Filippova et al. (2014).

For each scenario and possible parameter value, we report three numbers (Fig. 3): our distance (referred to as BP), Jaccard index distance (referred to as JI), and VI distance. The JI distance is calculated by dividing the number of TAD boundary overlaps between A and B by the number of all boundaries (overlaps are counted once) and subtracting the resulting fraction from 1. Calculation of VI distance is described in Supplementary Material.

FIG. 3.

Comparison with JI and VI distance. The captions report the type of metric used to compare partitions. The VI distance is scaled by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mathop { \log } \nolimits_2 N$$ \end{document} , so it ranges from 0 to 1. Both BP and VI distance capture the difference between match scenarios as well as between intensities within scenario. JI distance fails to recover true similarity of partitions indicating highest dissimilarities in case of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\varepsilon$$ \end{document} shift. BP, bipartite distance; JI, Jaccard index; VI, variation of information.

When the true location of TAD boundaries is shifted ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\varepsilon$$ \end{document} -shift and new boundary at \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\varepsilon$$ \end{document} ), the BP score distance, similarly to the VI distance, outperforms the JI distance in quantifying the true chromosome segmentation similarity. The case of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\varepsilon$$ \end{document} -shift matching presents a situation where the little shift in location of boundaries between two partitions is a result of randomness in Hi-C data protocol (like Hi-C data sets of two technical replicates of the same cell line). In contrast, binary boundary additions case illustrates a complete mismatch of two domain data sets. Therefore, higher distance of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\varepsilon$$ \end{document} -shift than binary boundary additions scenario as is the case for JI exhibits the limitations of this method and supports the choice of BP or VI metric.

2.3. Differentiating technical from biological replicates data

To demonstrate usefulness of our metric, we present a case where one might be interested in assessing structural similarities genomewide for multiple samples. We selected publicly available sets of Hi-C data consisting of six cell lines: ESC, mesenchymal stem cell (MSC), mesendoderm cell (MDC), neural progenitor cell (NPC), trophoblast-like cell (TDC), and IMR90 (Dixon et al., 2012, 2015). ESC line was available in two samples from separate experiments (Dixon et al., 2012, 2015), each of which (as well as remaining cell lines) in two technical replicates. We processed each data set using Armatus software (Filippova et al., 2014) to produce sets of TADs. Finally, we made pairwise comparisons for every pair of TAD sets separately for each chromosome. Figure 4a illustrates comparisons divided on two categories: within cell type and between cell types for BP metric (left), JI metric (center), and VI metric (right). As can be seen, all metrics discriminate between two groups under consideration, that is, the scores of within cell type comparisons are consistently lower than between cell type comparisons.

FIG. 4.

Pairwise comparisons of real Hi-C data sets with different metrics. Each boxplot represents a distribution of distances between pairs of partitions of every chromosome (calculated individually) assigned to different groups. (a) Pairs of partitions are assigned to either within replicates or between cells group. (b) Same as in (a) but different laboratories. ESC replicate pairs as well as ESC vs MDC pairs were pulled out to separate groups. Shown is scaled VI distance. ESC, embryonic stem cell; MDC, mesendoderm cell.

However, the distributions are substantially overlapping. We have looked more deeply into this issue and it seems that the majority of unexpected values can be attributed to ESC lines from different laboratories (higher variability) and to ESC versus MDC comparisons (lower variability than expected). After putting these pairs in separate groups (Fig. 4b), it seems more plausible to put them into different categories than initially, since ESC versus MDC comparisons seem to be drawn from the within cell type distribution, whereas ESC cells between laboratories exhibit much higher variance. This is consistent with the reports on the differential chromatin organization between cell lines presented in Dixon et al. (2015).

Although all metrics give significant discrimination, we note that the significance of the difference, as well as the separation between the two groups, is better for the VI and BP scores than for the JI. Of the two metrics with better separation, BP score is giving slightly lower p value; however, this is unlikely to be a difference of practical importance. Similar behavior is observed when comparing individual chromosomes (Supplementary Fig. S2).

2.4. Local BP score

Another potentially important application of comparing chromosomal segmentations is the search for locally reorganized fragments induced by two partitions for selected fragment of a chromosome. For example, to check whether differential gene expression or methylation pattern correlates with segmentation pattern. For this kind of analysis, we suggest to use local BP score—a measure associated with our metric. Local BP score of segment i is defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} d_ { A , B } ^ { { \bf { BP } } } \left( i \right) = 1 - { \frac { \vert o [ i ] \vert } { \max \left( { \vert { f_ { A , B } } ( i ) \vert , \vert { f_ { B , A } } ( i ) \vert } \right) } } . \tag { 1 } \end{align*} \end{document}

Local BP score satisfies \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$0 \le {d_{A , B}} \left( i \right) < 1$$ \end{document} . The closer it is to 0, the larger is the overlap between \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${f_{A , B}} ( i )$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${f_{B , A}} ( i )$$ \end{document} . Conversely, the closer it is to 1, the higher the mismatch. Value of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$0.5$$ \end{document} may represent a boundary situation between match and mismatch case. Figure 5a illustrates segmentation of human chr1: 4000–12,840 kb between ESC and MSC cells. Bottom axis ticks represent bins, top triangles are TAD boundaries from MSC, and bottom triangles are TAD boundaries belonging to ESC. Black, dashed vertical lines mark partitions, which increase matching (TAD boundaries in both sets overlap), whereas gray dashed vertical lines denote partitions increasing mismatch.

FIG. 5.

Local BP and MI scores. (a) Illustration of partitioning between human ESC and MSC TADs of chr1: 4000–12,840 kbp region (40 kb resolution) with local BP score. Green indicates match between two domains, whereas red illustrates mismatch. (b) Same as in (a), but color scale quantifies match between two domains and the segment length.

Although Filippova et al. (2014) only consider the VI measure for global segmentation comparisons and do not discuss any local version of this measure, we thought that a natural measure of local segmentation related to VI would be local mutual information (MI) expressed with the following formula: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} d_ { A , B } ^ { { \bf { MI } } } \left( i \right) = - { p_ { A , B } } ( i ) \cdot \mathop { \log } \nolimits_2 \left( { { \frac { { p_ { A , B } } ( i ) } { { p_A } ( i ) \cdot { p_B } ( i ) } } } \right) , \tag { 2 } \end{align*} \end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$ { p_A } ( i ) = \displaystyle \frac { { \vert { f_ { A , B } } ( i ) \vert } } { N } $$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$ { p_B } ( i ) = \displaystyle \frac { { \vert { f_ { B , A } } ( i ) \vert } } { N } $$ \end{document} , and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$ { p_ { A , B } } ( i ) = \displaystyle \frac { { \vert { o_ { A , B } } [ i ] \vert } } { N } $$ \end{document} . In local MI score, segments tend to be first ordered by their size (descending) and second by similarity of domains, which induce them (from match to mismatch), see Figure 5b. In our view, the behavior of the local BP score is more readily interpretable, as, for example, the perfectly matching segments 22–28 shown in Figure 5a are consistently scored by the local BP score, whereas the local MI scores some of them (e.g., segment 25 shown in Fig. 5b) as similar to rearranged segments (like those in the 14–20 range).

Based on hypothesis that TADs define regulatory landscapes and may limit activity of certain regulatory elements (Andrey et al., 2013; Nora et al., 2012), we decided to check whether there is a correlation between local score and differential gene expression or methylation pattern genome wide. To validate our prediction, we derived differential gene expression and methylation data from publicly available resources. We then created all by all pairing of Hi-C data sets for available cells, and for each pair we assigned local BP and MI scores and fold change of gene expression or methylation for every gene. Owing to large number of genes, we aggregated genes into 30 quantiles based on the local score value to reduce noise. Then for each group, we plotted its median quantile local score versus median fold change. This revealed correlations between gene expression or methylation fold change and the local domain rearrangement score (Supplementary Figs. S3 and S4).

As can be seen, the relationship is not strong, although visible and significant most of the time. In 8 out of 15 cases when using local BP score and 12 out of 15 cases when using local MI score, the relationship between local score and gene expression fold change (measured with Pearson correlation) has p value <0.05 (Supplementary Table S1). The trend is even more pronounced between methylation fold change versus local score, in which situation all 20 cases have significant relationship when using local BP score, and there are 16 significant cases when using local MI score (Supplementary Table S2).

3. Discussion

In this article, we have considered the problem of comparisons between different segmentations of chromosomes into TADs. Currently, such comparisons are mainly based on comparing the locations of TAD boundaries and quantification of discrepancies between two sets of such locations. This is usually done using the Jaccard coefficient or some analogous function. We show in this study that under a number of realistic scenarios, where the differences between TAD segmentations are driven by experimental noise, this is not an effective strategy. By utilizing different \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\varepsilon$$ \end{document} perturbations of TAD segmentations, we can show that JI is a poor technique for discerning significantly changed segmentations from those with minor perturbations.

We provide an alternative scoring scheme, so called BP score, deriving its name from the bi-partite segment overlap graph. This score has several advantages over the JI. First, it is naturally scaling with the size of chromosomes and the resolution at which the segmentation is done, that is any change in the TAD structure will contribute to the metric proportionally to its size. In addition, the method treats equally all kinds of changes (insertions, deletions, and shifts of domain boundaries) and can be meaningfully used to compare domain sets with different granularity, that is, containing significantly different number of domains. On the technical level, the BP score is a proper distance metric, which has advantages for applications that need symmetry or triangle inequality, including clustering methods. The global BP metric is similar in the applicability to the VI metric suggested for the task earlier (Filippova et al., 2014). Not only are these methods giving more sensible results than JI in the three simulated scenarios of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\varepsilon$$ \end{document} -matchings, but are also giving us a better discernibility between segmentations derived from experimental Hi-C replicates and actual segmentations originating from different experimental conditions. This shows that the researchers trying to obtain a meaningful score for TAD segmentation divergence should have more granular results with BP and VI scores than with JI or similar approaches.

In addition to the global VI and BP scores, we propose the analysis of local versions of the two: local BP and local MI scores. The usefulness these local measures of domain rearrangement is supported by our analysis of experimental data. We show that in actual comparisons between different TAD segmentations performed on different cell lines, the regions with high local BP and local MI scores, that is, those identified by the BP and MI scores as the most rearranged, show the highest fold changes in gene expression and methylation changes. This might be very useful in uncovering the mechanisms connecting chromosome structure rearrangement to functional changes such as gene expression.

We provide an implementation of the BP, VI, and JI scores that allows all researchers to test them on their own data. The implementation of the BP score is relatively simple as it can be computed as a sum of local scores for each of the atomic segment induced from the two segmentations being compared. This allows also to identify local contributions to the BP score, therefore, leading to the identification of fragments of chromosomes that yield the most divergent parts of the chromosomes between the two segmentations under consideration. This might be specifically useful to the researches studying chromosomal domain dynamics, allowing them to focus their attention on parts of chromosomes that undergo the most dynamic TAD rearrangements.

4. Conclusion

All these results together show that the BP score is a valid metric for comparisons of TAD segmentations. It is superior to the widely used JI, and can be useful in many comparative studies of chromatin domains. Not only is the BP score a better metric than the JI for the global comparison of segmentations, but its local version can also give us an interesting measure of domain rearrangement that we show to be correlated with functional measurements.

We have focused our analyses of the BP score solely on its application to the TAD segmentations originating from Hi-C experiments. This is, however, not an exclusive application of this measure. Mathematically, there is nothing in the BP score that would limit its adoption in the field of other chromosome segmentation situations, such as haplotype inference, epigenetic domains, and many others.

Footnotes

Acknowledgment

This study was supported by the Polish National Science Centre grant decision No. (DEC 2015/16/W/NZ2/00314).

Author Disclosure Statement

The authors declare that no competing financial interests exist.

References

Andrey

, Montavon

, Mascrez

, et al. 2013. A switch between topological domains underlies HoxD genes collinearity in mouse limbs. Science. 340, 1234167.

Barutcu

A.R.

, Lajoie

B.R.

, McCord

R.P.

, et al. 2015. Chromatin interaction analysis reveals changes in small chromosome and telomere clustering between epithelial and breast cancer cells. Genome Biol. 16, 214.

Beagrie

R.A.

, Scialdone

, Schueler

, et al. 2017. Complex multi-enhancer contacts captured by genome architecture mapping. Nature. 543, 519.

Bianco

, Lupiáñez

D.G.

, Chiariello

A.M.

, et al. 2018. Polymer physics predicts the effects of structural variants on chromatin architecture. Nat. Genet. 50, 662–667.

Chandra

, Ewels

P.A.

, Schoenfelder

, et al. 2015. Global reorganization of the nuclear landscape in senescent cells. Cell Rep. 10, 471–483.

Crane

, Bian

, McCord

R.P.

, et al. 2015. Condensin-driven remodelling of x chromosome topology during dosage compensation. Nature. 523, 240.

Dekker

, and Heard

2015. Structural and functional diversity of topologically associating domains. FEBS Lett. 589(20 Part A), 2877–2884.

Dekker

, Rippe

, Dekker

, et al. 2002. Capturing chromosome conformation. Science. 295, 1306–1311.

Dixon

J.R.

, Jung

, Selvaraj

, et al. 2015. Chromatin architecture reorganization during stem cell differentiation. Nature. 518, 331.

10.

Dixon

J.R.

, Selvaraj

, Yue

, et al. 2012. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 485, 376.

11.

Dostie

, Richmond

T.A.

, Arnaout

R.A.

, et al. 2006. Chromosome conformation capture carbon copy (5c): A massively parallel solution for mapping interactions between genomic elements. Genome Res. 16, 1299–1309.

12.

Feng

, Cokus

S.J.

, Schubert

, et al. 2014. Genome-wide Hi-C analyses in wild-type and mutants reveal high-resolution chromatin interactions in arabidopsis. Mol. Cell. 55, 694–707.

13.

Filippova

, Patro

, Duggal

, et al. 2014. Identification of alternative topological domains in chromatin. Algorithms Mol. Biol. 9, 14.

14.

Flyamer

I.M.

, Gassler

, Imakaev

, et al. 2017. Single-nucleus Hi-C reveals unique chromatin reorganization at oocyte-to-zygote transition. Nature. 544, 110.

15.

Fraser

, Ferrai

, Chiariello

A.M.

, et al. 2015. Hierarchical folding and reorganization of chromosomes are linked to transcriptional changes in cellular differentiation. Mol. Syst. Biol. 11, 852.

16.

Imakaev

, Fudenberg

, McCord

R.P.

, et al. 2012. Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat. Methods. 9, 999.

17.

Le Dily

, Baù

, Pohl

, et al. 2014. Distinct structural transitions of chromatin topological domains correlate with coordinated hormone-induced gene regulation. Genes Dev. 28, 2151–2162.

18.

Lieberman-Aiden

, Van Berkum

N.L.

, Williams

, et al. 2009. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 326, 289–293.

19.

Lupiáñez

D.G.

, Kraft

, Heinrich

, et al. 2015. Disruptions of topological chromatin domains cause pathogenic rewiring of gene-enhancer interactions. Cell. 161, 1012–1025.

20.

Nagano

, Lubling

, Stevens

T.J.

, et al. 2013. Single-cell Hi-C reveals cell-to-cell variability in chromosome structure. Nature. 502, 59.

21.

Nagano

, Lubling

, Várnai

, et al. 2017. Cell-cycle dynamics of chromosomal organization at single-cell resolution. Nature. 547, 61.

22.

Nora

E.P.

, Lajoie

B.R.

, Schulz

E.G.

, et al. 2012. Spatial partitioning of the regulatory landscape of the x-inactivation centre. Nature. 485, 381.

23.

Rao

S.S.

, Huntley

M.H.

, Durand

N.C.

, et al. 2014. A 3 d map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 159, 1665–1680.

24.

Simonis

, Klous

, Splinter

, et al. 2006. Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture-on-chip (4c). Nat. Genet. 38, 1348.

25.

Song

, Zhang

, Grasfeder

L.L.

, et al. 2011. Open chromatin defined by DNaseI and FAIRE identifies regulatory elements that shape cell-type identity. Genome Res. 21, 1757–1767.

26.

Wilczynski

, and Furlong

E.E.

2010. Challenges for modeling global gene regulatory networks during development: Insights from drosophila. Dev. Biol. 340, 161–169.

27.

Yaffe

, and Tanay

2011. Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture. Nat. Genet. 43, 1059.

28.

Zhao

, Tavoosidana

, Sjölinder

, et al. 2006. Circular chromosome conformation capture (4c) uncovers extensive networks of epigenetically regulated intra-and interchromosomal interactions. Nat. Genet. 38, 1341.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

4.23 MB