Orthology and Paralogy Relationships at Transcript Level

Abstract

Eukaryotic genes undergo a mechanism called alternative processing, resulting in transcriptome diversity by allowing the production of multiple distinct transcripts from a gene. More than half of human genes are affected, and the resulting transcripts are highly conserved among orthologous genes of distinct species. In this work, we present the definition of orthology and paralogy between transcripts of homologous genes, together with an algorithm to compute clusters of conserved orthologous and paralogous transcripts. Gene-level homology relationships are utilized to define various types of homology relationships between transcripts originating from the same ancestral transcript. A Reciprocal Best Hits approach is employed to infer clusters of isoorthologous and recent paralogous transcripts. We applied this method to transcripts from simulated gene families as well as real gene families from the Ensembl-Compara database. The results are consistent with those from previous studies that compared orthologous gene transcripts. Furthermore, our findings provide evidence that searching for conserved transcripts between homologous genes, beyond the scope of orthologous genes, is likely to yield valuable information.

1. INTRODUCTION

The importance and extent of alternative splicing in transcriptome diversity were revealed in the post-genomic era (Harrow et al., 2012). It allows the production of distinct transcripts from a gene. Over the past decade, the number of alternatively spliced genes and alternative transcripts annotated in eukaryotic organisms has increased dramatically (Zerbino et al., 2018). It has now been established that alternative splicing was likely a feature of the eukaryotes' common ancestor.

Understanding how transcripts evolve helps in the study of the functional diversity of genes because transcript evolution is closely related to gene regulation. In this context, comparative transcriptomics is useful to trace the evolutionary relationships between transcripts, genes, and species. Understanding how transcripts diverge or remain conserved across species provides insights into the evolution of biological complexity. Moreover, identifying conserved transcripts and understanding their roles can aid in annotating genes and predicting the functions of uncharacterized genomic elements.

Recently, several methods have been developed to identify conserved transcripts annotated in orthologous genes. These methods identify splicing orthologous transcripts between genes, defined as alternative transcripts of orthologous genes composed of orthologous exons (Blanquart et al., 2016; Guillaudeux et al., 2022; Jammali et al., 2019; Ouangraoua et al., 2012; Zambelli et al., 2010). Other studies have proposed various models of transcript evolution with associated algorithms to reconstruct transcript phylogenies using parsimony-based tree search methods (Ait-Hamlat et al., 2020; Christinat and Moret, 2013; Christinat and Moret, 2012) or supertree methods (Kuitche et al., 2019; Kuitche et al., 2017b). However, several questions about the evolution of sets of alternative transcripts in a gene family remain open (Keren et al., 2010).

For example, does transcript conservation occur more between orthologous genes than paralogous genes? How are new transcript isoforms created during evolution? How to identify the conserved counterpart of a transcript in other homologous genes? Where was a transcript gained or lost along evolution? Moreover, except for the comparison between transcripts of orthologous genes, no method exists to compare all transcripts of all homologous genes in a gene family, including the comparison of transcripts from paralogous genes, with measures of similarity to identify conserved transcripts. Furthermore, there is a need for a framework to classify transcript homology types, such as the existing framework for classifying gene homology types. Beyond allowing a better understanding of alternative transcript evolution, the prediction of orthologous and paralogous transcripts has other important potential applications. It can be useful for gene orthology inference and gene tree correction, as well as gene function prediction.

In this study, we present a framework for the definition of orthology and paralogy relations between homologous transcripts, originating from the same ancestral transcript in a gene family. Note, however, that the transcripts of a gene family are not necessarily all related by a single transcript tree, but possibly by multiple transcripts trees. Each tree then depicts the phylogeny of a set of homologous transcripts originating from an ancestral transcript present in one ancestral gene. Sets of homologous transcripts can be obtained by grouping the transcripts of a gene family into sets of similar transcripts. The orthology and paralogy model at the transcript level is an extension of the gene tree–species tree reconciliation model, which allows defining orthology, paralogy, and isoorthology relations at the gene level.

Isoorthologous genes are the least divergent orthologs that have retained the function of their lowest common ancestor (LCA) (Li et al., 2003; Swenson and El-Mabrouk, 2012). Based on our model, we devise an algorithm for the inference of groups of conserved isoorthologous and paralogous transcripts. The algorithm relies on transcript pairwise similarity scores to identify pairs of conserved recent paralogs and isoorthologs through a Reciprocal Best Hit (RBH) approach. The pairwise relations are then combined into ortholog groups comprising pairs of isoorthologs, recent paralog transcripts, or pairs of transcripts related by isoorthology and recent paralogy relations through transitivity.

In Section 2, we give the definitions and notations required in the article. In Section 3, we present our graph-based algorithm to infer ortholog groups. Section 4 presents the results of the application of the methods to transcripts of simulated data sets generated using SimSpliceEvol (Kuitche et al., 2019), and real gene families from the Ensembl-Compara database (Zerbino et al., 2018).

The results obtained for the gene families are available at https://github.com/UdeS-CoBIUS/TranscriptOrthology.

2. PRELIMINARIES

In this section, we introduce the definitions and notations required in the next sections. We start with general definitions of rooted trees. Species trees, gene trees, and transcript trees are then defined, followed by the definitions of reconciliations between two phylogenetic trees of different levels. Next, the different types of homology relationships at the gene and transcript levels are defined. The section ends with the definitions of the partition of a set of transcripts into ortholog groups and the decomposition of a transcript tree into ortholog subtrees.

All trees are considered rooted and binary. Given a tree P, $v (P)$ the set of nodes of P, $l (P)$ its leafset, and $r (P)$ its root node. Let x be a node x of P, the complete subtree of P rooted in x is denoted by $P [x]$ . Given an other node y of P, x is an ancestor of y if y is a node of $P [x]$ . We denote by x_l and x_r the two children of x, if x is an internal node. Given a subset $L'$ of $l (P)$ , we denote by $l c a_{P} (L')$ the LCA in P of $L'$ . $P_{| L'}$ is the tree with leafset $L'$ obtained from the subtree $P [l c a_{P} (L')]$ by removing all leaves that are not in $L'$ , and then all internal nodes of degree 2, except the root. Given a tree $P'$ such that $l (P') \subseteq l (P)$ , we say that P displays $P'$ if and only if $P_{| l (P')}$ is isomorphic to $P'$ while preserving the same leaf-labeling. A tree for a set $Σ$ is a tree P such that $l (P) = Σ$ . Given a set of trees ${P_{i}, 1 \leq i \leq k}$ for subsets $Σ_{i}, 1 \leq i \leq k$ that form a partition of a leafset $Σ$ , a supertree of ${P_{i}, 1 \leq i \leq k}$ is a tree for $Σ$ that displays all trees $P_{i}, 1 \leq i \leq k$ .

We denote by $S$ a set of species, and by $G$ a set of homologous genes related by a single gene tree. $T$ denotes a set of homologous transcripts descending from the same ancestral transcript, that is, related by a single transcript tree. The following two functions relate the sets $S$ , $G$ , $T$ . $s : G \to S$ maps each gene to its corresponding species, and $g : T \to G$ maps each transcript to its corresponding gene such that ${g (t) : t \in T} = G$ and ${s (g) : g \in G} = S$ . The induced set function $g^{- 1}$ associates each gene with its set of corresponding transcripts. In the sequel, S denotes a species tree for $S$ whose internal nodes represent a partially ordered set of speciation events that have led to $S$ . G denotes a gene tree for $G$ whose internal nodes represent speciation and gene duplication events that have led to $G$ . T denotes a transcript tree for $T$ whose internal nodes represent speciation, gene duplication, and transcript creation events that have led to $T$ . We extend the mapping functions s from $v (G)$ to $v (S)$ , and g from $v (T)$ to $v (G)$ as follows: for any node g in $v (G)$ , and for any node t in $v (T)$ ,

The Duplication–Loss model of gene evolution describes the evolutionary history of a set of homologous genes, with genes undergoing speciation events when the corresponding species diverge, duplication events when a gene is copied into a new one within a species, and loss events when a gene is lost in a species. Similarly, the Creation–Loss (CL) model of transcript evolution reflects the evolutionary history of a set of homologous transcripts. In the CL model, transcripts are subjected to speciation and duplication events at the same time as their corresponding genes, creation events when a new transcript is generated, and loss events when a gene ceases to produce a transcript. The reconciliation of a gene tree with a species tree allows for labeling the internal nodes of the gene tree as speciation or gene duplication events. Similarly, the internal nodes of a transcript tree can be labeled as creation, speciation, or gene duplication events based on the reconciliation of the transcript tree with a gene tree. Here, we consider the LCA-reconciliation based on the LCA mapping (Definition 1). A well-known reconciliation approach (Zhang, 1997) allows to further recover, in linear time, the location of loss events underlined by such a reconciliation history.

Definition 1 (LCA-reconciliation at the gene and transcript levels). The LCA-reconciliation of G and S is a function $r e c_{G} : v (G) ∖ G \to {S p e, D u p}$ that labels any internal node g of G as a speciation (Spe) if $s (g) \neq s (g_{l})$ and $s (g) \neq s (g_{r})$ , and as a duplication (Dup) otherwise. The LCA-reconciliation cost of G and S is the number of gene duplications and losses underlined by $r e c_{G}$ .

Similarly, the LCA-reconciliation of T and G is a function $r e c_{T} : v (T) ∖ T \to {S p e c, D u p, C r e}$ that labels any internal node t of T as a creation (Cre) if $g (t) = g (t_{l})$ or $g (t) = g (t_{r})$ , otherwise as a duplication (Dup) if $r e c_{G} (g (t)) = D u p$ , and as a speciation (Spe) if $r e c_{G} (g (t)) = S p e$ . The LCA-reconciliation cost of G and S is the number of transcript creations and losses underlined by $r e c_{T}$ .

Figure 1 provides an illustration for Definition 1. $r e c_{G}$ is a reconciliation between G and S that minimizes the number of gene duplications and losses (Chauve and El-Mabrouk, 2009). $r e c_{T}$ is also a reconciliation between T and G that minimizes the number of transcript creations and losses (Kuitche et al., 2017b).

FIG. 1.

A species tree S on $S = {a, b}$ , a gene tree G on $G = {a_{1}, a_{2}, a_{3}, b_{1}, b_{2}}$ , and a transcript tree T on $T = {a_{11}, a_{21}, a_{21}, a_{31}, a_{32}, b_{11}, b_{12}, b_{21}, b_{22}}$ such that for any species $x \in S$ , gene $x_{i} \in G$ , and transcript $x_{i j} \in T$ , $s (x_{i}) = x$ and $g (x_{i j}) = x_{i}$ . Round nodes represent speciations, square nodes gene duplications, triangle nodes transcript creations, and fictive dashed lines that end with a cross symbol are the location of losses in the LCA-reconciliation of G and S, and the LCA-reconciliation of T and G. Divergence edges after creation nodes are represented as dotted lines. The five isoortholog groups of $T$ are displayed using different colors: ${a_{11}, a_{21}, b_{11}, a_{31}}$ in red, ${a_{22}}$ in magenta, ${b_{12}}$ in brown, ${a_{32}, b_{21}}$ in blue, and ${b_{22}}$ in green. The LCA-reconciliation cost of G and S equals 2 (2 duplications), and the LCA-reconciliation cost of T and G is 6 (4 duplications +2 losses). LCA.

Orthology and paralogy are relationships defined for pairs of homologous genes through gene tree–species tree reconciliation. Orthology is the relationship between two genes where any two distinct ancestors taken at a point in time are always orthogonal, that is, they belong to distinct ancestral species. Paralogy is the relationship between two genes where there exist two distinct ancestors taken at a point in time that are parallel, that is, they belong to the same ancestral species. By considering the co-occurrence of transcript ancestors in the same ancestral genes, the definitions of orthology and paralogy can be extended to pairs of transcripts.

Definition 2 (Orthology, paralogy at the gene and transcript levels). Two distinct genes g ₁ and g ₂ of $G$ are:

orthologs if their LCA in G is a speciation, that is, $r e c_{G} (l c a_{G} ({g_{1}, g_{2}})) = S p e$ ;

recent paralogs if $r e c_{G} (l c a_{G} ({g_{1}, g_{2}})) = D u p$ and

ancient paralogs otherwise.

Similarly, two distinct transcripts t ₁ and t ₂ of $T$ are (Kuitche et al., 2017b):

ortho-orthologs if their LCA in T is a speciation, that is, $r e c_{T} (l c a_{T} ({t_{1}, t_{2}})) = S p e$ ;

para-orthologs if $r e c_{T} (l c a_{T} ({t_{1}, t_{2}})) = D u p$ ;

recent paralogs if $r e c_{T} (l c a_{T} ({t_{1}, t_{2}})) = C r e$ and

ancient paralogs otherwise.

For example, in Figure 1, $a_{11}$ and $b_{12}$ are ortho-orthologs, $a_{11}$ and $b_{21}$ are para-orthologs, $b_{21}$ and $b_{22}$ are recent paralogs, and $a_{11}$ and $a_{22}$ are ancient paralogs. Note that if all gene pairs are orthologs, then necessarily all pairs of orthologous transcripts are ortho-orthologs.

The following lemma clarifies the link between the homology relationships defined at the levels of genes and transcripts.

Lemma 1 (Link between homology relationships at the gene and transcript levels). If two transcripts t ₁ and t ₂ are ortho-orthologs, then the genes $g (t_{1})$ and $g (t_{2})$ are orthologs, and if t ₁ and t ₂ are para-orthologs, then $g (t_{1})$ and $g (t_{2})$ are paralogs. If t ₁ and t ₂ are recent paralogs, then $g (t_{1}) = g (t_{2})$ . None of the converse statements are true.

Proof. Let $t = l c a_{T} ({t_{1}, t_{2}})$ . If t ₁ and t ₂ are ortho-orthologs or para-orthologs, then $g (t) \neq g (t_{l})$ and Therefore, $l c a_{G} (g (t_{1}), g (t_{2})) = g (t)$ , and then $r e c_{G} (l c a_{G} (g (t_{1}), g (t_{2}))) = r e c_{G} (g (t))$ which is a speciation if t ₁ and t ₂ are ortho-orthologs, and a duplication otherwise. If t ₁ and t ₂ are recent paralogs, then any leaf $t'$ of the subtree $T [t]$ must satisfy Therefore,

Figure 1 provides an example where a₃ and b₂ (respectively, a₁ and a₂) are orthologous (respectively, paralogous) genes, yet their transcripts $a_{31}$ and $b_{21}$ (respectively, $a_{11}$ and $a_{22}$ ) are not ortho-orthologs (respectively, para-orthologs). Similarly, $g (a_{21}) = g (a_{22}) = a_{2}$ , but $a_{21}$ and $a_{22}$ are not recent paralogs.

The aim of introducing the definitions ortho-orthologs and para-orthologs is to distinguish the orthology relationship at the transcript level from the homology relationship between the host genes. Thus, ortho-orthologs are orthologous transcripts that belong to orthologous genes, while para-orthologs belong to paralogous genes. Introducing this difference allows us to assess the ortholog conjecture at the transcript level, that is, to compare the conservation of transcripts between orthologous genes and between paralogous genes. The ortholog conjecture assumes that orthologous genes evolve functions more slowly than paralogous genes. Thus, it assumes that the proportion of orthologous transcripts between orthologous genes should be higher than that between paralogous genes.

The following lemma shows that two recent paralog genes in $T$ must have the same type of homology relationships with any other transcript in $T$ .

Lemma 2 (Link between recent paralogy and orthology). (1) If three transcripts t ₁, t ₂, and t ₃ are such that t ₁ and t ₂ are recent paralogs, and t ₁ and t ₃ are ortho-orthologs (respectively, para-orthologs), then t ₂ and t ₃ are ortho-orthologs (respectively, para-orthologs). (2) If t ₁ and t ₃ are recent paralogs, and t ₂ and t ₃ are recent paralogs, then t ₁ and t ₂ are also recent paralogs.

Proof. If t ₁, t ₂, and t ₃ are in the configuration of (1), then they must satisfy $l c a_{T} ({t_{2}, t_{3}}) = l c a_{T} ({t_{1}, t_{3}})$ . Therefore, t ₂ and t ₃ have the same relation as t ₁ and t ₃. (2) is trivial.

The key assumption of our method is that after a transcript creation event, the newly created transcript tends to diverge from the original transcript from which it was modified, whereas the original transcript tends to remain conserved, like for the inference of gene ortholog groups using graph-based methods where isoorthologous gene pairs are considered (Altenhoff et al., 2013; Lafond et al., 2018; Li et al., 2003; Swenson and El-Mabrouk, 2012). Note that the conservation or divergence between transcripts is based on the comparison of the transcripts' content in exons, and not on their nucleotide sequences.

Therefore, given a creation node t in the LCA-reconciliation of T with G, one of its edges descending to its children, say $(t, t_{l})$ without loss of generality, corresponds to the original transcript conserved, whereas the other edge $(t, t_{r})$ corresponds to the newly created divergent transcript. In this case, we call $(t, t_{l})$ a conservation edge, whereas $(t, t_{r})$ is called a divergence edge. For example, in Figure 1, the divergence edges after creation nodes appear in dashed lines. In the CL model of transcript evolution, if one child branch is lost after a speciation, duplication, or creation node, this node can be deleted by merging the remaining two edges into a single edge. The new edge must be a divergence edge if at least one of the two initial edges was a divergence edge. Distinguishing conservation edges and divergence edges after a creation node allows to define a particular type of orthology relation between transcripts.

Definition 3 (Isoorthology at the transcript level). Two ortho- (respectively, para-) orthologous transcripts t ₁ and t ₂ of $T$ are ortho- (respectively, para-) isoorthologs if there are no divergence edges on the path between t ₁ and t ₂ in T.

Notice that the isoorthology relation is transitive, which allows $T$ to be partitioned into ortholog groups as follows:

Definition 4 (Ortholog groups at the transcript level). An ortholog group $O$ of $T$ is a subset of $T$ such that any two distinct transcripts t ₁ and t ₂ belonging to $O$ are isoorthologs (i.e., ortho- or para-isoorthologs), recent paralogs, or there exist two transcripts $t'_{1}$ and $t'_{2}$ in $O$ such that $t_{1} = t'_{1}$ or t ₁ and $t'_{1}$ are recent paralogs, $t_{2} = t'_{2}$ or t ₂ and $t'_{2}$ are recent paralogs, and $t'_{1}$ and $t'_{2}$ are isoorthologs. $O (T)$ denotes the partition of $T$ into a set of maximum inclusive-wise ortholog groups.

For example, in Figure 1, ${a_{11}, a_{21}, b_{11}, b_{12}, a_{31}}$ , ${a_{22}}$ , and ${a_{32}, b_{21}, b_{22}}$ are the maximum inclusive-wise ortholog groups of $T$ .

Based on the partition of $T$ into ortholog groups, the transcript T can be decomposed into a set of ortholog subtrees as follows:

Definition 5 (Decomposition of the transcript tree into ortholog trees). The set $O (T) = {O_{i}, 1 \leq i \leq k}$ of ortholog groups of $T$ defines a decomposition of the transcript tree T into a set of ortholog trees ${T_{i}, 1 \leq i \leq k}$ such that for each $1 \leq i \leq k$ , $T_{i} = T_{| O_{i}}$ .

For example, in Figure 2, a decomposition of the transcript tree T of Figure 1 into 3 ortholog trees $T_{1}, T_{2}, T_{3}$ is depicted. From Definition 5, it is easy to see that the transcript tree T is a supertree for the set of ortholog trees ${T_{i}, 1 \leq i \leq k}$ .

FIG. 2.

Decomposition of the transcript T into three ortholog trees T₁, T₂, and T₃ (respectively, illustrated in red, blue, and green colors) for a set of homologous transcripts ${a_{11}, a_{21}, a_{22}, a_{31}, a_{32}, b_{11}, b_{12}, b_{21}, b_{22}}$ . The tree T is a supertree of T₁, T₂, and T₃. Round internal nodes represent speciations, whereas square internal nodes represent gene duplications, and triangular internal nodes represent transcript creations. A dotted edge represents the link between two ortholog trees.

3. INFERRING ISOORTHOLOGY AND RECENT PARALOGY RELATIONS

We now present a graph-based method for the inference of isoorthology and recent paralogy relations between the transcripts of a set $T$ of homologous transcripts. The method is a heuristic that relies on a pairwise similarity measure between transcripts to infer the set $O (T)$ of maximum inclusive-wise ortholog groups of $T$ . We start by providing the definition of similarity scores between two transcripts based on a block-based representation of genes and transcripts. Next, we describe an algorithm for inferring ortholog groups through the construction of an orthology graph for $T$ .

3.1. Pairwise similarity score between transcripts

A gene $g \in G$ is a DNA sequence on the alphabet of nucleotides $Σ = {A, C, G, T}$ . A transcript t of g (i.e., $t \in g^{- 1} (g)$ ) is a subsequence of g obtained by concatenating an ordered set of substrings of g such that each substring is an exon of g that is present in t . The transcribed subsequence of a gene $g \in G$ , denoted by $\hat{g}$ , is the subsequence of g obtained by deleting from g any nucleotide that is absent from all transcripts of g . $\hat{G}$ denotes the set of transcribed subsequences of all genes in G . Note that Figure 3 shows an illustration.

FIG. 3.

(A) A gene $g 1$ with its transcripts $t 11$ and $t 12$ . A gene $g 2$ with its transcripts $t 21$ and $t 22$ . The character “*” is used to represent the nucleotides that are located in an intron, an untranscribed or an untranslated region of the gene. (B) The transcribed subsequences $\hat{g} 1$ and $\hat{g} 2$ corresponding to genes $g 1$ and $g 2$ . (C) A multiple sequence alignment of $\hat{g} 1$ , $\hat{g} 2$ , $t 11$ , $t 12$ , $t 21$ , $t 22$ , decomposed into nine blocks.

A multiple sequence alignment $A$ of all the transcribed subsequences in and all the transcripts in $T$ is obtained by first computing a multiple sequence alignment M of the transcribed subsequences in , and then mapping each transcript $t \in T$ on its corresponding transcribed subsequence within M to obtain the resulting alignment $A$ .

In the remaining, we denote by $A$ a multiple sequence alignment of the transcribed subsequences in and all the transcript sequences in $T$ . $A$ is represented as a $n \times m$ matrix such that and m is the number of columns of the alignment.

Following the block-based model used in Ouangraoua et al. (2012) to represent transcripts, a multiple sequence alignment $A$ of $T$ and is partitioned into a set of non-overlapping blocks of columns as follows:

Definition 6 (Decomposition of multiple sequence alignment). Let $A$ be a multiple sequence alignment of $T$ and . $A_{b}$ denotes the binary matrix of same dimension as $A$ such that each nucleotide A, C, G, or T in $A$ is replaced by 1 in $A_{b}$ , and each gap character “-” is replaced by 0. A block of $A$ is a set of consecutive columns of $A$ which correspond to a maximum inclusive-wise set of consecutive columns of $A_{b}$ which are equal.

For any block B of $A$ , $α (B)$ denotes a positive number representing the weight of the block B.

For example, in Figure 3, the alignment is decomposed into 9 blocks.

The following lemma and definition allow us to define a block-based representation for transcripts and genes, such that a transcript or gene is represented as the chain of blocks composing it.

Lemma 3 (Aligned sequences in blocks). For any aligned sequence $t'$ in $A$ and any block B of $A$ , $t'$ contains either only nucleotides, or only gaps in B.

Proof. Trivial, by definition of blocks.

Definition 7 (Block-based representation of transcripts and genes). Given the ordered set of blocks defined by the partition of $A$ , for each transcript $t \in T$ , $ℬ (t)$ denotes the ordered subset of blocks in which the aligned sequence $t'$ corresponding to t contains nucleotides.

Similarly, for each gene $g \in G$ , $ℬ (g)$ denotes the ordered subset of blocks in which the aligned sequence $g'$ corresponding to $\hat{g}$ contains nucleotides.

The following lemma clarifies the link between the block-based representation of a gene and the block-based representations of its transcripts: the blocks composing the representation of a gene are the union of the blocks composing the representations of its transcripts.

Lemma 4 (Link between representations of transcripts and genes). For any gene $g \in G$ , $ℬ (g)$ contains all blocks in which at least one aligned sequence $t'$ corresponding to a transcript t of g contains nucleotides.

Proof. Let B be a block contained in $ℬ (g)$ . Then, the transcribed subsequence $\hat{g}$ contains a segment of nucleotides in B. Therefore, there exists at least one transcript t of g , which contains this segment, and then the corresponding aligned sequence $t'$ contains nucleotides in B. Conversely, any block containing nucleotides from a transcript of g belongs necessarily to $ℬ (g)$ .

We are now ready to provide the definition of two transcript similarity measures. The first similarity measure considers all blocks present in the representation of either gene. The second similarity measure only considers blocks that are present in the representations of both genes.

Definition 8 (Pairwise transcript similarity). Let t₁ and t₂ be two distinct transcripts in $T$ . Consider the sets of blocks shared by t₁ and t₂, $ℬ ℐ (t_{1}, t_{2}) = ℬ (t_{1}) \cap ℬ (t_{2})$ , and the set of blocks $ℬ U (t_{1}, t_{2}) = ℬ (t_{1}) \cup ℬ (t_{2})$ and which contains the blocks of t₁ shared with $g (t_{2})$ and the blocks of t₂ shared with $g (t_{1})$ . The similarity score between t₁ and t₂ equals: $t s m (t_{1}, t_{2}) = \frac{\sum_{B \in ℬ ℐ (t_{1}, t_{2})} α (B)}{\sum_{B \in ℬ U (t_{1}, t_{2})} α (B)}$

The corrected similarity score between t ₁ and t ₂ equals: $t s m_{+} (t_{1}, t_{2}) = \frac{\sum_{B \in ℬ ℐ (t_{1}, t_{2})} α (B)}{\sum_{B \in ℬ U_{+} (t_{1}, t_{2})} α (B)}$

If the weights associated to the blocks are unitary, the similarity score between two transcripts t₁ and t₂ is the ratio between the number of blocks shared by the two transcripts and the number of blocks in at least one of the two transcripts. However, for the corrected similarity score, the blocks which are contained in the symmetric difference of $ℬ (t_{1})$ and $ℬ (t_{2})$ are only counted in the denominator if they belong to both $ℬ (g_{1})$ and $ℬ (g_{2})$ . This correction allows us to account for differences at the transcript level only, while avoiding those at the gene level. So, if a block b is present in transcript t₁ but not in transcript t₂, this block b will be considered in the formula only if it is present in gene g₂. In other words, both genes g₁ and g₂ have the block b, but the difference in the presence/absence of the block is only at the level of transcripts between t₁ and t₂. For instance, in the example provided in Figure 3, $t s m_{+} (t 11, t 21) =$ $| {2, 3, 4, 6, 8} | ∕ | {2, 3, 4, 6, 8} | = 1$ even if they do not share the block 7. If the weight associated to a block is its length, the similarity score corresponds to the case where each column of the multiple sequence alignment $A$ is considered as a block.

3.2. Orthology graph construction and ortholog groups inference

We now describe the method to identify pairs of recent paralogs and putative pairs of isoorthologs based on pairwise similarity scores between transcripts. The method relies on the RBH approach. The underlying idea is that two isoorthologous transcripts (para- or ortho-isoorthologs) from homologous genes g ₁ and g ₂ should be the most conserved compared with other pairs of transcripts from the same genes. Likewise, two recent paralogous transcripts $t_{11}$ and $t_{12}$ from a gene g ₁ should be more similar to each other than to any transcript in another gene g ₂. This is because the lowest common ancestor (LCA) of $t_{11}$ and $t_{12}$ , which is a creation node in the transcript tree, must be more recent than any node from which any pair of transcripts from g ₁ and g ₂ diverged.

Definition 9 (Inferred recent paralogs). Two distinct transcripts t ₁ and t ₂ of $T$ such that $g (t_{1}) = g (t_{2})$ are inferred as recent paralogs if:

• or there exists a third transcript t ₃ of $T$ such that t ₁ and t ₃ are recent paralogs and, t ₂ and t ₃ are also recent paralogs.

Definition 10 (Putative isoorthologs). Two transcripts t ₁ and t ₂ of $T$ of two distinct genes are inferred as putative isoorthologs if: $t s m (t_{1}, t_{2}) = m a x {t s m (t_{1}, t) : t \in g^{- 1} (g (t_{2}))} = m a x {t s m (t_{2}, t) : t \in g^{- 1} (g (t_{1}))} .$

Using RBH to define putative isoorthologs makes the method more robust to transcript loss or incomplete transcript annotation in some genes. The next step is to define an orthology graph whose set of vertices is $T$ , edges represent inferred recent paralogs or putative isoorthologs, and connected components define ortholog groups.

Definition 11 (Orthology graph). An orthology graph for $T$ is a graph $G = (V, E)$ whose set of vertices $V = T$ and for any two distinct transcripts t ₁ and t ₂ of $T$ :

(1) if $(t_{1}, t_{2}) \in E$ , then t₁ and t₂ are either inferred recent paralogs or putative isoorthologs, and;

(2) if $g (t_{1}) = g (t_{2})$ , then t₁ and t₂ belong to the same connected component of G if and only if t₁ and t₂ are recent paralogs.

The aim of Algorithm 1 is to obtain an orthology graph for $T$ with a minimum number of connected components. Given a graph $(V, E)$ , $C C (V, E)$ denotes the set of all connected components of the graph. The function $c c_{(V, E)} : V \to C C (V, E)$ associates each vertex $x \in V$ with the connected component to which it belongs. Algorithm 1 follows a progressive heuristic approach to build an orthology graph for $T$ . The inputs are the set $ℛ ℙ$ of all pairs of inferred recent paralogs and the ordered set $ℙ O$ of all pairs of putative isoorthologs ordered by decreasing similarity. The algorithm starts with an empty set of edges. The edges corresponding to recent paralogs are then added. In the next step, the algorithm considers the edges corresponding to putative isoorthologs and adds them progressively. Each such edge is added if its addition does not break the property that the graph is an orthology graph.

Lemma 5 (Correctness of Algorithm 1). Given $ℛ ℙ$ the set of all the pairs of inferred recent paralogs, and $ℙ O$ the ordered set of all the pairs of putative isoorthologs, Algorithm 1 computes an orthology graph for $T$ .

Proof. In Algorithm 1, after the first for loop, the set of edges contains only recent paralogy edges. Therefore, at this point, the graph is an orthology graph. In the remaining of the algorithm, an isoorthology edge is added if and only it preserves the property that the graph is an orthology graph.

Given a multiple sequence alignment $A$ of dimension $n \times m$ such that and m is the number of columns, the decomposition of the multiple sequence alignment $A$ into blocks is computed in $O (n \times m)$ time complexity. The pairwise transcript similarity scores are computed in $O (n^{2} \times b)$ time complexity where b is the number of blocks in the decomposition of $A$ and $b ≪ m$ . The pairs of inferred recent paralogs and putative isoorthologs are computed in $O (n^{2})$ time complexity. Finally, Algorithm 1 runs in $O (n^{2})$ time complexity. Therefore, given $A$ , the whole method runs in $O (n \times m + n^{2})$ time complexity.

4. RESULTS AND DISCUSSION

This study introduces the notion of orthology and paralogy at the transcript level. Previous studies have primarily focused on identifying conserved transcripts between one-to-one orthologous genes. Therefore, there has been no prior research on computing transcript ortholog groups. An exception is the work of Guillaudeux et al. (2022), who proposed a formal definition of splicing structure orthology and an algorithm used to predict transcript orthologs in human, mouse, and dog.

While our algorithm is based on the identification of putative isoorthologous transcripts, taking into account both structural similarity in terms of exon composition and the evolutionary context, the method of Guillaudeux et al. (2022) aims to identify structurally conserved transcripts without considering the evolutionary relationships between them. It starts by identifying conserved functional sites, such as donor and acceptor splice sites, start and stop codons, and exon blocks among three orthologous genes. Given the orthology relationships between functional sites, orthologous transcripts are then defined as transcripts sharing a conserved splicing structure.

In this section, we first compare the results of our ortholog group inference method with true ortholog groups obtained using a simulation tool (Kuitche et al., 2019). Next, we compare our method with the approach of Guillaudeux et al. (2022). Finally, we analyze the proportions of ortho-orthologs, para-orthologs, and recent paralogs predicted with our method.

4.1. Comparison with simulated true ortholog groups

We generated simulated ortholog groups using SimSpliceEvol (Kuitche et al., 2019). When provided with a guide gene tree with branch lengths as input, SimSpliceEvol produces a gene sequence and a set of transcript sequences at each leaf of the tree. These sequences are the result of a simulated evolution from an ancestral gene located at the root of the gene tree to extant genes located at the leaves of the tree, including their corresponding transcripts. The simulation results include the true evolutionary history of genes and transcripts, along with a precise multiple sequence alignment of all extant gene and transcript sequences. Therefore, this simulation provides us with the true ortholog groups for a given set of transcripts.

Using the default parameters of SimSpliceEvol, we generated 50 distinct data sets, which are described in Table 1. These data sets contain a total of 271 transcript ortholog groups, encompassing 735 transcripts.

Table 1.
Description of the 50 Data Sets Simulated Using SimSpliceEvol

Gene family No. of genes No. of transcripts No. of ortholog groups No. of ortho-orthologs No. of para-orthologs No. of recent paralogs

1 5 16 4 11 0 5

2 4 8 5 1 0 3

3 5 5 1 10 0 0

4 4 12 2 10 0 3

5 8 28 8 16 3 8

6 6 14 6 6 1 2

7 7 25 10 14 3 6

8 4 25 10 4 3 5

9 3 21 6 0 6 6

10 4 11 5 5 1 4

11 3 14 8 1 1 4

12 9 36 13 33 1 6

13 5 18 8 2 0 6

14 5 22 8 9 0 8

15 3 5 2 4 0 0

16 5 24 8 9 2 0

17 4 8 4 2 1 0

18 3 7 4 2 1 5

19 2 4 2 0 0 2

20 4 18 6 4 0 2

21 6 12 5 1 4 1

22 5 25 7 11 1 6

23 3 5 3 2 1 2

24 4 15 4 5 0 7

25 6 9 4 6 0 2

26 2 8 3 0 0 4

27 6 13 6 3 1 2

28 5 14 6 5 0 2

29 6 17 3 15 2 4

30 8 23 11 10 1 2

31 4 7 2 6 0 3

32 6 13 7 2 1 7

33 2 16 6 2 0 2

34 8 27 10 6 0 2

35 1 3 1 0 0 5

36 7 31 11 12 0 7

37 4 7 2 6 0 1

38 6 31 9 25 0 6

39 4 6 3 4 0 1

40 7 19 7 8 1 8

41 5 12 7 2 0 0

42 3 11 3 4 0 5

43 4 8 3 1 0 3

44 5 17 4 8 0 4

45 2 7 4 1 0 2

46 4 11 4 3 0 5

47 2 11 4 1 0 2

48 5 14 6 2 0 3

49 5 7 2 7 0 3

50 9 15 4 12 0 3

Total 237 735 271 313 35 179

Gene family	No. of genes	No. of transcripts	No. of ortholog groups	No. of ortho-orthologs	No. of para-orthologs	No. of recent paralogs
1	5	16	4	11	0	5
2	4	8	5	1	0	3
3	5	5	1	10	0	0
4	4	12	2	10	0	3
5	8	28	8	16	3	8
6	6	14	6	6	1	2
7	7	25	10	14	3	6
8	4	25	10	4	3	5
9	3	21	6	0	6	6
10	4	11	5	5	1	4
11	3	14	8	1	1	4
12	9	36	13	33	1	6
13	5	18	8	2	0	6
14	5	22	8	9	0	8
15	3	5	2	4	0	0
16	5	24	8	9	2	0
17	4	8	4	2	1	0
18	3	7	4	2	1	5
19	2	4	2	0	0	2
20	4	18	6	4	0	2
21	6	12	5	1	4	1
22	5	25	7	11	1	6
23	3	5	3	2	1	2
24	4	15	4	5	0	7
25	6	9	4	6	0	2
26	2	8	3	0	0	4
27	6	13	6	3	1	2
28	5	14	6	5	0	2
29	6	17	3	15	2	4
30	8	23	11	10	1	2
31	4	7	2	6	0	3
32	6	13	7	2	1	7
33	2	16	6	2	0	2
34	8	27	10	6	0	2
35	1	3	1	0	0	5
36	7	31	11	12	0	7
37	4	7	2	6	0	1
38	6	31	9	25	0	6
39	4	6	3	4	0	1
40	7	19	7	8	1	8
41	5	12	7	2	0	0
42	3	11	3	4	0	5
43	4	8	3	1	0	3
44	5	17	4	8	0	4
45	2	7	4	1	0	2
46	4	11	4	3	0	5
47	2	11	4	1	0	2
48	5	14	6	2	0	3
49	5	7	2	7	0	3
50	9	15	4	12	0	3
Total	237	735	271	313	35	179

For each gene family, the information contains the number of genes, the number of transcripts, the number of transcript ortholog groups, the number of pairwise ortho-ortholog relations, the number of pairwise para-ortholog relations, and the number of pairwise recent paralog relations.

We compared the 271 true transcript ortholog groups with ortholog groups obtained using 12 different settings of our method, considering the following variations: (1) Sequence alignments computed using MACSE (Ranwez et al., 2018), or the true alignment provided by SimSpliceEvol; (2) Unitary weights for alignment blocks, or weights corresponding to their lengths, or the mean of the two similarity scores; (3) The corrected or the uncorrected transcript similarity measure. For the assessment, we considered three performance measures: (i) The homogeneity score, which computes the ratio of pairs of transcripts predicted in the same group that are truly in the same group to pairs of transcripts predicted in the same group; (ii) The completeness score, which computes the ratio of pairs of transcripts predicted in the same group that are truly in the same group to pairs of transcripts that are truly in the same group; (iii) The V-measure, which is the harmonic mean between the scores of homogeneity and completeness.

Figure 4 provides the details of the score distributions for all settings. It shows that the highest V-measure scores (0.8) are achieved with the two settings that use the true sequence alignments or MACSE alignments, the unitary weights for block alignments, and the uncorrected transcript similarity measure. The scores obtained using the true sequence alignments or MACSE alignments are very close, highlighting the quality of alignments obtained using MACSE. The figure also shows that the uncorrected transcript similarity measure consistently outperforms the corrected one. Finally, our method consistently displays a tendency to cluster transcripts more than the true ortholog groups, as evidenced by the homogeneity score being lower than the completeness score for all settings. However, the high V-measure score obtained with our best setting (MACSE + unitary weights + uncorrected similarity measure) demonstrates the capacity of our method to recover transcript ortholog groups that are highly similar to the true ortholog groups.

FIG. 4.

The distributions of homogeneity, completeness, and V-measure scores for our ortholog group predictions on the true 50-gene family data set. The ortholog groups are obtained using the 12 different settings of our method, including 2 options for the transcript similarity measure (the uncorrected similarity score tsm or the uncorrected similarity score tsm+), 2 options for the sequence alignment (MACSE or true), and 3 options for the weights associated to blocks (unitary, length, or mean). The majority of mispredicted transcript ortholog groups (outliers) fail to predict true recent paralogs, which are falsely placed in different ortholog groups, or they predict false isoorthologs that do not share identical structural similarity but yet are very close, thus supported by the transcript similarity measure.

4.2. Comparison with ortholog groups predicted in human, mouse, and dog one-to-one orthologous genes

We compared the results of Guillaudeux et al.'s (2022) approach with those of our method. Their data set is publicly available and includes 253 triplets of one-to-one orthologous genes from 236 gene families in the Ensembl-Compara database. Their method predicted 879 transcript ortholog groups, covering a total of 1896 transcripts.

We compared the 879 orthologous groups of Guillaudeux et al. (2022) with orthologous groups obtained using 12 different settings of our method. We used the same settings as in the comparison with true ortholog groups, except that we replaced the true sequence alignments, which are not available here, with alignments obtained using Kalign (Lassmann and Sonnhammer, 2005). For each version of our method, we used our results as the predictions, and Guillaudeux et al.'s (2022) 879 orthologous groups as the ground truth.

For the assessment, we used the homogeneity score, the completeness score, and the V-measure score.

Figure 5 provides the details of the score distributions for all settings. The high scores obtained show that our results are in agreement with those of Guillaudeux et al. (2022). In particular, the best scores are achieved again with the setting that uses the MACSE sequence alignments, unitary weights for block alignments, and the uncorrected transcript similarity measure. In concordance with the previous results obtained for the comparison with true ortholog groups, our method consistently displays an inclination to cluster transcripts more, as evidenced by the homogeneity score consistently being lower than the completeness score in all settings.

FIG. 5.

The distributions of homogeneity, completeness, and V-measure scores for our predictions on the 253 triplets of orthologous genes from Guillaudeux et al. (2022). Our ortholog groups are obtained using the 12 different settings of our method, including 2 options for the transcript similarity measure (the uncorrected similarity score tsm or the uncorrected similarity score tsm+), 2 options for the sequence alignment (MACSE or Kalign), and 3 options for the weights associated to blocks (unitary, length, or mean).

The detailed comparison of the groups obtained by the Guillaudeux et al. (2022) method and the best-performing setting of our method shows that 495 clusters of 166 families are exactly the same for the two methods. Figure 6 shows the number of recent paralogs per species and the number of ortho-isoorthologs per species pair predicted by the two methods. Our method predicts 1896 ortho-isoorthologs and 466 recent paralogs compared with Guillaudeux et al.'s (2022) method, which predicts 1408 ortho-isoorthologs and 395 recent paralogs. Our method also finds all recent paralogs inferred by Guillaudeux et al. (2022). This observation is consistent with the previous observation that our method tends to cluster transcripts more than Guillaudeux et al.'s (2022) method.

FIG. 6.

Comparison between the quantities of recent paralogs and ortho-isoorthologs inferred by our method and the method proposed by Guillaudeux et al. (2022). More inferred relations are observed using our method, which appears to be less stringent than the method by Guillaudeux et al. (2022).

4.3. Comparison of the proportions of ortho-orthologs, para-orthologs, and recent paralogs predicted

We randomly selected 20 gene families composed of genes from 6 species: human, mouse, dog, dingo, cow, and chicken. Table 2 describes the data set. We used the best-performing setting of our method, using MACSE to compute the multiple sequence alignment, unitary-weights similarity scores, and the uncorrected transcript similarity measure. From a total of 1402 transcripts, we identified 236 ortholog groups. Table 3 shows the ratio of isoorthologs and recent paralogs found between transcript pairs. In this experiment, we could identify para-isoorthologs because the data set contains paralogous genes. Table 2 also shows the ratio of ortho-isoorthology relations normalized by the ratio of gene orthology relations and the ratio of para-isoorthology relations normalized by the ratio of gene paralogy relations. When the normalized ratio of ortho-isoorthology (respectively, para-isoorthology) is greater than 1, it means more relations were predicted than expected given the ratio of gene orthology (respectively, paralogy) relations to all gene pairs.

Table 2.
Description of the Data Set of 20 Gene Families

Gene families ID No. of genes No. of transcripts

ENSGT00390000000715 6 = 1 + 1 + 1 + 1 + 1 + 1 17 = 6 + 4 + 4 + 1 + 1 + 1

ENSGT00390000003967 6 = 1 + 1 + 1 + 1 + 1 + 1 13 = 3 + 2 + 4 + 1 + 2 + 1

ENSGT00390000004965 6 = 1 + 1 + 1 + 1 + 1 + 1 8 = 2 + 1 + 2 + 1 + 1 + 1

ENSGT00390000005532 6 = 1 + 1 + 1 + 1 + 1 + 1 14 = 5 + 2 + 2 + 1 + 2 + 2

ENSGT00390000008371 12 = 2 + 2 + 2 + 2 + 2 + 2 32 = 11 + 9 + 3 + 3 + 3 + 3

ENSGT00530000063023 23 = 4 + 4 + 4 + 4 + 4 + 3 65 = 26 + 11 + 12 + 5 + 6 + 5

ENSGT00530000063187 17 = 3 + 3 + 3 + 3 + 3 + 2 34 = 8 + 6 + 7 + 6 + 4 + 3

ENSGT00530000063205 19 = 3 + 3 + 3 + 3 + 4 + 3 54 = 19 + 11 + 6 + 3 + 10 + 5

ENSGT00940000153241 14 = 2 + 2 + 3 + 3 + 2 + 2 23 = 3 + 4 + 7 + 4 + 3 + 2

ENSGT00940000157909 6 = 1 + 1 + 1 + 1 + 1 + 1 47 = 17 + 13 + 5 + 3 + 4 + 5

ENSGT00950000182681 43 = 8 + 7 + 7 + 7 + 7 + 7 104 = 26 + 24 + 19 + 8 + 16 + 11

ENSGT00950000182705 39 = 7 + 7 + 6 + 6 + 6 + 7 158 = 60 + 41 + 14 + 10 + 19 + 14

ENSGT00950000182727 30 = 5 + 5 + 5 + 5 + 5 + 5 121 = 23 + 15 + 21 + 26 + 23 + 13

ENSGT00950000182728 35 = 6 + 6 + 6 + 6 + 6 + 5 112 = 31 + 13 + 15 + 30 + 12 + 11

ENSGT00950000182783 29 = 5 + 5 + 4 + 5 + 5 + 5 97 = 40 + 22 + 9 + 9 + 7 + 10

ENSGT00950000182875 37 = 7 + 7 + 5 + 5 + 5 + 8 77 = 26 + 14 + 10 + 6 + 10 + 11

ENSGT00950000182931 24 = 4 + 4 + 4 + 4 + 4 + 4 116 = 57 + 15 + 18 + 9 + 9 + 8

ENSGT00950000182956 23 = 4 + 4 + 4 + 4 + 4 + 3 109 = 25 + 29 + 21 + 9 + 15 + 10

ENSGT00950000182978 24 = 4 + 4 + 4 + 4 + 4 + 4 116 = 34 + 20 + 19 + 11 + 18 + 14

ENSGT00950000183192 19 = 4 + 3 + 3 + 3 + 3 + 3 85 = 50 + 9 + 7 + 5 + 8 + 6

Gene families ID	No. of genes	No. of transcripts
ENSGT00390000000715	6 = 1 + 1 + 1 + 1 + 1 + 1	17 = 6 + 4 + 4 + 1 + 1 + 1
ENSGT00390000003967	6 = 1 + 1 + 1 + 1 + 1 + 1	13 = 3 + 2 + 4 + 1 + 2 + 1
ENSGT00390000004965	6 = 1 + 1 + 1 + 1 + 1 + 1	8 = 2 + 1 + 2 + 1 + 1 + 1
ENSGT00390000005532	6 = 1 + 1 + 1 + 1 + 1 + 1	14 = 5 + 2 + 2 + 1 + 2 + 2
ENSGT00390000008371	12 = 2 + 2 + 2 + 2 + 2 + 2	32 = 11 + 9 + 3 + 3 + 3 + 3
ENSGT00530000063023	23 = 4 + 4 + 4 + 4 + 4 + 3	65 = 26 + 11 + 12 + 5 + 6 + 5
ENSGT00530000063187	17 = 3 + 3 + 3 + 3 + 3 + 2	34 = 8 + 6 + 7 + 6 + 4 + 3
ENSGT00530000063205	19 = 3 + 3 + 3 + 3 + 4 + 3	54 = 19 + 11 + 6 + 3 + 10 + 5
ENSGT00940000153241	14 = 2 + 2 + 3 + 3 + 2 + 2	23 = 3 + 4 + 7 + 4 + 3 + 2
ENSGT00940000157909	6 = 1 + 1 + 1 + 1 + 1 + 1	47 = 17 + 13 + 5 + 3 + 4 + 5
ENSGT00950000182681	43 = 8 + 7 + 7 + 7 + 7 + 7	104 = 26 + 24 + 19 + 8 + 16 + 11
ENSGT00950000182705	39 = 7 + 7 + 6 + 6 + 6 + 7	158 = 60 + 41 + 14 + 10 + 19 + 14
ENSGT00950000182727	30 = 5 + 5 + 5 + 5 + 5 + 5	121 = 23 + 15 + 21 + 26 + 23 + 13
ENSGT00950000182728	35 = 6 + 6 + 6 + 6 + 6 + 5	112 = 31 + 13 + 15 + 30 + 12 + 11
ENSGT00950000182783	29 = 5 + 5 + 4 + 5 + 5 + 5	97 = 40 + 22 + 9 + 9 + 7 + 10
ENSGT00950000182875	37 = 7 + 7 + 5 + 5 + 5 + 8	77 = 26 + 14 + 10 + 6 + 10 + 11
ENSGT00950000182931	24 = 4 + 4 + 4 + 4 + 4 + 4	116 = 57 + 15 + 18 + 9 + 9 + 8
ENSGT00950000182956	23 = 4 + 4 + 4 + 4 + 4 + 3	109 = 25 + 29 + 21 + 9 + 15 + 10
ENSGT00950000182978	24 = 4 + 4 + 4 + 4 + 4 + 4	116 = 34 + 20 + 19 + 11 + 18 + 14
ENSGT00950000183192	19 = 4 + 3 + 3 + 3 + 3 + 3	85 = 50 + 9 + 7 + 5 + 8 + 6

The total number of genes, the total number of transcripts, and the numbers per species are given.

Table 3.

Comparison of the Ratio of Isoorthologs and Recent Paralogs Found Between Transcript Pairs in the 20 Real Gene Families

Gene Family ID	No. of ortholog groups	Ratio of iso-orthologs	Ratio of recent paralogs	Normalized ratio of ortho-isoorthologs	Normalized ratio of para-isoorthologs
ENSGT00390000000715	4	0.4118	0.0882	0.7738	1.3393
ENSGT00390000003967	5	0.4103	0.0513	1.0	0.0
ENSGT00390000004965	2	0.7143	0.0357	1.0	0.0
ENSGT00390000005532	6	0.1758	0.0	0.75	1.125
ENSGT00390000008371	6	0.5423	0.0665	0.9574	1.145
ENSGT00530000063023	12	0.5029	0.0207	0.8293	1.034
ENSGT00530000063187	13	0.303	0.0071	0.9	1.0214
ENSGT00530000063205	5	0.5877	0.0307	0.6227	1.0389
ENSGT00940000153241	6	0.5257	0.0158	0.9431	1.039
ENSGT00940000157909	19	0.0722	0.0213	1.0	0.0
ENSGT00950000182681	11	0.6701	0.0149	1.1272	0.9925
ENSGT00950000182705	12	0.7817	0.0373	0.7965	1.011
ENSGT00950000182727	14	0.6354	0.0216	0.9795	1.0338
ENSGT00950000182728	11	0.772	0.0246	1.0074	0.9992
ENSGT00950000182783	9	0.561	0.0217	0.8978	1.017
ENSGT00950000182875	9	0.6548	0.0154	1.1482	0.9143
ENSGT00950000182931	33	0.2937	0.0127	0.98	1.0039
ENSGT00950000182956	24	0.4392	0.0221	0.9481	1.0557
ENSGT00950000182978	13	0.7217	0.036	0.9302	1.0962
ENSGT00950000183192	22	0.4031	0.0459	0.902	1.1559

For each gene family, the provided information includes: (1) the number of predicted ortholog groups, (2) the ratio of the number of isoortholog pairs to the total number of transcript pairs, (3) the ratio of the number of recent paralog pairs to the total number of transcript pairs, (4) the ratio of the number of ortho-isoortholog pairs to the total number of isoortholog pairs, divided by the ratio of the number of gene ortholog pairs to the total number of gene pairs, and (5) the ratio of the number of para-isoortholog pairs to the total number of isoortholog pairs, divided by the ratio of the number of gene paralog pairs to the total number of gene pairs.

In 14 of 20 families, the normalized para-isoorthology ratio is greater than 1 and greater than the normalized ortho-isoorthology ratio. Therefore, it seems that isoorthologous transcripts tend to be more present between paralogous genes than between orthologous genes. This is consistent with previous studies that have found evidence against the ortholog conjecture in the context of gene function prediction by transferring annotations between homologous genes (Stamboulian et al., 2020). The ortholog conjecture proposes that orthologous genes should be preferred when making such predictions because they evolve functions more slowly than paralogous genes. Our results support that orthologs and paralogs should be considered to provide higher prediction accuracy. However, this interpretation should also be taken with caution as there may be errors in the orthology/paralogy relationships between genes.

5. CONCLUSION AND DISCUSSION

Alternative splicing, alternative promoters, and other processing mechanisms can result in the production of multiple transcript isoforms from a single gene, each potentially having distinct functions. Thus, understanding how the sets of transcripts produced from genes evolve is crucial for the functional annotation of genes and genomes. In addition, changes in the structure and expression of transcripts can contribute to various diseases, including cancers and genetic disorders. Therefore, studying transcriptome changes helps identify potential disease markers and understand the molecular basis of pathologies. Furthermore, the evolution of transcripts is intricately related to the evolution of the genes from which they are produced. Therefore, the study of transcript evolution is also valuable in phylogenetic analyses to infer the evolutionary relationships between genes and species. This is particularly important for reconstructing the tree of life and understanding the evolutionary history of organisms.

A fundamental step in studying the evolution of alternative transcripts is categorizing homology relationships among homologous transcripts within gene families. The inferred relations can then be used as a basis to reconstruct a transcript phylogeny that agrees with the input relations. Moreover, identifying orthology relationships and paralogy relationships between transcripts is important for understanding the function of transcripts conserved between multiple genes and species, as well as comprehending the gain of functions in newly created transcripts. The definition of orthology and paralogy relations between transcripts also establishes a basis for correcting gene trees.

Indeed, as most current gene tree construction methods rely on incomplete data and consider a single reference transcript to represent each gene, they often produce trees that contain errors. Considering the whole set of transcripts produced from each gene, and orthology and paralogy relations between those transcripts, is likely to produce more accurate gene trees. Finally, defining orthology and paralogy relations between transcripts is crucial for understanding the relation between the evolution of transcripts under the creation and loss evolution model and the evolution of gene functions.

The work presented in this study formally revisits the concepts of orthology, paralogy, and isoorthology at the transcriptome level. It provides a greedy algorithm to compute clusters of transcript orthologs, consisting of transcripts that are isoorthologs, recent paralogs, or related through a path of isoorthology and recent paralogy relations. The results of the method are consistent with those obtained from methods that compare orthologous genes to identify conserved transcripts. Notably, when comparing the proportions of conserved transcripts between paralogous genes and between orthologous genes, it appears that the proportion between paralogous genes is substantial compared with that between orthologous genes. This contradicts the “ortholog conjecture” and argues for considering both orthologs and paralogs in gene function prediction, rather than using only orthologous genes. It also calls for further studies of the intertwined evolution of transcripts and genes within gene families.

The method offers numerous possibilities for improvement and extension. First, the quality of the multiple sequence alignment and the transcript similarity measure are crucial factors for the quality of orthology relations inference. In particular, the number of blocks obtained by decomposing the multiple sequence alignment is strongly reliant on the quality of the alignment and also increases in proportion to the alignment length. The results reveal that Kalign alignments tend to produce a larger number of blocks than MACSE alignments. In future works, we will explore various methods for comparing transcripts and revisit our approach for the decomposition of alignments into blocks. For instance, in the definition of a multiple sequence alignment decomposition into blocks, the rule that each gap opening in the alignment results in the start of a new block can be relaxed to allow short gaps to be included in blocks. This will lead to less fragmented alignments.

In addition, alternative algorithms for computing clusters of ortholog transcripts, provided putative pairs of isoorthologs, will be investigated. Moreover, as our method appears to be less stringent compared with existing methods, especially in the identification of recent paralogy relationships, it will be important to refine the method for computing recent paralogy relationships. For instance, more stringency in the definition of recent paralogs can be achieved by forbidding a transcript to be in a recent paralogy relationship if it has a putative isoortholog in another gene. As for the definition of putative pairs of isoorthologs, one can consider triplets of genes instead of two genes at a time to obtain a more consistent method.

Finally, we will extend the method to infer all pairwise relations between homologous transcripts originating from an ancestral transcript. To achieve this, the ortholog groups can be used as a basis to compute complete transcript phylogenies using supertree approaches. The goal will be to combine the ortholog subtrees into a complete transcript phylogeny that minimizes the reconciliation with the gene tree while preserving the orthology relationships between the transcripts of input ortholog trees. The resulting complete transcript tree can then be reconciled with the gene tree to label the internal nodes of the transcript phylogeny and deduce the orthology or paralogy relations between transcripts. The accuracy of the ortholog group predictions can then be assessed by comparing the annotated functions of the proteins corresponding to transcripts grouped in ortholog groups, as shown in Figure 7.

FIG. 7.

(Top) MSA of the transcripts from the JNK3 gene family. The ortholog groups inferred by our method are indicated in red and green, following the color code (respectively, yellow and green) of Ait-Hamlat et al. (2020). (Bottom) A graph whose nodes represent the GO terms are in black, and the transcripts of the family in yellow and green, corresponding to their color in the (MSA), with an edge between a transcript node and a GO term node if the transcript is annotated with the GO term. We observe that the GO annotations of transcripts do not allow to group the transcripts of the gene family into two different functional groups, whereas the transcript sequence alignment allows to clearly distinguish two groups of transcripts. GO, Gene Ontology; MSA, multiple sequence alignment.

Footnotes

ACKNOWLEDGMENTS

We thank the reviewers for their valuable suggestions and the CoBIUS laboratory at University of Sherbrooke for their helpful constructive discussion.

AUTHORs' CONTRIBUTIONS

W.Y.D.D.O. and A.O. conceived the study and the algorithms. W.Y.D.D.O. implemented the algorithm, generated the simulated data, collected the real data, analyzed and interpreted the results, and wrote a draft of the article. A.O. critically reviewed the article.

AUTHOR DISCLOSURE STATEMENT

The authors declare they have no conflicting financial interests.

FUNDING INFORMATION

This work was supported by the Canada Research Chair (CRC Tier 2 Grant 950-230577) and the Natural Sciences and Engineering Research Council of Canada (NSERC Discovery Grant RGPIN-2023-05474).

References

Ait-Hamlat

, Zea

, Labeeuw

, et al. Transcripts' evolutionary history and structural dynamics give mechanistic insights into the functional diversity of the jnk family. J Mol Biol, 2020; 432(7):2121–2140.

Altenhoff

, Gil

, Gonnet

, Dessimoz

. Inferring hierarchical orthologous groups from orthologous gene pairs. PLoS One, 2013; 8(1):e53786.

Blanquart

, Varré

J-S

, Guertin

, et al. Assisted transcriptome reconstruction and splicing orthology. BMC Genomics, 2016; 17(10):157.

Chauve

, El-Mabrouk

New perspectives on gene family evolution: Losses in reconciliation and a link with supertrees. In: Research in Computational Molecular Biology: Proceedings of the 13th Annual International Conference, RECOMB 2009, Tucson, AZ, USA, May 18–21, 2009. Springer; 2009; pp. 46–58.

Christinat

, Moret

. Inferring transcript phylogenies. BMC Bioinformatics, 2012; 13(9):S1.

Christinat

, Moret

. A transcript perspective on evolution. IEEE ACM Trans Comput Biol Bioinformatics, 2013; 10(6):1403–1411.

Guillaudeux

, Belleannée

, Blanquart

. Identifying genes with conserved splicing structure and orthologous isoforms in human, mouse and dog. BMC Genomics, 2022; 23(1):1–14.

Harrow

, Frankish

, Gonzalez

, et al. Gencode: The reference human genome annotation for the encode project. Genome Res, 2012; 22(9):1760–1774.

Jammali

, Aguilar

J-D

, Kuitche

, Ouangraoua

. Splicedfamalign: CDS-to-gene spliced alignment and identification of transcript orthology groups. BMC Bioinformatics, 2019; 20(3):133.

10.

Keren

, Lev-Maor

, Ast

. Alternative splicing and evolution: Diversification, exon definition and function. Nat Rev Genetics, 2010; 11(5):345–355.

11.

Kuitche

, Jammali

, Ouangraoua

. Simspliceevol: Alternative splicing-aware simulation of biological sequence evolution. BMC Bioinformatics, 2019; 20(20):640.

12.

Kuitche

, Lafond

, Ouangraoua

. Reconstructing protein and gene phylogenies by extending the framework of reconciliation. 2017a. In: Proceedings of the 9th International Conference on Bioinformatics and Computational Biology. ISBN: 9781510836679.

13.

Kuitche

, Lafond

, Ouangraoua

. Reconstructing protein and gene phylogenies using reconciliation and soft-clustering. J Bioinform Comput Biol, 2017b;15(06):1740007.

14.

Lafond

, Meghdari Miardan

, Sankoff

. Accurate prediction of orthologs in the presence of divergence after duplication. Bioinformatics, 2018; 34(13):i366–i375.

15.

Lassmann

, Sonnhammer

. Kalign—An accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics, 2005; 6(1):1–9.

16.

, Stoeckert

, Roos

. Orthomcl: Identification of ortholog groups for eukaryotic genomes. Genome Res, 2003; 13(9):2178–2189.

17.

Ouangraoua

, Swenson

, Bergeron

On the comparison of sets of alternative transcripts. In: Bioinformatics Research and Applications: Proceedings of the 8th International Symposium, ISBRA 2012, Dallas, TX, USA, May 21–23, 2012. Springer; 2012; pp. 201–212.

18.

Ranwez

, Douzery

, Cambon

, et al. MACSE v2: Toolkit for the alignment of coding sequences accounting for frameshifts and stop codons. Mol Biol Evol, 2018; 35(10):2582–2584.

19.

Stamboulian

, Guerrero

, Hahn

, Radivojac

. The ortholog conjecture revisited: The value of orthologs and paralogs in function prediction. Bioinformatics, 2020; 36(Suppl. 1):i219–i226.

20.

Swenson

, El-Mabrouk

. Gene trees and species trees: Irreconcilable differences. BMC Bioinformatics, 2012; 13:1–9.

21.

Zambelli

, Pavesi

, Gissi

, et al. Assessment of orthologous splicing isoforms in human and mouse orthologous genes. BMC Genomics, 2010; 11(1):1.

22.

Zerbino

, Achuthan

, Akanni

, et al. Ensembl 2018. Nucleic Acids Res, 2018; 46(D1):D754–D761.

23.

Zhang

On a mirkin-muchnik-smith conjecture for comparing molecular phylogenies. J Comput Biol, 1997; 4(2):177–187.