Abstract
Eukaryotic genes undergo a mechanism called alternative processing, resulting in transcriptome diversity by allowing the production of multiple distinct transcripts from a gene. More than half of human genes are affected, and the resulting transcripts are highly conserved among orthologous genes of distinct species. In this work, we present the definition of orthology and paralogy between transcripts of homologous genes, together with an algorithm to compute clusters of conserved orthologous and paralogous transcripts. Gene-level homology relationships are utilized to define various types of homology relationships between transcripts originating from the same ancestral transcript. A Reciprocal Best Hits approach is employed to infer clusters of isoorthologous and recent paralogous transcripts. We applied this method to transcripts from simulated gene families as well as real gene families from the Ensembl-Compara database. The results are consistent with those from previous studies that compared orthologous gene transcripts. Furthermore, our findings provide evidence that searching for conserved transcripts between homologous genes, beyond the scope of orthologous genes, is likely to yield valuable information.
INTRODUCTION
The importance and extent of alternative splicing in transcriptome diversity were revealed in the post-genomic era (Harrow et al., 2012). It allows the production of distinct transcripts from a gene. Over the past decade, the number of alternatively spliced genes and alternative transcripts annotated in eukaryotic organisms has increased dramatically (Zerbino et al., 2018). It has now been established that alternative splicing was likely a feature of the eukaryotes' common ancestor.
Understanding how transcripts evolve helps in the study of the functional diversity of genes because transcript evolution is closely related to gene regulation. In this context, comparative transcriptomics is useful to trace the evolutionary relationships between transcripts, genes, and species. Understanding how transcripts diverge or remain conserved across species provides insights into the evolution of biological complexity. Moreover, identifying conserved transcripts and understanding their roles can aid in annotating genes and predicting the functions of uncharacterized genomic elements.
Recently, several methods have been developed to identify conserved transcripts annotated in orthologous genes. These methods identify splicing orthologous transcripts between genes, defined as alternative transcripts of orthologous genes composed of orthologous exons (Blanquart et al., 2016; Guillaudeux et al., 2022; Jammali et al., 2019; Ouangraoua et al., 2012; Zambelli et al., 2010). Other studies have proposed various models of transcript evolution with associated algorithms to reconstruct transcript phylogenies using parsimony-based tree search methods (Ait-Hamlat et al., 2020; Christinat and Moret, 2013; Christinat and Moret, 2012) or supertree methods (Kuitche et al., 2019; Kuitche et al., 2017b). However, several questions about the evolution of sets of alternative transcripts in a gene family remain open (Keren et al., 2010).
For example, does transcript conservation occur more between orthologous genes than paralogous genes? How are new transcript isoforms created during evolution? How to identify the conserved counterpart of a transcript in other homologous genes? Where was a transcript gained or lost along evolution? Moreover, except for the comparison between transcripts of orthologous genes, no method exists to compare all transcripts of all homologous genes in a gene family, including the comparison of transcripts from paralogous genes, with measures of similarity to identify conserved transcripts. Furthermore, there is a need for a framework to classify transcript homology types, such as the existing framework for classifying gene homology types. Beyond allowing a better understanding of alternative transcript evolution, the prediction of orthologous and paralogous transcripts has other important potential applications. It can be useful for gene orthology inference and gene tree correction, as well as gene function prediction.
In this study, we present a framework for the definition of orthology and paralogy relations between homologous transcripts, originating from the same ancestral transcript in a gene family. Note, however, that the transcripts of a gene family are not necessarily all related by a single transcript tree, but possibly by multiple transcripts trees. Each tree then depicts the phylogeny of a set of homologous transcripts originating from an ancestral transcript present in one ancestral gene. Sets of homologous transcripts can be obtained by grouping the transcripts of a gene family into sets of similar transcripts. The orthology and paralogy model at the transcript level is an extension of the gene tree–species tree reconciliation model, which allows defining orthology, paralogy, and isoorthology relations at the gene level.
Isoorthologous genes are the least divergent orthologs that have retained the function of their lowest common ancestor (LCA) (Li et al., 2003; Swenson and El-Mabrouk, 2012). Based on our model, we devise an algorithm for the inference of groups of conserved isoorthologous and paralogous transcripts. The algorithm relies on transcript pairwise similarity scores to identify pairs of conserved recent paralogs and isoorthologs through a Reciprocal Best Hit (RBH) approach. The pairwise relations are then combined into ortholog groups comprising pairs of isoorthologs, recent paralog transcripts, or pairs of transcripts related by isoorthology and recent paralogy relations through transitivity.
In Section 2, we give the definitions and notations required in the article. In Section 3, we present our graph-based algorithm to infer ortholog groups. Section 4 presents the results of the application of the methods to transcripts of simulated data sets generated using SimSpliceEvol (Kuitche et al., 2019), and real gene families from the Ensembl-Compara database (Zerbino et al., 2018).
The results obtained for the gene families are available at https://github.com/UdeS-CoBIUS/TranscriptOrthology.
PRELIMINARIES
In this section, we introduce the definitions and notations required in the next sections. We start with general definitions of rooted trees. Species trees, gene trees, and transcript trees are then defined, followed by the definitions of reconciliations between two phylogenetic trees of different levels. Next, the different types of homology relationships at the gene and transcript levels are defined. The section ends with the definitions of the partition of a set of transcripts into ortholog groups and the decomposition of a transcript tree into ortholog subtrees.
All trees are considered rooted and binary. Given a tree P,
We denote by
and for any node t in 
The Duplication–Loss model of gene evolution describes the evolutionary history of a set of homologous genes, with genes undergoing speciation events when the corresponding species diverge, duplication events when a gene is copied into a new one within a species, and loss events when a gene is lost in a species. Similarly, the Creation–Loss (CL) model of transcript evolution reflects the evolutionary history of a set of homologous transcripts. In the CL model, transcripts are subjected to speciation and duplication events at the same time as their corresponding genes, creation events when a new transcript is generated, and loss events when a gene ceases to produce a transcript. The reconciliation of a gene tree with a species tree allows for labeling the internal nodes of the gene tree as speciation or gene duplication events. Similarly, the internal nodes of a transcript tree can be labeled as creation, speciation, or gene duplication events based on the reconciliation of the transcript tree with a gene tree. Here, we consider the LCA-reconciliation based on the LCA mapping (Definition 1). A well-known reconciliation approach (Zhang, 1997) allows to further recover, in linear time, the location of loss events underlined by such a reconciliation history.
Similarly, the LCA-reconciliation of T and G is a function
Figure 1 provides an illustration for Definition 1.
A species tree S on 
Orthology and paralogy are relationships defined for pairs of homologous genes through gene tree–species tree reconciliation. Orthology is the relationship between two genes where any two distinct ancestors taken at a point in time are always orthogonal, that is, they belong to distinct ancestral species. Paralogy is the relationship between two genes where there exist two distinct ancestors taken at a point in time that are parallel, that is, they belong to the same ancestral species. By considering the co-occurrence of transcript ancestors in the same ancestral genes, the definitions of orthology and paralogy can be extended to pairs of transcripts.
orthologs if their LCA in G is a speciation, that is,
recent paralogs if
ancient paralogs otherwise.
Similarly, two distinct transcripts t
1 and t
2 of ortho-orthologs if their LCA in T is a speciation, that is, para-orthologs if
recent paralogs if
ancient paralogs otherwise.
For example, in Figure 1,
The following lemma clarifies the link between the homology relationships defined at the levels of genes and transcripts.
Proof. Let
Therefore,
Therefore, 
Figure 1 provides an example where a3 and b2 (respectively, a1 and a2) are orthologous (respectively, paralogous) genes, yet their transcripts
The aim of introducing the definitions ortho-orthologs and para-orthologs is to distinguish the orthology relationship at the transcript level from the homology relationship between the host genes. Thus, ortho-orthologs are orthologous transcripts that belong to orthologous genes, while para-orthologs belong to paralogous genes. Introducing this difference allows us to assess the ortholog conjecture at the transcript level, that is, to compare the conservation of transcripts between orthologous genes and between paralogous genes. The ortholog conjecture assumes that orthologous genes evolve functions more slowly than paralogous genes. Thus, it assumes that the proportion of orthologous transcripts between orthologous genes should be higher than that between paralogous genes.
The following lemma shows that two recent paralog genes in
Proof. If t
1, t
2, and t
3 are in the configuration of (1), then they must satisfy
The key assumption of our method is that after a transcript creation event, the newly created transcript tends to diverge from the original transcript from which it was modified, whereas the original transcript tends to remain conserved, like for the inference of gene ortholog groups using graph-based methods where isoorthologous gene pairs are considered (Altenhoff et al., 2013; Lafond et al., 2018; Li et al., 2003; Swenson and El-Mabrouk, 2012). Note that the conservation or divergence between transcripts is based on the comparison of the transcripts' content in exons, and not on their nucleotide sequences.
Therefore, given a creation node t in the LCA-reconciliation of T with G, one of its edges descending to its children, say
Notice that the isoorthology relation is transitive, which allows
For example, in Figure 1,
Based on the partition of
For example, in Figure 2, a decomposition of the transcript tree T of Figure 1 into 3 ortholog trees
Decomposition of the transcript T into three ortholog trees T1, T2, and T3 (respectively, illustrated in red, blue, and green colors) for a set of homologous transcripts 
We now present a graph-based method for the inference of isoorthology and recent paralogy relations between the transcripts of a set
Pairwise similarity score between transcripts
A gene
Figure 3 shows an illustration.

A multiple sequence alignment
and all the transcripts in
, and then mapping each transcript
In the remaining, we denote by
and all the transcript sequences in
and m is the number of columns of the alignment.
Following the block-based model used in Ouangraoua et al. (2012) to represent transcripts, a multiple sequence alignment
is partitioned into a set of non-overlapping blocks of columns as follows:
.
For any block B of
For example, in Figure 3, the alignment is decomposed into 9 blocks.
The following lemma and definition allow us to define a block-based representation for transcripts and genes, such that a transcript or gene is represented as the chain of blocks composing it.
Proof. Trivial, by definition of blocks.
Similarly, for each gene
The following lemma clarifies the link between the block-based representation of a gene and the block-based representations of its transcripts: the blocks composing the representation of a gene are the union of the blocks composing the representations of its transcripts.
Proof. Let B be a block contained in
We are now ready to provide the definition of two transcript similarity measures. The first similarity measure considers all blocks present in the representation of either gene. The second similarity measure only considers blocks that are present in the representations of both genes.
which contains the blocks of t1 shared with
The corrected similarity score between t
1
and t
2
equals:
If the weights associated to the blocks are unitary, the similarity score between two transcripts t1 and t2 is the ratio between the number of blocks shared by the two transcripts and the number of blocks in at least one of the two transcripts. However, for the corrected similarity score, the blocks which are contained in the symmetric difference of
We now describe the method to identify pairs of recent paralogs and putative pairs of isoorthologs based on pairwise similarity scores between transcripts. The method relies on the RBH approach. The underlying idea is that two isoorthologous transcripts (para- or ortho-isoorthologs) from homologous genes g
1 and g
2 should be the most conserved compared with other pairs of transcripts from the same genes. Likewise, two recent paralogous transcripts
• or there exists a third transcript t
3 of
Using RBH to define putative isoorthologs makes the method more robust to transcript loss or incomplete transcript annotation in some genes. The next step is to define an orthology graph whose set of vertices is
(1) if (2) if
The aim of Algorithm 1 is to obtain an orthology graph for
Proof. In Algorithm 1, after the first for loop, the set of edges contains only recent paralogy edges. Therefore, at this point, the graph is an orthology graph. In the remaining of the algorithm, an isoorthology edge is added if and only it preserves the property that the graph is an orthology graph.
Given a multiple sequence alignment
and m is the number of columns, the decomposition of the multiple sequence alignment
RESULTS AND DISCUSSION
This study introduces the notion of orthology and paralogy at the transcript level. Previous studies have primarily focused on identifying conserved transcripts between one-to-one orthologous genes. Therefore, there has been no prior research on computing transcript ortholog groups. An exception is the work of Guillaudeux et al. (2022), who proposed a formal definition of splicing structure orthology and an algorithm used to predict transcript orthologs in human, mouse, and dog.
While our algorithm is based on the identification of putative isoorthologous transcripts, taking into account both structural similarity in terms of exon composition and the evolutionary context, the method of Guillaudeux et al. (2022) aims to identify structurally conserved transcripts without considering the evolutionary relationships between them. It starts by identifying conserved functional sites, such as donor and acceptor splice sites, start and stop codons, and exon blocks among three orthologous genes. Given the orthology relationships between functional sites, orthologous transcripts are then defined as transcripts sharing a conserved splicing structure.
In this section, we first compare the results of our ortholog group inference method with true ortholog groups obtained using a simulation tool (Kuitche et al., 2019). Next, we compare our method with the approach of Guillaudeux et al. (2022). Finally, we analyze the proportions of ortho-orthologs, para-orthologs, and recent paralogs predicted with our method.
Comparison with simulated true ortholog groups
We generated simulated ortholog groups using
Using the default parameters of
Description of the 50 Data Sets Simulated Using SimSpliceEvol
Description of the 50 Data Sets Simulated Using SimSpliceEvol
For each gene family, the information contains the number of genes, the number of transcripts, the number of transcript ortholog groups, the number of pairwise ortho-ortholog relations, the number of pairwise para-ortholog relations, and the number of pairwise recent paralog relations.
We compared the 271 true transcript ortholog groups with ortholog groups obtained using 12 different settings of our method, considering the following variations: (1) Sequence alignments computed using MACSE (Ranwez et al., 2018), or the true alignment provided by SimSpliceEvol; (2) Unitary weights for alignment blocks, or weights corresponding to their lengths, or the mean of the two similarity scores; (3) The corrected or the uncorrected transcript similarity measure. For the assessment, we considered three performance measures: (i) The homogeneity score, which computes the ratio of pairs of transcripts predicted in the same group that are truly in the same group to pairs of transcripts predicted in the same group; (ii) The completeness score, which computes the ratio of pairs of transcripts predicted in the same group that are truly in the same group to pairs of transcripts that are truly in the same group; (iii) The V-measure, which is the harmonic mean between the scores of homogeneity and completeness.
Figure 4 provides the details of the score distributions for all settings. It shows that the highest V-measure scores (0.8) are achieved with the two settings that use the true sequence alignments or MACSE alignments, the unitary weights for block alignments, and the uncorrected transcript similarity measure. The scores obtained using the true sequence alignments or MACSE alignments are very close, highlighting the quality of alignments obtained using MACSE. The figure also shows that the uncorrected transcript similarity measure consistently outperforms the corrected one. Finally, our method consistently displays a tendency to cluster transcripts more than the true ortholog groups, as evidenced by the homogeneity score being lower than the completeness score for all settings. However, the high V-measure score obtained with our best setting (MACSE + unitary weights + uncorrected similarity measure) demonstrates the capacity of our method to recover transcript ortholog groups that are highly similar to the true ortholog groups.

The distributions of homogeneity, completeness, and V-measure scores for our ortholog group predictions on the true 50-gene family data set. The ortholog groups are obtained using the 12 different settings of our method, including 2 options for the transcript similarity measure (the uncorrected similarity score tsm or the uncorrected similarity score tsm+), 2 options for the sequence alignment (MACSE or true), and 3 options for the weights associated to blocks (unitary, length, or mean). The majority of mispredicted transcript ortholog groups (outliers) fail to predict true recent paralogs, which are falsely placed in different ortholog groups, or they predict false isoorthologs that do not share identical structural similarity but yet are very close, thus supported by the transcript similarity measure.
We compared the results of Guillaudeux et al.'s (2022) approach with those of our method. Their data set is publicly available and includes 253 triplets of one-to-one orthologous genes from 236 gene families in the Ensembl-Compara database. Their method predicted 879 transcript ortholog groups, covering a total of 1896 transcripts.
We compared the 879 orthologous groups of Guillaudeux et al. (2022) with orthologous groups obtained using 12 different settings of our method. We used the same settings as in the comparison with true ortholog groups, except that we replaced the true sequence alignments, which are not available here, with alignments obtained using Kalign (Lassmann and Sonnhammer, 2005). For each version of our method, we used our results as the predictions, and Guillaudeux et al.'s (2022) 879 orthologous groups as the ground truth.
For the assessment, we used the homogeneity score, the completeness score, and the V-measure score.
Figure 5 provides the details of the score distributions for all settings. The high scores obtained show that our results are in agreement with those of Guillaudeux et al. (2022). In particular, the best scores are achieved again with the setting that uses the MACSE sequence alignments, unitary weights for block alignments, and the uncorrected transcript similarity measure. In concordance with the previous results obtained for the comparison with true ortholog groups, our method consistently displays an inclination to cluster transcripts more, as evidenced by the homogeneity score consistently being lower than the completeness score in all settings.

The distributions of homogeneity, completeness, and V-measure scores for our predictions on the 253 triplets of orthologous genes from Guillaudeux et al. (2022). Our ortholog groups are obtained using the 12 different settings of our method, including 2 options for the transcript similarity measure (the uncorrected similarity score tsm or the uncorrected similarity score tsm+), 2 options for the sequence alignment (MACSE or Kalign), and 3 options for the weights associated to blocks (unitary, length, or mean).
The detailed comparison of the groups obtained by the Guillaudeux et al. (2022) method and the best-performing setting of our method shows that 495 clusters of 166 families are exactly the same for the two methods. Figure 6 shows the number of recent paralogs per species and the number of ortho-isoorthologs per species pair predicted by the two methods. Our method predicts 1896 ortho-isoorthologs and 466 recent paralogs compared with Guillaudeux et al.'s (2022) method, which predicts 1408 ortho-isoorthologs and 395 recent paralogs. Our method also finds all recent paralogs inferred by Guillaudeux et al. (2022). This observation is consistent with the previous observation that our method tends to cluster transcripts more than Guillaudeux et al.'s (2022) method.

We randomly selected 20 gene families composed of genes from 6 species: human, mouse, dog, dingo, cow, and chicken. Table 2 describes the data set. We used the best-performing setting of our method, using MACSE to compute the multiple sequence alignment, unitary-weights similarity scores, and the uncorrected transcript similarity measure. From a total of 1402 transcripts, we identified 236 ortholog groups. Table 3 shows the ratio of isoorthologs and recent paralogs found between transcript pairs. In this experiment, we could identify para-isoorthologs because the data set contains paralogous genes. Table 2 also shows the ratio of ortho-isoorthology relations normalized by the ratio of gene orthology relations and the ratio of para-isoorthology relations normalized by the ratio of gene paralogy relations. When the normalized ratio of ortho-isoorthology (respectively, para-isoorthology) is greater than 1, it means more relations were predicted than expected given the ratio of gene orthology (respectively, paralogy) relations to all gene pairs.
Description of the Data Set of 20 Gene Families
Description of the Data Set of 20 Gene Families
The total number of genes, the total number of transcripts, and the numbers per species are given.
Comparison of the Ratio of Isoorthologs and Recent Paralogs Found Between Transcript Pairs in the 20 Real Gene Families
For each gene family, the provided information includes: (1) the number of predicted ortholog groups, (2) the ratio of the number of isoortholog pairs to the total number of transcript pairs, (3) the ratio of the number of recent paralog pairs to the total number of transcript pairs, (4) the ratio of the number of ortho-isoortholog pairs to the total number of isoortholog pairs, divided by the ratio of the number of gene ortholog pairs to the total number of gene pairs, and (5) the ratio of the number of para-isoortholog pairs to the total number of isoortholog pairs, divided by the ratio of the number of gene paralog pairs to the total number of gene pairs.
In 14 of 20 families, the normalized para-isoorthology ratio is greater than 1 and greater than the normalized ortho-isoorthology ratio. Therefore, it seems that isoorthologous transcripts tend to be more present between paralogous genes than between orthologous genes. This is consistent with previous studies that have found evidence against the ortholog conjecture in the context of gene function prediction by transferring annotations between homologous genes (Stamboulian et al., 2020). The ortholog conjecture proposes that orthologous genes should be preferred when making such predictions because they evolve functions more slowly than paralogous genes. Our results support that orthologs and paralogs should be considered to provide higher prediction accuracy. However, this interpretation should also be taken with caution as there may be errors in the orthology/paralogy relationships between genes.
Alternative splicing, alternative promoters, and other processing mechanisms can result in the production of multiple transcript isoforms from a single gene, each potentially having distinct functions. Thus, understanding how the sets of transcripts produced from genes evolve is crucial for the functional annotation of genes and genomes. In addition, changes in the structure and expression of transcripts can contribute to various diseases, including cancers and genetic disorders. Therefore, studying transcriptome changes helps identify potential disease markers and understand the molecular basis of pathologies. Furthermore, the evolution of transcripts is intricately related to the evolution of the genes from which they are produced. Therefore, the study of transcript evolution is also valuable in phylogenetic analyses to infer the evolutionary relationships between genes and species. This is particularly important for reconstructing the tree of life and understanding the evolutionary history of organisms.
A fundamental step in studying the evolution of alternative transcripts is categorizing homology relationships among homologous transcripts within gene families. The inferred relations can then be used as a basis to reconstruct a transcript phylogeny that agrees with the input relations. Moreover, identifying orthology relationships and paralogy relationships between transcripts is important for understanding the function of transcripts conserved between multiple genes and species, as well as comprehending the gain of functions in newly created transcripts. The definition of orthology and paralogy relations between transcripts also establishes a basis for correcting gene trees.
Indeed, as most current gene tree construction methods rely on incomplete data and consider a single reference transcript to represent each gene, they often produce trees that contain errors. Considering the whole set of transcripts produced from each gene, and orthology and paralogy relations between those transcripts, is likely to produce more accurate gene trees. Finally, defining orthology and paralogy relations between transcripts is crucial for understanding the relation between the evolution of transcripts under the creation and loss evolution model and the evolution of gene functions.
The work presented in this study formally revisits the concepts of orthology, paralogy, and isoorthology at the transcriptome level. It provides a greedy algorithm to compute clusters of transcript orthologs, consisting of transcripts that are isoorthologs, recent paralogs, or related through a path of isoorthology and recent paralogy relations. The results of the method are consistent with those obtained from methods that compare orthologous genes to identify conserved transcripts. Notably, when comparing the proportions of conserved transcripts between paralogous genes and between orthologous genes, it appears that the proportion between paralogous genes is substantial compared with that between orthologous genes. This contradicts the “ortholog conjecture” and argues for considering both orthologs and paralogs in gene function prediction, rather than using only orthologous genes. It also calls for further studies of the intertwined evolution of transcripts and genes within gene families.
The method offers numerous possibilities for improvement and extension. First, the quality of the multiple sequence alignment and the transcript similarity measure are crucial factors for the quality of orthology relations inference. In particular, the number of blocks obtained by decomposing the multiple sequence alignment is strongly reliant on the quality of the alignment and also increases in proportion to the alignment length. The results reveal that Kalign alignments tend to produce a larger number of blocks than MACSE alignments. In future works, we will explore various methods for comparing transcripts and revisit our approach for the decomposition of alignments into blocks. For instance, in the definition of a multiple sequence alignment decomposition into blocks, the rule that each gap opening in the alignment results in the start of a new block can be relaxed to allow short gaps to be included in blocks. This will lead to less fragmented alignments.
In addition, alternative algorithms for computing clusters of ortholog transcripts, provided putative pairs of isoorthologs, will be investigated. Moreover, as our method appears to be less stringent compared with existing methods, especially in the identification of recent paralogy relationships, it will be important to refine the method for computing recent paralogy relationships. For instance, more stringency in the definition of recent paralogs can be achieved by forbidding a transcript to be in a recent paralogy relationship if it has a putative isoortholog in another gene. As for the definition of putative pairs of isoorthologs, one can consider triplets of genes instead of two genes at a time to obtain a more consistent method.
Finally, we will extend the method to infer all pairwise relations between homologous transcripts originating from an ancestral transcript. To achieve this, the ortholog groups can be used as a basis to compute complete transcript phylogenies using supertree approaches. The goal will be to combine the ortholog subtrees into a complete transcript phylogeny that minimizes the reconciliation with the gene tree while preserving the orthology relationships between the transcripts of input ortholog trees. The resulting complete transcript tree can then be reconciled with the gene tree to label the internal nodes of the transcript phylogeny and deduce the orthology or paralogy relations between transcripts. The accuracy of the ortholog group predictions can then be assessed by comparing the annotated functions of the proteins corresponding to transcripts grouped in ortholog groups, as shown in Figure 7.

(Top) MSA of the transcripts from the JNK3 gene family. The ortholog groups inferred by our method are indicated in red and green, following the color code (respectively, yellow and green) of Ait-Hamlat et al. (2020). (Bottom) A graph whose nodes represent the GO terms are in black, and the transcripts of the family in yellow and green, corresponding to their color in the (MSA), with an edge between a transcript node and a GO term node if the transcript is annotated with the GO term. We observe that the GO annotations of transcripts do not allow to group the transcripts of the gene family into two different functional groups, whereas the transcript sequence alignment allows to clearly distinguish two groups of transcripts. GO, Gene Ontology; MSA, multiple sequence alignment.
Footnotes
ACKNOWLEDGMENTS
We thank the reviewers for their valuable suggestions and the CoBIUS laboratory at University of Sherbrooke for their helpful constructive discussion.
AUTHORs' CONTRIBUTIONS
W.Y.D.D.O. and A.O. conceived the study and the algorithms. W.Y.D.D.O. implemented the algorithm, generated the simulated data, collected the real data, analyzed and interpreted the results, and wrote a draft of the article. A.O. critically reviewed the article.
AUTHOR DISCLOSURE STATEMENT
The authors declare they have no conflicting financial interests.
FUNDING INFORMATION
This work was supported by the Canada Research Chair (CRC Tier 2 Grant 950-230577) and the Natural Sciences and Engineering Research Council of Canada (NSERC Discovery Grant RGPIN-2023-05474).
