Evaluating Compression-Based Phylogeny Estimation in the Presence of Incomplete Lineage Sorting

Abstract

This study assesses characteristics of the normalized compression distance (NCD) technique for building phylogenetic trees from molecular data. We examined results from a mammalian biological data set as well as a collection of simulated data with varying levels of incomplete lineage sorting. The implementation of NCD we analyze is a concatenation-based, distance-based, alignment-free, and model-free phylogeny estimation method, which takes concatenated unaligned sequence data as input and outputs a matrix of distances. We compare the NCD phylogeny estimation method with various other methods, including coalescent- and concatenation-based methods.

1. INTRODUCTION

Phylogenetic trees are hypotheses about evolutionary relationships among species. Molecular phylogenetics analyzes DNA or amino acid sequences to build these trees. The approaches used are broadly categorized as either model-based statistical estimation, based on a statistical model in which all sites evolve identically and independently (i.i.d.) down a single model tree, or distance-based method, based on pairwise differences among sequences (Warnow, 2018). The most common approach for phylogeny estimation is concatenated analysis using maximum likelihood (CA-ML), in which alignments for each locus are aggregated into a supermatrix and a species tree is estimated using a maximum likelihood (ML) method such as RAxML (Stamatakis, 2006).

In addition to concatenation-based analyses are coalescent-based analyses, which represent each gene by a single tree. Another commonly used model-based statistical method is Bayesian inference, as implemented in MrBayes (Ronquist and Huelsenbeck, 2003). Both ML and Bayesian inference work by calculating likelihoods of trees based upon explicit parametric mathematical models of evolution. Distance-based methods produce a dissimilarity matrix of pairwise difference scores, which is input to a program that builds the tree. An example of this is the “neighbor joining with p distance (NJp)” method, where the p-distance between two sequences is the normalized Hamming distance, and the neighbor joining (NJ) (Saitou and Nei, 1987) method translates the matrix into a tree (Yoshida and Nei, 2016).

Yoshida and Nei (2016) investigated NJ with uncorrected p distances, which means that the distances were not adjusted according to a substitution model like Jukes–Cantor. Their approach quickly constructed a reliable tree for large data sets. NJ trees are commonly constructed using corrected distance but, in the case of p distances and NJ (NJp), the standard error of corrected evolutionary distances is larger than those of uncorrected p distances.

Backed by increasingly more powerful computers, numerous studies (Guindon and Gascuel, 2003; Kuhner and Felsenstein, 1994) have claimed that ML and Bayesian inference are very effective at determining the true tree. However, Yohida and Nei (2016) concluded in their study that the uncorrected NJp distance-based method is generally better than ML and Bayesian methods in obtaining the correct tree topology. In the same vein, we claim that the distance-based method using normalized compression distance (NCD) is at least as good at deriving correct topologies as the model-based approaches.

For methods such as Bayesian inference and ML, one selects a particular evolutionary model to establish site substitution probabilities. This requirement forces assumptions to be made about biological sequence data. These assumptions can be problematic, affecting the accuracy of resulting trees, particularly when analyzing whole genomes or compositional genes where a single statistical model is unlikely to be applicable to the entire data set (Roch and Steel, 2015; Yoshida and Nei, 2016).

With an overwhelming amount of sequence data now available, analyzing such using the Bayesian Markov Chain Monte Carlo (MCMC) technique and reaching a stationary distribution takes very long running times. This limits such analyses to smaller sequence data sets, typically under a thousand sequences (Warnow, 2018). ML implementations such as FastTree-2 (Morgan et al, 2010) are capable of analyzing data sets of a thousand sequences in under an hour and much larger data sets in the order of days, but is not designed for longer sequences (Liu et al, 2011; Warnow, 2018).

Bayesian phylogenetic inference and ML programs rely on computing the Phylogenetic Likelihood Function, which dominates a vast majority of the execution time and memory requirements. The users of RAxML have found that memory shortages are the main limiting factor for large scale phylogenetic analysis (Izquierdo-Carrasco et al, 2011). For example, a ML analysis for 48 avian species, using a concatenated alignment of ∼41.8 million base pairs, took ∼200 CPU years and a terabyte of memory to find the local optimum (Jarvis et al, 2014; Warnow, 2017).

These two statistical phylogeny estimation methods are approximating solutions to NP-Hard problems and that is computationally expensive, quickly becoming infeasible as input sizes rise. This comes not only because more taxa are being sequenced, but also due to research that considers the evolution of compositional or mosaic genes, rather than considering only one homologous region of DNA for a set of species (Yoshida and Nei, 2016). Izquierdo-Carrasco et al (2011) discuss techniques for trading reduced memory for increased running time, but given the ever-growing amount of available sequence data, this trade-off is not a solution.

Most phylogeny estimation methodologies, whether statistical or distance based, require alignment of input sequence data through a multiple sequence alignment (MSA) tool. Methods such as NJp, ML, and Bayesian inference make use of such a tool. Many of the same problems that plague statistical estimation methods recur in alignments. The methodology of all sequence alignment software is founded on the same underlying algorithm, which attempts to look for correspondence between sites in the same ordering for two or more sequences.

There are inherent problems with this process. Zielezinski et al (2017) discuss five situations wherein alignment-based methods might not work well. For example, these methods assume low variability among homologous sequences and, therefore, produce tools ill-equipped to deal with genomes that exhibit great variation. They also struggle with those that share a low common identity, such as viral genomes.

In addition, the problem of finding a globally optimal MSA is NP-complete (Wang and Jiang, 1994). Consequently, MSA is extremely costly in terms of computational complexity, and can only be approximately solved using heuristic techniques, which can introduce noise and mistakes at the start of the alignment (Daugelaite et al, 2013). Liu et al (2011) report limitations of popular MSA tools and note that “[A]lignments computed for large datasets have relatively large error rates, and maximum likelihood phylogenies computed on these alignments also have high error rates.”

Many additive approaches have been developed over time to mitigate the tradeoffs of time, space, and accuracy for MSA and phylogeny construction methods. We suggest an alternative approach to further develop distance-based alignment-free tree estimation methods.

The implementation of the NCD phylogeny estimation method used in this article is a concatenated, alignment-free, and distance-based analysis using uncorrected distances. Given n sequences, distance-based phylogeny estimation methods rely on computing all pairwise distances among the sequences. The distances are often normalized so the output of this computation is an $n \times n$ dissimilarity matrix with values in the range 0 to 1. Sometimes these distances are computed on aligned sequences but this is not necessary. This matrix is input to one of a number of very fast algorithms to produce a tree.

1.1. Normalized compression distance

Our dissimilarity matrices contain NCDs. The NCD technique is an algorithmic approximation of the normalized information distance (NID) measure. Informally, a measure of the similarity between two strings, over a finite alphabet, is how much their concatenation can be losslessly compressed relative to how much each string can be individually compressed. The idea is that similar strings share structure a compressor can use to minimize the number of bits needed to represent the concatenated strings.

The theoretical minimum compressed size of a string s is its Kolmogorov complexity (Li and Vitányi, 2008), which is the length of the shortest program p such that, when run, produces s as output. We denote this size $K (s)$ .

Based on this, Li et al (2001) defined NID: $N I D (s_{0}, s_{1}) = \frac{K (s_{0} s_{1}) - min (K (s_{0}), K (s_{1}))}{max (K (s_{0}), K (s_{1}))} .$

Because the Kolmogorov complexity of a string is uncomputable (Li and Vitányi, 2008), the NID cannot be computed exactly. But it can be approximated from above by selecting a compression algorithm A and replacing $K (s)$ by $C_{A} (s)$ , which is the size of the bit string resulting from applying A to s. This new distance is called the NCD Cilibrasi and Vitányi (2005): $N C D (s_{0}, s_{1}) = \frac{C (s_{0} s_{1}) - min (C (s_{0}), C (s_{1}))}{max (C (s_{0}), C (s_{1}))} .$

Note that the NCD is always non-negative and, in general, less than or equal to 1. A small NCD for two strings indicates that they are similar. We assume lossless compression so, for example, one may use a compressor based on the Huffman algorithm, the various Lempel–Ziv algorithms, or a compressor designed specifically for a particular kind of data.

This article is inspired by research from several studies, including the already mentioned study by Yoshida and Nei (2016), wherein they argue, in the context of compositional gene analyses, that the accuracy of trees produced from uncorrected NJp was essentially the same as ML and Bayesian trees while being significantly more efficient. With that, we are emboldened to bring the NCD technique into the same conversation.

A number of authors have used NCD to construct phylogenetic trees. Li et al (2001) analyzed the whole mitochondrial genomes of 20 mammalian species and produced trees that compare favorably with those generated using other techniques. Cilibrasi and Vitányi (2005) applied the technique to analyzing 24 mammalian whole mitochondrial genomes. A more recent analysis appears in Cilibrasi and Vitányi (2020), where they classify viral sequences from SARS-CoV-2 and other viral sequences.

The results already mentioned analyze sequences taken from nature. To our knowledge, the accuracy of tree estimation using NCD on simulated sequences has not yet been explored. One advantage of simulation is that one is able to generate the sequences to have varying levels of incomplete lineage sorting (ILS). In ILS, copies of a gene in two species coalesce further back in time than the most recent common ancestor. This is a particularly challenging problem for species tree estimation because this can lead to a species tree that differs from the gene tree.

We compare NCD with other methods on both simulated and biological data to address the following questions:

1.
How does NCD, with uncorrected distances, compare with statistically consistent methods?
2.
How does NCD compare with concatenation and coalescent-based methods?
3.
How does NCD perform in the presence of ILS?

We ran NCD using three different compressors to produce dissimilarity matrices. These matrices were input into FastME, which infers species trees using the BioNJ technique (Gascuel, 1997; Lefort, 2015). In addition, we compare the output species trees with previously published trees (Chou et al, 2015; Rusinko and McPartlon, 2017) computed using ASTRAL-2, NJst, RAxML, SVDquartets+PAUP*, and NJp on the same simulated data set. With all bifurcating species trees (i.e., fully resolved trees), we evaluate species tree estimation methods using the Robinson–Foulds (RF) (Robinson and Foulds, 1981) error rate, also known as the normalized bipartition distance.

For biological data, we used a well-studied mammalian biological data set with 37 taxa from Song et al (2012), used also in Chou et al (2015), and compared the resulting species tree with previously published phylogenies (Chou et al, 2015; Mirarab et al, 2014; Nikaido et al, 2020; Song et al, 2012). Note that where the previous articles used aligned versions of these sequences, we used unaligned sequences.

Chou et al (2015) compared the performance of various estimation methods on simulated sequences for which one can adjust the ILS level. We used their 11-taxon data sets so that we could compare our results against theirs.
2. METHODS

2.1. Data sets

In this study we examine the accuracy of NCD when using one of three different compressors: 7-Zip Pavlov, MFCompress (Pinho and Pratas, 2013), and PPMZ (Bloom, 1998). We explore results on two different data sets both of which are examined by Chou et al (2015), the first a 11-taxon simulated data set and the second a 37-taxon biological data set.

For the 11-taxon simulated data set, we compare the accuracy of species trees constructed using the NCD methodology with five other well-known tree reconstruction algorithms, as given in Table 1. The five additional methods that we compare NCD with are those examined by Chou et al (2015) and Rusinko and McPartlon (2017) on the same 11-taxon simulated data set. These include the methods ASTRAL-II, NJst, and SVD-Quartets+PAUP* that are statistically consistent under the multispecies coalescent model, as well as concatenation using NJp and ML, specifically RAxML (Stamatakis, 2006).

Table 1.
Comparison of the Mean Robinson Foulds Rates, Rounded to the Nearest Thousandth, for 50 Replicates

Model Sites/gene No. of genes AST-II RAxML SVD-Q NJst NJ + p 7ZIP MFC3 PPMZ9

M1 (15.5% ILS) 10 Sites 100 0.24 0.14^* 0.25 0.26 0.17 0.23 0.27 0.26

500 0.16 0.05^* 0.17 0.18 0.12 0.23 0.27 0.26

1000 0.13 0.04^* 0.13 0.14 0.12 0.23 0.27 0.27

200 Sites 100 0.05 0.03^* 0.08 0.06 0.12 0.23 0.27 0.27

500 0.04 0.02^* 0.07 0.05 0.11 0.23 0.27 0.27

1000 0.03 0.01^* 0.06 0.04 0.11 0.23 0.27 0.27

M2 (38.3% ILS) 10 Sites 100 0.27 0.22^* 0.33 0.31 0.20 0.26 0.29 0.29

500 0.16 0.10^* 0.23 0.17 0.13 0.28 0.31 0.30

1000 0.15 0.07^* 0.15 0.17 0.12 0.30 0.32 0.32

200 Sites 100 0.10 0.09^* 0.12 0.10 0.10 0.31 0.32 0.33

500 0.06 0.03^* 0.05 0.06 0.08 0.32 0.33 0.34

1000 0.06 0.03^* 0.05 0.05 0.08 0.33 0.34 0.34

M3 (66.3% ILS) 10 Sites 100 0.37 0.37 0.42 0.47 0.30^* 0.32 0.33 0.34

500 0.20 0.20 0.27 0.33 0.15^* 0.32 0.33 0.34

1000 0.13 0.14 0.20 0.22 0.11^* 0.32 0.33 0.34

200 Sites 100 0.15^* 0.19 0.20 0.15^* 0.16 0.32 0.33 0.33

500 0.07^* 0.1 0.15 0.07^* 0.07 0.31 0.33 0.33

1000 0.08 0.07^* 0.11 0.07^* 0.07^* 0.31 0.33 0.33

M4 (85.0% ILS) 10 Sites 100 0.66 0.68 0.75 0.78 0.56 0.31^* 0.32 0.33

500 0.37 0.44 0.46 0.57 0.32 0.30^* 0.32 0.32

1000 0.30 0.37 0.43 0.48 0.25^* 0.30 0.32 0.32

200 Sites 100 0.33 0.53 0.54 0.37 0.41 0.29^* 0.31 0.32

500 0.18^* 0.27 0.34 0.18^* 0.19 0.29 0.31 0.31

1000 0.14^* 0.22 0.28 0.17 0.14^* 0.29 0.31 0.31

Model	Sites/gene	No. of genes	AST-II	RAxML	SVD-Q	NJst	NJ + p	7ZIP	MFC3	PPMZ9
M1 (15.5% ILS)	10 Sites	100	0.24	0.14^*	0.25	0.26	0.17	0.23	0.27	0.26
		500	0.16	0.05^*	0.17	0.18	0.12	0.23	0.27	0.26
	1000	0.13	0.04^*	0.13	0.14	0.12	0.23	0.27	0.27
200 Sites	100	0.05	0.03^*	0.08	0.06	0.12	0.23	0.27	0.27
	500	0.04	0.02^*	0.07	0.05	0.11	0.23	0.27	0.27
	1000	0.03	0.01^*	0.06	0.04	0.11	0.23	0.27	0.27
M2 (38.3% ILS)	10 Sites	100	0.27	0.22^*	0.33	0.31	0.20	0.26	0.29	0.29
		500	0.16	0.10^*	0.23	0.17	0.13	0.28	0.31	0.30
	1000	0.15	0.07^*	0.15	0.17	0.12	0.30	0.32	0.32
200 Sites	100	0.10	0.09^*	0.12	0.10	0.10	0.31	0.32	0.33
	500	0.06	0.03^*	0.05	0.06	0.08	0.32	0.33	0.34
	1000	0.06	0.03^*	0.05	0.05	0.08	0.33	0.34	0.34
M3 (66.3% ILS)	10 Sites	100	0.37	0.37	0.42	0.47	0.30^*	0.32	0.33	0.34
		500	0.20	0.20	0.27	0.33	0.15^*	0.32	0.33	0.34
	1000	0.13	0.14	0.20	0.22	0.11^*	0.32	0.33	0.34
200 Sites	100	0.15^*	0.19	0.20	0.15^*	0.16	0.32	0.33	0.33
	500	0.07^*	0.1	0.15	0.07^*	0.07	0.31	0.33	0.33
	1000	0.08	0.07^*	0.11	0.07^*	0.07^*	0.31	0.33	0.33
M4 (85.0% ILS)	10 Sites	100	0.66	0.68	0.75	0.78	0.56	0.31^*	0.32	0.33
		500	0.37	0.44	0.46	0.57	0.32	0.30^*	0.32	0.32
	1000	0.30	0.37	0.43	0.48	0.25^*	0.30	0.32	0.32
200 Sites	100	0.33	0.53	0.54	0.37	0.41	0.29^*	0.31	0.32
	500	0.18^*	0.27	0.34	0.18^*	0.19	0.29	0.31	0.31
	1000	0.14^*	0.22	0.28	0.17	0.14^*	0.29	0.31	0.31

Each model contains the same 11 taxa (10 ingroup taxa and 1 outgroup taxon), and varies by level of ILS, ranging from very low (M1) to very high (M4).

A ^* on a given RF rate for a given entry of the table denotes the lowest rate achieved by the eight methods with respect to the model, number of genes, and number of sites per gene.

Standard error for every mean for each NCD method, when rounded to the thousandth, was 0.0.

ILS, incomplete lineage sorting; NJ, neighbor joining.

Table 2.

The Computing Time, in Minutes, of Neighbor Joining with p Distance and Each Normalized Compression Distance Analysis with the Compressors 7Zip, MFCompress, and PPMZ Used on the 37-Taxon Biological Data Set

	7ZIP	MFC3	PPMZ	NJ + p
Total time	15.16	21.43	297.97	0.34

The configuration for our current machine is AMD EPYC 7742 64-Core Processor 2.25 GHz with 2 Tb memory. NJp total time does not include MSA run-time.

NJp, neighbor joining with p distance.

The 11-taxon data set consists of 50 replicate species trees generated under each of the 4 different model conditions (M1, M2, M3, and M4) for a total of 200. The models vary by level of ILS, each of which contains gene sequence alignments for the same 11 taxa; the ILS levels between M1 and M4 range from very low (15.5% for M1) to very high (85.0% for M4). The ILS level of a data set (denoted by AD) was measured by the average topological distance between the model gene trees and the model species tree, where the distance between two trees is the RF rate, expressed as a percentage.

None of the data sets assume a strict molecular clock, and within each data set, gene sequence alignments differ by the number of sites sampled per gene, and the number of genes per sequence. For additional information on how the data were simulated, model specifications, as well as access to the original data used in this study, refer to Chou et al (2015).

The 37-taxon data set of mammals with 447 gene alignments was studied by Song et al (2012). In addition, it is used by Chou et al (2015), and as noted in Mirarab et al (2014), 23 of the 447 loci were removed because 21 contained mislabeled sequences and 2 were outliers. We ran a concatenated analysis on the 424 remaining unaligned gene sequences using each compressor with the NCD methodology. Of note, additional discrepancies within the data set are discussed in Gatesy and Springer (2016) and Springer and Gatesy (2016).

2.2. Evaluation metric

All the estimated species trees returned by the analyses are bifurcating (i.e., all internal nodes have degree 3). Hence, we report the RF (Robinson and Foulds, 1981) error rate, which is equal to the missing branch rate for bifurcating trees. The RF error rate was calculated along with the NCD phylogenies by the software package PhyloTools. The resulting replicate RF rates were then combined through a script. Both the software package and script are provided in our Supplementary Data S1. In addition, we assess branch confidence values of our computed dissimilarity matrices using the program REQ (gitlab.pasteur.fr/GIPhy/REQ).

This software estimates the rate of elementary quartets (REQs) for each branch of a given phylogenetic tree, from the associated distances d_ij, as described by Guénoche and Garreta (2001) and Gufienoche and Garreta (2001). The method calculates the proportion of four-leaf subtrees (i.e., quartets) induced by every internal branch that are supported by the four-point condition applied to the six corresponding pairwise evolutionary distances (Buneman, 1971; Zaretskii, 1965), as such the measure is not based on a random sampling. The closer this measure is to 1, the more the corresponding branch is fully supported by the pairwise evolutionary distances d_ij (Criscuolo, 2019).

3. RESULTS

3.1. Results on 11-taxon data sets

The advantage of the simulated data sets is that the ILS level could be controlled, with ILS level ranging from low (model M1) to very high (model M4). The results gathered from the simulated 11-taxon data set are given in Table 1. In previous studies (Chou et al, 2015; Rusinko and McPartlon, 2017), numerous tree estimation methods, based on both coalescent and concatenation, have been assessed using these data, including ASTRAL-II, RAxML, SVDquartets, NJst, and NJp.

The relative performance between methods did vary depending on the model conditions. Although, for all of the methods used to analyze these data in previous studies, the resulting tree estimation error rates decreased as the number of genes or sites increased, the error rates increased as the level of ILS increased, for all previously studied and NCD methods considered.

The results from Chou et al (2015) showed that CA-ML displayed excellent accuracy for the two lower ILS model conditions and then average accuracy on the two higher ILS model conditions. For models M1 and M2, NCD with each of the considered compressors has significantly higher error rates. However, for the highest levels of ILS, in models M3 and M4, and the smallest data sets, NCD produces comparable RF errors rates with the previously studied methods. For example, of the error rates from model M3 with 100 ten-site genes, only the NJ + p method achieved better accuracy than NCD.

Using the NCD technique, three different compressors were considered to produce results on the four simulated data sets. Each of the different compressors yields distinct results, but ultimately follow a similar pattern of relative constancy in the face of ILS and sequence size changes. For example, the largest range of distance values for the model M4 with a non-NCD method is [0.17–0.78] for the NJst method, whereas for an NCD method, it is [0.29–0.31] with the 7Zip compressor.

In addition, NCD with any of the three compressors out-performs the other non-NCD methods on the model with highest level of ILS (M4) for smaller sequence size data sets with the following site–gene combinations: (100 ten-site genes), (500 ten-site genes), and (100 two hundred-site genes). Among the three compressors used with NCD on the simulated data sets, all of them yielded similar results, with the 7Zip compressor resulting in the lowest error rates.

3.2. Results on biological data set

The phylogeny produced by NCD-7Zip on the unaligned 37-taxa mammalian biological data set is shown in Figure 1. This data set has been analyzed in prior studies, with trees computed using CA-ML, MP-EST, ASTRAL-2, and SVDquartets+PAUP* (Chou et al, 2015). Here we examine the differences and similarities between the trees produced by NCD and previously published trees. The phylogeny estimated by NCD-7Zip shows confidence values at each branch, representing the estimated REQs, a method for computing direct branch support for distance-based phylogeny estimation, as bootstrap-based methods cannot be used (Criscuolo, 2019).

FIG. 1.

The NCD-7Zip tree on the 37-taxon 424-gene mammalian data set from Song et al (2012). Labels on branches indicate rate of elementary quartets confidence value. NCD, normalized compression distance.

The majority of sister taxa branch confidence values are highly supported with an REQ of > 90. However, numerous clade placements in the phylogeny are not highly supported, with REQ ranging from 25, (Pteropus vampyrus, (Equus caballus, Canis familiaris)) to 90, (((((Pan troglodytes, Homo sapiens), Pongo pygmaeus), Macaca mulatta), Callithrix jacchus), Gorilla gorilla).

Among the previously studied methods, there are two topological differences between the output species trees. These concern the placement of Scandentia (represented by Tupaia belangeri), and the topology of the clade involving Cetartiodactyla, Chiroptera, and the clade ((Felis cactus, Canis familiaris), Equus caballus). The majority of previously studied methods place Scandentia sister to Glires, including CA-ML, whereas SVDquartets+PAUP* places Scandentia sister to Primates. In the case of NCD-7Zip, a clade of Primates and Scandentia forms a monophyletic group, whereas Scandentia also forms a polyphyletic group with a Glires clade. The placement of Scandentia is debated; as such, it is not clear which relationship is correct (Bayzid et al, 2015; Boussau et al, 2013; Janečka et al, 2007).

All three possible topologies for Cetartiodactyla, Chiroptera, and the clade ((Felis catus, Canis familiaris), Equus caballus) were obtained by the various methods. Whereas NCD-7Zip produced a distinct topology due to it splitting each of the relevant clades, resulting in separate monophyletic groups: (Felis catus, (Myotis lucifugus, (Sus scrofa, Vicugna pacos))) and ((Tursiops truncatus, Bos taurus), (Pteropus vampyrus, (Equus caballus, Canis familiaris))).

The NCD method introduces a third topological difference between the output species trees, regarding the grouping of the Glires clade. NCD-7Zip was the only method, of those discussed, that did not group all the species in the Glires clade together. Rather, the resulting tree splits Glires into two distinct clades forming a paraphyletic group. The Glires clades consist of (Muridae, Cavia Procellus) and ((Spermophilus tridecemlineatus, Dipodomys ordii), Lagomorpha).

4. DISCUSSION

For the methods used in past studies on the simulated data set, there were several trends noted. For example, species tree estimation error tends to decrease with decreases in the ILS level, increases in gene sequence lengths, and increases in number of genes. We examined how NCD compares with these results and found that species tree estimation error was fairly constant in comparison with the other methods, despite the variable data (ILS level, sequence length, gene count) among sets. In particular, species tree estimation error decreased a relatively small amount as a result of decreases in ILS level, whereas changes in the gene sequence length or number of genes did not have a significant impact.

The empirical results suggest that the uncorrected NCD technique might not be statistically consistent, although this would have to be proven. Rather, it seems to show constancy in its analysis, as the error rates for all NCD results vary significantly less, within and between data sets, than the other methods previously studied on these data.

All of the NCD methods used in this study had acceptable accuracy on these data. The 7Zip compressor yielded the most accurate results among the compressors. However, each of the NCD analyses, on average over all data sets, was not able to produce as accurate as results when compared with the previously studied methods, partially due to the lack of statistically consistent behavior. With that, there are numerous ways in which the NCD method can be expanded upon. Although, with its current implementation and the comparatively low error rates, it yielded on model conditions of high ILS, it suggests that NCD is inherently resistant to the incongruities of ILS.

5. CONCLUSION

The main motivation behind this study was to determine how well one can estimate trees directly from sequence data, without biological assumptions. The ability to fine tune an analysis using different MSA tools, models of evolution, taxon sampling, or the setting of priors, as is done for Bayesian inference, is beneficial in most cases. However, with too many options, the flexibility can become an obstacle to performing quick, accurate, and consistent analyses. In addition, with this flexibility to adjust analysis parameters, there is also the chance for errors in selection to be made, forcing assumptions that can negatively impact the accuracy of results (Grievink et al, 2010).

For example, in the case of unpartitioned concatenation-based ML analyses, a single model of evolution is assumed. Roch and Steel (2015) proved that a concatenated analysis of different aligned loci, using unpartitioned ML, can be statistically inconsistent and even positively misleading under the multispecies coalescent model (Warnow, 2015). In a similar vein, Roch et al (2019) show that under biologically realistic conditions, where the number of sites per locus is bounded, gene tree summary methods, such as ASTRAL and NJst, are not statistically consistent (Roch et al, 2019).

Tree estimation methods that are distance based and alignment free mitigate these issues, although they are not yet widely used in phylogenetics, as the standard analysis is done using aligned data and a model-based statistical tree estimation method.

The NCD technique maintains simplicity by minimizing its reliance on biological understanding while being modular or flexible in its uses. The implementation used in this study yielded results that do not demonstrate statistical consistency, that is, they do not converge to the true tree as input sequence length increases. The other methods used on the 11-taxon simulated data set have this advantage over this implementation of NCD, when working with large sequences. However, we question whether statistical consistency is a necessary property of a tree estimation method, particularly for NCD's use cases.

The choice for a model-based tree estimation method is really a trade-off between consistency and efficiency, an idea discussed in “Hobgoblin of Phylogenetics” (Hillis et al, 1994). In this article by Hillis et al (1994), they illustrate the distinction between consistency and efficiency, arguing that “knowing that a method will correctly reconstruct the phylogeny given an infinite amount of data ( = consistency) does not necessarily lead to preference for that method when only a finite amount of data are available.

If the method is inefficient, requiring large amounts of data before converging on the true tree, attaching too much importance to consistency may be misguided.” This concept, the amount of data a method requires to reconstruct the true tree with high probability, is referred as the “sample complexity” of an estimation method (Warnow, 2018). Although the study we have presented was limited in scope, some trends clearly emerge. As noted in the results, NCD performed comparatively best when model conditions were the highest level of ILS (M4) as well as showing average accuracies on the data set with minimal sites and genes, across all four models. With this, we believe that the NCD technique is a highly efficient methodology, which is resistant to ILS.

The results gathered from all NCD methods show that NCD proved to be, by comparison, the worst performer on low ILS model conditions. Despite this, due to NCD's efficiency in terms of running time and sample complexity, as well as its accuracy in the presence of ILS, the NCD technique as a tree estimation method has application in phylogenetics. A seemingly ideal use case that fits these parameters is viral or bacterial genomes. This includes the tradeoff between sampling more individuals per species and sampling more loci; exploring NCD's handling of different tree shapes, varying in breadth and depth; and examining NCD's ability to identify relationships in trees with minimal and maximal similarities.

When one considers the run-time of NCD and that this implementation of NCD makes no biological assumptions, it suggests that NCD can also serve as a baseline test that any algorithm should have to outperform to justify additional running time.

In this study, we evaluate NCD in its purest and simplest form, to establish a basis for more refined assessments of NCD. With that, as well as the results we produced on the mammalian biological data set, we can conclude that compression techniques, which NCD is founded upon, are able to successfully identify patterns in sequence data, producing trees that reflect relations that are backed by current accepted phylogenetic mammalian results.

There are numerous questions we intend to explore with regard to augmenting the NCD tree estimation implementation. The first of which is to apply models of evolution to NCD, incorporating biological understanding into the analysis and enabling the assessment of whether NCD can be statistically consistent under a model of evolution. As well as directly exploring the sample complexity of the NCD tree estimation method. The next avenue of research we intend to explore is the comparison between NCD implemented as a coalescent-based tree estimation method and the concatenation-based implementation.

We seek to examine this due to the discordance that can occur between gene trees as a result of recombination events (Chou et al, 2015), which the results from this study suggest NCD may be able to better accommodate. When one considers the efficacy of NCD on small amounts of sequence data with high ILS, NCD seems to be a suitable candidate for producing genes trees in coalescent-based analyses.

Lastly, we intend to thoroughly assess the taxon sampling conditions in which NCD tree estimation excels. To determine the situation or conditions that NCD can thrive within, we intend to explore numerous different sampling strategies. In addition, NCD's simplicity and run-time efficiency enables its use as a method for quickly and reliably constructing trees for any sized data set. This is similar to how the tree extimation method NJp is described in the study by Yoshida and Nei (2016) and in the study by Rusinko and McPartlon (2017).

Footnotes

ACKNOWLEDGMENTS

We thank three members of our research group. Nick Morris, Catherine Huber, and Thiru Ramaraj provided valuable comments on the research and the article as it was being written. In addition, we thank our reviewers for their valuable comments.

AUTHORs' CONTRIBUTIONS

D.W. contributed to writing—original draft (lead), review and editing (equal), conceptualization (supporting), and software (lead). J.D.R. was involved in conceptualization (lead), writing—review and editing (equal), and original draft (supporting).

SUPPLEMENTARY DATA

All relevant data and trees discussed in this article are available at (https://github.com/depaulbioinformatics/NCD-2021-data). A github repository containing all the source codes used in our experiments can be found at ().

AUTHOR DISCLOSURE STATEMENT

The authors declare they have no conflicting financial interests.

FUNDING INFORMATION

Some of this study was supported by a 2020 grant (#602033) from DePaul University's Academic Growth and Innovation Fund (AGIF).

SUPPLEMENTARY MATERIAL

References

Adachi

, Hasegawa

Molphy, Programs for Molecular Phylogenetics, I: Protml, Maximum Likelihood Inference of Protein Phylogeny; 1992.

Arnason

, Adegoke

, Bodin

, et al. Mammalian mitogenomic relationships and the root of the eutherian tree. Proc Natl Acad Sci, 2002; 99(12):8151–8156, 2002; doi: 10.1073/pnas.102164299

Bayzid

, Mirarab

, Boussau

, et al. Weighted statistical binning: Enabling statistically consistent genome-scale phylogenetic analyses. PLoS One, 2015; 10(6):1–40; doi: 10.1371/journal.pone.0129183

Blackburne

, Whelan

Class of multiple sequence alignment algorithm affects genomic analysis. Mol Biol Evol, 2013; 30(3):642–653.

Bloom

Solving the Problems of Context Modeling. Informally Published Report; 1998. Available from: www.cbloom. …

Bogusz

, Whelan

Phylogenetic tree estimation with and without alignment: New distance methods and benchmarking. Syst Biol, 2016; 66(2):218–231; doi: 10.1093/sysbio/syw074

Boussau

, Szollosi

, Duret

, et al. Genome-scale coestimation of species and gene trees. Genome Res, 2013; doi: 10.1101/gr.141978.112

Buneman

The Recovery of Trees from Measures of Dissimilarity. BT—Mathematics in the Historical and Archaeological Sciences; 1971.

Chan

, Ragan

. Next-generation phylogenomics. Biol Direct, 2013; 8:3.

10.

Chou

, Gupta

, Yaduvanshi

, et al. A comparative study of SVDquartets and other coalescent-based species tree estimation methods. BMC Genomics, 2015; 16 Suppl 10(Suppl 10):S2; doi: 10.1186/1471-2164-16-S10-S2

11.

Cilibrasi

, Vitányi

PMB

. Clustering by compression. IEEE Trans Inf Theory, 2005; 51:1523–1545.

12.

Cilibrasi

, Vitányi

PMB

. Fast whole-genome phylogeny of the COVID-19 virus SARS-CoV-2 by compression. bioRxiv, 2020;2020.07.22.216242; doi: 10.1101/2020.07.22.216242

13.

Criscuolo

A fast alignment-free bioinformatics procedure to infer accurate distance-based phylogenetic trees from genome assemblies. Res Ideas Outcomes, 2019;5:e36178; doi: 10.3897/rio.5.e36178

14.

Criscuolo

Req. Available from: https://gitlab.pasteur.fr/GIPhy/REQ

15.

Daugelaite

, O’ Driscoll

, Sleator

. An overview of multiple sequence alignments and cloud computing in bioinformatics. ISRN Biomath, 2013; doi: 10.1155/2013/615630

16.

Edgar

Muscle: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, 2004; 32(5):1792–97.

17.

Felsenstein

Evolutionary trees from DNA sequences: A maximum likelihood approach. J Mol Evol, 1981; 17(6):368–376; doi: 10.1007/BF01734359

18.

Gascuel

BIONJ: An improved version of the NJ algorithm based on a simple model of sequence data. Mol Biol Evol, 1997; 14(7):685–695.

19.

Goldman

Maximum likelihood inference of phylogenetic trees, with special reference to a poisson process model of DNA substitution and to parsimony analyses. Syst Biol, 1990; 39(4):345–361; doi: 10.2307/2992355

20.

Grievink

, Penny

, Hendy

, et al. Phylogenetic tree reconstruction accuracy and model fit when proportions of variable sites change across the tree. Syst Biol, 2010; doi: 10.1093/sysbio/syq003

21.

Gufienoche

, Garreta

Can we have confidence in a tree representation? In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 2001; doi: 10.1007/3-540-45727-5_5.

22.

Guindon

, Gascuel

A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol, 2003; 52(5):696–704.

23.

Hillis

, Huelsenbeck

, Swofford

. Hobgoblin of phylogenetics? Nature 1994; 369(6479):363–364; doi: 10.1038/369363a0.

24.

Izquierdo-Carrasco

, Gagneur

, Stamatakis

Trading Memory for Running Time in Phylogenetic Likelihood Computations. Heidelberg Institute for Theoretical Studies; 2011; doi: 10.5220/0003765600860095

25.

Janečka

, Miller

, Pringle

, et al. Molecular and genomic data identify the closest living relative of primates. Science, 2007; 18(5851):792–794; doi: 10.1126/science.1147555

26.

Jarvis

, Mirarab

, Aberer

, et al. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science, 2014; 346(6215):1320–1331; doi: 10.1126/science.1253451

27.

Katoh

, Standley

. Mafft multiple sequence alignment software version 7: Improvements in performance and usability. Mol Biol Evol, 2013; 30(4):772–780; doi: 10.1093/molbev/mst010

28.

Kuhner

, Felsenstein

A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol Biol Evol, 1994; 11(3):459–468.

29.

Larkin

, Blackshields

, Brown

, et al. Clustal W and Clustal X version 2.0. Bioinformatics, 2007; 23(21):2947–2948.

30.

Lefort

, Desper

RGO

. Fastme 2.0: A comprehensive, accurate and fast distance-based phylogeny inference program. Mol Biol Evol, 2015; 32(10):2798–2800.

31.

, Chen

, Li

, et al. The Similarity Metric. CoRR, cs.CC/0111054; 2001.

32.

, Vitányi

. An Introduction to Kolmogorov Complexity and Its Applications, 3rd Edition. Springer Publishing Company, Incorporated; 2008.

33.

Liu

, Linder

, Warnow

Multiple sequence alignment: A major challenge to large-scale phylogenetics. PLoS Curr, 2010; doi: 10.1371/currents.RRN1198

34.

Liu

, Linder

, Warnow

RAxML and FastTree: Comparing two methods for large-scale maximum likelihood phylogeny estimation. PLoS One, 2011; doi: 10.1371/journal.pone.0027731

35.

, Xin

, Zhang

A new quartet approach for reconstructing phylogenetic trees: Quartet joining method. J Comb Optimiz, 2008; 16(3):293–306; doi: 10.1007/s10878-008-9145-1

36.

Milanoi

, Ongus

, Gachara

, et al. Serotype and genetic diversity of human rhinovirus strains that circulated in Kenya in 2008. Influenza Other Respir Viruses, 2016; 10(3):185–191; doi: 10.1111/irv.12373

37.

Mirarab

, Bayzid

, Boussau

, et al. Statistical binning enables an accurate coalescent-based estimation of the avian tree. Science, 2014; doi: 10.1126/science.1250463

38.

Morgan

, Dehal

, Arkin

. Fasttree 2–approximately maximum-likelihood trees for large alignments. PLoS One, 2010; 5(3):e9490.

39.

National Center for Biotechnology Information. NCBI Home Page. Available from: https://www.ncbi.nlm.nih.gov/

40.

Nikaido

, Kondo

, Zhang

, et al. Comparative genomic analyses illuminate the distinct evolution of megabats within Chiroptera. DNA Res, 2020; 27(4):dsaa021; doi: 10.1093/dnares/dsaa021

41.

Pavlov

7-zip. Available from: https://7-zip.org

42.

Penn

, Privman

, Landan

, et al. An alignment confidence score capturing robustness to guide tree uncertainty. Mol Biol Evol, 2010; 27(8):1759–1767.

43.

Pinho

, Pratas

Mfcompress: A compression tool for FASTA and multi-FASTA data. Bioinformatics, 2013; 30(1):117–118.

44.

Rajarajeswari

, Apparao

Dnabit compress—Genome compression algorithm. Bioinformation, 2011; 5(8):350–360; doi: 10.6026/97320630005350

45.

Robinson

, Foulds

. Comparison of Weighted Labelled Trees. In: Combinatorial Mathematics VI. (Horadam AF, Wallis WD. Eds.), Berlin, Heidelberg: Springer Berlin Heidelberg; 1979; pp. 119–126.

46.

Robinson

, Foulds

. Comparison of phylogenetic trees. Math Biosci 1981;53(1):131–147; doi: 10.1016/0025-5564(81)90043-2

47.

Roch

, Nute

, Warnow

Long-branch attraction in species tree estimation: Inconsistency of partitioned likelihood and topology-based summary methods. Syst Biol, 2019; 68(2):281–297; doi: 10.1093/sysbio/syy061

48.

Roch

, Steel

Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor Popul Biol, 2015;100C:56–62; doi: 10.1016/j.tpb.2014.12.005

49.

Ronquist

, Huelsenbeck

. Mrbayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics, 2003; 19:1572–1574.

50.

Rusinko

, McPartlon

Species tree estimation using Neighbor Joining. J Theor Biol, 2017; 414:5–7; doi: 10.1016/j.jtbi.2016.11.005

51.

Saitou

, Nei

The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol Biol Evol, 1987; 4(4):406–425; doi: 10.1093/oxfordjournals.molbev.a040454

52.

Shu

, McCauley

Gisaid: Global initiative on sharing all influenza dataFrom vision to reality. Euro Surveill, 2017; 22(13):30494; doi: 10.2807/1560-7917

53.

Sievers

, Wilm

, Dineen

, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega. Mol Syst Biol, 2011; 7(1):539; doi: 10.1038/msb.2011.75

54.

Song

, Liu

, Edwards

, et al. Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proc Natl Acad Sci U S A, 2012; 109(37):14942–14947; doi: 10.1073/pnas.1211733109

55.

Springer

, Gatesy

The gene tree delusion. Mol Phylogenet Evol, 2016;94(Pt A):1–33; doi: 10.1016/j.ympev.2015.07.018

56.

Stamatakis

RAxML-VI-HPC: Maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics, 2006; 22(21):2688–2690; doi: 10.1093/bioinformatics/btl446

57.

Steel

MA.

The complexity of reconstructing trees from qualitative characters and subtrees. J Classif 1992; 9:91–116.

58.

Vinga

, Almeida

Alignment-free sequence comparison—A review. Bioinformatics, 2003; 19(4):513–523; doi: 10.1093/bioinformatics/btg005

59.

Vitányi

PMB

, Balbach

, Cilibrasi

, et al. Normalized Information Distance. CoRR abs/0809.2553; 2008. Available from: http://arxiv.org/abs/0809.2553

60.

Wang

, Jiang

On the complexity of multiple sequence alignment. J Comput Biol, 1994; 1(4):337–348; doi: 10.1089/cmb.1994.1.337

61.

Warnow

Concatenation analyses in the presence of incomplete lineage sorting. PLoS Curr, 2015; doi: 10.1371/currents.tol.8d41ac0f13d1abedf4c4a59f5d17b1f7

62.

Warnow

. Computational Challenges in Constructing the Tree of

Life

. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS); 2017; 1; doi: 10.1109/IPDPS.2017.128

63.

Warnow

Computational Phylogenetics, 1st Edition. Cambridge University Press; 2018.

64.

Williams

, Zhaxybayeva

, Poptsova

, et al. Quartet decomposition server: A platform for analyzing phylogenetic trees. BMC Bioinformatics, 2012; 13:123; doi: 10.1186/1471-2105-13-123

65.

, Davis

, Christman

, et al. Evolutionary history and phylodynamics of influenza a and b neuraminidase (na) genes inferred from large-scale sequence analyses. PLoS One, 2012; 7(7):1–15; doi: 10.1371/journal.pone.0038665

66.

Yang

, Rannala

Bayesian phylogenetic inference using DNA sequences: A Markov Chain Monte Carlo Method. Mol Biol Evol, 1997; 14(7):717–724; doi: 10.1093/oxfordjournals.molbev.a025811

67.

Yoshida

, Nei

Efficiencies of the NJp, maximum likelihood, and Bayesian methods of phylogenetic construction for compositional and noncompositional genes. Mol Biol Evol, 2016; 33(6):1618–1624; doi: 10.1093/molbev/msw042

68.

Zeeh

The Lempel Ziv Algorithm; 2003.

69.

Zhang

, Wen

, Yau

SS-T

. Phylogenetic analysis of protein sequences based on a novel k-mer natural vector method. Genomics, 2019; 111(6):1298–1305; ; doi: 10.1016/j.ygeno.2018.08.010

70.

Zielezinski

, Vinga

, Almeida

, et al. Alignment-free sequence comparison: Benefits, applications, and tools. Genome Biol, 2017; 18(1):186; doi: 10.1186/s13059-017-1319-7