QMix: An Efficient Program to Automatically Estimate Multi-Matrix Mixture Models for Amino Acid Substitution Process

Abstract

The single-matrix amino acid (AA) substitution models are widely used in phylogenetic analyses; however, they are unable to properly model the heterogeneity of AA substitution rates among sites. The multi-matrix mixture models can handle the site rate heterogeneity and outperform the single-matrix models. Estimating multi-matrix mixture models is a complex process and no computer program is available for this task. In this study, we implemented a computer program of the so-called QMix based on the algorithm of LG4X and LG4M with several enhancements to automatically estimate multi-matrix mixture models from large datasets. QMix employs QMaker algorithm instead of XRATE algorithm to accurately and rapidly estimate the parameters of models. It is able to estimate mixture models with different number of matrices and supports multi-threading computing to efficiently estimate models from thousands of genes. We re-estimate mixture models LG4X and LG4M from 1471 HSSP alignments. The re-estimated models (HP4X and HP4M) are slightly better than LG4X and LG4M in building maximum likelihood trees from HSSP and TreeBASE datasets. QMix program required about 10 hours on a computer with 18 cores to estimate a mixture model with four matrices from 200 HSSP alignments. It is easy to use and freely available for researchers.

1. INTRODUCTION

The amino acid (AA) substitution models are used to analyze protein sequences. The AA substitution process is normally described by a Markov process with properties of time-reversible, time-continuous, time-homologous, and stationary (Durbin et al., 2006). Popular AA substitution models consist of one replacement matrix representing the substitution rates among amino acids for all sites (Minh et al., 2021). The evolution of amino acids varies among sites and depends on many factors such as genetic codes, solvent accessibility, and protein functions (Le et al., 2012). The single-matrix model simply uses a discrete gamma distribution to model the heterogeneity of evolutionary rates (Yang, 1993); however, it is incapable of modeling the variability of substitution patterns among sites. For example, buried sites and exposed sites correspond to slow and fast substitution patterns and require very different AA replacement matrices. Using several replacement matrices, each for one rate category is a better approach (Le et al., 2012).

Two 4-matrix mixture models LG4X and LG4M have been estimated from 1471 HSSP alignments (Le et al., 2012). The LG4M model has four replacement matrices each corresponding to one rate category of the discrete gamma distribution (Yang, 1993). The LG4X model follows a distribution-free scheme for the site rates (Yang, 1995). The multi-matrix models outperform the single-matrix models in inferring maximum likelihood trees while requiring the same memory space and similar running times (Le et al., 2012). The multi-matrix model consists of more parameters than the single-matrix model, therefore, estimating a multi-matrix model is more complicated, more time-consuming, and requires larger datasets.

The workflow to estimate LG4X and LG4M is very complicated for biologists (even for bioinformaticians); and not yet implemented as a computer program. To overcome the burden for researchers, we implemented the workflow in QMix program with several improvements to automatically estimate multi-matrix models from large datasets.

2. METHODS

The training dataset used to estimate an amino acid substitution model include N alignments denoted by $D = {D^{1}, {\dots, D}^{N}}$ . Let $T = {T^{1}, {\dots, T}^{N}}$ be the set of trees and $R = {R^{1}, {\dots, R}^{N}}$ be the set of site-rate models for the N alignments. A 4-matrix mixture model $Q = {Q_{1}, Q_{2}, Q_{3}, Q_{4}}$ consists of four matrices corresponding to four site rate categories (i.e., Q₁ for very slow rate, Q₂ for slow rate, Q₃ for medium rate, and Q₄ for high rate). We determine $Q^{*} = \{Q_{1}^{*}, Q_{2}^{*}, Q_{3}^{*}, Q_{4}^{*}\}$ , T, and R to maximize the likelihood value $L (T, Q, R | D) .$ Technically, $Q^{*} = {Q_{1}^{*}, Q_{2}^{*}, Q_{3}^{*}, Q_{4}^{*}} = argma x_{Q = {Q_{1}, Q_{2}, Q_{3}, Q_{4}}} {\prod_{i = 1 \dots N} L (T^{i}, Q, R^{i} | D^{i})}$

Optimizing Q, T, and R simultaneously to determine $Q^{*}$ is computationally feasible for small datasets. Previous studies show that $Q^{*}$ can be obtained using nearly optimal trees T and site rate models R (Le et al., 2012; Le and Gascuel, 2008).

The QMix program consists of four main steps to iteratively optimize Q, T, and R as described followings:

Initial step: Initialize $Q = {Q_{1} = M, Q_{2} = M, Q_{3} = M, Q_{4} = M)$ , where M is a replacement matrix such as LG; Q is considered the current best model.

Categorizing step: Estimate trees T and site-rate models R based on the current best model Q, i.e., for each alignment $D^{a} \in D$ , estimate T^α and R^α with Q fixed using the IQ-TREE algorithm (Minh et al., 2020). Create four sub-alignment groups D¹, D², D³, and D⁴ each corresponding to one site rate category. To this end, each alignment $D^{a} \in D$ is separated into four sub-alignments corresponding to four different site rate categories. Technically, site i of D^a is assigned into the sub-alignment with maximum posterior probability $c_{i} = {argmax}_{k = 1 \dots 4} (w_{k} L (T^{a}, {ρ_{k} Q}_{k} | D_{i}^{a}))$ , where w_k and ρ_k are weight and rate of matrix Q_k with constraints $\sum_{k = 1}^{4} w_{k} = 1$ and $\sum_{k = 1}^{4} w_{k} ρ_{k} = 1$ . Model Q follows a discrete gamma distribution if rates ρ_k are estimated from a discrete gamma distribution with equal weights for all categories ( $w_{1} = w_{2} = w_{3} = w_{4} = 1 / 4$ ). Otherwise, it follows a rate distribution-free scheme.

Estimating parameters: Estimate $Q^{*} = {Q_{1}^{*}, Q_{2}^{*}, Q_{3}^{*}, Q_{4}^{*}}$ with four new matrices each from one sub-alignment group (i.e., $Q_{i = 1 \dots 4}^{*}$ estimated from $D^{i})$ based on trees T and site-rate models R using the QMaker algorithm (Minh et al., 2021).

Stopping step: Compare new matrices of $Q^{*}$ and corresponding matrices of the current best model Q. If they are highly correlated (Pearson correlation ≥ 0.9999), stop and consider $Q^{*}$ as the final best model. Otherwise, assign Q = $Q^{*}$ and go to the Categorizing step to continue optimizing Q.

We evaluated the performance of both single-matrix and multi-matrix models in inferring maximum likelihood trees. As models might have different number of free parameters (e.g., multi-matrix models have more parameters than single-matrix models), we used AIC criteria (Akaike, 1974) to assess the performance of the models, i.e., $AIC (M, D^{a}) = - 2 \times L L (M, T^{a}| D^{a}) + 2 \times # F P (M)$ , where $L L (M, T^{a}| D^{a})$ is the log-likelihood of alignment D^a given model M and inferred tree T^a; $# F P (M)$ is the number of free parameters of model M. If $AIC (M_{1}, D^{a}) < AIC (M_{2}, D^{a})$ , M₁ is considered better than M₂ for alignment D^a.

3. RESULTS

3.1. QMix validation

We employed the HSSP dataset that was used to estimate four-matrix mixture models LG4M and LG4X (Le et al., 2012) to validate the QMix program. The dataset comprises 1771 alignments containing about 27 million amino acids (on average, each alignment consists of about 56 sequences and 254 sites). The dataset was divided into two parts: the training part of 1471 alignments and the testing part of 300 remaining alignments. The training and testing parts in this study were the same as those used in the article of LG4M and LG4X models (Le et al., 2012). We employed QMix to estimate four-matrix mixture models HP4M (rate categories follow a discrete gamma distribution) and HP4X (rate categories follow a rate distribution-free scheme) from the 1471 training. To evaluate the stability of QMix, we also estimated two 4-matrix models HP4X.200 and HP4M.200 from 200 HSSP alignments.

Table 1 shows that matrices for the “medium” and “fast” rates of the models have high correlations. The matrices for “very slow” and “slow” rates have lower correlations. The “very slow” and “slow” matrices of LG4X have considerably low correlations with those of other matrices. The deviation of “very slow” and “slow” matrices of LG4X is perhaps due to the sensitivity of XRATE algorithm with starting parameter values as noted by the authors (Le et al., 2012). We observe high correlations between matrices of HP4X and HP4X.200; and reasonable correlations between matrices of HP4M and HP4M.200. The results indicate the stability of the QMix program.

Table 1.
The Pearson Correlations Between Matrices of LG4X, LG4M, HP4X, HP4M, HP4X.200, HP4M.200 Mixture Models

Very slow Slow Medium Fast

LG4X vs. HP4X 0.237 0.442 0.931 0.989

LG4X vs. LG4M 0.055 0.653 0.898 0.991

HP4X vs. HP4X.200 0.944 0.967 0.955 0.962

LG4M vs. HP4M 0.773 0.843 0.982 0.995

HP4M vs. HP4M.200 0.752 0.881 0.944 0.964

HP4X vs. HP4M 0.825 0.943 0.961 0.996

	Very slow	Slow	Medium	Fast
LG4X vs. HP4X	0.237	0.442	0.931	0.989
LG4X vs. LG4M	0.055	0.653	0.898	0.991
HP4X vs. HP4X.200	0.944	0.967	0.955	0.962
LG4M vs. HP4M	0.773	0.843	0.982	0.995
HP4M vs. HP4M.200	0.752	0.881	0.944	0.964
HP4X vs. HP4M	0.825	0.943	0.961	0.996

Figure 1 shows the relative difference between exchangeability coefficients in matrices of HP4X and LG4X. The “fast” matrices of HP4X and LG4X are highly similar with only two rates that show five times difference, while the “very slow” matrices are more different (36 rates show at least five times difference). We observed similar results when comparing matrices of HP4M and LG4M.

FIG. 1.

The relative difference between exchangeability coefficient in matrices of HP4X and LG4X models. Notations: 2x (5x) indicates that the exchangeability coefficient between two models is at least two times (five times) difference.

We compared the performance of four-matrix models (i.e., LG4M, LG4X, HP4M, HP4X) and general single-matrix models LG (Le and Gascuel, 2008) and Q.pfam (Minh et al., 2021) in inferring maximum likelihood trees on 300 HSSP testing alignments and 84 TreeBASE benchmark alignments (Sanderson et al., 1994) using the AIC criteria (see Table 2). The re-estimated models HP4M and HP4X had slightly better AIC values than their corresponding models LG4M and LG4X. Specifically, HP4M was better than LG4M on 160 (53%) HSSP alignments and 63 (75%) TreeBASE alignments. Similarly, HP4X was better than LG4X on 152 (50.6%) HSSP alignments; and 48 (57.1%) TreeBASE alignments. The multi-matrix models HP4M and HP4X were much better than the single-matrix models LG and Q.pfam. For example, HP4X was better than LG on 299/300 HSSP testing alignments. The results confirm that the multi-matrix models outperform the single-matrix models in building maximum likelihood trees.

Table 2.

Model Comparison on 300 HSSP and 84 TreeBASE Testing Alignments. #M1>M2: The Number of Alignments Where M₁ Has Better AIC Value than M₂

M₁	M₂	#M₁ > M₂ on300 HSSP alignments	#M₁ > M₂ on 84TreeBASE alignments
HP4X	LG4X	152	48
HP4M	LG4M	160	63
HP4X	LG	299	84
HP4M	LG	299	84
HP4X	Q.pfam	291	80
HP4M	Q.pfam	275	65

3.2. Usage and commands

The QMix program as well as re-estimated models HP4X and HP4M are freely available at https://github.com/tinhnh2/Qmix. QMix can estimate a k-matrix mixture model from a set of protein alignments on a Linux computer. All training protein alignments are contained in one folder. The estimation process can be accomplished by one command

python Qmix.py -rate_model M -ncat 4 -init_model LG -corr_threshold 0.99 -nthread 18 -data hssp1471

to execute the QMix program with the following parameters:

-
rate_model M: The type of the mixture model, i.e., M for the discrete gamma distribution and X for the distribution-free scheme
-
ncat 4: The number of rate categories, e.g., 4
-
init_model LG: The initial replacement matrix, e.g., LG
-
corr_threshold 0.99: The Pearson correlation used to stop the estimation process, e.g., 0.99
-
nthread 18: The number of computing threads, e.g., 18 threads
-
data hssp1471: The full path to a folder of training alignments, e.g., hssp1471

The resulting mixture model contains k matrices named from Q.1 (the slowest rate) to Q.K (the fastest rate) created at the running folder. The resulting k-matrix models can be used to construct maximum likelihood trees by IQ-TREE2 software (i.e., iqtree2 -s alignment.phy -m “MIX{Q.1,Q.2,Q.3,Q.4}”). We note that QMix is written in Python; however, it can be converted to C++ to be incorporated into other more complex software such as IQ-TREE.
4. DISCUSSION AND CONCLUSION

We presented the QMix program for estimating k-matrix mixture models using the maximum likelihood approach. The rate heterogeneity among sites can be modeled using either the discrete gamma rate distribution or the distribution-free scheme. Experiments on both the HSSP and TreeBASE alignments showed that the QMix program was able to estimate reliable and robust 4-matrix mixture models.

The QMix program takes a set of alignments and automatically estimates a k-matrix mixture model from the training alignments in an acceptable time, e.g., about ten hours to estimate four-matrix mixture models from 200 HSSP alignments. It is easy to use and applicable for researchers to estimate k-matrix mixture models from their datasets.

In summary, QMix improves the workflow used to estimate LG4M and LG4X by employing IQ-TREE2 and QMaker instead of PhyML and XRATE to accurately and rapidly estimate the parameters of models. It also supports multi-threading computing to efficiently estimate models from thousands of alignments. QMix can estimate mixture models with different number of matrices.

Footnotes

AUTHORS’ CONTRIBUTIONS

N.H.T.: Methodology (equal); Software (lead); Formal analysis (lead); Writing—original draft (lead); Visualization (equal). C.C.D.: Conceptualization (equal); Methodology (equal); Writing—review and editing (equal); Visualization (equal). L.S.V.: Conceptualization (equal); Methodology (equal); Writing—review and editing (lead); Supervision (lead).

DATA AVAILABILITY

QMix program and the datasets used in this article are available at .

AUTHOR DISCLOSURE STATEMENT

The authors declare that they have no conflicting financial interests.

FUNDING INFORMATION

No funding was received for this article.

References

Akaike

. A new look at the statistical model identification. In: Selected Papers of Hirotugu Akaike. Springer Series in Statistics. ( Parzen

, Tanabe

, Kitagawa

. eds.) Springer: New York, NY; 1974; pp. 215–222; doi: 10.1007/978-1-4612-1694-0_16

Durbin

, Eddy

, Krogh

, et al. Biological sequence analysis: probabilistic models of proteins and nucleic acids. In: Computational Biology. Codon Publications; 2006; pp, 1–371.

, Dang

, Gascuel

. Modeling protein evolution with several amino acid replacement matrices depending on site rates. Mol Biol Evol, 2012; 29(10):2921–2936; doi: 10.1093/molbev/mss112

, Gascuel

. An improved general amino acid replacement matrix. Mol Biol Evol, 2008; 25(7):1307–1320; doi: 10.1093/molbev/msn067

Minh

, Dang

, Vinh

, et al. QMaker: Fast and accurate method to estimate empirical models of protein evolution. Syst Biol, 2021; 70(5):1046–1060; doi: 10.1093/sysbio/syab010

Minh

, Schmidt

, Chernomor

, et al. IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol Biol Evol, 2020; 37(5):1530–1534; doi: 10.1093/molbev/msaa015

Sanderson

, Donoghue

, Piel

, et al. TreeBASE: A prototype database of phylogenetic analyses and an interactive tool for browsing the phylogeny of life. Am J Bot, 1994; 81:183.

Yang

. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol Biol Evol, 1993; 10(6):1396–1401; doi: 10.1093/oxfordjournals.molbev.a040082

Yang

. A space-time process model for the evolution of DNA sequences. Genetics, 1995; 139(2):993–1005; doi: 10.1093/genetics/139.2.993