Abstract
Protease is one of three enzymes encoded within HIV's pol gene, responsible for the cleavage of viral Gag-Pol polypeptide into mature viral proteins and a target of current anti-retroviral therapy. Protease diversity analysis in Latin America has been lacking in spite of extensive studies of protease-inhibitor resistance mutations. We studied the diversity of 777 Mexican protease sequences and found that all were subtype B except one (CRF02_AG). Phylogenetic analysis suggested the existence of six different clades with geospecific contributions. Thirty-three percent of sites were conserved, 25% had conservative substitutions, and 41% exhibited physicochemical changes. The most conserved regions surrounded the active site, most of the flap domain, and a region between the 60's loop and C-terminal triad. A single sequence exhibited an active site mutation (T26S). Variable sites were mapped to a crystallographic structure, providing further insight into the distribution and functional relevance of variable sites among Mexican isolates.
During 2017, the World Health Organization (WHO) reported 36.9 million HIV-infected individuals around the globe, 3.4 million in the United States, and 230,000 ± 25,000 in Mexico. During the same year, 1.8 million new HIV infections were reported worldwide, 160,000 in the United States, and 15,000 ± 1500 in Mexico as well as 940,000, 56,000 and 4000 ± 1000 HIV-related deaths, respectively. 1 The introduction of antiretroviral therapy (ART) has been crucial at decreasing HIV morbidity and mortality to the point where life expectancy of recently infected individuals is close to that of the general population. 2 First-line adult ART currently relies on reverse transcriptase inhibitors. However, the preferred second-line regimen consists of a nucleoside reverse-transcriptase inhibitors plus a ritonavir-boosted protease inhibitor (PI). 3
HIV Protease is encoded within the pol region and belongs to the aspartic acid protease family. 4 Protease is responsible for the proteolytic cleavage of the Gag polyprotein into matrix, capsid, nucleocapsid, and p6 proteins as well as for Pol polyprotein processing into reverse transcriptase (p51 and p66), protease, and integrase. Env polyprotein processing into gp120 and gp41 relies on cellular protease activity, not on that of the viral protease. The protease homodimer consists of two 99 amino acid, 22 kDa monomers. 5 As in all aspartic acid proteases, its catalytic residues (D25, T26, G27) are highly conserved. 6,7 Protease sequence diversity is mostly a consequence of ART-driven selection for mutations that lower affinity for a drug, enabling viral replication in spite of continued ART use (ARVm). As such, genetic variation in viral enzyme-encoding regions poses a major challenge for HIV prevention and treatment. 8
Here, we present an analysis of the sequence diversity present in integrated HIV-1 sequences from 777 Mexican individuals, including 35 generated by our lab from patients residing in the state of San Luis Potosí. 9 In addition to our local sequences, 742 Mexican nucleotide sequences were retrieved from the “Geography Search Interface” tool provided by the Los Alamos National Laboratory database. 10 Samples bearing premature stop codons, indels, and evidence of APOBEC3 hypermutation were excluded as well as those having 7% or more of nucleotide ambiguity. Identification of ART-resistance mutations, frameshift mutations, indels, and stop codons used the HIVdb program for genotypic resistance interpretation. Alignments were prepared by using Clustal Omega v1.2.1 and reformatted for unanimity to the HXB2 reference by using a locally developed freely accessible tool (Sequence Unanimity Reformatting tool).
Translation of protein sequences relied on EMBOSS Transeq. HIV-1 consensus sequences for group M subtypes (A1, A2, B, C, D, F1, F2, H and G) along with outlier group O were obtained from Los Alamos National Laboratory HIV database. 11 Phylogenetic trees were produced from nucleotide sequences through a Markov chain Monte Carlo algorithm-based Bayesian analysis suite (BEAUti and BEAST) with a general time-reversible substitution model, estimated base frequencies, gamma plus invariant sites with four categories, and three codon partitions and using a strict clock with a coalescent (constant size) tree earlier. 12
The 50 million sample trees produced were simplified by using a 10% burn-in to generate a single maximum clade credibility tree using TreeAnnotator v1.10.4 (A. Rambaut and A. J. Drummond, Institute of Evolutionary Biology, University of Edinburgh) and reformatted to a radial cladogram by using FigTree v1.4.4 (A. Rambaut and A. J. Drummond) to highlight clade diversity and composition. Consensus amino acid frequency and Shannon entropy were calculated online by employing amino acid equivalents representing increases in entropy with regards to HXB2 reference. 13 Entropy levels were arbitrarily classified as high (>0.6), mid (between 0.6 and 0.2), and low (below 0.2).
Nucleotide sequences produced by our laboratory have been deposited in GenBank with accession nos. KT869026, KT869027, KT869029, KT869030, KT869034–KT869036, KT869041–KT869043, KT869045–KT869048, KT869051–KT869056, KT869059–KT869062, and KT869066–KT869076.
In all, 777 nucleotide (297 bp) and amino acid sequences (99 sites) were analyzed. Nucleotide sequence phylogenetic tree topology has 776 of the Mexican sequences forming six different clades within subtype B (clades I–VI in Fig. 1). A single isolate (MX460) representing circulating recombinant form CRF02_AG exhibited greater homology to subtype A and G consensus sequences. This isolate was obtained from a former United States resident Mexican mestizo. It should be noted, however, that the relatively short 297 bases encoding protease are barely sufficient to distinguish HIV-1 subtype D sequences from those of subtype B, a fact that would hinder proper identification of epidemiologically relevant sublineages. This warrants further study into the characterization of full Pol region sequences of Mexican origin.

Radial phylogram highlighting clades of Mexican HIV-1 protease-encoding nucleotide sequences. Phylogenetic trees reconstructed form nucleotide sequences using a Markov Chain Monte Carlo algorithm-based Bayesian analysis were reformatted by using FigTree to illustrate six clades (I–VI). The root of the tree harbors consensus O and other HIV-1 group M subtype consensus sequences (A1, A2, C, F1, F2, G, and H). Sample MX460 is identified as CRF02_AG recombinant, forming a subclade with A and G consensus sequences.
As seen from the map provided in Figure 2 and summarized in Table 1, the most extensively sampled states (Mexico City and the State of Mexico) were represented in all clades. Other states with important contribution to each of the clades include Morelos, Nuevo Leon, and Oaxaca to clade I; Jalisco, Oaxaca, and Veracruz to clade II; Jalisco, Oaxaca, and Puebla to clade III; Jalisco, Nuevo Leon, and Veracruz to clade IV; Jalisco and Morelos to clade V; and Jalisco and San Luis Potosí to clade VI.

Map depicting Mexican states for which HIV protease sequences were available (gray) at the time of the study; numbers in parentheses indicate number of sequences available.
Percent Contribution of Each Mexican State to the Total Number of Sequences Composing Each Clade (I–VI)
The distribution of amino acid variations among the 777 Mexican sequences is summarized in Figure 3. Most isolates (97%) bore V3I, which is typical of subtype B sequences. Six primary antiretroviral resistance mutations (ARVms) were observed (D30N, L33F, M46L, V82A, N88D, and L90M) as well as 12 secondary ARVms (L10I, L10V, V11I, V11L, K20R, K20M, K20I, K43T, G48R, Q58E, A71T, and A71V). Surveillance drug resistance mutations were identified in 12 sequences (MX071, KC168417, KC168143, KC169717, KC169595, KC168811, KC168469, KC169700, KC169346, KC169134, KC168724, and KC168376). Approximately 33% of the amino acid positions (33/99) were conserved, 25% sites (25/99) exhibited conservative substitutions, and physicochemical changes were present in 41% (41/99).

Mexican protease amino acid sequence alignment summary. Protein domains and functional regions are shaded to highlight functional domains and regions. These include N- and C-terminal dimerization regions of the terminal domains; the 10's loop, catalytic residues, 60's loop, and C-terminal triad of the core domains as well as the elbow and tip-of-the-flap regions of the Flap domain. Substitutions shown with a single or double diamond correspond to primary and secondary antiretroviral drug resistance mutations, respectively.
On average, individual amino acid sequences differed from HXB2 reference at 1–17 sites (7.4 ± 2.4 standard deviation). Protease regions that were found to be extremely conserved (i.e., less than eight percent protein sequences showing variations at a specific site, that is less than one percent of compiled sequences) included 5 LWQRP 9 in the N-terminal domain, positions E21–V32 surrounding the catalytic residues, residues M46–V56 surrounding the tip-of-the-flap region, 73GTVL76 and78GPTP81 in the central C-terminal core domain, N83–N88 surrounding the C-terminal triad, and 94G—F99 spanning the C- terminal dimerization region. An active site mutation (T26S) was present in a single sequence (KC168702). The tip of the flap and C-terminal triad regions were conserved, highlighting their known functional role.
The C-terminal triad located in the core domain is highly conserved in all known retroviral proteases and involved in proper folding and dimerization. 14 Only two Mexican sequences (KC168417 and KC169346) had mutations in this region (N88D) represented by a drastic physicochemical change (polar to acidic residue). Interestingly, this substitution is classified as a primary ARVm according to Stanford's database criteria. The tip of the flap region is glycine rich and very sensitive to amino acid changes with a profound impact on protease activity as described by other authors. 15 A single sequence (DQ631417) had a mutation affecting the properties of this region (G49E).
The most variable protease regions include 10 L—K20 spanning the 10's loop of the core domain, 33L—K43 of the elbow region, and 60D—I72 region surrounding the 60's loop. The amino terminal half of the elbow region is characterized by a stretch of polar residues followed by basic residues in the C-terminal half. A change from polar to acidic residues was the most common change observed in this region (S37D in 105 isolates, S37A in 7 isolates, and S37G in 25 isolates). Non-conservative substitutions within the 60's loop were observed in 48 sequences (C67 in 6, G48 in 6, and H69 in 36). Positions +12 and +63 located in the N-terminal halves of the core domains and position +37 in the flap domain had the greatest number of different amino acid changes seen (12, 15, and 14 different amino acids changes each, respectively).
Analysis of Shannon entropy further illustrates protein variation (Fig. 4). High levels of entropy were observed in residues surrounding the 10's loop, the elbow region, the 60's loop, and in the first half of the C-terminal domain. Five amino acid positions exhibited high levels of entropy (>0.6), position +12 flanking the 10's loop (hosting 12 different substitutions), position +37 located within the elbow region of the flap domain (hosting 14 substitutions), position +63 and +71 flanking the 60's loop (hosting 15 and 3 substitutions, respectively), as well as position +93 within the C-terminal domain (hosting a single substitution). Mid-levels of entropy were seen in 10 different sites.

Mexican protease protein sequence Shannon entropy map. The frequency of the most common amino acids observed at each site in the Mexican protease sequences is shown in the continuous line, whereas the Shannon entropy level for each site is shown in bar graphs. Boxes indicate protein regions, whereas domains are indicated in upper brackets.
One located in +10 at the interface of the N-terminal domain with the core domain (hosting 4 substitutions), positions +16 and +19 in the 10's loop (hosting 2 and 6 substitutions, respectively), positions +36 and +39 near the elbow region (4 substitutions each), as well as positions +64, +67, +69, +70, and +72 in or near the 60's loop (hosting 3, 5, 7, 9, and 7 substitutions, respectively).
The protein entropy levels were mapped into an X-ray diffraction crystal of HIV-1 protease dimer coupled to the antiretroviral drug saquinavir (PDB ID: 3OXC) by using PyMOL (The PyMOL Molecular Graphics System, Version 2.0 Schrödinger, LLC.) and provided as a Supplementary Figure S1. In this figure, the protease dimer is represented as a blue cartoon with a surface mesh whereas saquinavir is shown in gray and the different sites of high and highest entropy are shown in a solid surface with orange and red coloring, respectively. Absolutely all sites exhibiting high and mid-entropy were located in the surface of the protease protein and confined to an area located between the elbow and the 60's loop. As expected, variable sites are more tolerated in regions not subjected to high selective pressure, such as catalytic sites and structurally important regions.
It remains unknown as to whether such distribution of variable sites provides HIV with a biological advantage having to do with Gag, Gag-Pol, and Nef precursor polyprotein processing, immune escape, or by interfering/enhancing interactions with other viral or human host proteins. However, recent evidence has shown that residues influencing PI docking are not necessarily located in or near the active site but located in distal regions such as the dimerization interface or flap. 16 Although previous publications have addressed the frequency of PI ARVms, a proper study into the structural distribution of mutations among Latin-American isolates had been lacking. Protease diversity in Mexico and other parts of Latin America is likely to expand in the near future due to current human migratory dynamics.
This study represents the first attempt to characterize the extent of protease diversity present in Mexican HIV-1 isolates and provides a structural overview of its distribution. Unprecedented efforts in the field of genomics, structural virology, and pharmacology have contributed to converting HIV into a chronic manageable disease with sustained control of viral replication.
The integration of these results with other structural, biophysical, molecular dynamics, and virologic data will, undoubtedly, further our understanding of HIV evolution, structure, and function as well as sponsor future research into vaccine development and antiretroviral therapy targeting HIV proteins.
Footnotes
Acknowledgments
The authors thank the physicians and patients attended by CAPASITS San Luis Potosí for providing the samples and understanding the scope and importance of this study.
Author Disclosure Statement
No competing financial interests exist.
Funding Information
This project was funded by the Mexican National Science and Technology Council through grants CONACYT I0017 (CB-2011-01) no. 167374 and SSA/IMSS/ISSSTE-CONACYT (FONSALUD-2009-01) no. 115226.
Supplementary Material
Supplementary Figure S1
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
