Abstract
The prevalence of HIV-1 drug resistance is increasing worldwide and monitoring its emergence is important for the successful management of populations receiving combination antiretroviral therapy. It is likely that pre-existing drug resistance mutations linked on the same viral genomes are predictive of treatment failure. Because of the large number of sequences generated by ultrasensitive single-genome sequencing (uSGS) and other similar next-generation sequencing methods, it is difficult to assess each sequence individually for linked drug resistance mutations. Several software/programs exist to report the frequencies of individual mutations in large data sets, but they provide no information on linkage of resistance mutations. In this study, we report the HIV-DRLink program, a research tool that provides resistance mutation frequencies as well as their genetic linkage by parsing and summarizing the Sierra output from the Stanford HIV Database. The HIV-DRLink program should only be used on data sets generated by methods that eliminate artifacts due to polymerase chain reaction recombination, for example, standard single-genome sequencing or uSGS. HIV-DRLink is exclusively a research tool and is not intended to inform clinical decisions.
Introduction
The advent of combination antiretroviral therapy (cART) changed HIV/AIDS from a deadly disease in most individuals to one that can be managed with lifelong treatment. 1 However, owing to the high genetic diversity of HIV-1, low-frequency drug resistance mutations exist in patients even before cART initiation 2 –4 and these low-frequency mutations may lead to the emergence of drug-resistant viral rebound and treatment failure. Recently, acquired and transmitted drug resistance have become quite prevalent in some parts of the world, 5,6 constituting a major barrier to successful treatment of HIV-1.
The Stanford HIV Database (Stanford HIVdb) is a reliable and accurate tool for interpreting HIV drug resistance in population genotypes
7
(
One difficulty in the interpretation of linked mutations is that bulk polymerase chain reaction (PCR) may result in artifactual recombination. 9 Therefore, identifying linked mutations can only be applied to sequences obtained using methods that eliminate PCR recombination in conjunction with pipelines that omit sequences resulting from PCR recombination and possible nucleotide mixtures, such as single-genome sequencing (SGS), 10 ultrasensitive single-genome sequencing (uSGS), 11 or other similar next-generation sequencing (NGS) methods. 12,13 In brief, uSGS uses primer IDs like many other NGS approaches 14 –17 ; however, the Illumina adapters are added by ligation rather than by PCR, significantly reducing the bias and recombination that is inherent to amplification with long PCR primers.
Although uSGS can only obtain sequence reads up to ∼500 base pairs from amplicons up to 1 Kb in length, it is possible to link protease inhibitor (PI) resistance mutations to reverse transcriptase (RT) mutations by obtaining 250 base pair reads from one end of the 1 Kb amplicon and 250 base pair reads from the other end.
Significant new changes to the Pacbio platform and chemistry allow for much longer sequence reads (up to the full-length HIV genome) with much higher accuracy than previously possible with this approach. 18 Pacbio technology may allow for future investigations of linkage between PI, RT, integrase (IN), and even Env drug resistance mutations 19,20 when applied to single-genome amplicons. Although many programs exist for the analysis of HIV sequencing data including their assembly, base-calling, and mutation frequencies, 21 –28 these programs do not detect linkage of drug resistance mutations on the same viral genomes.
In this study, we describe a tool called HIV-DRLink that can quickly process thousands of HIV-1 sequences using the Stanford HIVdb server to report, not only the frequencies of single drug resistance mutations in the population, but also the frequency of mutations linked on the same viral genomes, which may be predictive of cART failure. HIV-DRLink is intended as a research tool that parses the output of Stanford HIVdb to report linked HIV-1 drug resistance mutations. It should only be used for analyzing high-quality sequences without artifact recombination from any platform, such as uSGS. HIV-DRLink is not intended for informing clinical decisions.
Materials and Methods
HIV-DRLink description
HIV-DRLink is based on the Stanford HIVdb genotypic resistance interpretation program using the Stanford command line program “Sierra Web Service 2.0: 2016—present” (
The output of HIVdb Sierra JSON files can be extensive. To simplify the output, a GraphQL protocol is used to select only the gene names (PR, RT, or IN), the mutation types (e.g., primary), and the specific mutations. The GraphQL protocol used in the pipeline is a simple text file called inputSequence { header, }, mutations { gene {name} primaryType text }
Two steps are used to run the pipeline:
Step 1: Submit fasta formatted sequences to Stanford HIVdb using the following command line after the Python client SierraPy is installed locally:
sierrapy fasta input_file.fasta
Where output.json is an example output file with any name.
Step 2: Run
It should be noted that mutations or polymorphisms in nondrug resistance positions are ignored, and thus the sequences with the same patterns of drug resistance mutations may not be identical at nonresistance sites. The results of HIV-DRLink are reported in a tab delimited text file.
Meta sequence data
The pipeline was tested for speed and performance using HIV-1 subtype B sequences from Los Alamos HIV Sequence Database (
Clinical sequence data
Although HIV-DRLink can be used to report linkage of drug resistance mutations in sequences obtained by technologies that omit artifactual recombination, Illumina MiSeq-based uSGS data were used here to validate the pipeline on sequences obtained from a clinical sample. 11 The clinically derived sequences were obtained from genbank (Accession Nos. KY810858–KY812454). After filtering to remove low-quality reads, the paired end fastq files were used for bioinformatics processing to generate HIV-1 sequences of 404 bases in length that covered RT from codons 59 to 131 and from 166 to 226. 11
All sequences used in this manuscript were obtained from published papers and public databases. No additional IRB approval was needed.
Results
Testing the accuracy of HIV-DRLink on sequences obtained from Los Alamos HIV Database
Table 1 gives the results of an HIV-DRLink run on 500 patient-derived HIV sequences obtained from the Los Alamos database. As stated in the methods, although the training data set contains sequences generated by bulk PCR and sequencing and, therefore, true linkage cannot be determined with such data, it is used here only to assess the ability of the pipeline to accurately report drug resistance mutations and to assess its rate of processing large data sets.
Drug Resistance Frequencies of HIV-1 Subtype B Reverse Transcriptase Sequences
Bulk DNA sequences from Los Alamos HIV Sequence Database (
DRM, drug resistance mutation; RT, reverse transcriptase.
The first column in Table 1 shows the number assigned to each drug resistance pattern; the second column shows the specific pattern identified, and the third column shows the number of sequence variants that share that particular pattern. Among the 500 sequences retrieved from the Los Alamos HIV Database, 56 had at least one drug resistance mutation (Table 1, bottom row).
Although some sequences had a single resistance mutation, for example, pattern 3 had only K101E, some others had two or more resistance-conferring mutations, for example, pattern 15 had M41L, D67N, K70R, M184V, L210W, T215F, and K219Q. The percentages of each resistance pattern in the population are shown in the fourth column, ranging from 0.2% to 1.4%. The remainder of the columns show the presence of each individual drug resistance mutation with the last row providing the frequency of each in the total population.
Although Table 1 shows the drug resistance patterns in RT only, our program can reveal linkage of mutations in protease (PR) and IN and other genes without additional input options or parameters. In addition to the 500 sequences already described, we downloaded an additional 200 bulk pro-pol sequences from the Los Alamos Database to test in the pipeline for analysis of linked mutations in PR–RT–IN.
Supplementary Table S1 gives an HIV-DRLink output file demonstrating that some sequences, as in the first set of data, had only one resistance mutation, for example, pattern 1 included only S147R in IN whereas others had “linked” mutations, such as pattern 6 with M46I and N88D in PR, M41L, Y215Y in RT, and G163K in IN and pattern 34 with V32I, L33F, M46I, I47V, I54L, and I84V in PR, M41L, D67N, K70R, M184V, T215F, and K219Q in RT, and G140S and Q148H in IN. As already stated, because the training data used were generated by bulk sequencing, the mutational patterns described do not report true linkage of mutations on single genomes but demonstrate the accuracy of the program to report patterns of drug resistance mutations in a hypothetical large data set.
Drug resistance detection in clinical samples using HIV-DRLink
To evaluate true linkage of drug resistance mutations on single HIV-1 genomes, we tested the pipeline on plasma HIV-1 RNA sequences obtained by uSGS 11 (sequences available at KY810858–KY812454). The plasma sample was obtained from an HIV-infected donor with viremic failure on ART. uSGS yielded 1,597 high-quality single-genome pol sequences covering RT codons from 59 to 131 and from 166 to 226. 11
Table 2 gives 12 different resistance patterns that were detected using the HIV-DRLink program, all of which had linked mutations. Although some of the patterns were rare, for example, patterns 1 to 3 comprising 0.06% of the population, pattern 12 with four linked mutations comprised 73% of the population. HIV-DRLink also calculated the levels of individual drug resistance mutations in the sample. For example, 21.54% of the sequences had the D67N mutation and 99% had T215Y.
Frequencies of Linked Drug Resistance in HIV1 Subtype B Reverse Transcriptase Sequences
Clinical sequences analyzed by uSGS. 11
uSGS, ultrasensitive single-genome sequencing.
Speed of HIV-DRLink
To evaluate HIV-DRLink speed and performance, 23,781 pol sequences from the Los Alamos HIV Sequence Database were submitted to the Stanford HIVdb through Python client Sierrapy (
Discussion
We developed and applied a program called HIV-DRLink that is capable of reporting linked and unlinked HIV-1 drug resistance mutations in large data sets of single-genome sequences, making use of the very well annotated and maintained Stanford HIV Drug Resistance Database (
To test the accuracy and speed of HIV-DRLink, HIV-1 pol sequences from the Los Alamos HIV Sequence Database were queried and submitted to Stanford HIVdb. It should be noted that most sequences stored in the Los Alamos HIV and Stanford databases are from population sequencing and, therefore, each likely represents a mixture of genomes. Such data sets are used here only to assess the accuracy and speed of the program, not to directly assess linkage of mutations on single genomes. The results show that HIV-DRLink accurately and rapidly reported the frequency of drug resistance mutations by the parsing the results from the Stanford HIVdb. HIV-DRLink was further tested on uSGS data obtained from a clinical specimen where each sequence was known to have originated from a single viral template 11 and the program reported accurate results in <1 minute.
In conclusion, we developed a tool, HIV-DRLink, that works in conjunction with the Stanford HIVdb to rapidly report linked and unlinked HIV-1 drug resistance mutations in large data sets generated by SGS methods, including the uSGS NGS approach, that eliminate PCR-based recombination and nucleotide mixtures. HIV-DRLink is a necessary tool to further investigate the effect of single versus linked pre-existing drug resistance mutations on the outcome of ART.
Footnotes
Acknowledgments
We acknowledge that a very important part of the pipeline is to obtain and parse information from Stanford HIVdb (
Author Disclosure Statement
No competing financial interests exist.
Funding Information
We acknowledge the funding sources for this study from NCI CCR, the Office of AIDS Research, NIH, NCI intramural funding to M.F.K., and NCI contract no. HHSN261200800001E. J.M.C. was a research professor of the American Cancer Society and was supported by Leidos contract 13XS110.
Supplementary Material
Supplementary Table S1
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
