A Workflow to Improve Variant Calling Accuracy in Molecular Barcoded Sequencing Reads

Abstract

Multiplexed molecular barcoded amplicon sequencing has been previously demonstrated to improve the sensitivity of low-frequency variant detection. Molecular barcoded reads can be used to identify and correct amplification biases introduced during library preparation and sequencing errors. We propose a generic workflow to improve variant calling accuracy that takes advantage of molecular barcoded sequencing reads by applying a base score correction method to duplicate or overlapping read pairs across the targeted panel. The workflow is able to reduce the false-positive rate of variant calls in homopolymer and repetitive regions where the sequencer commonly encounters phasing errors interfering with base calling. The analysis was focused on three specific regions within a custom QIAseq targeted DNA (Qiagen, USA) of 220 genes observed to have a high false-positive rate in a clinical validation study. Uncorrected and corrected datasets were compared at targeted regions to reference calls from NIST and exome data from a reference laboratory (Macrogen, USA). Base correction removed all false positives identified without the correction method and retained the true positive call in the dataset. We have shown that our workflow incorporating base correction of molecular barcoded sequencing data can be applied to germline sequencing to improve variant calling accuracy in genomic regions with certain repetitive sequence motifs.

1. Introduction

Advances in next-generation sequencing (NGS) technologies and the identification of risk associated genetic markers have paved the way for better diagnostic and treatment decisions. Targeted sequencing panels provide an economical option for both the patients and the laboratory to detect genetic variants and assess them for clinical action. Various enrichment strategies for NGS can be employed by the laboratory each with their own tradeoffs between costs, sensitivity, specificity, and uniformity (Mamanova et al., 2009). Hybridization capture techniques have been applied to a wide range of sequencing applications most commonly in whole exome sequencing, whereas the use of multiplex polymerase chain reaction (PCR)-based amplicon enrichment has been more suited in targeted panels requiring high coverage reported by Miya et al. (2015) and Chang and Li (2013).

Amplicon-based sequencing techniques have traditionally offered laboratories a cost-effective approach to high-throughput sequencing by reducing both library preparation time and sample input requirements (Ballester et al., 2016). However, these benefits oftentimes come at the cost of increasing the susceptibility of the sequencing data to PCR biases and amplification errors. The recent introduction of molecular barcodes to identify unique templates within the library of sequencing reads has mitigated these concerns while retaining the benefits of the PCR-based approach (Peng et al., 2015). Molecular barcodes or counter sequences that track the number of unique templates captured during the initial enrichment have previously been shown to reduce polymerase error and provide a more accurate estimation of allelic fraction (Casbon et al., 2011). Barcode aware variant calling has been shown to provide increased sensitivity and specificity for calling single-nucleotide polymorphisms (SNPs) in somatic tissue (Xu et al., 2017). We propose a general workflow to process molecular barcoded amplicon sequencing data in germline specimens to correct the quality scores of conflicting base calls and achieve greater sensitivity when used with any variant caller accepting binary alignment map files as input.

2. Methodology

Library preparation on cell line, blood, and saliva specimens was performed using a custom QIAseq targeted DNA sequencing kit (Qiagen, USA). The kit uses a specific primer for the enrichment of targeted regions and a universal primer containing unique molecular tags to distinguish enriched templates from subsequent PCR duplicates. Sequencing of prepared libraries was conducted on an Illumina NextSeq 500 (Illumina, USA) at 2 × 150 bp pair end sequencing.

2.1. Data analysis

A bioinformatics pipeline was developed to analyze the generated data from the targeted sequencing kit. Raw sequencing data were demultiplexed using Illumina's bcl2fastq software, and sequencing reads below 50 bases were removed from downstream analysis. Then the molecular tag was trimmed from each read and appended to the end of the read name for downstream processing. Alignment of the sequencing data to the human genome reference (GRCh37.p13) was carried out using the Burrow's Wheeler Aligner with default alignment parameters for paired end sequencing (Li and Durbin, 2010). Downstream processing of the aligned sequencing reads was followed.

2.2. Base correction and deduplication

Sequencing reads sharing an identical molecular tag and primer are derived from the same enriched template before PCR amplification. Sequencing reads in the alignment file were parsed and associated with the nearest matching primer based on alignment location. Reads without a mate that mapped within the primer search range were dropped. Soft clipping was applied to primer sequences for each sequencing read to mask the reference bases found on the primer from downstream analysis. Afterward, the pileup of base calls for each sequenced position was retrieved using Pysam, the python wrapper for SAMtools (Li et al., 2009). Sequencing reads were grouped according to primer and molecular barcode. Reads sharing the same molecular tag and primer with conflicting base calls for a given pileup position underwent base score correction.

The correction was performed by applying Bayes rule to determine the conditional probability of each observed base given their respective quality scores. The probability of observing a base k in set n of all observed bases for a group of duplicate reads is dependent on the Phred score of all observed calls for that base P(A_k), the Phred error of observing other base calls at that position P(B|A_k), and the total probability of each observed base in n. A minimum of two sequencing reads is required for correction. Reads can be from the same pair for any positions where overlapping read through is observed. Deduplication of the corrected reads was accomplished by identifying duplicate reads originating from the same primer and molecular tag and subsequently marking all sequencing reads as a duplicate except for the read with the highest average quality score across the sequenced bases. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} P { \rm { ( } } { A_k } { \rm { \vert } } B ) = \; { \frac { P \left( { { A_k } } \right) P ( B \vert { A_k } ) } { \sum P \left( { { A_n } } \right) P { \rm { ( } } B { \rm { \vert } } { A_n } ) } } \tag { 1 } \end{align*} \end{document}

2.3. Quality control (QC) metrics

Quality metrics were captured to evaluate the performance of the sequencing run and each individual specimen within the run. Run metrics were provided by the sequencer. Sequencing runs with metrics within the optimal range recommended by Illumina passed the quality control (QC) check. Slightly underclustered or overclustered sequencing run were evaluated on a case-by-case basis with respect to the quality of the specimens within the run. Sample-specific sequencing metrics were generated to assess the quality of enrichment and uniformity of the generated data.

2.4. Variant calling and copy number variant (CNV) analysis

After the postalignment processing, variant calling for SNPs and small indels was performed using Freebayes (Garrison and Marth, 2012). Variants were called for targeted bases achieving the minimum read depth required. In regions with repetitive sequence motifs such as the presence of homopolymers, dinucleotide repeats, or tandem duplications, a subset of sequencing reads spanning the full length of the motif was isolated and used. Identified variants were filtered according to quality scores, alternate read frequency, and minimum reportable read depth. Small indels were filtered according to a p-score based on Fisher's exact method, alternate read frequency, and minimum reportable read depth. Variant calls persisting after filtering criteria were submitted for classification.

Exon level copy number analysis (deletion and duplications) for targeted genes was performed using the BioDiscovery Nexus software suite (BioDiscovery, USA). Analysis followed the methodology specified by the developer. A reference was created using sequencing data of different samples on the same batch. Each specimen was then compared with the reference and significant differences in coverage between the interrogated specimen and the reference resulted in a gain or loss call by the software. Further analysis of calls in pseudogene or homologous regions was conducted by comparing the relative amplification efficiency of particular specific primers between the targeted gene and its pseudogene. Gene conversions between the targeted and pseudogene were sometimes detected in the initial analysis by the Nexus suite and further supported by the comparison of primer efficiencies.

3. Results

3.1. Base score correction

A subset of the sequencing data generated from a custom QIAseq targeted DNA Panel for 220 genes was processed. Three genomic regions targeted by the panel and observed to have a high false-positive rate due to repetitive stretches of homopolymers were evaluated with and without base score correction. These regions include splice sites and intronic regions in the genes BCKDHA, GAA, and PLEKHG5. A comparison of the variant calls obtained for the corrected and uncorrected datasets was made against one another for NIST standards, NA12878 and NA24385, and saliva draws against corresponding exome sequencing data from a CAP/CLIA reference laboratory [CAP (College of American Pathologists) accredited and CLIA (Clinical Laboratory Improvement Amendments) certified] (Macrogen, USA; Zook et al., 2014). The performance of method to resolve base inaccuracies after homopolymer repeat region was determined by the number of false-positive variant calls identified after correction. In the untranslated region (UTR), 20 bases downstream of the stop codon for BCKDHA, a repetitive sequence motif exists consisting of singular A base calls interspersed between repeats of 3–5 C base calls (Fig. 1). Phasing errors observed during sequencing in this region contribute to two erroneous polymorphisms (c.1338+15A>C and c.1338-21A>C) observed at ∼30% variant read frequency in all specimens sequenced. The false-positive variant c.1338-21A>C was called in NA24385 and 1122SL4 without the correction method. After applying the correction, all discordant positions identified previously were correctly called (Tables 1, 2).

FIG. 1.

This image shows a region of BCKDHA that contains a repetitive stretch of C bases. Base calls for each sequencing read are color coded based on matches (red) and mismatches (blue) to the reference. Base call quality is depicted as saturation with higher Phred scores producing a deeper color and lower scores yielding a lighter color. Incorrect base calls at the A positions are assigned a higher Phred score before correction (left) compared with postcorrection (right).

Table 1.

Variant Calling Accuracy of Base Score Corrected Datasets

Gene	Coordinate (Hg19)	REF	ALT	NA12878	NA24385	1122SL4	1122SL9	1122SL19
BCKDHA	chr19:41930534	A	C	TN	TN	TN	TN	TN
GAA	chr17:78086833	T	G	TN	N/A	TN	TN	TN
GAA	chr17:78086839	T	G	TN	N/A	TN	TN	TN
GAA	chr17:78086846	A	G	TP	N/A	TP	TP	TP
PLEKHG5	chr1:6535217	T	G	N/A	N/A	TN	TN	TN

Comparison of variant calls in homologous regions between the five evaluated samples with base score correction applied. Base score correction reduced the variant call QUAL score below the filtering threshold of 20 removing the FP while retaining the TP call.

FP, false positives; N/A, portions not reported by the references; TN, true negatives; TP, true positive.

Table 2.

Variant Calling Accuracy of Uncorrected Datasets

Gene	Coordinate (Hg19)	REF	ALT	NA12878	NA24385	1122SL4	1122SL9	1122SL19
BCKDHA	chr19:41930534	A	C	TN	FP	FP	TN	TN
GAA	chr17:78086833	T	G	FP	N/A	FP	FP	TN
GAA	chr17:78086839	T	G	FP	N/A	FP	FP	FP
GAA	chr17:78086846	A	G	TP	N/A	TP	TP	TP
PLEKHG5	chr1:6535217	T	G	N/A	N/A	TN	FP	TN

Comparison of variant calls in the same position and samples without base score correction. TNs were sourced from the NIST reference for NA12878 and NA24385 and exome sequencing for the other specimens. Positions not reported by the reference were assigned N/A. FP calls were determined by comparison to the reference data.

3.2. Sequencing error

Base call errors resulting from either PCR amplification or the sequencer were identified separately from the mismatch base calls in NA12878 and NA24385 for invariant regions. Sequencing errors were determined to be discordant base calls within a group of duplicates. PCR errors were determined to be mismatches in all reads within a group of duplicates at a given position. Only sequencing reads with at least one duplicate were used for the error rate calculations. Unique sequencing reads without duplicates could not be reliably evaluated for all mismatches to the genome.

Sequencer error rate varied depending on the composition of the targeted bases. A higher error rate was observed for regions with repetitive bases (upward of 4.2% of sequencing reads) compared with regions with greater base diversity (below 1% of sequencing reads). Overall, PCR errors were observed to be below 0.3% of sequencing reads for a given base position.

Mean base quality scores for mismatches resulting from sequencing errors and PCR amplification were determined from both corrected and uncorrected datasets. Mean base call quality score of corrected datasets for sequencing errors was below 1. Base call quality score for PCR amplification did not vary significantly before and after correction likewise for the quality scores of correct sequencing bases.

3.3. Correction rate

To assess the effect of duplication rate on quality correction, the mean quality score of base calls required to correct a mismatch from a single sequencing read with a resulting quality score of Q30 in correct read was calculated for 1, 2, 3, and 4 duplicate sequencing reads. This value was determined for each base call score based from the following equation: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { P_ { Q30 } } = { \frac { p_a^n \; { p_ { berror } } } { p_a^n \; { p_ { berror } } + \; { p_b } p_ { aerror } ^n } } , \tag { 2 } \end{align*} \end{document}

where p_a and p_b represent the probability that the base call p_a and p_b are correct respectively. n is the number of duplicate sequencing reads with the same base call. p_aerror and p_berror are the probability of errors for the base calls p_a and p_b, respectively.

Base correction when applied to a single duplicate read returned a corrected Phred score of 30 if the given call score for the incorrect base call was below 10. It was not possible to obtain a corrected base score ≥Q30 for incorrect base calls with a score greater than 10 with only one duplicate sequencing read for correction. For a single sequencing read, a minimum of two duplicate reads are necessary to correct an incorrect call at all possible Phred score values produced by the Illumina NextSeq 500 sequencer with a resulting corrected call score of at least 30 (Fig. 2). Additional duplicate reads reduce the necessary mean quality score required to achieve a Q30 call.

FIG. 2.

A graph of the minimum number of duplicate reads and their mean Phred score required to correct a discordant base called at a certain Phred score. The majority base call will have a quality score of Q30 after correction if they meet the mean correct base score. A minimum of two duplicate reads is necessary to correct a single base error at all possible base qualities on a Phred scale of 40. More duplicates reads supporting a particular base call will reduce the mean Phred score requirement to correct a discordant call.

4. Discussion

4.1. Primer affinity and sequence variants

One challenge from amplicon capture methods is the dependency on the primer to bind and sequence all alleles present in the genome uniformly. A commonly observed phenomenon is the presence of a sequence variant on a primer binding site which reduces the affinity of the primer for the target. In some cases, the affinity is significantly reduced and no sequencing reads are generated from the affected allele. As a limitation of amplicon sequencing, this poses challenges to adequately evaluate a target when allelic dropout caused by primer affinity could occur in a given individual.

4.2. Sequencing errors, phasing, and limitations

Homopolymer repeat regions pose many challenges for high-throughput sequencing technology. Enrichment and subsequent PCR amplification from these regions can introduce errors during library preparation that get carried over to sequencing where incorrect phasing of clonal clusters further reduce base call quality and result in additional sequencing errors. Base score correction for sequencing reads in homopolymer or repetitive regions in the genome improves variant calling by reducing the number of false-positive calls. Sequencing errors due to phasing interference can also be mitigated from downstream analysis by reevaluating the quality score of the base call and subsequent filtering applied to variant calls originating from positions with low alternate observation qualities.

The base score correction method described here relies on the assumption that a higher base call score equates to a more correct base call and the greater number of calls supporting one observation the more correct that observation is for the given position.

4.3. Strand bias

We have detected strand bias in GC-rich regions affecting either the forward or reverse read in libraries sequenced on the Illumina NextSeq 500. The contributing factor to the observed strand bias is due to DNA polymerase failing to sequence beyond the GC-rich segment on the affected read. In some cases, the affected reads can be identified from a nonaligning segment of reads comprised predominantly of G base calls. In other cases, the nonaligning segments comprised a string of G base calls followed by base calls that do not correspond to the reference. The G base calls result from the absence of either red or green fluorescence on the flow cell, suggesting that the segment lacks base incorporation at subsequent cycles. In the latter examples, we believe the interference from nearby clusters leads to erroneous base calls in the sequence as part of this stretch. Mismatched read length between the mates in the pair was observed in the dataset. The aligning mate in these reads was trimmed below the maximum read length due to sequencing through to the adapter sequence; however, the hompolymer tail in the nonaligning mate continues until the maximum read length. Unlike in the mate, the adapter sequence is never encountered further, suggesting a failure of the polymerase to incorporate DNA. Allelic bias has also been identified in the amplicon sequencing method consistent with the findings from Samorodnitsky et al. (2015).

4.4. Improved sensitivity

Base score correction depends on the number of duplicate reads available and the odds that the correct base call occurs more often or at a higher sequencing quality to perform the correction. This methodology benefits from increased sequencing depth allowing the algorithm to draw upon a larger number of duplicate reads as evidence for a base call. Although we have only tested this methodology to improve calling of germline variants, we believe that it is applicable to improve calling somatic variants as well. Low-frequency variants will benefit from increased accuracy and sensitivity of additional duplicate reads used for correction.

5. Conclusion

We have demonstrated the effect of base quality correction to improve variant calling accuracy in repetitive regions where phasing issues may exist. Additionally, base quality correction can be applied to reduce the impact of sequencing errors to improve sensitivity and detect low-frequency variants.

Footnotes

Acknowledgment

The authors would like to thank the Department of Molecular Genetics wet lab at True Health Diagnostics for providing the sequencing data used in our analysis.

Author Disclosure Statement

The authors declare that no competing financial interests exist.

References

Ballester

L.Y.

, Luthra

, Kanagal-Shamanna

, et al. 2016. Advances in clinical next-generation sequencing: Target enrichment and sequencing technologies. Expert Rev. Mol. Diagn. 16, 357–372.

Casbon

J.A.

, Osborne

R.J.

, Brenner

, et al. 2011. A method for counting PCR template molecules with application to next-generation sequencing. Nucleic Acids Res. 39, e81–e81.

Chang

, and Li

2013. Clinical application of amplicon-based next-generation sequencing in cancer. Cancer Genet. 206, 413–419.

Garrison

, and Marth

. 2012. Haplotype-based variant detection from short-read sequencing. http://arxiv.org/abs/1207.3907v2. Accessed January 20, 2017.

, and Durbin

2010. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics, 26, 589–595.

, Handsaker

, Wysoker

, et al. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 25, 2078–2079.

Mamanova

, Coffey

A.J.

, Scott

C.E.

, et al. 2010. Target-enrichment strategies for next-generation sequencing. Nat. Methods. 7, 111–118.

Miya

, Kato

, Shiohama

, et al. 2015. A combination of targeted enrichment methodologies for whole-exome sequencing reveals novel pathogenic mutations. Sci. Rep. 5, 9331.

Peng

, Satya

R.V.

, Lewis

, et al. 2015. Reducing amplification artifacts in high multiplex amplicon sequencing by using molecular barcodes. BMC Genomics. 16, 589.

10.

Samorodnitsky

, Jewell

B.M.

, Hagopian

, et al. 2015. Evaluation of hybridization capture versus amplicon-based methods for whole-exome sequencing. Hum. Mutat. 36, 903–914.

11.

, Ranjbar

M.R.N.

, Wu

, et al. 2017. Detecting very low allele fraction variants using targeted DNA sequencing and a novel molecular barcode-aware variant caller. BMC Genomics, 18, 5.

12.

Zook

J.M.

, Chapman

, Wang

, et al. 2014. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251.