Abstract
Several tools have been developed for calling variants from next-generation sequencing (NGS) data. Although they are generally accurate and reliable, most of them have room for improvement, especially regarding calling variants in datasets with low read depth. In addition, the somatic variants predicted by several somatic variant callers tend to have very low concordance rates. In this study, we developed a new method (RDscan) for improving germline and somatic variant calling in NGS data. RDscan removes misaligned reads, repositions reads, and calculates RDscore based on the read depth distribution. With RDscore, RDscan improves the precision of variant callers by removing false-positive variant calls. When we tested our new tool using the latest variant calling algorithms and data from the 1000 Genomes Project and Illumina's public datasets, accuracy was improved for most of the algorithms. After screening variants with RDscan, calling accuracies increased for germline variants in 11 of 12 cases and for somatic variants in 21 of 24 cases. RDscan is simple to use and can effectively remove false-positive variants while maintaining a low computation load. Therefore, RDscan, along with existing variant callers, should contribute to improvements in genome analysis.
1. Introduction
Over the past decade, large-scale sequencing of human genomes has been carried out using next-generation sequencing (NGS) technologies (Yi et al., 2010; The 1000 Genomes Project Consortium, 2015; Lek et al., 2016). Variants in human genomes are intimately associated with the genetic causes of many human diseases, as well as the genetic diversity within and among human populations (Ng et al., 2010; Lee et al., 2014). Therefore, identifying variants from NGS data has become a key foundation of human genome analysis.
The accuracy of variant calls for single-nucleotide variants (SNVs) and insertions and deletions (indels) depends on artifacts from sequencing device error, DNA contamination, and read misalignment (Li, 2014; Do and Dobrovic, 2015). These artifacts can lead variant callers to call artificial unreal variants. To minimize false-positive variant calling frequencies, several algorithms have been developed based on key features such as read depth, base/mapping quality, strand bias, and haplotype (DePristo et al., 2011; Garrison and Marth, 2012; Koboldt et al., 2013; Rimmer et al., 2014; Lai et al., 2016; Cho et al., 2018; Kim et al., 2018).
Recently, deep learning models such as DeepVariant (Poplin et al., 2018) and NeuSomatic (Sahraeian et al., 2019) have been implemented for variant calling based on these features. However, variant callers for the variants with low read depth and low variant allele frequency (VAF) still have room for improvement in terms of accuracy (Krøigård et al., 2016; Kim et al., 2018; Sahraeian et al., 2019).
To improve variant call accuracy, we propose read depth distribution as a new feature. We expect that the depth distribution of the aligned read set containing a variant should not differ from the depth distribution of the whole read set mapped to the same region, as genomic DNAs are randomly broken into smaller fragments and reads are generated from these fragments by NGS devices regardless of whether a variant is included (King et al., 2006). Hence, this similarity of the read depth distributions can be used to eliminate artificial variants; the variant candidate with read depth distribution similar to the overall read depth distribution should be kept as the true-positive variant, whereas other candidates should be removed as false. We used a similar approach for haplotyping the major histocompatibility complex regions (Ka et al., 2017).
In this study, we developed RDscan, a novel variant filtering method based on read depth distribution that effectively removes artificial variants. This method can improve the accuracy of variant calls relative to any other method. For most variant calling algorithms that we tested, the accuracies of these algorithms were further improved by differentiating real variants from false-positive variants using RDscan.
2. Materials and Methods
2.1. WGS data from public genome datasets
To evaluate the accuracy of RDscan for germline variants, we used WGS data for two samples, HG001 (34 × ) and HG002 (25 × ), from the 1000 Genomes Project (The 1000 Genomes Project Consortium, 2015); the detailed dataset information is provided in Supplementary Table S1. The reference standards, the Genome in a bottle (GIAB) truth sets for HG001 (v3.3.2) and HG002 (v4.1), were downloaded from ftp://ftp-trace.ncbi.nlm.nih.gov/giab/. The variant calling accuracy was evaluated within the high-confidence region suggested by GIAB.
2.2. In silico mixtures of unrelated germline samples
We used mixed DNA sequencing data from two unrelated individuals (NA12878 and NA12877) generated by Illumina (Kim et al., 2018) to evaluate the accuracy of somatic calls by RDscan. The dataset, consisting of three paired samples with different proportions of normal and tumor DNA purity (100% and 80%, 100% and 20%, and 90% and 80%), was downloaded from https://www.ncbi.nlm.nih.gov/sra/(Supplementary Table S1). The truth sets and the confidence region data for verification were downloaded from Illumina.
2.3. Collecting candidate variants
RDscan provides an additional filtering method for eliminating the false-positive variants called by the other variant callers. To collect the germline variants, we used GATK HaplotypeCaller (DePristo et al., 2011), Strelka2 (Kim et al., 2018), and DeepVariant (Poplin et al., 2018). For the somatic variants, we used Strelka2, VarDict (Lai et al., 2016), Mutect2 (Cibulskis et al., 2013), and NeuSomatic (Sahraeian et al., 2019). The collected variants are re-evaluated by the RDscan method, as described below.
2.4. Filtering misaligned reads in repeat region
RDscan prevents erroneous influence on variant calling by preliminarily removing misaligned reads in the repeat regions where one or more sequences appear more than three times repeatedly. Specifically, a read that starts or ends in a repeat region is considered misaligned. As an example, shown in Step 2 of Figure 1, a repeat region consists of nucleotide sequences that repeat TG four times.

Overview of RDscan based on read depth distribution. The workflow of RDscan is summarized in six steps. In Step 1, a variant caller produces a VCF file of candidate variants. Among the candidate variants, this figure deals with AT deletion variants occurring at locus i. For the reads mapped to locus i, in Step 2, RDscan filters the misaligned reads to ensure call accuracy. Reads that start or end in a repeat region are considered misaligned. In this example, the repeat region consists of a nucleotide sequence that repeats TG four times, and the top four reads and the bottom read are removed. In Step 3, all remaining reads are tagged as ALL, and all the reads containing the variant are tagged as VAR. If the variant is an indel, as shown in Step 4, RDscan adjusts the read alignment to eliminate alignment region length differences between ALL and VAR. Finally, in Step 5, RDscan estimates the RDscore for the AT deletion variant, which represents the correlation depths of ALL and VAR in the Comparison Region. The Comparison Region is the area in which the reads in ALL are aligned. For each candidate variant, the estimated RDscore is recorded in the original VCF file through iterations of Steps 2 through 5. VCF, Variant Call Format.
In this case, we can simultaneously observe two variations at locus i, one for having G instead of the reference sequence A, and the other for the AT deletion. However, the top four reads with their ends in the repeat region cannot differentiate the case of variation G from the case of AT deletion because they do not have sequence information to anchor after the repeat region. For reads extending beyond the repeat region, the sequence alignment before and after the repeat region will force the variant to be either G or deletion of AT; otherwise, the misidentification would lead to a positional shift of the sequences after the repeat region. Hence, if the start or end of a read is aligned in a repeat region, we consider the read as having a risk of misalignment, and therefore eliminate it.
2.5. A new variant calling method
In this section, we introduce a new method for determining the reliability of a variant call by comparing the read depth distribution with the variant to the whole read depth distribution aligned over the variant locus. For each candidate variant, RDscan first extracts all reads aligned to the locus of the variant in the corresponding Binary Alignment Map (BAM) file, and then removes misaligned reads in the repeat regions, as described in the previous section. This set of filtered reads is called ALL. Among the reads in ALL, RDscan extracts VAR, a new set of reads containing the variant (Step 3 in Fig. 1). If the variant is an indel, it causes a difference in the aligned area lengths of reads between ALL and VAR, which adversely affects the calculation of the correlation of the read depth distribution between the two groups.
To prevent this, RDscan adjusts alignment of the reads. As shown in Step 4 of Figure 1, the reads with the AT deletion have an aligned region longer than other reads by the length of the deleted sequence. In this example, RDscan decreases the aligned area of the reads with the AT deletion by the length of the deletion. Conversely, in the case of an insertion variant, the aligned region is increased by the length of the variant. To compare the read depth distributions between ALL and VAR, RDscan calculates the read depths for each group.
We denote the read depth of the group X at locus i as
The RDscore ranges from 0 to 1. The RDscore of a variant ∼1 indicates that the variant is reliable. We used ALGLIB library (ALGLIB; https://www.alglib.net/) to calculate the Pearson correlation coefficient (corr) between the depths of ALL and VAR. The exponent value at Eq. (1) is to maximize the difference between the corrs of true-positive and false-positive variants. Specifically, the reference corr for the true-positive variants was set to 0.86, which preserves a sensitivity of 95% and 99.9% for NA12878 sample with coverage of 34 × and 300 × , respectively (Fig. 2).

The cumulative distributions of the Pearson correlation coefficients between ALL and VAR of the true-positive variants (3,544,295) within the two NGS data with coverage of 34 × and 300 × for the NA12878 sample. NGS, next-generation sequencing.
The reference corr of false-positive variants was set from the average corr of the false-positive variants from each algorithm. For the NA12878 sample with coverage of 34 × , the average corrs of the false-positive variants were 0.231, 0.431, and 0.245 for Strelka2, GATK Haplotype Caller, and DeepVariant, respectively. The exponent values of the Rdscore, which maximize the distance between the reference corrs of true-positive and false-positive variants, are 1.7, 2.5, and 1.8, respectively. As a result, the average (2.0) of those values is used for the exponent value.
2.6. Variant calling method from paired tumor and normal samples
A variant that is present in a cancerous tissue, but not in matched normal tissue, is called a somatic variant. In practice, the normal sample may contain some tumor cells, or the tumor sample could be contaminated by normal cells; either case could cause real somatic variants to be discarded. To address this problem, RDscan considers the VAF of the tumor and normal samples. First, RDscan calculates the RDscores (RDtumor and RDnormal) of a variant from the tumor and matched normal samples, respectively, using Eq. (1) in the previous section. Then, RDscan uses the following score function to determine whether the variant is somatic:
The possible range of the RDscoresomatic is 0–1 for a variant; a score ∼1 means that the variant is somatic variant. For a given variant, VAFtumor and VAFnormal represent, respectively, the variant allele frequencies of tumor and normal tissues.
3. Results
3.1. Evaluations of germline variant calling accuracy
We ran RDscan with candidate variant sets from three germline variant callers (Strelka2, GATK HaplotypeCaller, DeepVariant) using two public datasets (HG001 and HG002). Germline variant callers were executed based on the best practices described by the authors (Supplementary Table S2). Germline variant calling accuracy was evaluated using “hap.py” (Krusche et al., 2019) relative to the GIAB truth set. In this study, we assumed that germline variants with RDscore > 0.5 were true-positive variants.
We first compared the original set of variants called by each algorithm with the variants subjected to further filtering with RDscan. The original set of variants consisted of variants that passed all filtering criteria provided by each variant caller. Overall, RDscan improved variant calling accuracy by decreasing the number of false-positive calls (FPs) while minimizing the reduction in true-positive calls (TPs). As shown in Table 1, for SNVs in HG001, the number of FPs was reduced by 69.5% for Strelka2 (from 438,358 to 133,573), 54.1% for GATK HaplotypeCaller (139,993 to 64,219), and 36.0% for DeepVariant (from 115,512 to 73,911), whereas the numbers of TPs were reduced by 0.65%, 0.74%, and 0.72%, respectively. After screening with RDscan, the accuracy (F-score) for SNVs was higher in all six cases, and the accuracy for indels was higher in five of six cases.
Germline Variant Calling Accuracy
Recall, precision, and F-score are expressed as Rec., Pre., and F-scr., respectively. PASS is the set of variants that passed all filtering criteria provided by each variant caller, and PASS+RD is the set of variants in PASS that passed additional filtering with RDscan. F-scores are calculated by the following equation:
indels, insertions and deletions; SNVs, single-nucleotide variants.
We also show the overall improvement in the performance of these algorithms achieved by RDscan. As shown in Figure 3, performance for indels was noticeably improved, whereas for SNVs RDscan slightly improved the variant calling performance relative to DeepVariant, even though it already had excellent variant calling accuracy. For both datasets, the overall performance of these three algorithms was improved when the variants were screened by RDscan (Supplementary Fig. S1).

Variant calling accuracy changes according to DeepVariant and RDscan scores. For two datasets (HG001 and HG002), the Receiver Operating Characteristic (ROC) curve shows the variation in recall and precision of the variant call with a scoring parameter. The scoring parameters used to generate the solid and dotted curves were Qual (DeepVariant) and RDscore (RDscan), respectively. The dashed lines, Qual (RD >0.5), show the results of applying RDscore > 0.5 to the sets of variants according to the Qual parameter of DeepVariant (solid line).
The additional burden of the quality checks should be relatively moderate. RDscan is simple to use and can effectively remove false-positive variants while still having a low computation load (∼2 hours using Intel Xeon E5620 2.4 GHz, 12 cores, and 64 GB memory for one WGS dataset) relative to the runtime of the entire variant call pipeline. The memory usage requirement for RDscan is 5–10 GB.
3.2. Evaluations of somatic variant calling accuracy for paired tumor and normal samples
Next, we evaluated the performance of RDscan for somatic variant calling using in silico mixed datasets and the truth set provided by Illumina (Kim et al., 2018). We used these data to run some of the known somatic variant callers (Strelka2, VarDict, Mutect2, and NeuSomatic), and then ran RDscan to further screen the results obtained with each caller. The somatic variant callers were executed using the best practices provided by the authors (Supplementary Table S2). We first tried running the latest version (v4.1.8.1) of Mutect2 using the best practice pipeline provided by bcbio-nextgen (Chapman et al., 2021).
This pipeline includes the step of removing known germline variants by reference to an external database. Consequently, higher somatic variant calling accuracy is expected for a typical Tumor-Normal dataset. However, for the in silico datasets generated based on the germline variants used for verification in this article, most of the variants were filtered out by this method, making analysis impossible. Therefore, we used Mutect2 (v4.1.1.0) as an alternative method to proceed only with basic somatic variant calling from the in silico datasets. We used “som.py” (Krusche et al., 2019) to evaluate somatic variant calling accuracy against the in silico germline mixture truth sets. In this study, variants with RDscoresomatic > 0.3 were considered true-positive somatic variants.
We first compared the original variants of the algorithms with the variants selected by RDscan. The original set of variants consisted of variants that passed all filtering criteria provided by each variant caller. Overall, variant calling accuracies of four algorithms were improved after the variants were screened using RDscan. As shown in Table 2, after screening with RDscan, the call accuracies for SNVs increased in 10 of 12 cases, and the call accuracies for indels increased in 11 of 12 cases.
Somatic Variant Calling Accuracy
Recall, precision, and F-score are expressed as Rec., Pre., and F-scr., respectively. PASS is the set of variants that passed all filtering criteria provided by each variant caller, and PASS+RD is the set of variants in PASS that passed additional filtering with RDscan. F-scores are calculated by the following equation:
Table 2 also shows that most variant callers had difficulty in producing high-accuracy results from samples with low-tumor purity. In the case of SNVs, the F-score of the samples with low-tumor purity (T:20%, N:100%) decreased by 3.9% for Strelka2, 15.9% for VarDict, 7% for NeuSomatic, and 4.6% for Mutect2 relative to samples with high-tumor purity (T:80%, N:100%). Despite the difficulties of variant calling in samples with low-tumor purity, RDscan achieved consistent improvement in accuracy over the variant callers we tested. In addition, even with normal samples contaminated by tumor cells (T:80%, N:90%), RDscan improved the accuracy of the variant callers.
Second, to show the overall improvement in the performance of these algorithms achieved by RDscan, we compared the changes in precision and recall of the variants according to the score of each algorithm with the changes resulting from application of RDscan to those variants. As shown in Figure 4, for both SNVs and indels, RDscan improved the performance of the variant calls from Strelka2. For all three datasets, the overall performance of these four algorithms was improved when the variants were screened with RDscan (Supplementary Fig. S2).

Changes in somatic variant calling accuracy according to Strelka2 and RDscan scores. For three datasets, ROC curve shows the change of recall and precision for the variant calls according to a scoring parameter. The scoring parameters used to generate the solid and dotted curves were SomaticEVS (Strelka2) and RDscore (RDscan), respectively. The dashed lines, SomaticEVS (RD >0.3), show the results of applying RDscoresomatic > 0.3 for the sets of variants according to SomaticEVS of Strelka2 (solid line).
3.3. Somatic variant calling accuracy of ensemble models
To compare RDscan with the same level methods, the results of using the well-known variant callers as filters were compared with the results of RDscan. We analyzed the accuracies of ensemble models that combined two somatic variant calling algorithms (one as a variant caller and the other as a filter). Six ensemble models were created by combining the four algorithms. For each model with two algorithms, we generated a set of variants that passed the default criteria of both algorithms’ filters, and then evaluated them.
We compared 14 results from six ensemble models, four existing algorithms, and four models analyzed with RDscan. Figure 5 shows the top three results for each dataset, in order of accuracy. In the case of SNVs, the combination of Strelka2 and RDscan had the highest accuracy in the two datasets with 80% tumor purity, and the combination of Mutect2 and RDscan had the highest accuracy in the dataset with 20% tumor purity. Indels exhibited similar results to SNVs, but in the case of the dataset with 80% tumor and 100% normal purity, the combination of Mutect2 and Strelka2 had the best accuracy, and the combination of Strelka2 and RDscan had the second highest accuracy.

Somatic variant calling accuracies of ensemble models. To evaluate the performance of ensemble models, 14 models (6 ensemble models, 4 models analyzed with RDscan, and 4 existing algorithms) were used along with three datasets with proportions of tumor and normal DNA purity of 20% and 100%, 80% and 100%, and 80% and 90%. This figure shows the top three models in order of SNV/indels call accuracy for each dataset. Indels, insertions and deletions; SNV, single-nucleotide variant.
Note that RDscan effectively removes artificial variants, significantly improving variant call accuracy for indels in samples with low-tumor purity, where it is difficult to distinguish true variants from artificial variants. The results for all models are shown in Supplementary Table S3.
3.4. Relationship between read depth coverage, indel length, RDscore, and variant calling accuracy
There are several parameters that influence call accuracy. The major factor would be depth coverage. The depth ranges of the current main technology are 30–70 × for WGS and 100–150 × for whole-exome sequence. To assess the accuracy of RDscan as a function of depth coverage, we ran RDscan with candidate variant sets from the three germline variant callers using an additional dataset, HG002 (300 × ), with high depth coverage. Similar to the analysis in the previous section, we compared the original set of variants called and passed by each algorithm with the variants further filtered by RDscan. Supplementary Table S4 shows that additional screening with RDscan is ineffective for datasets with a high depth coverage. We will address this issue in the Discussion section.
RDscan assumes that the depth distribution of the aligned read set containing a variant should not differ from the depth distribution of the whole read set mapped to the same region. However, the two distributions could be different for a long indel, because the sequencing quality and mapping quality scores of sequence reads with long indels may be lower than those of sequence reads without variants.
To evaluate the RDscore changes according to an indel length, we calculated the RDscores of all heterozygous indels in the GIAB truth sets. Among the 263,034 indels, we analyzed 220,097 indels with at least one sequence read in the HG001 (34 × ) BAM file. Supplementary Figure S3 shows that the RDscore slightly decreases as the length of indel increases. However, the medians of the RDscores are between 0.8 and 1.0 for all indel lengths, which can be distinguished from artificial variants (in this study, we assumed that germline variants with RDscore < 0.5 were artificial variants).
4. Discussion
Numerous methods have been developed for detecting variants from NGS data. Although the existing variant calling algorithms are generally accurate and reliable, most of them still have room for improvement in terms of accuracy for the variants with low VAF. RDscan removes misaligned reads and repositions reads, and then calculates RDscore based on the read depth distribution. By adopting this score, the accuracy of SNV/indel calls can be improved relative to the results obtained using existing variant callers.
Although RDscan can improve variant calling accuracy in most cases, it is important to use it with an understanding of the characteristics of RDscore. First, the RDscores of the true-positive variants are very densely distributed over a particular value. For example, 98.65% (3,076,963 of 3,118,927) of the true-positive variants called by GATK HaplotypeCaller for SNVs have RDscore > 0.64. Because variants with RDscore > 0.64 already have very high reliability in terms of the correlation (>0.8) of read depth distribution, variant filtering using RDscore > 0.64 can decrease the sensitivity of variant calls. Second, variant filtering with low RDscore increases the variant call precision while minimizing the reduction in sensitivity.
In particular, in the case of somatic variants in Figure 4, precision increases rapidly as RDscore increases (blue dots mean RDscore = 0, red dots mean RDscore = 0.3). Even with a loose RDscore criterion, RDscan can find many false calls not identified by existing methods. In this study, we used RDscore criteria of 0.5 and 0.3 for germline and somatic variants, respectively. The RDscore ranges from 0 to 1, but we recommend using an RDscore <0.64.
Third, Supplementary Table S4 shows that although RDscan can still increase the precision of variant calls, there is little room for precision improvement, resulting in a slightly decrease in the variant call accuracy for deep sequencing data. For example, the precision values of DeepVariant's SNV and indels calls were already very high, at 99.88% and 99.82%, respectively. Therefore, we recommend using RDscan for typical sequencing data with relatively low read depth coverage, rather than for deep sequencing data.
In the last decade, many algorithms have been introduced to achieve high performance using features of NGS data such as read depth, base and mapping quality score, strand bias, and haplotype. In this article, we propose a new feature, read depth distribution. To validate the effectiveness of the read distribution, we first developed our own variant caller (standalone) based on read depth distribution. The accuracy of the variant caller was comparable with, but not better than, the state-of-art variant callers. However, we demonstrated that read depth distribution can increase the accuracy of variant calls when used with other features. Read depth distribution can be used as a key feature of genomic analysis, along with the existing NGS data features.
5. Conclusion
RDscan is an SNV/indels filtering tool based on the read depth distribution of an NGS dataset. In this study, we showed that our method could improve the accuracy of germline and somatic variant calls from NGS data. In addition, RDscan is simple to use and can effectively remove false-positive variants while maintaining a low computation load. Therefore, RDscan, along with existing variant callers, will contribute to improvement in genome analysis. Future work should seek to develop new methods based on read depth distribution for sequence analysis other than variant calling and haplotyping.
Footnotes
Authors’ Contributions
S.L. contributed to conceptualization; methodology; software; writing—original draft; and writing—review and editing. S.H. contributed to software; visualization; writing—original draft; and writing—review and editing. J.W. and J.-H.L. performed formal analysis; writing—original draft. K.K. performed data curation and formal analysis. L.K. contributed to conceptualization and resources. K.P. provided conceptualization and methodology. J.J. contributed to conceptualization, methodology, and supervision.
Availability of Data and Implementation
Source code and binaries, implemented in C++ and supported on Linux, are freely available for download at https://github.com/satchellhong/RDscan. The sequencing data for HG001-HG002 and in silico mixtures used in this study are obtained from the 1000 Genomes Project (www.internationalgenome.org/) and the National Center for Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov/), respectively. Detailed descriptions about the data can be found in
.
Acknowledgments
The authors thank all members of the Precision Medicine Support Center at Inha University Hospital for their generous support in testing the NGS data.
Author Disclosure Statement
The authors declare they have no competing financial interests.
Funding Information
This research was supported in part by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (NRF-2019M3E5D4064683). K. Park was supported by NRF-2014M3C9A3063541.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
