EasyQC: Tool with Interactive User Interface for Efficient Next-Generation Sequencing Data Quality Control

Abstract

The advent of next-generation sequencing (NGS) technologies has revolutionized the world of genomic research. Millions of sequences are generated in a short period of time and they provide intriguing insights to the researcher. Many NGS platforms have evolved over a period of time and their efficiency has been ever increasing. Still, primarily because of the chemistry, glitch in the sequencing machine and human handling errors, some artifacts tend to exist in the final sequence data set. These sequence errors have a profound impact on the downstream analyses and may provide misleading information. Hence, filtering of these erroneous reads has become inevitable and myriad of tools are available for this purpose. However, many of them are accessible as a command line interface that requires the user to enter each command manually. Here, we report EasyQC, a tool for NGS data quality control (QC) with a graphical user interface providing options to carry out trimming of NGS reads based on quality, length, homopolymer, and ambiguous bases. EasyQC also possesses features such as format converter, paired end merger, adapter trimmer, and a graph generator that generates quality distribution, length distribution, GC content, and base composition graphs. Comparison of raw and processed sequence data sets using EasyQC suggested significant increase in overall quality of the sequences. Testing of EasyQC using NGS data sets on a standalone desktop proved to be relatively faster. EasyQC is developed using PERL modules and can be executed in Windows and Linux platforms. With the various QC features, easy interface for end users, and cross-platform compatibility, EasyQC would be a valuable addition to the already existing tools facilitating better downstream analyses.

1. Introduction

Ever since its inception more than a decade ago, the next-generation sequencing (NGS) technologies have created a significant impact in the field of genomic research. The high throughput provided by NGS platforms has enabled us to amass huge genomic information, thereby becoming an inevitable part of genome-related studies (Roy et al., 2016). NGS technologies continue to evolve in terms of increase in sequencing speed and data throughput and reduction in cost. The average read length has tremendously increased (Goodwin et al., 2016) and the cost of sequencing a whole genome has dropped rapidly (Caulfield et al., 2013). NGS is widely used in studies related to metagenomics (Verma et al., 2017), transcriptomics (Cao et al., 2017), exomics (Ashktorab et al., 2016), etc. NGS has its application in the field of clinical and diagnostics studies (Previati et al., 2013; Hoadley et al., 2014) as well, paving a gradual way for personalized medicine. NGS is also extensively used in genetic testing wherein the genetic disorders are studied at a greater resolution (Nigro and Piluso, 2012; Simbolo et al., 2015).

Bioinformatic analyses remain to be the vital and complex part of any NGS project, involving a plethora of heterogeneous sequence data (Pabinger et al., 2014). Considering the wide array of applications possessed by NGS, it becomes essential to be wary of the nature of data obtained from them. Artifacts such as low-quality reads and adapter contamination are commonly observed in NGS data. Presence of low quality sequences shall severely impact the downstream analyses, viz, assembly, annotation, SNP detection, etc. (Dai et al., 2010). There are multiple factors affecting the quality of reads across various NGS platforms and each one of them has their own quality control (QC) pipelines.

Despite the primary filtering of sequences, there exists a need to carry out an elaborate QC process to ensure that only high-quality sequences are carried further for subsequent analyses. Earlier reports (Gloor et al., 2010; Caporaso et al., 2011) suggest that there was a loss of 40–70% reads after the initial quality filtering of raw sequences. Many tools have come into existence since then for NGS data QC, such as NGS QC toolkit (Dai et al., 2010), FastQC (Andrews, 2010), ClinQC (Pandey et al., 2016), PRINSEQ (Schmieder and Edwards, 2011), FAST-X Toolkit (Gordon, 2010), Galaxy (Goecks et al., 2010), Tag Cleaner (Schmieder et al., 2010), and QC-Chain (Zhou et al., 2013). Few of them exist as online tools and others are standalone tools with their own dependencies and limitations. These dependencies may at times hinder data processing and not many tools have user interface to carry out seamless operations.

In this study, we have developed EasyQC, an integrated standalone tool for NGS data QC with interactive user interface and cross-platform compatibility. EasyQC facilitates the end user to prepare NGS data in FASTQ format for secondary analyses through QC pipeline and other associated features. Considering the ever growing need of infallible tool for NGS data QC, we believe that EasyQC will be of significant utility to the research fraternity.

2. Development and Implementation

All the scripts of EasyQC have been written using PERL programming language (PERL v5.22). Perl Tk package has been implemented for developing the user interface. The functionalities of EasyQC have been segregated into various modules that ensure better maintainability of the tool. The workflow of EasyQC is described in Figure 1. The user interface of the various utilities of EasyQC is depicted in Figure 2. Graphical outputs were generated using GD::Graphs and URI::Google Chart packages. Functioning of the tool has been exhaustively tested in Windows and Linux (Ubuntu v16.04) environments.

FIG. 1.

The workflow of EasyQC pipeline.

FIG. 2.

User interface of EasyQC containing multiple tabs enabling the user to execute required process. (A) QC pipeline that enables the user to set the threshold values for various filtering parameters. (B) Paired end merge interface that facilitates merging of two fastq files based on the user-specified thresholds. (C) The adapter trimmer module that enables the user to remove the adapter sequences from the 5′ and 3′ ends of the sequence. (D) Graph generator module of EasyQC that provides sequence quality, length and GC distributions, along with base composition graphs. GC, DNA nucleotides; QC, quality control.

3. Results and Discussion

EasyQC is an open source desktop tool with cross-operating system compatibility, facilitating the QC process of huge NGS data sets with great accuracy. EasyQC includes various modules such as format converter, adapter trimmer, and QC graph generator apart from the pipeline that carries out filtering based on length, quality, ambiguous bases, and homopolymer repeats as well. The key features of EasyQC are described hereunder.

3.1. EasyQC features and pipeline

3.1.1. Format converter

This module of EasyQC mediates the conversion of. fastq files to. fasta files. It verifies the accuracy of the input FASTQ file and converts it to FASTA file upon successful format check.

3.1.2. Adapter trimmer

Adapter or primer trimming is a vital step in NGS data QC. It is important to remove the forward and reverse adapter sequences from the data set. The adapter trimmer module of EasyQC takes the forward and reverse adapter sequences as input and trims the sequences accordingly.

3.1.3. Paired end merging

Many sequencing protocols that are in use today sequence the DNA fragments from both ends. Merging of these paired end reads is important in genome assembly and other analyses. The paired end merging module of EasyQC uses FLASH (Magoč and Salzberg, 2011) to merge the input FASTQ files.

3.1.4. Graph generator

This feature of EasyQC enables the user to obtain graphical outputs of length distribution, average quality distribution, ATGC composition, and GC content of the input sequence data set.

3.1.5. Quality-based trimming

Removal of low quality sequences is essential, since they can have a significant impact on the downstream analyses of NGS data. This module of EasyQC gets the quality threshold for each base along with the percentage threshold for each sequence as input. Sequences that contain lesser percentage of high-quality bases than the threshold will be removed from the data set.

3.1.6. Length-based trimming

At times, data sets possess sequences with a very wide range of read lengths and it is important to remove those sequences that fall short of the required length. This module trims those sequences that are shorter than the input threshold length value.

3.1.7. Ambiguous-based trimming

Presence of ambiguous bases has significant impact on the subsequent alignment process. This feature fetches the threshold percentage of the ambiguous bases that shall be permitted in a sequence and those with a greater percentage will be removed.

3.1.8. Homopolymer trimming

Homopolymer repeats in a sequence shall also cause misalignment of the sequences leading to false results. This module takes the cutoff value for the homopolymers to be present in a sequence and those that exceed this cutoff value will be taken off from the data set.

The trimming modules are designed as a single pipeline in EasyQC. The user shall upload the input sequence data set and specify the threshold values of all the modules at the same time and execute the pipeline.

3.2. Input and output of EasyQC

EasyQC takes FASTQ files generated from any NGS platform as input for all its modules. The outputs from EasyQC are as hereunder.

3.2.1. Paired end merging

The output FASTQ file containing the stitched paired end sequences based on user input using FLASH gets stored in a new folder with the statistics of the number of sequences that were merged.

3.2.2. QC statistics

The QC pipeline of EasyQC provides the statistics of the total sequences that got filtered in each step of the QC processes based on the filter parameters provided by the user (Fig. 3). The output sequences in FASTQ and FASTA format are generated at every step of the pipeline. These sequences can directly be used for downstream analyses. The comparison of quality of raw sequence data set with that processed by EasyQC with a quality cutoff of Q30 was carried out using FastQC, and the results indicated significant increase in the overall quality of the reads (Fig. 4).

FIG. 3.

Results of the trimming pipeline in EasyQC. The graph represents the number of sequences filtered at each stage of the trimming process.

FIG. 4.

Comparison of per base quality of (A) raw reads and (B) reads processed using EasyQC with a quality filter of Q30. The graph shows significant increase in quality of processed reads.

3.2.3. Graphical output

The graphs module generates four different graphs as described earlier (Fig. 5) and is stored in a separate folder in the same location from where the input files were uploaded.

FIG. 5.

Graphical output of an NGS data set obtained from EasyQC. (A) Base composition, (B) sequence quality distribution, (C) GC percentage, and (D) sequence length distribution. NGS, next-generation sequencing.

3.3. Comparison with existing tools

The comparison of various features available in EasyQC with that of other QC tools is described in Table 1. Although many toolkits are available, not many exist as desktop standalones with a user interface. EasyQC provides that interactive facility to the users through its interface with a flexibility to work across Windows and Linux platforms.

Table 1.

Comparison of Various Features in EasyQC with Other Quality Control Tools

Features	EasyQC	FastQC	ClinQC	FASTX-Toolkit	NGSQC Toolkit	QC-Chain	PRINSEQ	TagCleaner
Format conversion	Yes	No	Yes	Yes	Yes	Yes	No	No
Adapter trimming	Yes	No	Yes	Yes	Yes	Yes	Yes	Yes
Paired-end merging	Yes	No	No	No	No	No	No	No
Quality filtering	Yes	No	No	Yes	Yes	Yes	Yes	No
Length filtering	Yes	No	No	Yes	Yes	No	Yes	No
Ambiguous base filtering	Yes	No	No	Yes	Yes	Yes	Yes	Yes
Homopolymers filtering	Yes	No	No	No	Yes	No	No	No
Graphical output	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Graphical user interface	Yes	Yes	No	No	No	No	No	No

3.4. Performance evaluation

The performance of EasyQC was validated using various Illumina sequence data sets containing sequences ranging from 0.3 to 2.8 million. These evaluations were carried out on a standalone Dell precision T7600 workstation with Intel Xeon(R) E5-2560 processor and 32GB RAM (operating system: Windows 7). The time taken for the execution of QC pipeline for various sequence data sets is given in Table 2. Comparison of EasyQC with other tools such as NGSQC and PRINSEQ for filtering of sequences based on quality score and length was carried out using a data set containing 2.3 million sequences. This was carried out in HP Z620 Workstation with Intel(R) Xeon(R) CPU E5-2620 v2 processor and 32GB RAM (operating system: Ubuntu v16.04). The quality cutoff was set at Q30, whereas the length threshold was set to 100 bp.

Table 2.

Execution Time of the Quality Control Pipeline of EasyQC with Next-Generation Sequencing Data Sets

S. no.	Total sequences	Time (minutes)
1	3,19,095	2.08
2	10,53,732	5.15
3	20,44,728	9.51
4	28,56,099	14.39

In addition, the percentage of high-quality bases per sequence was set at 90 (which is the default value) in EasyQC. Comparison of the post-QC data sets (Fig. 6) indicated that the filtering of sequences by EasyQC was more stringent and there were no significant deviation in the quality distribution as well. The time taken for the QC process was 4 minutes 28 seconds for EasyQC, whereas it took 1 minute 20 seconds for NGSQC and 1 minute 40 seconds for PRINSEQ. The relatively higher time taken by EasyQC can be attributed to the execution of percentage of high-quality base filter module, which is not present in the other two tools, in addition to the overall quality filter option. Furthermore, EasyQC provides the output data set containing the sequences that passed and failed in each step of the QC process that adds to its execution time, which is not provided by the tools compared herewith.

FIG. 6.

Comparison of post-QC evaluation of sample data set using EasyQC (A) with PRINSEQ (B) and NGSQC (C). The number of sequences filtered during the QC process from each tool is indicated in the graph (D).

4. Availability and installation

EasyQC is an open source tool available for free download at (www.niot.res.in/EasyQC). Separate installer files are available for Windows and Linux platforms. Steps to be followed while installing EasyQC and the installation of the dependent PERL packages are given in detail in the readme file available along with the installer.

5. Conclusion

With an interactive user interface, QC pipeline, and other associated features with cross-platform compatibility, EasyQC would be a significant addition to the already existing QC tools providing enhanced downstream analyses.

Footnotes

Acknowledgments

The authors gratefully acknowledge the financial support provided by the Ministry of Earth Sciences, Government of India. The authors are thankful to Dr. M.A. Atmarand, Director, National Institute of Ocean Technology, for constant suggestions and encouragement to carry out this work.

Author Disclosure Statement

The authors declare that no competing financial interests exist.

References

Andrews

2010. FastQC: A quality control tool for high throughput sequence data. Available at: www.bioinformatics.babraham.ac.uk/projects/fastqc. Accessed June 15, 2017.

Ashktorab

, Azimi

, Nickerson

M.L.

, et al. 2016. Targeted exome sequencing outcome variations of colorectal tumors within and across two sequencing platforms. Next Gener. Seq. Appl. 3, 123.

Cao

, Agarwal

, Dituri

, et al. 2017. NGS-based transcriptome profiling reveals biomarkers for companion diagnostics of the TGF-β receptor blocker galunisertib in HCC. Cell Death Dis. 8, e2634.

Caporaso

J.G.

, Lauberm

C.L.

, Walters

W.A.

, et al. 2011. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc. Natl. Acad. Sci. U. S. A., 108, 4516–4522.

Caulfield

, Evans

, McGuire

, et al. 2013. Reflections on the cost of “low-cost” whole genome sequencing: Framing the health policy debate. PLoS Biol. 11, e1001699.

Dai

, Thompson

R.C.

, Maher

, et al. 2010. NGSQC: Cross-platform quality analysis pipeline for deep sequencing data. BMC Genomics, 11, S7.

Gloor

G.B.

, Hummelen

, Macklaim

J.M.

, et al. 2010. Microbiome profiling by illumina sequencing of combinatorial sequence-tagged PCR products. PLoS One, 5, e15406.

Goecks

, Nekrutenko

, and Taylor

2010. Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11, R86.

Goodwin

, McPherson

J.D.

, and McCombie

W.R.

2016. Coming of age: Ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351.

10.

Gordon

2010. FASTX- toolkit. Available at: http://hannonlab.cshl.edu/fastx_toolkit/index.html. Accessed January 20, 2017.

11.

Hoadley

K.A.

, Yau

, Wolf

D.M.

, et al. 2014. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell, 158, 929–944.

12.

Magoč

, and Salzberg

S.L.

2011. FLASH: Fast length adjustment of short reads to improve genome assemblies. Bioinformatics, 27, 2957–2963.

13.

Nigro

, and Piluso

2012. Next generation sequencing (NGS) strategies for the genetic testing of myopathies. Acta Myol. 31, 196.

14.

Pabinger

, Dander

, Fischer

, et al. 2014. A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform. 15, 256–278.

15.

Pandey

R.V.

, Pabinger

, Kriegner

, et al. 2016. ClinQC: A tool for quality control and cleaning of Sanger and NGS data in clinical research. BMC Bioinformatics, 17, 56.

16.

Previati

, Manfrini

, Galasso

, et al. 2013. Next generation analysis of breast cancer genomes for precision medicine. Cancer Lett. 339, 1–7.

17.

Roy

, LaFramboise

W.A.

, Nikiforov

Y.E.

, et al. 2016. Next-Generation Sequencing Informatics: Challenges and Strategies for Implementation in a Clinical Environment. Arch. Pathol. Lab. Med. 140, 958–975.

18.

Schmieder

, and Edwards

2011. Quality control and preprocessing of metagenomic datasets. Bioinformatics, 27, 863–864.

19.

Schmieder

, Lim

Y.W.

, Rohwer

, et al. 2010. TagCleaner: Identification and removal of tag sequences from genomic and metagenomic datasets. BMC Bioinformatics, 11, 341.

20.

Simbolo

, Mafficini

, Agostini

, et al. 2015. Next-generation sequencing for genetic testing of familial colorectal cancer syndromes. Hered. Cancer Clin. Pract. 13, 18.

21.

Verma

, Raghavan

R.V.

, Jeon

C.O.

, et al. 2017. Complex bacterial communities in the deep-sea sediments of the Bay of Bengal and volcanic Barren Island in the Andaman Sea. Mar. Genomics., 31, 33–41.

22.

Zhou

, Su

, Wang

, et al. 2013. QC-Chain: Fast and holistic quality control method for next-generation sequencing data. PLoS One. 8, e60234.