Abstract
Given the wide variability in the quality of next-generation sequencing data submitted to public repositories, it is essential to identify methods that can perform quality control on these data sets when additional quality control data, such as mean tile data, are missing from public repositories. In this study, we present evidence that correlating counts of reads corresponding to pairs of motifs separated over specific distances on individual exons can be used as a proxy mean tile data in the data sets we analyzed and hence could be used when mean tile data are not available. As test data sets we use the Homo sapiens in vitro transcribed (IVT) data set, and a Drosophila melanogaster data set comprising wild and mutant types. We find that a FastQC analysis of the available parts of these data sets demonstrates that the per-tile sequencing quality is good for all the data sets apart from the mutant-type data where the mutant-r3 data are worse than the mutant-r2 data. Correspondingly, intra-exon motif correlations are reasonably large for all data sets except this latter case where the mutant-r2 correlations are low and the mutant-r3 correlations close to zero. We propose that these extremely low correlations are indicative of bias of technical origin, such as flowcell errors. In addition to this, the intra-exon motif correlations as a function of both guanosine-cytosine (GC) content parameters are somewhat higher and less dependent on the GC content parameters in the IVT-Plasmids messenger RNA (mRNA) selection free RNA-Seq sample (control) than in the other RNA-Seq samples that did undergo mRNA selection: both ribosomal depletion (IVT-Only) and PolyA selection (IVT-PolyA, wild type, and mutant).
INTRODUCTION
Next-generation sequencing (
These are typically fragmentation of fragments to the size appropriate for the target sequencing platform, amplification (e.g., PCR), and ligation of synthetic sequencing adapters for the sequencing platform. Such high-throughput sequencing technologies generate millions to billions of reads in a matter of days and generate large data sets (Metzker, 2010)—it has enabled a number of large-scale sequencing projects. These include the 100,000 genomes project in the United Kingdom (Siva, 2015; Gallagher, 2014), the National Institutes of Health Precision Medicine 1 million genomes project in the United States (Reuters, 2015), and a 1 million genome project by the Beijing Genome Institute in China (Cyranoski, 2016) to name just a few.
RNA-Seq, on which this work focuses, is a high-throughput NGS technique for estimating the contraction of messenger RNA (mRNA) transcripts in a transcriptome. It provides wider coverage of the transcriptome than microarrays as its methods involve the direct sequencing of transcripts of RNA found in the sample (Gerstein et al, 2009; Kukurba and Montgomery, 2015). RNA-Seq can be used to study various types of RNA present: total RNA, mRNA, pre-mRNA, and non-coding RNA (ncRNA), such as microRNA and long ncRNA enabling it to be used to study alternative splicing events (Dickerson et al, 2014; Park et al, 2013). Furthermore, RNA-Seq achieves this at a higher resolution (Kukurba and Montgomery, 2015) than other technologies. To quantify gene expression, a mapping process is combined with gene boundary information so as to count the number of transcripts that map to a given gene or exon region (Garber et al, 2011; Gerstein et al, 2009; McCue et al, 2008).
RNA-Seq has transformed our view of the extent and complexity of the transcriptome through deep sequencing (Gerstein et al, 2009), and also as a result of the increased precision, the technique offers over other methods. While recent developments in the RNA-Seq workflow, from sample preparation to sequencing, have furthered our understanding of the transcriptome, they have also required substantial effort for data analysis and computation, and given the complexity of RNA-Seq workflow necessitates study of the bias that can be introduced in the preparatory steps (Kapranov et al, 2011; Kukurba and Montgomery, 2015). Characterization of bias in RNA-Seq is especially incumbent, given that the method sequences and measures the transcriptome indirectly using reverse-transcribed complementary DNA (cDNA) (Bowers et al, 2009).
Bias introduced in the preparatory steps can have a profound effect on the raw data and typically manifest themselves as sequence-specific or positional biases, while bias introduced by the sequencing process itself is often systematic in nature (Meacham et al, 2011).
The main obstacle to obtaining accurate estimates of transcript expression from RNA-Seq data is nonuniformity in the distribution of mapped reads to the reference genome, which reduces the certainty that the measured counts of mapped reads reflect the true expression of the transcript within the cell's transcriptome. These bias have numerous sources such as wet-lab sample preparatory techniques, the sequencing process itself (Alnasir and Shanahan, 2015; Dohm et al, 2008), and the potential for errors in post-sequencing data processing. They perturb the uniformity of the distribution of mapped reads to a reference genome (Lahens et al, 2014), and such bias manifests itself as sequence-specific or positional (Donaghey et al, 2011). Also, positional biases can occur due to random hexamer priming in sample preparation (Hansen et al, 2010).
Large amounts of these raw RNA-Seq read data are deposited in public repositories, such as the Sequence Read Archive (SRA) (Leinonen et al, 2011) and Gene Expression Omnibus (GEO) (Edgar, 2002). Furthermore, the SRA, for example, does not require a quality check on submission (Nakazato et al, 2013) and has shown poor annotation of sequencing protocol steps—both at the top-level study and at individual experiment record level (Alnasir and Shanahan, 2015). Hence, it is critical that methods are developed to characterize and quantify bias in these data sets. Such methods can augment the analysis of guanosine-cytosine (GC) metadata in the data sets or can serve as an alternative measure when this metadata is not present (Andrews, 2010).
FastQC (Andrews, 2010) is often used to perform a number of quality control checks on the raw reads in high-throughput sequencing data. The per-tile sequence quality measure reported by FastQC is particularly important because it represents the deviation of each nucleotide from the average quality for each flowcell tile. The measure is used by FastQC to produce a heat map plot of this deviation—cold colors indicate base quality scores that are at, or above, the average quality scores for other tiles on the flowcell, and hotter colors indicate base quality scores that are below the average quality scores for other tiles on the flowcell.
Hence, a good heat map plot should be blue all over. Per-tile sequence quality can only be computed from Illumina sequencing data in which the original sequence identifiers are retained, that is, it is only present in raw Illumina reads. However, this information is not retained after processing Illumina reads into, for example, Binary Alignment Map (BAM) files after sequence alignment. This difficulty is compounded when using sequencing data sets from public repositories, such as the SRA, where not all submissions include the raw reads.
In this study, we propose that a method we have previously devised, which applies distributed-computing to quantify the sequence-specific deviations in the uniformity of mapped reads (Alnasir and Shanahan, 2017), can be used as a proxy measure when mean tile data are not available for short-read RNA-Seq data. Our method uses counts of reads overlapping motifs and works at the deep, read level. This approach is based on the assumption that 4-mers in short reads from one region on an exon will be correlated with 4-mers in short reads from another region of the same exon.
To provide the capacity to process the amounts of data typical in transcriptomic data sets (Campbell et al, 2015), our analysis employs parallel distributed computing algorithms and infrastructure using the Apache Spark platform—it is named Hercules and is available on GitHub (Alnasir, 2018) and at Zenodo (Alnasir and Shanahan, 2022; a message passing interface (MPI) version has now also been made available). We demonstrate this using a controlled in vitro transcribed (IVT) data set created by Lahens et al (2014) for the purpose of characterizing bias introduced in RNA-Seq library preparation, as well as a Drosophila melanogaster data set.
The first data set (IVT) was produced utilizing IVT in Escherichia coli to clone a pool of ∼1000 preselected human plasmids from the Mammalian Gene Collection (MGC) (Derge et al, 2009). Because the sequences and expression levels of these plasmids are known, and they do not undergo splicing, this allowed them to generate a highly controlled set of samples, and therefore a controlled data set, in which the source of biological variation in the samples is minimized. The samples in this data set were then subjected to different RNA-Seq preparatory protocols, specifically varying the step in which mRNA is selected.
This enables the study and quantification of the effect of these steps on coverage levels of the MGC transcripts when they were aligned to the human reference genome (hg19/grch37). They found that the mRNA selection methods employed in RNA-Seq protocols, poly-A and ribosomal depletion, both resulted in significant fold changes in the coverage of the IVT MGC plasmids when compared with sequencing the IVT MGC plasmids directly (without mRNA selection). Importantly, as the bias introduced in this data set is well characterized and attributed, we apply our analysis method to a selection of relevant samples.
In previous work, we applied our analysis method to replicates of two samples from a D. melanogaster data set produced from typical biological specimens using conventional RNA-Seq protocols (Alnasir and Shanahan, 2017). These are two small, but whole transcriptomes—those of the fruit fly species D. melanogaster wild type and mutant-r2, comprising ∼12.9 and 15.0 M reads, respectively. We will re-examine our analysis of these data with respect to sequencing tile means. While the IVT data set offers a set of samples that allow us to study the effect of different RNA selection methods, the D. melanogaster data sets have the same RNA selection method applied to all samples and vary only in the glass eye mutation—that is, the technical variation is fixed, and the biological variation should be minimal. Furthermore, the Drosophila species and its reference genome are extremely well studied and annotated, and the data have excellent provenance.
MATERIALS AND METHODS
Quantifying sequence-specific deviation in the distribution of mapped reads across exons
As explained in detail in Alnasir and Shanahan (2017), the uniformity of read distribution across an exon (Fig. 1) can be quantified by computing Pearson's correlations of the counts for the given motif pair in all exons within the data set by aggregating the motif pair counts at a given distance apart (motif-spacing) regardless of position within the exon (Fig. 2). We used motif-spacings of 10, 50, 100, and 200 base pairs (bp). An ideal data set would have perfect correlations for motif pairs (for instance, +1.0 for the Pearson correlation coefficient) for any given motif-pair and motif-spacing.

Typical distribution of RNA-Seq reads mapped to an exon (Alnasir and Shanahan, 2018).

Quantification of read coverage using short pairs of sequence motifs (4-mers) within the reads shown in light shading. Motif-pairs show variable correlations that are used to quantify sequence-specific deviations in the distribution of mapped reads across exons (Alnasir and Shanahan, 2017).
To thoroughly examine the affect of sequence-specific motifs on the uniformity of read distribution, we analyzed the correlation for all 4-mer motifs ranging from AAAA to GGGG (i.e., 44 combinations) in the RNA-Seq reads of an aligned Sequence Alignment Map (SAM) file. We verified this method by running our analysis on an artificially created transcriptome with in-built uniform distribution of reads (see Section A, Supplementary Table S1, and Supplementary Fig. S1 in the Supplementary Data).
The effect of extremes of GC content in RNA-Seq data has been discussed in numerous studies (Chen et al, 2013; Harrison et al, 2010), and we therefore also investigate the effect the mean GC content of reads within the exon, and the GC content of the 4-mer motif itself, has on the distribution of reads across the exon. To partition reads by mean GC content (of the exon), we define binned GC content ranges: 30%–40%, 40%–50%, 50%–60%, and 60%–70%. Since exons of such extreme mean GC values are not observed, that is, <30% or >70%, at least not at sufficient levels, we are not examining these. Given we are working with 4-mers, the motif GC is fixed to the range 0%, 25%, 50%, 75%, and 100%.
We have analyzed three Homo sapiens samples, which were prepared by IVT RNA-Seq. Lahens et al (2014) in their study used RNA that has been IVT from cDNA clones in E. coli. Their rationale for the development of this data set was that the “nucleotide sequence at every base was known, the splicing pattern established, and the expression the level coverage is uniform across the transcript.” This means that any bias occurring in the coverage of reads in these three samples must be as a result of technical rather than biological origin. In addition, these samples are known to demonstrate intra-exon coverage bias (we have used their H. sapiens data; Lahens et al, 2014) and hence is ideal to study from this perspective.
We explored known-bias in this RNA-Seq, by analyzing intra-exon motif pair correlation within the reads, we performed analysis on three samples from the H. sapiens data set, which applied different library preparation protocols to each sample during RNA-Seq. These samples from Lahens et al (2014) are known to demonstrate intra-exon coverage bias (we have used their H. sapiens data). Although the raw data deposited in GEO for this data set have not been aligned, the library preparation protocol for each sample and the alignment and post-processing strategy applied have been clearly documented. We, therefore, aligned the reads of the samples to the reference hg19 genome according to the documented parameters to generate the SAM files for analysis by our method.
The RNA in these samples has been transcribed from cDNA clones in E. coli DH5α cells. The data set comprises a pool of 1062 RNAs from a full-length human cDNA library sequenced using RNA-Seq. The first sample, IVT-Only, had its IVT RNA subjected to ribosomal RNA depletion before sequencing, while the second sample, IVT-PolyAsel, had polyadenylated selection applied instead of ribosomal depletion—these are two different, routinely used protocols for selecting specifically mature (mRNA) from RNA samples. The third sample, IVT-Plasmids, is our control as it was produced by direct sequencing of the Human IVT plasmids without RNA selection (i.e., neither ribosomal depletion nor PolyA selection was applied). These data are deposited at the GEO database with the ID GSE50445 (Black et al, 2012).
The distribution of protein-coding exon lengths in H. sapiens is shown in Supplementary Figure S2 of the Supplementary Data. We note that, although the median exon length in H. sapiens is 121 bp (shorter than Drosophila) and ∼80% of the exons are <200 bp (Sakharkar et al, 2004), the remaining 20% of exons will contribute to motif pair correlations at 200 bp apart.
D. melanogaster RNA-Seq data set
To investigate intra-exon motif pair correlation within the reads from a “typical data set,” we have used two Drosophila (species D. melanogaster) transcriptomics data sets, which differ by mutation gl[60j] in the eye–antennal disk. These are the full transcriptomes of the wild-type and mutant glass eye mutations, acquired from Stein Aerts Laboratory of Computational Biology at the University of Leuven, Belgium. These data are deposited at the GEO database with the ID GSE39781 (Aerts, 2012). The D. melanogaster data sets featured in a research publication by Naval-Sánchez et al (2013).
Supplementary Figure S3 shows the distribution of exon lengths in the Drosophila genome—the median exon length is 298 bp. This is important because it shows that most of the exons are longer than 200 bp and therefore can have data to compute correlations.
The information regarding the source of the biological samples, the sample preparatory protocols, and post-sequencing processing that were applied to samples in these two data sets is documented in Table 1.
The Sample and Library Preparation Protocols, Together with the Data Processing Steps, Applied to the RNA-Seq Data Sets Used in This Analysis
The Sample and Library Preparation Protocols, Together with the Data Processing Steps, Applied to the RNA-Seq Data Sets Used in This Analysis
Left: H. sapiens IVT RNA-Seq, Right: D. melanogaster.
cDNAs, complementary DNAs; IVT, in vitro transcribed; MGC, Mammalian Gene Collection.
Analysis of IVT RNA in H. sapiens
The first IVT sample we analyzed was the IVT-Plasmids sample, as this was produced from sequencing the Human IVT plasmids directly without applying ribosomal depletion or PolyA selection methods and therefore represents our control. The IVT-Plasmids sample, by virtue of not having RNA selection protocol steps applied, also reduces the technical sources of variation in read distribution. Table 2 shows a number of 4-mer motif-pairs with the highest or lowest correlations. We construct box plots of the correlations across the IVT-Plasmids sample, and we partitioned the results as a function of GC content of the motif and GC content of the exon and produced box and whisker plots (Fig. 3). We observe that the median correlation has a weak dependence on the motif GC content and is in the range of 0.6–0.8, with the exception of 100% motif GC content. The median correlation as a function of the GC content of the exons likewise lies in the range of 0.6–0.8.

Homo sapiens IVT-Plasmid replicates Pearson's outliers box and whisker plot of the IVT-Plasmid sample (control) correlations as a function of motif GC and mean exon GC content, for all spacings. GC, guanosine-cytosine; IVT, in vitro transcribed.
Homo sapiens Pearson's Correlation Coefficient Outliers (Top 10 and Lowest 10) for Different Intra-Exon 4-Mer Motif Sequence Pairs at 10, 50, 100, and 200 bp Spacings
Actual motif-pair counts are in parentheses.
To compare the effect of applying different RNA selection protocol methods to the IVT samples, specifically ribosomal depletion versus PolyA selection, we compared the IVT-Only and IVT-PolyA samples, respectively. Tables 3 and 4 present that the highest outliers for both these IVT-Seq samples show a number of 4-mer motif pairs that have very high correlations. High correlation outliers are observed across all spacings. We produced box and whisker plots of correlation as a function of GC content of the motif and GC content of the exon (Fig. 4). The trend of the data for IVT-Only and IVT-PolyA, as a function of motif GC content and mean GC content, is similar to that of the IVT-Plasmids data, although the median correlation is somewhat less (0.4–0.6).

Correlation (Pearson's) as a function of 4-mer motif and exon GC content, across all spacings, for Homo sapiens for the two IVT samples that had different RNA selection protocols applied: (Top) IVT-Only sample, which underwent ribosomal depletion, and (Bottom) IVT-PolyA, which underwent PolyA selection.
Homo sapiens In Vitro Transcribed-Seq Pearson's Correlation Coefficient Outliers (Top 10 and Lowest 10) for Different Intra-Exon 4-Mer Motif Sequence Pairs at 10, 50, 100, and 200 bp Spacings: In Vitro Transcribed-Only Library Preparation
Sample sizes are given in parentheses.
Homo sapiens In Vitro Transcribed-Seq Pearson Correlation Coefficient Outliers (Top 10 and Lowest 10) for Different Intra-Exon 4-Mer Motif Sequence Pairs at 10, 50, 100, and 200 bp Spacings: In Vitro Transcribed with PolyA Library Selection
Sample sizes are given in parentheses.
We analyzed the whole transcriptomes of two Drosophila fruit fly (species D. melanogaster), which only differ by mutation gl[60j] in the eye–antennal disk (Naval-Sánchez et al, 2013). Overall, outliers for correlations for both wild type and mutant-r2 are given in Tables 5–8. We observe much lower correlations for the mutant-r2 than in wild type—the difference is especially marked when comparing the motifs with the highest 10 correlations in both samples for all separations.
Drosophila melanogaster Pearson's Correlation Coefficient Outliers (Top 10 and Lowest 10) for Different Intra-Exon 4-Mer Motif Sequence Pairs at 10, 50, 100, and 200 bp Spacings: Wild Type
Drosophila melanogaster Pearson's Correlation Coefficient Outliers (Top 10 and Lowest 10) for Different Intra-Exon 4-Mer Motif Sequence Pairs at 10, 50, 100, and 200 bp Spacings: Wild Type
Sample sizes are given in parentheses.
Drosophila melanogaster Pearson's Correlation Coefficient Outliers (Top 10 and Lowest 10) for Different Intra-Exon 4-Mer Motif Sequence Pairs at 10, 50, 100, and 200 bp Spacings: Wild-Type-r2
Sample sizes are given in parentheses.
Drosophila melanogaster Pearson's Correlation Coefficient Outliers (Top 10 and Lowest 10) for Different Intra-Exon 4-Mer Motif Sequence Pairs at 10, 50, 100, and 200 bp Spacings: Mutant-r2
Sample sizes are given in parentheses.
Drosophila melanogaster Pearson's Correlation Coefficient Outliers (Top 10 and Lowest 10) for Different Intra-Exon 4-Mer Motif Sequence Pairs at 10, 50, 100, and 200 bp Spacings: Mutant-r3
Sample sizes are given in parentheses.
We show box plots of the intra-exon 4-mer motif-pair correlations as a function of both motif and mean exon GC content, for both D. melanogaster data sets (Fig. 5). We note in the wild-type data set that the median correlation shows a dependence on the GC content of the motif, with the median correlations having a range of 0.1–0.5.

Correlation (Pearson's) as a function of 4-mer motif and exon GC content in
However, in the mutant-r2 data set, the median correlations in all cases are <0.1 and independent of the GC content of the motif or the exon. In the mutant-r3 data set, the range of correlations is consistent with zero (within the first and third quartiles).
The presence of raw unaligned read data for the above data sets allows us to compare these results with more traditional measures of quality control.
With this in mind, we ran FastQC (Andrews, 2010) on all the data sets we analyzed in this research. We note that relevant sets of reads for the IVT-Plasmids are not available.
We also ran Qualimap to analyze the source BAM files that the SAM files were directly derived from the D. melanogaster data set to see if there were any alignment or coverage issues in the mutant data (see Section C, Supplementary Table S2, and Supplementary Fig. S4 in the Supplementary Data).
In Figure 6, we plot the heat maps of the Sequencing per-tile Means of the IVT-PolyA and IVT-NoSel reads. These depict deviations from the average tile quality within areas of flowcells on the sequencing apparatus (the explicit values underlying the heat maps are referred to as the per-tile Means; Babraham Institute, 2015). Both data sets demonstrate a low deviation from the average time quality of for each tile.

Heat maps of the Sequencing per-tile Means, obtained from FastQC analysis for the two RNA selection Homo sapiens IVT RNA-Seq samples analyzed—sequencing tile information was not present in the IVT-Plasmids sample. (Top) IVT-PolyA, (bottom) IVT-NoSel.
However, in comparing the FastQC analysis results for the data sets, we found that the per-tile sequencing quality heat maps were radically different for the mutant and wild-type data sets (Fig. 7). In the wild-type replicates, these heat maps were plain and uniform, whereas both of the mutant replicates showed considerable variation from the Means. There were no differences in any of the other FastQC analysis parameters measured between the wild-type and mutant data sets.

Heat maps of the Sequencing per-tile Means, obtained from FastQC analysis for all Drosophila melanogaster RNA-Seq samples and their replicates analyzed. (Top) wild-type replicates, (Bottom) mutant replicates.
To investigate this effect further, we plotted the distribution of per-tile Means as both a histogram and a density plot (Fig. 8) for all the samples in all the data sets we have analyzed. In Figure 8, we observe that the distributions in the IVT and wild-type data sets show the least deviation from the Mean tile qualities. The mutant D. melanogaster data sets show considerable deviation from the Mean tile qualities—which match the low correlations seen across the mutant replicates. By computing intra-exon motif-pair correlations, we have observed that sequencing errors occurring in flowcell tiles result in widespread deviations from the uniformity of mapped reads across exons and are independent of sequence or GC content.

Visualization of the distributions of the Sequencing per-tile Means, obtained from FastQC analysis for all RNA-Seq samples analyzed (species Drosophila melanogaster and Homo sapiens).
In the BamQC data, there is not much difference between the mutant and wild-type data in terms of the mapping quality, nucleotide frequencies, and GC content, although we note an increase in variance in the coverage of the mutant D. melanogaster data set as indicated by the slightly higher standard deviation.
We have analyzed RNA-Seq samples from two species: H. sapiens (MGC plasmids) and D. melanogaster (transcriptomes). The IVT data set was produced in a highly controlled manner—we have analyzed three H. sapiens samples from this, two were subjected to mRNA selection before RNA-Seq, and the other was not. The IVT samples contain fewer reads than the D. melanogaster samples in which the wild and mutant-r2 types comprise ∼12.9 and 15.0 M reads respectively, whereas the IVT-Only sample has 0.406 M reads, the IVT-PolyA sample has 0.397 M reads, and the IVT-Plasmids has 0.181 M reads. Although these are comparatively small numbers of reads, the IVT samples have the intra-exon bias within them, as indicated by coverage of reads mapping to the source MGC plasmids used, well quantified.
From the box and whisker plots of the two IVT samples that underwent mRNA selection RNA-Seq protocols (ribosomal depletion and Poly-A selection), we observe a dependence of intra-exon motif pair correlation on motif GC content. This effect is also seen in the wild-type replicates of the D. melanogaster samples, but not in the IVT-Plasmids sample which is not subjected to mRNA selection. The mutant D. melanogaster data have very low overall correlations, and hence, this pattern is not seen here. The median of the IVT-Plasmids correlations is higher than the IVT-PolyA and IVT-Only correlations and that the dependence on the motif GC content is stronger in the latter cases than the former. There is a much more noticeable difference in the median of the correlations between the wild-type and mutant Drosophila data.
Dependence of intra-exon correlation on GC content appears to be due to mRNA selection
We note that the intra-exon motif correlations as a function of both GC content parameters are much higher in the IVT-Plasmids mRNA selection free RNA-Seq sample than in the other RNA-Seq samples that we analyzed that did undergo mRNA selection: both ribosomal depletion (IVT-Only) and PolyA selection (IVT-PolyA and wild-type). Additionally, both the H. sapiens and wild-type D. melanogaster samples that underwent mRNA selection in the RNA-Seq process had slightly lower correlations than the H. sapiens IVT-Plasmids sample, which did not, suggesting that this is likely of technical origin. Importantly, all the samples from all of the data sets we analyzed were sequenced on the same platform—Illumina HiSeq 2000 (as detailed in Table 1).
As the dependence on overall GC concentration is not observed in the IVT-Plasmids control sample, but is observed in RNA-Seq samples that underwent mRNA selection, we can exclude platform-specific sequencing bias as a source of this effect in the IVT samples. This suggests that not only do mRNA selection protocols result in bias in the distribution of mapped RNA-Seq reads as Lahens et al (2014) demonstrated, but that mRNA selection is also responsible for the dependence of correlation on GC content—we have observed this in all the RNA-Seq samples that underwent mRNA selection across both H. sapiens and D. melanogaster species. Risso et al (2011) also noted motif-specific GC effects, manifesting as deviations from uniform read distribution, which they attribute these problematic motifs being underrepresented.
However, the GC effect we observe occurs in both the wild-type replicates, which have large sample sizes (Table 5), as well as in the IVT-PolyA and IVT-NoSel mRNA selection data sets, which have low sample sizes (due to the controlled way in which the samples were prepared; Tables 3 and 4). Furthermore, the numbers of counts for high GC content motifs, as indicated by the highest outliers (Tables 5–8), are of the same order as other motifs. The dependence of intra-exon correlations on the GC content of the motifs could be explained by mRNA selection methods, which are routinely employed in RNA-Seq experiments and are known to introduce bias (Cao et al, 2002; Lahens et al, 2014; Theurkauf et al, 2012).
CONCLUSION
In this work, we have applied our novel k-mer-based analysis method that allows an investigation into RNA-Seq read data at the deep exon level and quantifies sequence-specific deviations in the uniformity of the distribution of mapped reads to a reference genome. The method we have applied uses distributed computing to count reads overlapping 4-mer motifs exhaustively (AAAA through to GGGG) at different regions of the same exon. The assumption is that, when there is no deviation in the distribution of mapped reads, these counts should be correlated. We have demonstrated that the correlations we have computed correspond to mean tile data in the samples from the two data sets we analyzed.
We have shown that sequencing errors occurring in flowcell tiles result in widespread deviations from the uniformity of mapped reads across exons and are independent of sequence or GC content; hence, we propose that this be used when mean tile data are not available and therefore requires only the raw short-read data alone. We have also observed that extremely poor correlations are an indication of technical sources of bias, such as sequencing flowcell tile errors or batch effects. We note also the reduction in the above correlations when not using mRNA selection techniques in the IVT data sets.
This is potentially useful in assessing the quality of RNA-Seq data in public repositories where raw unaligned read data have not been deposited as the above approach can be carried out on a post hoc basis. In practice, if poor or unusual motif correlations are observed, this suggests bias resulting from flowcell errors or batch effects. We propose that this will help with data sets where FastQC cannot be done due to the lack of sequencing metadata such as read-id's and flowcell metadata. For this purposes, readers can use the code developed for this study [Hercules is available on GitHub (Alnasir, 2018) and at Zenodo (Alnasir and Shanahan, 2022)].
When the analysis is performed, plots such as Figure 6 (IVT-PolyA and IVT-NoSel plots) and Figure 7 (Wild-type plots, upper panels) indicate good quality sequencing data, whereas Figure 7 (Mutant-r and Mutant plots, lower panels) indicate potential bias and are a caveat with the data set. Furthermore, these flowcell tile plots can be augmented with plots of the distribution of tile-error means (Fig. 8), where tall narrow distributions suggest good quality data and short flatter distributions suggest bias and poor quality data.
The method and software we have developed are scalable and can, therefore, be applied to analyze the large data sets present in these vast data repositories—as of October 2021 (the time of writing), the SRA alone contains more than 57 peta bases (57.920 × 1015) of sequencing data, that is in excess of 7 petabytes (Sequence Read Archive, 2017). Application of this method would enable a consistent assessment of quality across such data sets.
Footnotes
ACKNOWLEDGMENTS
We thank Prof. Adrian Shepherd and Dr. Matthew Bashton for suggesting the data sets used in this study as a means to calibrate the method. The authors also wish to thank Eszter Ábrahám for proofreading the article.
AUTHORs' CONTRIBUTIONS
J.J.A. and H.P.S. conceived the study. J.J.A. developed the analysis code. H.P.S. supervised the research and contributed to the writing of the article. Both the authors read and approved the final article.
AUTHOR DISCLOSURE STATEMENT
The authors declare they have no conflicting financial interests.
FUNDING INFORMATION
J.A. was supported through direct funding from the Department of Computer Science at Royal Holloway, University of London.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
