Intra-Exon Motif Correlations as a Proxy Measure for Mean Per-Tile Sequence Quality Data in RNA-Seq

Abstract

Given the wide variability in the quality of next-generation sequencing data submitted to public repositories, it is essential to identify methods that can perform quality control on these data sets when additional quality control data, such as mean tile data, are missing from public repositories. In this study, we present evidence that correlating counts of reads corresponding to pairs of motifs separated over specific distances on individual exons can be used as a proxy mean tile data in the data sets we analyzed and hence could be used when mean tile data are not available. As test data sets we use the Homo sapiens in vitro transcribed (IVT) data set, and a Drosophila melanogaster data set comprising wild and mutant types. We find that a FastQC analysis of the available parts of these data sets demonstrates that the per-tile sequencing quality is good for all the data sets apart from the mutant-type data where the mutant-r3 data are worse than the mutant-r2 data. Correspondingly, intra-exon motif correlations are reasonably large for all data sets except this latter case where the mutant-r2 correlations are low and the mutant-r3 correlations close to zero. We propose that these extremely low correlations are indicative of bias of technical origin, such as flowcell errors. In addition to this, the intra-exon motif correlations as a function of both guanosine-cytosine (GC) content parameters are somewhat higher and less dependent on the GC content parameters in the IVT-Plasmids messenger RNA (mRNA) selection free RNA-Seq sample (control) than in the other RNA-Seq samples that did undergo mRNA selection: both ribosomal depletion (IVT-Only) and PolyA selection (IVT-PolyA, wild type, and mutant).

1. INTRODUCTION

Next-generation sequencing (NGS) methods have revolutionized nucleic acid sequencing largely as a result of the employment of fluorescence-based nucleotide chemistry to generate a light signal on nucleotide incorporation (Baumeister et al, 1987; Connell et al, 1985; Soper, 1985), miniaturization, and massively parallel sequencing reactions (Mardis, 2006). Although these have, to a degree, simplified the core sequencing process allowing reactions to be performed in clusters to generate enough signal and in parallel to increase throughput, NGS technologies share the same complex preparatory procedures (Ji and Shendure, 2008).

These are typically fragmentation of fragments to the size appropriate for the target sequencing platform, amplification (e.g., PCR), and ligation of synthetic sequencing adapters for the sequencing platform. Such high-throughput sequencing technologies generate millions to billions of reads in a matter of days and generate large data sets (Metzker, 2010)—it has enabled a number of large-scale sequencing projects. These include the 100,000 genomes project in the United Kingdom (Siva, 2015; Gallagher, 2014), the National Institutes of Health Precision Medicine 1 million genomes project in the United States (Reuters, 2015), and a 1 million genome project by the Beijing Genome Institute in China (Cyranoski, 2016) to name just a few.

RNA-Seq, on which this work focuses, is a high-throughput NGS technique for estimating the contraction of messenger RNA (mRNA) transcripts in a transcriptome. It provides wider coverage of the transcriptome than microarrays as its methods involve the direct sequencing of transcripts of RNA found in the sample (Gerstein et al, 2009; Kukurba and Montgomery, 2015). RNA-Seq can be used to study various types of RNA present: total RNA, mRNA, pre-mRNA, and non-coding RNA (ncRNA), such as microRNA and long ncRNA enabling it to be used to study alternative splicing events (Dickerson et al, 2014; Park et al, 2013). Furthermore, RNA-Seq achieves this at a higher resolution (Kukurba and Montgomery, 2015) than other technologies. To quantify gene expression, a mapping process is combined with gene boundary information so as to count the number of transcripts that map to a given gene or exon region (Garber et al, 2011; Gerstein et al, 2009; McCue et al, 2008).

RNA-Seq has transformed our view of the extent and complexity of the transcriptome through deep sequencing (Gerstein et al, 2009), and also as a result of the increased precision, the technique offers over other methods. While recent developments in the RNA-Seq workflow, from sample preparation to sequencing, have furthered our understanding of the transcriptome, they have also required substantial effort for data analysis and computation, and given the complexity of RNA-Seq workflow necessitates study of the bias that can be introduced in the preparatory steps (Kapranov et al, 2011; Kukurba and Montgomery, 2015). Characterization of bias in RNA-Seq is especially incumbent, given that the method sequences and measures the transcriptome indirectly using reverse-transcribed complementary DNA (cDNA) (Bowers et al, 2009).

Bias introduced in the preparatory steps can have a profound effect on the raw data and typically manifest themselves as sequence-specific or positional biases, while bias introduced by the sequencing process itself is often systematic in nature (Meacham et al, 2011).

The main obstacle to obtaining accurate estimates of transcript expression from RNA-Seq data is nonuniformity in the distribution of mapped reads to the reference genome, which reduces the certainty that the measured counts of mapped reads reflect the true expression of the transcript within the cell's transcriptome. These bias have numerous sources such as wet-lab sample preparatory techniques, the sequencing process itself (Alnasir and Shanahan, 2015; Dohm et al, 2008), and the potential for errors in post-sequencing data processing. They perturb the uniformity of the distribution of mapped reads to a reference genome (Lahens et al, 2014), and such bias manifests itself as sequence-specific or positional (Donaghey et al, 2011). Also, positional biases can occur due to random hexamer priming in sample preparation (Hansen et al, 2010).

Large amounts of these raw RNA-Seq read data are deposited in public repositories, such as the Sequence Read Archive (SRA) (Leinonen et al, 2011) and Gene Expression Omnibus (GEO) (Edgar, 2002). Furthermore, the SRA, for example, does not require a quality check on submission (Nakazato et al, 2013) and has shown poor annotation of sequencing protocol steps—both at the top-level study and at individual experiment record level (Alnasir and Shanahan, 2015). Hence, it is critical that methods are developed to characterize and quantify bias in these data sets. Such methods can augment the analysis of guanosine-cytosine (GC) metadata in the data sets or can serve as an alternative measure when this metadata is not present (Andrews, 2010).

FastQC (Andrews, 2010) is often used to perform a number of quality control checks on the raw reads in high-throughput sequencing data. The per-tile sequence quality measure reported by FastQC is particularly important because it represents the deviation of each nucleotide from the average quality for each flowcell tile. The measure is used by FastQC to produce a heat map plot of this deviation—cold colors indicate base quality scores that are at, or above, the average quality scores for other tiles on the flowcell, and hotter colors indicate base quality scores that are below the average quality scores for other tiles on the flowcell.

Hence, a good heat map plot should be blue all over. Per-tile sequence quality can only be computed from Illumina sequencing data in which the original sequence identifiers are retained, that is, it is only present in raw Illumina reads. However, this information is not retained after processing Illumina reads into, for example, Binary Alignment Map (BAM) files after sequence alignment. This difficulty is compounded when using sequencing data sets from public repositories, such as the SRA, where not all submissions include the raw reads.

In this study, we propose that a method we have previously devised, which applies distributed-computing to quantify the sequence-specific deviations in the uniformity of mapped reads (Alnasir and Shanahan, 2017), can be used as a proxy measure when mean tile data are not available for short-read RNA-Seq data. Our method uses counts of reads overlapping motifs and works at the deep, read level. This approach is based on the assumption that 4-mers in short reads from one region on an exon will be correlated with 4-mers in short reads from another region of the same exon.

To provide the capacity to process the amounts of data typical in transcriptomic data sets (Campbell et al, 2015), our analysis employs parallel distributed computing algorithms and infrastructure using the Apache Spark platform—it is named Hercules and is available on GitHub (Alnasir, 2018) and at Zenodo (Alnasir and Shanahan, 2022; a message passing interface (MPI) version has now also been made available). We demonstrate this using a controlled in vitro transcribed (IVT) data set created by Lahens et al (2014) for the purpose of characterizing bias introduced in RNA-Seq library preparation, as well as a Drosophila melanogaster data set.

The first data set (IVT) was produced utilizing IVT in Escherichia coli to clone a pool of ∼1000 preselected human plasmids from the Mammalian Gene Collection (MGC) (Derge et al, 2009). Because the sequences and expression levels of these plasmids are known, and they do not undergo splicing, this allowed them to generate a highly controlled set of samples, and therefore a controlled data set, in which the source of biological variation in the samples is minimized. The samples in this data set were then subjected to different RNA-Seq preparatory protocols, specifically varying the step in which mRNA is selected.

This enables the study and quantification of the effect of these steps on coverage levels of the MGC transcripts when they were aligned to the human reference genome (hg19/grch37). They found that the mRNA selection methods employed in RNA-Seq protocols, poly-A and ribosomal depletion, both resulted in significant fold changes in the coverage of the IVT MGC plasmids when compared with sequencing the IVT MGC plasmids directly (without mRNA selection). Importantly, as the bias introduced in this data set is well characterized and attributed, we apply our analysis method to a selection of relevant samples.

In previous work, we applied our analysis method to replicates of two samples from a D. melanogaster data set produced from typical biological specimens using conventional RNA-Seq protocols (Alnasir and Shanahan, 2017). These are two small, but whole transcriptomes—those of the fruit fly species D. melanogaster wild type and mutant-r2, comprising ∼12.9 and 15.0 M reads, respectively. We will re-examine our analysis of these data with respect to sequencing tile means. While the IVT data set offers a set of samples that allow us to study the effect of different RNA selection methods, the D. melanogaster data sets have the same RNA selection method applied to all samples and vary only in the glass eye mutation—that is, the technical variation is fixed, and the biological variation should be minimal. Furthermore, the Drosophila species and its reference genome are extremely well studied and annotated, and the data have excellent provenance.

2. MATERIALS AND METHODS

2.1. Quantifying sequence-specific deviation in the distribution of mapped reads across exons

As explained in detail in Alnasir and Shanahan (2017), the uniformity of read distribution across an exon (Fig. 1) can be quantified by computing Pearson's correlations of the counts for the given motif pair in all exons within the data set by aggregating the motif pair counts at a given distance apart (motif-spacing) regardless of position within the exon (Fig. 2). We used motif-spacings of 10, 50, 100, and 200 base pairs (bp). An ideal data set would have perfect correlations for motif pairs (for instance, +1.0 for the Pearson correlation coefficient) for any given motif-pair and motif-spacing.

FIG. 1.

Typical distribution of RNA-Seq reads mapped to an exon (Alnasir and Shanahan, 2018).

FIG. 2.

Quantification of read coverage using short pairs of sequence motifs (4-mers) within the reads shown in light shading. Motif-pairs show variable correlations that are used to quantify sequence-specific deviations in the distribution of mapped reads across exons (Alnasir and Shanahan, 2017).

To thoroughly examine the affect of sequence-specific motifs on the uniformity of read distribution, we analyzed the correlation for all 4-mer motifs ranging from AAAA to GGGG (i.e., 4⁴ combinations) in the RNA-Seq reads of an aligned Sequence Alignment Map (SAM) file. We verified this method by running our analysis on an artificially created transcriptome with in-built uniform distribution of reads (see Section A, Supplementary Table S1, and Supplementary Fig. S1 in the Supplementary Data).

The effect of extremes of GC content in RNA-Seq data has been discussed in numerous studies (Chen et al, 2013; Harrison et al, 2010), and we therefore also investigate the effect the mean GC content of reads within the exon, and the GC content of the 4-mer motif itself, has on the distribution of reads across the exon. To partition reads by mean GC content (of the exon), we define binned GC content ranges: 30%–40%, 40%–50%, 50%–60%, and 60%–70%. Since exons of such extreme mean GC values are not observed, that is, <30% or >70%, at least not at sufficient levels, we are not examining these. Given we are working with 4-mers, the motif GC is fixed to the range 0%, 25%, 50%, 75%, and 100%.

2.2. Homo sapiens IVT RNA-Seq data set

We have analyzed three Homo sapiens samples, which were prepared by IVT RNA-Seq. Lahens et al (2014) in their study used RNA that has been IVT from cDNA clones in E. coli. Their rationale for the development of this data set was that the “nucleotide sequence at every base was known, the splicing pattern established, and the expression the level coverage is uniform across the transcript.” This means that any bias occurring in the coverage of reads in these three samples must be as a result of technical rather than biological origin. In addition, these samples are known to demonstrate intra-exon coverage bias (we have used their H. sapiens data; Lahens et al, 2014) and hence is ideal to study from this perspective.

We explored known-bias in this RNA-Seq, by analyzing intra-exon motif pair correlation within the reads, we performed analysis on three samples from the H. sapiens data set, which applied different library preparation protocols to each sample during RNA-Seq. These samples from Lahens et al (2014) are known to demonstrate intra-exon coverage bias (we have used their H. sapiens data). Although the raw data deposited in GEO for this data set have not been aligned, the library preparation protocol for each sample and the alignment and post-processing strategy applied have been clearly documented. We, therefore, aligned the reads of the samples to the reference hg19 genome according to the documented parameters to generate the SAM files for analysis by our method.

The RNA in these samples has been transcribed from cDNA clones in E. coli DH5α cells. The data set comprises a pool of 1062 RNAs from a full-length human cDNA library sequenced using RNA-Seq. The first sample, IVT-Only, had its IVT RNA subjected to ribosomal RNA depletion before sequencing, while the second sample, IVT-PolyAsel, had polyadenylated selection applied instead of ribosomal depletion—these are two different, routinely used protocols for selecting specifically mature (mRNA) from RNA samples. The third sample, IVT-Plasmids, is our control as it was produced by direct sequencing of the Human IVT plasmids without RNA selection (i.e., neither ribosomal depletion nor PolyA selection was applied). These data are deposited at the GEO database with the ID GSE50445 (Black et al, 2012).

The distribution of protein-coding exon lengths in H. sapiens is shown in Supplementary Figure S2 of the Supplementary Data. We note that, although the median exon length in H. sapiens is 121 bp (shorter than Drosophila) and ∼80% of the exons are <200 bp (Sakharkar et al, 2004), the remaining 20% of exons will contribute to motif pair correlations at 200 bp apart.

2.3. D. melanogaster RNA-Seq data set

To investigate intra-exon motif pair correlation within the reads from a “typical data set,” we have used two Drosophila (species D. melanogaster) transcriptomics data sets, which differ by mutation gl[60j] in the eye–antennal disk. These are the full transcriptomes of the wild-type and mutant glass eye mutations, acquired from Stein Aerts Laboratory of Computational Biology at the University of Leuven, Belgium. These data are deposited at the GEO database with the ID GSE39781 (Aerts, 2012). The D. melanogaster data sets featured in a research publication by Naval-Sánchez et al (2013).

Supplementary Figure S3 shows the distribution of exon lengths in the Drosophila genome—the median exon length is 298 bp. This is important because it shows that most of the exons are longer than 200 bp and therefore can have data to compute correlations.

The information regarding the source of the biological samples, the sample preparatory protocols, and post-sequencing processing that were applied to samples in these two data sets is documented in Table 1.

Table 1.
The Sample and Library Preparation Protocols, Together with the Data Processing Steps, Applied to the RNA-Seq Data Sets Used in This Analysis

Homo sapiens (IVT-Plasmids, IVT-Only, and IVT-PolyAsel) Drosophila melanogaster (wild type and mutant)

Replicates: 1×.sam file for each IVT sample (no replicates) Replicates: 4×.sam files (two replicates for each species)

Sample preparation Sample preparation

Glycerol stocks containing individual cDNAs (cloned into pCMV-Sport 6 plasmid) from the MGC (Derge et al, 2009) were produced. Plasmid DNA was extracted from these glycerol stocks and plated at 50 ng per well in 384-well plates. The contents of three 384-well plates (total of 1062 human transcripts) were collected. The plasmid library was then amplified by transferring 10 ng into Escherichia coli DH5α cells (cat. no. 18258012; Invitrogen, Life Technologies, Carlsbad, CA). The heat shock method was used to transform E. coli (see Lahens et al, 2014, for more details). The plasmids were then purified using Qiagen (Hilden, Germany) Maxiprep Kit (cat. no. 12163) according to the manufacturer's protocol. Samples were sequenced on the Illumina HiSeq 2000 platform For the wild-type D. melanogaster, fly stocks (Canton-S and strain RAL-208) were obtained from the inbred collection of T. Mackay. For the mutant-r2 D. melanogaster, fly stocks (stock 507) were obtained from the Bloomington Stock Center. Eye–antennal and wing imaginal disks were dissected. RNA was extracted, yielding 3 mg of total RNA per sample. The samples were processed into libraries according to the Illumina TruSeq protocol with appropriate indices, pooled. Sequencing of the transcriptomes was performed on the Illumina HiSeq 2000 platform

RNA-Seq RNA-Seq

After sequencing, raw reads from the samples were aligned to the human genome (GRCh37/hg19) using the RNA-Seq Unified Mapper (version 2.0.4) with default parameters. Only reads that mapped to a single location were used (selected from the RUM_Unique aligned reads file) After sequencing, Fastx-clipper (Gordon and Hannon, 2010) was used to discard reads containing residuals of adapter sequences that were discarded (FastX clipper version 0.0.13 with option-M15). Quality control was applied to the raw sequence reads and performed using the FastQC software (version 0.9; Andrews, 2010), checking for PHRED quality >20 and different primer contaminations. The reads were then aligned using TopHat version 2.0 (Pachter et al, 2009) with default parameters to the Flybase D. melanogaster genome version r5.45 (released March 2012; Antonazzo et al, 2017)

Homo sapiens (IVT-Plasmids, IVT-Only, and IVT-PolyAsel)	Drosophila melanogaster (wild type and mutant)
Replicates: 1×.sam file for each IVT sample (no replicates)	Replicates: 4×.sam files (two replicates for each species)
Sample preparation	Sample preparation
Glycerol stocks containing individual cDNAs (cloned into pCMV-Sport 6 plasmid) from the MGC (Derge et al, 2009) were produced. Plasmid DNA was extracted from these glycerol stocks and plated at 50 ng per well in 384-well plates. The contents of three 384-well plates (total of 1062 human transcripts) were collected. The plasmid library was then amplified by transferring 10 ng into Escherichia coli DH5α cells (cat. no. 18258012; Invitrogen, Life Technologies, Carlsbad, CA). The heat shock method was used to transform E. coli (see Lahens et al, 2014, for more details). The plasmids were then purified using Qiagen (Hilden, Germany) Maxiprep Kit (cat. no. 12163) according to the manufacturer's protocol. Samples were sequenced on the Illumina HiSeq 2000 platform	For the wild-type D. melanogaster, fly stocks (Canton-S and strain RAL-208) were obtained from the inbred collection of T. Mackay. For the mutant-r2 D. melanogaster, fly stocks (stock 507) were obtained from the Bloomington Stock Center. Eye–antennal and wing imaginal disks were dissected. RNA was extracted, yielding 3 mg of total RNA per sample. The samples were processed into libraries according to the Illumina TruSeq protocol with appropriate indices, pooled. Sequencing of the transcriptomes was performed on the Illumina HiSeq 2000 platform
RNA-Seq	RNA-Seq
After sequencing, raw reads from the samples were aligned to the human genome (GRCh37/hg19) using the RNA-Seq Unified Mapper (version 2.0.4) with default parameters. Only reads that mapped to a single location were used (selected from the RUM_Unique aligned reads file)	After sequencing, Fastx-clipper (Gordon and Hannon, 2010) was used to discard reads containing residuals of adapter sequences that were discarded (FastX clipper version 0.0.13 with option-M15). Quality control was applied to the raw sequence reads and performed using the FastQC software (version 0.9; Andrews, 2010), checking for PHRED quality >20 and different primer contaminations. The reads were then aligned using TopHat version 2.0 (Pachter et al, 2009) with default parameters to the Flybase D. melanogaster genome version r5.45 (released March 2012; Antonazzo et al, 2017)

Left: H. sapiens IVT RNA-Seq, Right: D. melanogaster.

cDNAs, complementary DNAs; IVT, in vitro transcribed; MGC, Mammalian Gene Collection.

3. RESULTS

3.1. Analysis of IVT RNA in H. sapiens

The first IVT sample we analyzed was the IVT-Plasmids sample, as this was produced from sequencing the Human IVT plasmids directly without applying ribosomal depletion or PolyA selection methods and therefore represents our control. The IVT-Plasmids sample, by virtue of not having RNA selection protocol steps applied, also reduces the technical sources of variation in read distribution. Table 2 shows a number of 4-mer motif-pairs with the highest or lowest correlations. We construct box plots of the correlations across the IVT-Plasmids sample, and we partitioned the results as a function of GC content of the motif and GC content of the exon and produced box and whisker plots (Fig. 3). We observe that the median correlation has a weak dependence on the motif GC content and is in the range of 0.6–0.8, with the exception of 100% motif GC content. The median correlation as a function of the GC content of the exons likewise lies in the range of 0.6–0.8.

FIG. 3.

Homo sapiens IVT-Plasmid replicates Pearson's outliers box and whisker plot of the IVT-Plasmid sample (control) correlations as a function of motif GC and mean exon GC content, for all spacings. GC, guanosine-cytosine; IVT, in vitro transcribed.

Table 2.

Homo sapiens Pearson's Correlation Coefficient Outliers (Top 10 and Lowest 10) for Different Intra-Exon 4-Mer Motif Sequence Pairs at 10, 50, 100, and 200 bp Spacings

IVT-Plasmids H. sapiens
R (10 bp)	R (50 bp)	R (100 bp)	R (200 bp)
Lowest 10 Pearson's correlation outliers and their motifs
ATCG = −0.0543 (23)	ACGA = −0.2715 (15)	TCGA = −0.2600 (16)	ACGA = −0.3050 (10)
CGTC = −0.0209 (44)	GACG = −0.0101 (39)	CGAA = −0.1953 (17)	TCGC = −0.1834 (13)
ACCG = −0.0129 (49)	AACG = 0.0000 (9)	CGTT = −0.1510 (10)	ACCG = −0.1608 (18)
CGTT = −0.0052 (21)	TTCG = 0.0000 (9)	GTCG = −0.0786 (19)	GTAC = −0.1465 (21)
CGTA = 0.0000 (7)	TCGA = 0.0000 (8)	CGGT = −0.0304 (45)	TCGA = −0.1274 (11)
CGAC = 0.0479 (43)	CGTT = 0.0000 (8)	CGTA = 0.0000 (5)	CGAG = −0.1055 (41)
ACGC = 0.1639 (40)	TCGT = 0.0000 (8)	ACGT = 0.0000 (9)	TGCG = −0.0935 (40)
GCGA = 0.1665 (62)	TACG = 0.0000 (7)	TACG = 0.0000 (4)	GTCT = −0.0547 (54)
ACCC = 0.1899 (260)	CGAT = 0.0000 (9)	CCGT = 0.0312 (37)	TAGG = −0.0541 (24)
GGGG = 0.1991 (649)	TCGC = 0.0884 (28)	CGAT = 0.0407 (18)	GCGA = −0.0482 (18)
Highest 10 Pearson's correlation outliers and their motifs
TAGA = 0.9985 (48)	TAGG = 0.9943 (19)	ATCG = 0.9822 (16)	TCAA = 0.8609 (57)
GTCG = 0.9984 (19)	ACTA = 0.9703 (21)	GTAT = 0.9581 (33)	ATCG = 0.8324 (10)
TTAG = 0.9983 (39)	CTAG = 0.9692 (19)	CGTC = 0.9429 (31)	AGGC = 0.7683 (137)
ACGT = 0.9978 (19)	CCTA = 0.9672 (26)	GTTA = 0.9233 (23)	ATCC = 0.7553 (49)
ATAG = 0.9976 (37)	CATT = 0.9644 (103)	AGTC = 0.9189 (59)	GGAT = 0.7430 (40)
ATTC = 0.9974 (85)	ACTT = 0.9594 (84)	GTCA = 0.8984 (67)	TGAC = 0.6765 (52)
GCAC = 0.9971 (110)	GATA = 0.9590 (27)	TCGT = 0.8971 (12)	TACC = 0.6553 (22)
TAGG = 0.9970 (31)	CTAA = 0.9572 (31)	GCAA = 0.8938 (72)	TGGT = 0.6435 (84)
TCGT = 0.9968 (18)	TATC = 0.9495 (27)	CATC = 0.8817 (117)	GGTG = 0.6410 (103)
ACAT = 0.9968 (114)	TTAC = 0.9389 (33)	ATTC = 0.8620 (78)	ACGG = 0.6184 (21)

Actual motif-pair counts are in parentheses.

To compare the effect of applying different RNA selection protocol methods to the IVT samples, specifically ribosomal depletion versus PolyA selection, we compared the IVT-Only and IVT-PolyA samples, respectively. Tables 3 and 4 present that the highest outliers for both these IVT-Seq samples show a number of 4-mer motif pairs that have very high correlations. High correlation outliers are observed across all spacings. We produced box and whisker plots of correlation as a function of GC content of the motif and GC content of the exon (Fig. 4). The trend of the data for IVT-Only and IVT-PolyA, as a function of motif GC content and mean GC content, is similar to that of the IVT-Plasmids data, although the median correlation is somewhat less (0.4–0.6).

FIG. 4.

Correlation (Pearson's) as a function of 4-mer motif and exon GC content, across all spacings, for Homo sapiens for the two IVT samples that had different RNA selection protocols applied: (Top) IVT-Only sample, which underwent ribosomal depletion, and (Bottom) IVT-PolyA, which underwent PolyA selection.

Table 3.

Homo sapiens In Vitro Transcribed-Seq Pearson's Correlation Coefficient Outliers (Top 10 and Lowest 10) for Different Intra-Exon 4-Mer Motif Sequence Pairs at 10, 50, 100, and 200 bp Spacings: In Vitro Transcribed-Only Library Preparation

IVT-Only H. sapiens
R (10 bp)	R (50 bp)	R (100 bp)	R (200 bp)
Lowest 10 Pearson's correlation outliers and their motifs
CGTT = −0.1726 (18)	CGAC = −0.2017 (22)	CGAC = −0.1421 (28)	CTAC = −0.2360 (30)
CGTA = −0.1470 (13)	GCGT = −0.1374 (25)	GCGA = −0.1255 (33)	ACGC = −0.2332 (14)
ACGC = −0.1337 (35)	CGAT = −0.1171 (11)	ACCG = −0.1224 (25)	TCGG = −0.2111 (21)
CCGT = −0.0979 (37)	CCTA = −0.1010 (34)	GACG = −0.0923 (29)	GTAT = −0.1806 (13)
GCGT = −0.0746 (44)	TAAC = −0.0980 (42)	CGAA = −0.0903 (18)	CCGA = −0.1790 (17)
GAGT = −0.0628 (101)	CGAA = −0.0884 (18)	GTTA = −0.0891 (32)	CGGT = −0.1735 (16)
CGTC = −0.0298 (44)	TAGG = −0.0698 (32)	CGTT = −0.0832 (11)	ACGG = −0.1720 (16)
CGGA = −0.0217 (75)	AACC = −0.0657 (81)	CGGT = −0.0666 (34)	ACTA = −0.1571 (23)
GCGA = −0.0156 (52)	CGCA = −0.0563 (40)	GTTC = −0.0627 (48)	GACG = −0.1540 (26)
TACG = 0.0000 (6)	GCTA = −0.0562 (26)	GTAG = −0.0583 (40)	TGCG = −0.1532 (34)
Highest 10 Pearson's correlation outliers and their motifs
TACA = 0.9989 (98)	TCGA = 0.9805 (15)	ATCC = 0.9609 (41)	TAGT = 0.9913 (20)
CTCA = 0.9983 (168)	GACT = 0.9713 (56)	AGTA = 0.9403 (31)	ATCC = 0.9860 (35)
AGTA = 0.9978 (50)	TGCA = 0.9674 (114)	TATA = 0.9261 (50)	GCAT = 0.9734 (40)
TACT = 0.9977 (47)	AGGT = 0.9667 (70)	CGTG = 0.9146 (39)	CTCG = 0.9417 (19)
TGAA = 0.9975 (229)	ATGC = 0.9626 (44)	TCAC = 0.8733 (77)	AGTC = 0.8869 (34)
GTCA = 0.9970 (81)	CCTT = 0.9551 (132)	CACC = 0.8694 (164)	TCAC = 0.8492 (47)
GCAT = 0.9969 (92)	TTGC = 0.9544 (48)	CACT = 0.8642 (109)	TGTG = 0.8017 (111)
AAGA = 0.9969 (257)	ATGG = 0.9543 (93)	CCCT = 0.8530 (275)	ATTC = 0.7894 (52)
AACT = 0.9965 (104)	CTCG = 0.9507 (30)	TGAT = 0.8525 (64)	CTGT = 0.7682 (114)
GTAT = 0.9963 (24)	CCGT = 0.9480 (17)	GTAC = 0.8448 (18)	TACC = 0.7581 (22)