REDO: RNA Editing Detection in Plant Organelles Based on Variant Calling Results

Abstract

RNA editing is a post-transcriptional or cotranscriptional process that changes the sequence of the precursor transcript by substitutions, insertions, or deletions. Almost all of the land plants undergo RNA editing in organelles (plastids and mitochondria). Although several software tools have been developed to identify RNA editing events, there has been a great challenge to distinguish true RNA editing events from genome variation, sequencing errors, and other factors. Here we introduce REDO, a comprehensive application tool for identifying RNA editing events in plant organelles based on variant call format files from RNA-sequencing data. REDO is a suite of Perl scripts that illustrate a bunch of attributes of RNA editing events in figures and tables. REDO can also detect RNA editing events in multiple samples simultaneously and identify the significant differential proportion of RNA editing loci. Comparing with similar tools, such as REDItools, REDO runs faster with higher accuracy, and more specificity at the cost of slightly lower sensitivity. Moreover, REDO annotates each RNA editing site in RNAs, whereas REDItools reports only possible RNA editing sites in genome, which need additional steps to obtain RNA editing profiles for RNAs. Overall, REDO can identify potential RNA editing sites easily and provide several functions such as detailed annotations, statistics, figures, and significantly differential proportion of RNA editing sites among different samples.

1. Introduction

RNA editing, a molecular process through changing the specific nucleotides within an RNA molecule (Brennicke et al., 1999), has been observed in the organelles (plastids and mitochondria) of most land plants (Hiesel et al., 1994; Freyer et al., 1997; Shikanai, 2006). Recent RNA-sequencing (RNA-seq) studies suggest that RNA editing shows tissue-specific and partial patterns in mitochondria (mt) and nuclear genome (Picardi et al., 2010; Zhang and Xiao, 2015; Khermesh et al., 2016). The editing events endow an extra complexity in functional regulation including the tRNA maturation and RNA splicing (Ichinose and Sugita, 2016). With the facilitating of next-generation sequencing, RNA-seq becomes a comprehensive, precise, and low-cost approach for transcriptome profiling and variant analysis. The growing of public RNA-seq data also provides an excellent opportunity to investigate the effect of RNA editing on organelle function and evolution (Smith, 2013).

Although several software have been developed to detect RNA editing events based on RNA-seq data, such as REDItools (Picardi and Pesole, 2013), RED (Sun et al., 2016), VIRGO (Distefano et al., 2013), rddChecker (http://ccb.jhu.edu/software/rddChecker/), RCARE (Lee et al., 2015), and GIREMI (Zhang and Xiao, 2015), there is still a great challenge to distinguish true RNA editing events from genome variation, sequencing errors, and other factors (Bass et al., 2012). Moreover, all of these tools are developed based on human transcriptome data and most of them have usage limitation. RED is complicated in installation and has relatively low sensitivity compared with REDItools. VIRGO is a web-based tool that only maps A->G mismatches between genomic and expression sequence. rddChecker only supports hg19 and mm9 genomes and needs both RNA-seq and DNA-sequencing data. RCARE is not able to identify novel RNA-editing sites. GIREMI is used to identify human A->I RNA editing sites. REDItools is the only effective software that can be used to identify RNA editing sites in organelles, but it lacks the detailed annotation for RNA editing sites and the comparative analysis of RNA editing sites among multiple samples. Compared with nuclear genome, the difference of mutant type, repair system, and copy number in organelle genome will affect the identification of true RNA editing events. Currently, there is no specific software for identification of RNA editing sites in organelles.

To improve RNA editing study in organelles using the next-generation sequencing data, we developed the package REDO for RNA editing detection in organelles based on raw standard variant calling results from popular variant callers GATK (McKenna et al., 2010; DePristo et al., 2011) and SAMtools (Li et al., 2009; Li, 2011). REDO can identify potential RNA editing sites by applying stringent rule-dependent filters and statistical filters. REDO also provides detailed annotations, statistics, and figures for RNA editing sites. Besides, REDO can detect RNA editing events in multiple samples simultaneously and identify significantly differential proportion of RNA editing events in different samples. The genome variation can be easily removed by a subprogram in our package. Moreover, REDO can be run easily on any operating system with Perl and R environment installed. REDO is freely available at Source forge (see first Reference).

2. Methods

In principle, the bases that differ between genomic DNA sequence and RNA-seq reads from transcripts are candidates of RNA editing events. However, to achieve accurate mapping of reads, it is critical to pay attention to removal of potential false positives that result from various artifacts or errors. For RNA-seq mapping, several outstanding software have been developed, for example, GSNAP (Wu and Nacu, 2010) and BWA (Li and Durbin, 2009). The concept of RNA editing detection is identical to the genome variant calling in GATK and SAMtools. Thus, we used the preliminary output from variant calling software to identify RNA editing sites.

REDO only uses the variant call format (VCF) files (records for all sites), the genome sequence file (FASTA format, for annotation), and the gene annotation file (for annotation, feature table file, www.ncbi.nlm.nih.gov/projects/Sequin/table.html) as input. Then the raw variants are filtered by rule-dependent filters and statistical filters to reduce the false positives as following steps.

(1) Quality control filter: the low quality sites are filtered according to the reads quality (DP = 0 in GATK [reads with MQ = 255 or with bad mates are filtered] and DP4 = 0 in SAMtools [number of high-quality ref-forward, ref-reverse, alt-forward, and alt-reverse bases]).

(2) Depth filter: three criterions are used in this filter, including total reads depth (<4), alt reads depth (<3), and the total reads depth variation between alt site and adjacent nonvariant sites in specific window (total reads depth <0.2 × average reads depth of adjacent sites).

(3) Alt proportion filter: two criteria are used in this filter, including alt proportion (<0.1) and the modified alt proportion (alt proportion minus error proportion obtained from adjacent nonvariant sites in specific window; [alt_proportion-error_proportion] <0.1).

(4) Multiple alt filter: only the variant with one alt allele is retained for RNA editing detection.

(5) Distance filter: the variant sites in short distance (<3 bp) are filtered due to the possible positional interference for RNA editing.

(6) Spliced junction filter: variants within short spliced anchor (<2) are removed.

(7) Indel filter: the indel variants are removed in default (indels can be kept if assigning “-i yes” option).

(8) Likelihood ratio (LLR) test filter: a LLR test is used for RNA editing sites identification (Chepelev, 2012; Sun et al., 2016). LLR test is a probabilistic test incorporating error probability of bases (error probability is obtained using adjacent nonvariant sites in specific window) for detecting RNA editing sites (LLR <10).

(9) Fisher exact test filter: we assessed the significance for a given RNA editing site (alt reads, ref reads) by comparing its expected levels (0, alt reads + ref reads) using the Fisher exact test and the p value is used for filter (p value >0.01).

(10) The complicated filter model: based on the statistics results for experiment validated RNA editing sites and the attributes of codon table, a complicated filter model was built according to five characteristics of RNA editing sites, which are RNA editing types (C->T, A->G, T->C, etc), alt proportion, amino acids change, codon phase, and hydrophobic/hydrophilic change.

Based on the complicated filter model, the candidate RNA editing sites were divided into five tiers. The first tier was RNA editing type C->T with at least 50% reads supported, then all candidates were kept. The second tier was RNA editing type C->T without amino acid change. Only the candidates located in any base of codon encoding hydrophobic amino acid were kept. The third tier was RNA editing type C->T with amino acid change in amino acid level. If the candidates located in second or first base of codon, all of them were kept. If the candidates located in third base of codon, they were kept if amino acid changed from hydrophilic amino acid to hydrophobic amino acid. The fourth tier was other RNA editing types (A->G, T->C, etc) without amino acid change. Only the candidates located in second base of codon or in codon encoding hydrophobic amino acid were kept. The fifth tier was other RNA editing types with amino acid change. If the candidates located in second base of codon, all of them were kept. If the candidates located in first base of codon encoding hydrophobic amino acid, these candidates were kept. If the candidates located in third base of codon, they were kept if amino acid changed from hydrophilic amino acid to hydrophobic amino acid.

REDO can provide detailed annotations, including gene name, genome position, nucleotide and amino acid position in gene, codon phase, ref and editing alleles, ref and editing codons, ref and editing amino acids, editing proportion, total reads depth, reads depth of ref and editing allele, LLR, p value of fisher exact test, GC (guanine and cytosine) content of gene and surrounding sequence of RNA editing site (20 bp), and gene function. Some useful information is also provided by REDO, such as the summary for RNA editing types (C->T, T->C, etc.), hydrophobic/hydrophilic changes, codon phases of editing loci, and silent/nonsilent changes. To illustrate these attributes of RNA editing sites, 16 figures are produced using R program.

Inspecting the significantly differential proportion of RNA editing events in multiple samples is a popular interest of researchers. REDO can detect RNA editing events in multiple samples simultaneously and identify RNA editing loci with significantly differential editing proportion in different samples (the pairwise comparison and the cluster analysis of RNA editing sites in multiple samples). A subprogram fish.pl is provided to find common or unique sites in two samples or remove the genome variants in REDO result by comparing with genome variant file. This program can deal with the most commonly used file formats *.fa, *.psl, and *.csv. It takes the “baits” (unique keys composed of one or several columns that are universal in both files) from the “bait” file and searches the “baits” in the “fish” file. By default, it prints the items in “fish” file that also exist in “bait” file.

3. Results

Huang et al. (2013) identified 75 C->T RNA editing sites in chloroplast (cp) of coconut (Cocos nucifera L.) by RT-PCR experiments, and several public RNA-seq data of coconut were deposited in NCBI SRA database. Thus, we used coconut cp (KX028884/KF285453) and related RNA-seq data for cp RNA editing test (Supplementary Table S1). Meanwhile, Philippe Giege et al. reported 455 C->T RNA editing sites by cDNA clones sequencing in Arabidopsis (Arabidopsis thaliana) mt (Giegé and Brennicke, 1999) and a lot of Arabidopsis RNA-seq data is available. Therefore, we used Arabidopsis mt (NC_001284) and related RNA-seq data (SRR2079784) for mt RNA editing test (Supplementary Table S1).

The cp RNA-seq data were mapped by GSNAP (version: 2014-12-19) whereas mt RNA-seq data were mapped by BWA (Version: 0.7.10-r789) and GSNAP. All variants were identified by SAMtools (Version: 1.2) and BCFtools (Version: 1.2) (Table 1). Since the size of the organelle genomes and VCF files was relatively small, REDO required very low memory consumption and runs <2 minutes for eight samples in coconut cp and <1 minute for one sample in Arabidopsis mt. Considering the time for SAMtools and BCFtools, REDO needed 9 minutes and 54 seconds for cp1and 4 minutes and 28 seconds for mt, whereas REDItools spent 33 minutes and 50 seconds for cp1 and 42 minutes and 22 seconds for mt.

Table 1.

Information of RNA Editing Identification in Test RNA-Sequencing Data

Species	Organelle	SRA accession No.	SAMtools variant	REDItools editing sites	Percent	REDO editing sites	Percent
Coconut	cp	SRR1063404	210	126	60.00	96	45.71
		SRR1063407	387	—	—	66	17.05
		SRR1137438	197	—	—	51	25.89
		SRR1173229	590	—	—	129	21.86
		SRR1265939	477	—	—	49	10.27
		SRR1273070	1426	375	26.30	175	12.27
		SRR1273180	229	—	—	88	38.43
		SRR606452	247	—	—	68	27.53
Arabidopsis	mt (GSNAP)	SRR2079784	885	532	60.11	473	53.45
	mt (BWA)	SRR2079784	1219	553	45.37	488	40.03

cp, chloroplast; mt, mitochondria.

Identification, annotation, and statistics of RNA editing sites were performed by Perl script, while figures were drawn using R program. All test data were run on a Linux system. Comparing the performance between two methods, REDO has higher precision rate (80.72% vs. 75.00% in cp1 and 88.32% vs. 83.37% in mt) and slightly higher specificity (99.95% vs. 99.93% in cp1 and 99.84% vs. 99.76% in mt), whereas REDItools has higher sensitivity (92.00% vs. 89.33% in cp1 and 82.01% vs. 81.31% in mt) (Table 2). Comparing the true RNA editing sites with the common RNA editing sites identified by two methods in cp and mt, the precision rate, specificity, and sensitivity are similar to the result obtained by REDO alone. The accuracy of REDO is underestimated due to the sample difference between experiment validated data and RNA-seq data (tissue specific of RNA editing). By comparing with the prediction of PREP software (Mower, 2009), REDO has higher precision rate (80.72% vs. 59.81%), sensitivity (89.33% vs. 85.33%), and specificity (99.95% vs. 99.87%) in testing data set of coconut chloroplast (SRR1063404).

Table 2.

RNA Editing Sites Identification Between REDO and REDItools in Three Analyzed RNA-Sequencing Data Sets

Samples	Software	Type	True	Total	True+	True−	False+	False−	Precision (%)	Sensitivity (%)	Specificity (%)
Coconut cp1 (SRR1063404)	REDO	CT	75	83	67	33349	16	8	80.72	89.33	99.95
	REDItools			92	69	33340	23	6	75.00	92.00	99.93
	Common			80	65	33352	15	10	81.25	86.67	99.96
	REDO	All^a		83	67	33349	16	8	80.72	89.33	99.95
	REDItools			98	69	33334	29	6	70.41	92.00	99.91
	Common			80	65	33352	15	10	81.25	86.67	99.96
Coconut cp2 (SRR1273070)	REDO	CT		44	22	33388	22	53	50.00	29.33	99.93
	REDItools			57	22	33375	35	53	38.60	29.33	99.90
	Common			39	19	29415	20	56	48.72	25.33	99.93
	REDO	All		71	22	29383	49	53	30.99	29.33	99.83
	REDItools			121	22	29333	99	53	18.18	29.33	99.66
	Common			65	19	29389	46	56	29.23	25.33	99.84
Arabidopsis mt (gsnap, SRR2079784)	REDO	CT	428^b	394	348	29060	46	80	88.32	81.31	99.84
	REDItools			421	351	29033	70	77	83.37	82.01	99.76
	Common			391	345	29063	46	83	88.24	80.61	99.84
	REDO	All		408	348	29046	60	80	85.29	81.31	99.79
	REDItools			439	351	29015	88	77	79.95	82.01	99.70
	Common			426	345	29028	81	83	80.99	80.61	99.72
Arabidopsis mt (bwa, SRR2079784)	REDO	CT	428^b	406	356	29048	50	72	87.68	83.18	99.83
	REDItools			430	362	29024	68	66	84.19	84.58	99.77
	Common			402	354	29052	48	74	88.06	82.71	99.84
	REDO	All		419	356	29035	63	72	84.96	83.18	99.78
	REDItools			448	362	29006	86	66	80.80	84.58	99.70
	Common			415	354	29039	61	74	85.30	82.71	99.79

REDItools was run with “-c 4” option (total reads coverage ≥4) and the result was filtered by p value ≤0.01. RED was run using “-a 0, -s 0” option (minimum alt distance = 0 and minimum junction anchor size = 0) for mt data and default for cp data with applying the complicated filter model. The true negative number was obtained using total inspected gene length minus total identified RNA editing sites.

All possible RNA editing types.

There are 455 C->T RNA editing sites in Arabidopsis by cDNA clones sequencing, whereas 428 C->T RNA editing sites are located in the CDS region. We only used these 428 C->T RNA editing sites for mt RNA editing sites test.

Precision = True/total, True is defined as experimental validated RNA-editing sites; Sensitivity = True+/(True+ + False−); Specificity = True−/(True− + False+).

CT, C->T type.

In general, REDItools only reports the genome position for each RNA editing site, whereas REDO shows gene name, genome position, gene position, amino acid position, codon phase, base change, codon change, amino acid change, and other information for RNA editing sites (Fig. 1). In addition, the result of REDO can be used for publication, whereas REDItools result needs additional process. REDO can also detect RNA editing events and identify the significant differential editing level in multiple samples simultaneously. Cluster analysis and heatmap plotting are also provided based on the RNA editing proportion (Fig. 2). A matrix for editing allele proportion of RNA editing sites in all samples was produced. The heatmap was plotted based on this matrix using heatmap function in R. In heatmap function, the distance matrix of different samples was calculated using “dist” function with the default Euclidean method, and the hierarchical clustering was computed using “hclust” function with the complete method.

FIG. 1.

The attributes of RNA editing sites in coconut cp1 (SRR1063404). The density of alternative allele proportion and sequencing error proportion, the density of total read depth and average read depth, the density of read proportion for alternative allele and reference allele, the density of distance to adjacent candidate editing site, the density of log₁₀ LLR, the density of p value, the density of GC content, the boxplot of alternative allele proportion, the boxplot of sequencing error proportion, the boxplot of total read depth, the boxplot of total read depth/average read depth, the boxplot of distance to adjacent candidate editing site, the boxplot of log₁₀ LLR, the boxplot of p value and the boxplot of GC content are shown for RNA editing sites. cp, chloroplast; GC, GC content; LLR, likelihood ratio.

FIG. 2.

The cluster of RNA editing sites in coconut cp based on eight RNA-seq data according to the RNA editing proportions.

4. Conclusion

In summary, by using REDO, users could identify RNA editing sites based on RNA-seq data simpler and faster. First, it provides major processes for RNA editing analysis, including RNA editing detection, annotation, statistics analysis, identification of differential proportion of RNA editing sites with higher precision rate, and more specificity. Second, it enables RNA editing analysis with multiple samples as well as cluster analysis. Third, REDO could generate publication-ready figures and other illustrations. Moreover, the subprogram fish.pl can be used to process the REDO result tables easily and quickly.

Footnotes

Acknowledgments

This study is supported by grants from the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDA08020102), National Natural Science Foundation of China (Grant Nos. 31501042, 31271385, and 31200957), and King Abdulaziz City for Science and Technology (KACST) grant 1035-35 from KACST, Kingdom of Saudi Arabia. Technical supports were provided by the CAS Key Laboratory of Genome Science and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, People's Republic of China. We also thank all anonymous reviewers and Dr. Shenghan Gao, Dr. Sen Wong, and Dr. Zilong He for critical comments and helpful suggestions.

Author Disclosure Statement

No competing financial interests exist.

References

REDO is freely available at https://sourceforge.net/projects/redo.

Bass

, Hundley

, Li

J.B.

, et al. 2012. The difficult calls in RNA editing. Nat. Biotechnol., 30, 1207–1209.

Brennicke

, Marchfelder

, and Binder

1999. RNA editing. FEMS Microbiol. Rev., 23, 297–316.

Chepelev

2012. Detection of RNA editing events in human cells using high-throughput sequencing. Methods Mol. Biol., 815, 91–102.

DePristo

M.A.

, Banks

, Poplin

, et al. 2011. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498.

Distefano

, Nigita

, Macca

, et al. 2013. VIRGO: Visualization of A-to-I RNA editing sites in genomic sequences. BMC Bioinformatics, 14, S5.

Freyer

, Kiefer-Meyer

M-C.

, and Kössel

1997. Occurrence of plastid RNA editing in all major lineages of land plants. Proc. Natl. Acad. Sci. U. S. A., 94, 6285–6290.

Giegé

, and Brennicke

1999. RNA editing in Arabidopsis mitochondria effects 441 C to U changes in ORFs. Proc. Natl. Acad. Sci. U. S. A., 96, 15324–15329.

Hiesel

, Combettes

, and Brennicke

1994. Evidence for RNA editing in mitochondria of all major groups of land plants except the Bryophyta. Proc. Natl. Acad. Sci. U. S. A., 91, 629–633.

10.

Huang

Y-Y.

, Matzke

A.J.

, and Matzke

2013. Complete sequence and comparative analysis of the chloroplast genome of coconut palm (Cocos nucifera). PLoS One, 8, e74736.

11.

Ichinose

, and Sugita

2016. RNA editing and its molecular mechanism in plant organelles. Genes, 8, 5.

12.

Khermesh

, D'Erchia

A.M.

, Barak

, et al. 2016. Reduced levels of protein recoding by A-to-I RNA editing in Alzheimer's disease. RNA, 22, 290–302.

13.

Lee

S.Y.

, Joung

J-G.

, Park

C.H.

, et al. 2015. RCARE: RNA sequence comparison and annotation for RNA editing. BMC Med. Genomics, 8, S8.

14.

2011. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27, 2987–2993.

15.

, and Durbin

2009. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25, 1754–1760.

16.

, Handsaker

, Wysoker

, et al. 2009. The sequence alignment/map format and SAMtools. Bioinformatics, 25, 2078–2079.

17.

McKenna

, Hanna

, Banks

, et al. 2010. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303.

18.

Mower

J.P.

2009. The PREP suite: Predictive RNA editors for plant mitochondrial genes, chloroplast genes and user-defined alignments. Nucleic Acids Res. 37, W253–W259.

19.

Picardi

, Horner

D. S.

, Chiara

, et al. 2010. Large-scale detection and analysis of RNA editing in grape mtDNA by RNA deep-sequencing. Nucleic Acids Res. 38, 4755–4767.

20.

Picardi

, and Pesole

2013. REDItools: High-throughput RNA editing detection made easy. Bioinformatics, 29, 1813–1814.

21.

Shikanai

2006. RNA editing in plant organelles: Machinery, physiological function and evolution. Cell. Mol. Life Sci., 63, 698–708.

22.

Smith

D.R.

2013. RNA-Seq data: A goldmine for organelle research. Brief Funct. Genomics, 12, 454–456.

23.

Sun

, Li

, Wu

, et al. 2016. RED: A Java-MySQL software for identifying and visualizing RNA editing sites using rule-based and statistical filters. PLoS One, 11, e0150465.

24.

T.D.

, and Nacu

2010. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics, 26, 873–881.

25.

Zhang

, and Xiao

2015. Genome sequence-independent identification of RNA editing sites. Nat. Methods, 12, 347–350.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.02 MB