The Rice Word Landscape: A Detailed Catalogue of the Rice Motif Content in the Non-coding Regions

Abstract

Among the different areas of molecular biology concerning the detailed study of different parts of the cell, such as genomics, proteomics, and metabolomics, different new areas of study are emerging which entail the analysis of different parts of the genome, such as the prediction of genes or different kinds of transcription factor binding sites (TFBSs). The goal of this study was to construct and analyze a catalogue of all statistically relevant putative functional octamer words or motifs (which we have termed the “motifome” of a given organism) found within first introns, promoters, the 5′ and 3′ untranslated regions (UTRs), and the entire genome of japonica rice, and compare them to results attained from a previous analysis performed on the Arabidopsis genome. We found a number of novel motifs in different sets of non-coding rice sequence sets. The diversity of motifs in rice was higher in Arabidopsis, implicating a higher mutation turnover. While common motifs were found between the two species, motif pairs were missing, showing the difference between the regulatory machinery between rice and Arabidopsis.

Introduction

T hus far, the entire genome sequence is available for only a handful of plant species. One of the key steps in bioinformatics analysis is to make sense of the raw sequence data of different organisms available in public databases, such as predicting, finding, and annotating all possible genes within the whole genome sequence. A number of gene prediction and motif discovery algorithms (Sandve and Drabløs, 2006) exist, which have been fashioned to this end. Motifs have been analyzed for only a single organism, namely Arabidopsis thaliana (Lichtenberg, 2009a).

The goal of this study was to enumerate and catalogue all possible octamer motifs in different parts of the japonica rice genome, as well as in the genome itself. The reason rice was chosen is because its entire genome sequence is already available and well annotated. Furthermore, consensus sequences of a number of high-scoring motifs have been defined, and co-occurring motifs have also been defined and analyzed.

Furthermore, the genome of rice can be compared to that of Arabidopsis. Therefore a cross-species comparison was done to draw conclusions about species differences and similarities in gene regulation. Furthermore, motifs marked as statistically significant in this analysis can be used in further experimental studies concerning gene regulation.

Materials and Methods

Selection of rice sequences

The 5′ untranslated region (UTR), 3′ UTR, intron, and whole genome sequences for rice were downloaded from the MSU Rice Genome Annotation Project website at ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_6.1/all.dir/ (Ouyang, 2007). The files all.con, all.intron, and all.utr were downloaded, and the file all.utr was split into two separate files, which contained the 5′ UTR and 3′ UTR sequences.

The core (250 bp), proximal (1000 bp), and distal (3000 bp) promoter sequences were downloaded from the Osiris database (Morris, 2008; http://www.bioinformatics2.wsu.edu/cgi-bin/Osiris/cgi/home.pl). Transcript sequences were used, and were shortened to prevent overlapping with neighboring genes.

Word statistical measure

These calculations are based on the algorithm presented in Lichtenberg (2009a). The statistical significance of a given word w is sign(w)=S·ln(S/E_S), where S is the number of sequences the word w occurs in, and E_S is the number of sequences the word is expected to occur in, by calculating the probability of the word's occurrence based on the background base distribution in rice (p_A=p_T=28.2%, p_C=p_G=21.8%). The probability p_w can be calculated with the following formula: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_w = \prod \nolimits _{i = 1}^n p_{X , i}$$\end{document} , where n is the length of the motif, i is a running variable from 1 to 8, and p_X,i is the probability of the i^th base in the word, where X={A,C,G,T}. Only word motifs not containing ambiguous letters were retained.

In the whole genome, the expected occurrence of w is E_S (w)=N_genome·p_w, where N_genome is the size of the rice genome, and p_w is the occurrence probability of the word. In the case of the other six sequence sets, E_S is calculated somewhat differently. We assume that the occurrence of a given word follows a Poisson distribution. Hence, the number of sequences the word is expected to occur in is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$E_S ( w ) = N_{sequences} \cdot \left( 1 - e^{ - ( N_S \cdot p_w) } \right)$$\end{document} , where N_sequences is the length of all sequences belonging to sequence set S, N_s is the number of sequences within a given sequence set, and p_w is the occurrence probability of the word.

Word clustering

For all seven sequence sets we matched all of the top 100 highest scoring words with each other. Two words belonged to the same cluster if the Hamming distance was at most 1 bp. The two words were also allowed to slide 1 bp alongside each other.

Word location distribution

We searched for the top 100 motifs of each of six sequence sets (core, proximal, and distal promoters, introns, 5′ UTRs, and 3′ UTRs), and calculated their location in each sequence of each set. The number of occurrences of different motifs at all positions was also calculated and mapped to an interval of [−N,−1] in the case of the 3 promoter sets, where n=250, 1000, or 3000 (for core, proximal, or distal promoters). Since the length of the sequences differs in all other sequence sets, the positions were normalized to an interval of [1,100]. The position occurrence frequency was then plotted to a curve. For introns, 5′ UTRs, and 3′ UTRs, the position occurrence frequency was also plotted from the beginning and end of the individual sequences in addition to normalization.

Word pair statistical measure

For a word pair w1;w2, the probability of finding such a pair is equal to the product of the individual word probabilities: p_w_1;w2=p_w₁·p_w₂. The significance value for a word pair can also be calculated similarly with p_w1;w2 in place of p_w.

GO term analysis

For six of the sequence sets (core, proximal, and distal promoters, introns, 5′ UTRs, and 3′ UTRs), a list of genes was determined containing at least one of the top 10 octamer words. A list of these rice gene identifiers was entered into the GO Enrichment Analysis Tool at the Rice Array Database. MSU GOSlim terms were retrieved for biological, cellular, and molecular functions. GO terms were accepted whose p value was at most 0.01. The top 10 octamer words and the GO terms associated with them were visualized for all six sequence sets and all three functional categories using the matrix2png web application software at http://chibi.ubc.ca/matrix2png/bin/matrix2png.cgi.

Results

Principles of investigation

The total occurrence of all possible 65,536 octamer motif words was enumerated in the entire rice genome, as well as the appropriate 5′ UTR, 3′ UTR, all introns, core, proximal, and distal promoters, and a corresponding significance value was assigned to them. Octamers were also analyzed in this study, since they are short enough to be classified as TFBSs without being too specific, and are long enough to be statistically robust and diverse (shorter words such as hexamers would appear roughly once every ∼4000 bp; about once in every distal promoter).

Analysis of occurrences of word motifs for all seven sequence sets

For all seven sequence sets, the top 100 highest scoring motifs were studied in further detail. A statistical overview of these sequence sets can be seen in Table 1. All motifs are listed in the worksheets of Supplementary Material 1 (see online supplementary material at http://www.liebertonline.com). The top 10 motifs from all seven sequence sets can be seen in Table 2. To test whether the found motifs were truly biologically relevant, we checked whether the individual octamer motifs perfectly matched any experimentally verified motifs listed in the PLACE database. Quite a large number of the top motifs corresponded to motifs found in the database (core promoters: 66, proximal promoters: 40, distal promoters: 66, 5′ UTRs: 12, 3′ UTRs: 32, introns: 44, whole genome: 43). This information can found in Supplementary Material 1 (see online supplementary material at http://www.liebertonline.com).

Table 1.

Statistics for Different Sets of Sequences in Rice

Sequence set type	No. of sequences in the set	Minimum sequence length	Maximum sequence length	Average sequence length	Standard deviation	Total no. of nucleotides	Percentage of the genome
Core promoters	24,209	0	250	247.86	15.89	6,000,499	1.61%
Proximal promoters	24,209	0	1,000	945.49	177.92	22,889,345	6.15%
Distal promoters	24,209	0	3,000	2534.38	859.43	61,354,768	16.48%
5′ UTRs	32,591	1	6,648	259.19	341.35	8,447,211	2.27%
3′ UTRs	34,160	1	6,072	465.49	405.27	15,900,997	4.27%
Introns	251,812	5	18,327	399.11	634.66	100,500,735	26.99%
Whole genome	12	23,011,239	43,268,879	31,347,305	5,923,101	372,317,567	100.00%

Table 2.

Top 10 Octamer Words from Each of the Seven Sequence Sets in Rice

Motif	Occurrence (S)	Expected occurrence (E_S)	Significance (S^ln(S/ E_S))*
Core promoters
AAAAAAAA	3439	239.66	9160.52
CCTCCTCC	1613	50.8191	5577.07
GAAAAAAA	2236	185.221	5569.64
AAAAAAAT	2301	239.642	5204.75
TCCTCCTC	1569	65.8256	4975.59
TTTTTTTT	2216	239.519	4930.21
AAAAAAAG	1889	185.221	4386.74
TAAAAAAA	2031	239.642	4340.52
TCTCTCTC	1501	85.2555	4305.22
AAAGAAAA	1751	185.221	3933.44
Proximal promoters
AAAAAAAA	9477	901.54	22294.8
TTTTTTTT	8454	901.019	18927.4
AAAAAAAT	8045	901.475	17608.7
ATTTTTTT	7754	901.084	16689.4
TAAAAAAA	7413	901.475	15618.9
GAAAAAAA	6507	698.965	14517.3
TTTTTTTA	7013	901.084	14390.1
TTTTTTTC	5708	698.394	11991.6
TTAAAAAA	6063	901.41	11556.1
AAAAAAAG	5422	698.965	11107.6
Distal promoters
AAAAAAAA	15495	2341.6	29280.7
TTTTTTTT	15147	2340.29	28287.5
AAAAAAAT	14222	2341.43	25656.9
ATTTTTTT	14219	2340.45	25654.4
TAAAAAAA	13446	2341.43	23502.5
TTTTTTTA	13264	2340.45	23009.2
GAAAAAAA	11849	1828.42	22143.3
TTTTTTTC	11495	1826.96	21142.3
TTAAAAAA	11487	2341.27	18270.3
TTTTTTAA	11446	2340.61	18167.4
5′ UTRs
CGCCGCCG	6729	42.6707	34053.3
CCGCCGCC	6130	42.6573	30452.3
GCCGCCGC	5924	42.6707	29224.6
CCTCCTCC	5231	71.5372	22452.2
TCCTCCTC	5162	92.6603	20752
CGGCGGCG	4238	42.6975	19485.1
CTCCTCCT	4344	92.6603	16714
GCGGCGGC	3702	42.6975	16520.1
GGCGGCGG	3372	42.7109	14731.6
TCGCCGCC	3278	55.2601	13383.9
3′ UTRs
TTTTTTTT	5819	631.96	12918.6
TTTTTTTC	3953	488.77	8263.09
ATTTTTTT	3865	632.00	6998.84
TTTCTTTT	3509	488.77	6916.9
CTTTTTTT	3496	488.77	6878.3
TTTTTCTT	3412	488.77	6630.05
TTTTCTTT	3408	488.77	6618.28
TTCTTTTT	3388	488.77	6559.5
AAAAAAAA	3563	632.32	6160.27
TTTTTTTG	3157	488.93	5888.34
Introns
TTTTTTTT	40822	3999.52	94831.4
ATTTTTTT	31559	3999.81	65188.6
TTTTTTTC	27640	3092.41	60540.3
TTTCTTTT	25151	3092.41	52715.2
TTTTCTTT	24923	3092.41	52010.3
CTTTTTTT	24168	3092.41	49691.3
TTTTTTTA	25600	3999.81	47522.4
TTTTTCTT	23287	3092.41	47015.2
TTTTGTTT	22986	3093.37	46101.2
TTTTTTTG	22550	3093.37	44795
Whole genome
TTTTTTTT	236977	14935.6	655055
TTTTTTTT	236977	14935.6	655055
AAAAAAAA	234176	14944.5	644390
ATATATAT	232878	14940	639592
TATATATA	224164	14940	607111
CGCCGCCG	118631	1881.98	491571
CGGCGGCG	118636	1883.16	491522
GGCGGCGG	102492	1883.75	409611
GCCGCCGC	102222	1881.98	408359
CCGCCGCC	102135	1881.38	407957

The motifs CARGNCAT, CARGCW8GAT, MARTBOX, −314MOTIFZMSBE1, and ABRECE3ZMRAB28 were consistently the highest ranking motifs, which matched the top 100 from the seven different sequence sets in japonica rice. They also matched motifs from the top 100 sets a total number of 223, 219, 105, 59, and 41 times, respectively. These 5 motifs correspond respectively to two MADS protein-binding sites, a T-box in a scaffold attachment region, a sugar-responsive element, and a CE3-coupling element involved in stress response, as annotated in the PLACE database.

It is of interest to note that a number of (TC)_n and (TTC)_n motifs were found to be overrepresented in the region [−39,−26] which corresponds to where the TATA-box is located in many Arabidopsis and rice core promoters. According to Bernard and associates (2010), such motifs also contribute to the regulation of transcription. Motifs from the top 100 motifs found in japonica rice core promoters such as CTCTCTCT (score: 3796.82, rank: 12), TCTCTCTC (score: 4305.22, rank: 9), and TCTTCTTC (score: 1692.44, rank: 89) were found in this region.

We also studied the number of motifs that were unique to one or more sequence sets. Overall the algorithm found 311 such motifs: 121, 108, 20, 18, 33, and 11 motifs were unique to only one, two, three, four, five, or six sets (Fig. 1). No motifs were common to all seven sets. A list of these common motifs may be found in Supplementary Material 1 (see online supplementary material at http://www.liebertonline.com). For these top 311 motifs we studied their distribution among different combinations of sequence sets. Overall 28 different set combinations were found, 13 of which occurred 10 or more times (Table 3). High-scoring motifs from the top 100 set of motifs occurring in both the 3′ UTRs and introns were the most frequent, with 39 occurrences. A number of motifs were found to be unique to a specific set (34 to 5′ UTRs, 29 to 3′ UTRs, 25 to introns, 20 to core promoters, and 10 to the whole genome). Twenty-seven motifs were common to both the core promoters and 5′ UTRs, which means that there may be some overlap in the regulation between these two kinds of sequences. Twenty-one motifs were common to both proximal and distal promoters, meaning that these motifs might take part in the special regulation of genes. Another 17 motifs were common to the whole genome as well as 5′ UTRs. A small number of motifs were found to be promiscuous to a number of different sets. Overall 13 motifs were found in the set combination 12345, 11 in 13467, 11 in 123467, and 11 in 1234 (1: whole genome, 2: core promoters, 3: proximal promoters, 4: distal promoters, 5: 5′ UTRs, 6: 3’UTRs, 7: introns). In total, 68 motifs were found in the whole genome as well as proximal and distal promoters (the set combination 134).

FIG. 1.

Number of top 100 words common to different numbers of sequence sets in rice.

Table 3.

Frequency of Different Sequence Set Combinations That 311 Motifs from the Top 100 Motifs Occur in (1: Whole Genome, 2: Core Promoters, 3: Proximal Promoters, 4: Distal Promoters, 5: 5′ UTRs, 6: 3′ UTRs, 7: Introns)

Sequence set combination	Frequency
67	39
5	34
6	29
25	27
7	25
34	21
2	20
15	17
12345	13
13467	11
123467	11
1234	11
1	10
12346	8
134	7
1347	6
235	5
47	3
467	2
347	2
3	2
125	2
4	1
234	1
145	1
1345	1
12347	1
12	1

What is interesting to note is that no nullmers (motifs that do not occur in a given sequence set) were found in any of the sequence sets in rice, from which we may infer that the motif diversity in rice is higher than that of Arabidopsis, for which sets of nullmers of different sizes were found in core promoters, introns, 5′ UTRs, and 3′ UTRs (Lichtenberg, 2009a).

The top 100 motifs of each of the seven individual sequences sets were also compared between rice and Arabidopsis. Overall, between Arabiodopsis and japonica, 10, 11, 2, 5, 19, 27, and 27 motifs were common to both sets of core promoters, proximal promoters, distal promoters, 5′ UTRs, 3′ UTRs, introns, and genomes, respectively (Fig. 2). These data may be seen in Supplementary Material 1 (see online supplementary material at http://www.liebertonline.com). Here those PLACE motifs with hits to most of the top 100 motifs from all different sequence sets were the motifs CARGNCAT, CARGCW8GAT, MARTBOX, and −314MOTIFZMSBE1 (functions mentioned previously), with 25, 24, 9, and 6 hits, respectively.

FIG. 2.

Number of top 100 words common to the corresponding sequence set in Arabidopsis.

Word clusters

Clusters of words coming from the top 100 words were defined with a consensus sequence. A total of 10, 9, 6, 11, 7, 2, and 6 clusters were made for core, proximal, and distal promoters, 5′ UTRs, 3′ UTRs, introns, and the whole genome, respectively (51 in total). Each cluster had at least three members, and the largest cluster was found in introns containing 50 members, represented by the consensus sequence NNHYNNNTYNT. The average number of cluster members was 6.4, with a standard deviation of 7.36. The clustering data can be seen in Supplementary Material 2 (see online supplementary material at http://www.liebertonline.com).

Of particular interest are the consensus sequences KAAAAAAAW and KAAAAAAAAK, which correspond to the PLACE motif ATRICHPSPETE, annotated in the PLACE database as an A/T-rich sequence (Sandhu et al., 1998). The motifs HAAAATTTT, HAAAWTTTW, and AAAWTTWA correspond to the PLACE motifs CARGCW8GAT and CARGNCAT, which are poly A/T motifs involved in binding MADS proteins (Tang and Perry, 2003; Wang et al., 2004). The consensus sequences CCACCHCC and CCACCDCC form part of the P-box, which is necessary but not sufficient for elicitor or light responsiveness (Logemann, 1995). The motif GCCGCC (GCCCORE) matched our consensus sequences CGCCGCCGCC, GCCGCCGCCG, CCGCCGBC, KCGCCGCCN, KCCGCCTCC, and BCGCCGCCS. This motif is present in the promoter of a number of pathogen-responsive genes (Brown, 2003). The PLACE motif GRWAAW corresponded to our motifs BAAAAAAWN, KAAAAAAAAK, and KAAAAAAAW. This motif, the GT-1 binding site, functions in light responsiveness in many genes (Zhou, 1999). The PLACE motif GAAAAA (GT1GMSCAM14) matches our motifs BAAAAAAWN, KAAAAAAAAK, and KAAAAAAAW, and is also a GT-1 binding site just like GRWAAW, but also takes part in pathogen response and salt-induced stress (Park et al., 2004). The motif TGTCTC (ARFAT) is part of the auxin response factor (ARF), which is present in the promoter of a number of early auxin response genes (Hagen and Guilfoyle, 2003). The motif TTATTT (TATABOX5) corresponds to the well-known TATA box, matched by our motifs WWAKTTTTTTN and ADNTTTWTTWN.

In order to check whether the consensus sequences themselves were real, we calculated the score value for each consensus sequence according to the algorithm (Table 4). As we can see, the score value for the consensus sequences is high except in only a few cases. For example, the score for the consensus sequence NNHYNNNTYNT is −15500.6. This consensus sequence is found in introns, from a cluster with 50 members. A possible reason for such a low score is that the sequence is very unspecific, and therefore occurs very frequently (235,282 times). Its expected occurrence is 251,305, with a difference of 16,023 (6.81% of 235,282). At such high occurrences, a small difference such as this is more significant. Another consensus sequence found in core promoters, GTGGGAAAM had a score value of −6.97284. These two consensus sequences were regarded as statistical artifacts not representing true binding sites.

Table 4.

Real and Hypothetical Occurrences and Score of Consensus Sequences in Seven Sequence Sets in Rice

Motif	Occurrence (S)	Expected occurrence (E)	Score value
Core promoters
BAAAAAAWN	5984	1195.15	9639.14
CCACCHCC	2043	140.656	5466.78
CGCCGCCGCC	447	1.4378	2565.53
CTCTCCTCY	959	32.9242	3233.44
GTGGGAAAM	35	42.7159	−6.97284
MCGGCCCAW	843	50.8418	2367.35
MTTTTTTTTH	1600	93.933	4536.28
TCCTCCTCY	1235	32.9242	4476.4
YCCCTCCCC	663	19.6169	2334.01
Proximal promoters
AAAWTTWA	8840	3408.98	8423.41
CGCCGCCGCC	1214	5.48413	6555.38
CGGCGGCGGC	891	5.4893	4534.78
CTCTCTCYC	2072	125.352	5812.26
HAAAATTTT	3680	707.887	6066.05
NWWNAAAAN	22002	20613.5	1434.26
TCCTCCTCCT	764	15.439	2980.88
WWAKTTTTTTN	4855	512.54	10915.9
Distal promoters
ADNTTTWTTWN	17450	6597.53	16972.7
ANWNAAAAANH	20009	10436.7	13023
GCCGCCGCCG	2502	14.702	12852.4
GCGGCGGCGG	2476	14.7158	12690.7
HAAAWTTTW	14788	6597.86	11935
TCCTCCTCCT	1382	41.3619	4849.34
5′ UTRs
AGGAGGAGGA	878	5.71144	4420.88
BCCTCCKCB	8536	298.995	28609.4
CCACCDCC	4719	197.999	14964.4
CCGCCGBC	7058	140.435	27647.4
CGGCGGCGV	3778	30.6579	18187.5
CTCTCTCYC	2722	46.3477	11086.6
KCCGCCTCC	1911	27.6277	8096.08
KCGCCGCCN	4914	97.8584	19244.8
NCCGCCGCK	3338	97.8584	11781.8
YCCTCCCCY	2574	63.3612	9535.03
YCTCCCTCC	1645	35.7764	6297.4
3′ UTRs
ATATATRTG	1084	245.36	1610.49
KAAAAAAAAK	937	159.128	1661.28
MTTCTTTTB	2494	620.49	3469.49
NTTKYYYYN	26176	23407.3	2926.32
RTTTGTTTS	1443	377.967	1933.15
TNTKTTSTTB	6476	1230.01	10757.1
Introns
KAAAAAAAW	20166	4000.93	32617.9
NNHYNNNTYNT	235282	251305	−15500.6
Whole genome
GVSGGCGGCGM	12314	251.758	47901.7
NTWTTTTTN	12481	134783	523683
BAAANAAAN	23423	2701.61	508364
BCGCCGCCS	130981	294.399	102512
TCCTCCTCCT	441426	105890	1.15E+06
AGGAGGAGGA	637877	251.21	48746.7

Localization of word motifs across sequences

In order to get a feel of how our top 100 words localize along the sequences in each sequence set we performed a complete sequence set search for all top 100 words of all sets. The results can be seen in Figures 3 –14. In the case of promoters, a number of regulatory motifs, such as TFBSs, are localized to a specific region of the promoter. As in the case of the core promoters it is interesting to note that there is a hump between −80 and −35 bp within this region, which is where the core transcription machinery binds to the DNA (Fig. 3). Such a bulge was also observed in the case of Arabidopsis core promoters (Lichtenberg, 2009a). Motif localization for proximal and distal promoters can be seen in Figures 4 and 5.

FIG. 3.

Word positional frequency across core promoters.

FIG. 4.

Word positional frequency across proximal promoters.

FIG. 5.

Word positional frequency across distal promoters.

FIG. 6.

Word positional frequency across introns.

FIG. 7.

Word positional frequency from the beginning of introns. In order to make the diagram more readable, only the first 500 bp were taken from the distribution. The positional frequency values all decrease after 500 bp.

FIG. 8.

Word positional frequency from the end of introns. In order to make the diagram more readable, only the first 500 bp were taken from the distribution. The positional frequency values all decrease after 500 bp.

FIG. 9.

Word positional frequency across 3′ UTRs.

FIG. 10.

Word positional frequency from the beginning of 3′ UTRs. In order to make the diagram more readable, only the first 500 bp were taken from the distribution. The positional frequency values all decrease after 500 bp.

FIG. 11.

Word positional frequency from the end of 3′ UTRs. In order to make the diagram more readable, only the first 500 bp were taken from the distribution. The positional frequency values all decrease after 500 bp.

FIG. 12.

Word positional frequency across 5′ UTRs.

FIG. 13.

Word positional frequency from the beginning of 5′ UTR. In order to make the diagram more readable, only the first 500 bp were taken from the distribution. The positional frequency values all decrease after 500 bp.

FIG. 14.

Word positional frequency from the end of 5′ UTRs. In order to make the diagram more readable, only the first 500 bp were taken from the distribution. The positional frequency values all decrease after 500 bp.

Also interesting is the positional occurrence frequency of motifs in the intron sequences (Fig. 6). As we can see the curve is a concave one, with maxima at both ends of the curve. This is where we would expect the splicing machinery to form, therefore regulatory motifs are expected to be found at these positions. Goren and colleagues found that regulatory motifs were overabundant at the exon/intron boundaries in a number of species (Goren et al., 2010).

For the intron sequences and the 3′ and 5′ UTR sequences we studied the positional frequency of the top 100 motifs from the beginning of the sequence to the end, as well as from the end of the sequence to the beginning. In introns, there is a frequency peak at 12 bp from the beginning of the sequence with an occurrence of 16,026. There was also another peak at 22 bp from the end of the intron with an occurrence of 6608 (Figs. 7 and 8). This is because regulatory proteins binding to introns does not occur exactly at the exon-intron junction, but occurs further inside the intron itself.

The situation is the same for 3′ UTR sequences (Fig. 9), where there is a maximum occurrence of the top 100 motifs of 682 at 107 bp from the start of the sequence, and 660 at 104 bp from the end of the sequence (Figs. 10 and 1). Of these 660 sequences, 157 of them are poly(A/T) sequences (which correspond to the sequence WWWWWWWW), which occur frequently at the end of 3′ UTR sequences. A₈ occurs 16 times, A₇ 35 times, and A₆ 46 times. In 5′ UTRs (Fig. 12) there is a peak of 1789 occurrences 15 bp from the start of the sequence, and 2145 at 68 bp from the end of the sequence (Figs. 13 and 4).

Co-occurrences of words

Genes are usually regulated by more than one regulatory motif, and often transcription factors join together to form regulatory complexes, especially in the proximal and core promoter as well as introns. Therefore we studied the distribution of motif pairs in the core, proximal and distal promoters, introns, 5′ UTRs, and 3′ UTRs in japonica rice. We counted the number of sequences that a given motif pair occurs in, as well as the number of sequences in which it was expected to occur in based on the background base distribution. A list of the top 10 highest scoring motif pairs from the previously mentioned six sequence sets from japonica can be seen in Table 5.

Table 5.

Top 10 Octamer Word Pairs from Six Sequence Sets (Core, Proximal and Distal Promoters, Introns, 5′ UTRs, and 3′ UTRs) from Rice

Head motif	Tail motif	Occurrence (S)	Expected occurrence (E_S)	Significance (S^ln(S/ E_S)*
Core promoters
AAAAAAAA	AAAAAAAA	2252	233.822	5100.81
CCTCCTCC	TCCTCCTC	1020	13.5778	4405.5
CTCTCTCT	TCTCTCTC	1162	29.5253	4267.62
CTCCTCCT	TCCTCCTC	1024	17.5913	4161.61
CCTCCTCC	CTCCTCCT	920	13.5778	3878.66
AAAAAAAA	GAAAAAAA	1679	180.704	3742.65
CGCCGCCG	GCCGCCGC	685	3.72585	3571.67
AAAAAAAA	AAAAAAAG	1502	180.704	3180.78
CCGCCGCC	GCCGCCGC	618	3.72468	3158.91
CCGCCGCC	CGCCGCCG	608	3.72468	3097.88
Proximal promoters
AAAAAAAA	AAAAAAAA	6301	879.879	12404.6
AAAAAAAA	AAAAAAAT	5826	879.815	11013.3
AAAAAAAA	GAAAAAAA	5085	682.1	10215.1
AAAAAAAA	TAAAAAAA	5375	879.815	9727.69
ATTTTTTT	TTTTTTTT	5317	878.925	9570.42
TTTTTTTT	TTTTTTTT	5294	878.861	9506.45
CGCCGCCG	GCCGCCGC	1919	14.2095	9413.94
AAAAAAAT	TAAAAAAA	5160	879.751	9128.31
CCGCCGCC	GCCGCCGC	1856	14.205	9043.52
CCGCCGCC	CGCCGCCG	1821	14.205	8838.31
Distal promoters
CGCCGCCG	GCCGCCGC	4242	38.0696	19994.1
AAAAAAAA	AAAAAAAT	11983	2286.92	19847.2
AAAAAAAA	AAAAAAAA	11936	2287.08	19721.7
CCGCCGCC	GCCGCCGC	4188	38.0577	19687.3
ATTTTTTT	TTTTTTTT	11776	2284.68	19310.7
CCGCCGCC	CGCCGCCG	4112	38.0577	19254.7
CGGCGGCG	GCGGCGGC	4070	38.1174	19009.9
TTTTTTTT	TTTTTTTT	11507	2284.52	18604.5
CGGCGGCG	GGCGGCGG	3956	38.1294	18363.8
GCGGCGGC	GGCGGCGG	3940	38.1294	18273.5
Introns
TTGTTTTC	TTGTTTTC	561	14104.8	−1808.97
TGTTCTTT	TGTTCTTT	563	14104.8	−1813.42
CTTTTGTT	CTTTTGTT	612	14104.8	−1920.17
TTCTGTTT	TTCTGTTT	623	14104.8	−1943.59
TGTTTTCT	TGTTTTCT	628	14104.8	−1954.17
TGTTGTTT	TGTTGTTT	650	14113.4	−2000.64
TTTGTTGT	TTTGTTGT	651	14113.4	−2002.72
TTTCTTGT	TTTCTTGT	678	14104.8	−2057.81
TTGTTTCT	TTGTTTCT	684	14104.8	−2070
TTTCTGTT	TTTCTGTT	699	14104.8	−2100.23
5′ UTRs
CGCCGCCG	GCCGCCGC	5309	7.06088	35159.3
CCGCCGCC	CGCCGCCG	5245	7.05866	34673.5
CCGCCGCC	GCCGCCGC	5024	7.05866	32996.3
CGCCGCCG	CGCCGCCG	3824	7.06088	24070.1
CCGCCGCC	CCGCCGCC	3592	7.05645	22387.2
GCCGCCGC	GCCGCCGC	3342	7.06088	20585.9
CCTCCTCC	TCCTCCTC	3895	25.7293	19552.2
CGGCGGCG	GCGGCGGC	3157	7.06976	19262.6
CTCCTCCT	TCCTCCTC	3909	33.3334	18624.3
CGGCGGCG	GGCGGCGG	2927	7.07198	17636.9
3′ UTRs
TTTTTTTT	TTTTTTTT	3828	866.345	5687.7
TTTTTTTC	TTTTTTTT	2946	670.59	4360.22
CTTTTTTT	TTTTTTTT	2542	670.59	3387.34
TTTCTTTT	TTTTCTTT	2162	518.721	3086.09
TTTTTTCT	TTTTTTTC	2111	518.721	2962.9
ATTTTTTT	TTTTTTTT	2635	866.408	2930.86
TTTTTTTG	TTTTTTTT	2185	670.798	2580.27
TTTTCTTT	TTTTTCTT	1914	518.721	2498.89
TTCTTTTT	TTTCTTTT	1898	518.721	2462.07
CTTTTTTT	TCTTTTTT	1847	518.721	2345.6

A list of the top 100 highest scoring motif pairs for each sequence set can be found in Supplementary Material 3 (see online supplementary material at http://www.liebertonline.com). Furthermore, we counted the number of motif pairs which were either unique to one sequence set, or which were common to two, three, or four sequence sets. As we can see in Figure 15, there were 316, 56, 24, and 25 such motif pairs. No motif pairs were found in common with Arabidopsis.

FIG. 15.

Number of top 100 word pairs common to different numbers of sequence sets in rice.

Functional categorization

In order to see what kinds of genes contained the highest scoring motifs, we studied the distribution of the top 10 words in six of the seven sequence sets (core, proximal and distal promoters, intron, 5′ UTRs, and 3′ UTRs). We searched for Gene Ontology (MSU GOSlim) terms used in connection with a given gene at the Rice Array Database, where we performed GO Enrichment Analysis (Jung, 2008). The GO terms from biological, cellular, and molecular functional categories associated with the top 10 motifs from the six sets were visualized with matrix2png software (Pavlidis and Noble, 2003). The result of this analysis can be seen in the figures associated with Supplementary Material 4 (see online supplementary material at http://www.liebertonline.com). Here GO terms strongly associated with a motif (low p value) are denoted with a green box, while GO terms loosely associated with a motif (higher p value) are denoted with a red box. Where the p value is not present or is higher than 0.01, the box is colored grey.

Discussion

A number of motifs found in our analysis have been found to match experimentally verified regulatory motifs. Besides this, a major novel finding of the application of the algorithm to rice is the prediction of a number of novel putative motifs of yet unknown functions. Of the top 100 motifs from different sequence sets, those common to the core promoter and 5′ UTR might prove to be of significant interest. These motifs can be used in further studies or verified experimentally. The top 100 highest-scoring motifs were also clustered and a consensus sequence defined for them. Compared to the study in Arabidopsis, the statistical measure of these consensus sequences was calculated; the majority were found to be statistically significant. We also predict a number of regulatory motif modules (pairs) in six of these non-coding sequence sets based on the co-occurrence of motifs from the top 100 motif sets. Another novel type of analysis compared to Arabidopsis was the position-specific analysis of the frequency of the octamer words at both ends of the intron sequences and the 3′ UTRs, allowing us to locate the place where motifs occur to which the splicing and transcription machinery binds.

Since the rice word landscape has been analyzed, it is now possible to compare it with the word landscape of Arabidopsis and draw novel insights therefrom. The importance of our findings lies in the fact that compared to the total motif content of Arabidopsis, we found that out of the top 100 motifs found for each of the seven sequence sets, 190 of them were found in at least two of the sequence sets in rice, while 101 motifs were found to be common between rice and Arabidopsis pertaining to these sequence sets. However, when comparing word pair content between the two species, no pairs were found to be in common with Arabidopsis and rice, suggesting that the overall molecular regulatory networks in these two species is different. The reason could be that one is a monocot and the other a dicot. Furthermore, we found no nullmers in rice, indicating that the sequence background in rice results in a larger variability in rice motifs, suggesting a higher mutational turnover of motifs. However, since these motifs are conserved between rice and Arabidopsis, we can infer that these are general regulatory motifs which could be conserved across a large number of species.

A genomic comparison of the total motif content of such species as Arabidopsis and japonica rice can serve as a basis for new areas in the study of the regulatory machinery found in non-coding sequences between species. Total motif content comparison between different species can be used to measure how closely related they are. According to some views, changes in gene regulation are responsible for speciation (Ohno, 1970). Therefore, high scoring motifs found in one species may account for differences between another species where that particular motif has a low score. In fact, our future plans include the analysis of the Oryza sativa indica and Brachypodium distachyon genomes. Comparison of the two Oryza sativa genomes would prove to be of interest, as these two species are very closely related and are therefore expected to contain a very high number of motifs common to the japonica genome.

Lichtenberg and associates (2009b) studied the distribution of word motifs in unidirectional and bidirectional promoters involved in DNA repair pathways in humans and found a subtle, yet still discernible, signature for bidirectional promoters. Similar studies between differently regulated genes could show which specific motifs are responsible for the regulation of different biochemical pathways or physiological processes. Similar motif content may point to regulatory overlaps between different pathways or processes, such as in ABA-dependent and ABA-independent abiotic stress pathways in plants (Yamaguchi-Shinozaki and Shinozaki, 2005).

Conclusion

The total octamer motif content of rice has been enumerated and analyzed. Although the algorithm applied to the rice non-coding sequences is not new, a number of novel and interesting insights and improvements were made based on the analysis of the japonica genome and its comparison with the corresponding Arabidopsis sequence sets. Therefore the present study serves as an excellent follow-up analysis of both plant genomes. The study and comparison of the word landscape of different species could help in opening up a new chapter in genomics with the analysis of an organism's so-called “motifome,” which we define as the listing and statistical ranking of all possible motifs in a given genome.

Footnotes

Author Disclosure Statement

The authors declare that no conflicting financial interests exist.

References

Bernard

, Brunaud

, Lecharny

2010. TC-motifs at the TATA-box expected position in plant genes: a novel class of motifs involved in the transcription regulation. BMC Genomics, 11:166.

Brown

R.L.

, Kazan

, McGrath

K.C.

, Maclean

D.J.

, Manners

J.M.

2003. A role for the GCC-box in jasmonate-mediated activation of the PDF1.2 gene of Arabidopsis. Plant Physiol, 132:1020–1032.

Goren

, Kim

, Amit

, Vaknin

, Kfir

, Ram

, Ast

2010. Overlapping splicing regulatory motifs—combinatorial effects on splicing. Nucleic Acids, 38:3318–3327.

Hagen

, Guilfoyle

2003. Auxin-responsive gene expression: genes, promoters and regulatory factors. Plant Mol Biol, 49:373–385.

Jung

K.H.

, Dardick

, Bartley

L.E.

et al. 2008. Refinement of light-responsive transcript lists using rice oligonucleotide arrays: evaluation of gene-redundancy. PLoS One, 3:e3337.

Lichtenberg

, Jacox

, Welch

J.D.

et al. 2009b. Word-based characterization of promoters involved in human DNA repair pathways. BMC Genomics, 10,Suppl 1:S18.

Lichtenberg

, Yilmaz

, Welch

J.D.

et al. 2009a. The word landscape of the non-coding segments of the Arabidopsis thaliana genome. BMC Genomics, 10:463.

Logemann

, Parniske

, Hahlbrock

1995. Modes of expression and common structural features of the complete phenylalanine ammonia-lyase gene family in parsley. Proc Natl Acad Sci USA, 92:5905–5909.

Morris

R.T.

, O'Connor

T.R.

, Wyrick

J.J.

2008. Osiris: an integrated promoter database for Oryza sativa L. Bioinformatics, 24:2915–2917.

10.

Ohno

1970. Evolution by Gene Duplication. Heidelberg: Springer-Verlag.

11.

Ouyang

, Zhu

, Hamilton

et al. 2007. The TIGR Rice Genome Annotation Resource: improvements and new features. Nucleic Acids Res, 35,Database issue:D883–D887.

12.

Park

H.C.

, Kim

M.L.

, Kang

Y.H.

et al. 2004. Pathogen- and NaCl-induced expression of the SCaM-4 promoter is mediated in part by a GT-1 box that interacts with a GT-1-like transcription factor. Plant Physiol, 135:2150–2161.

13.

Pavlidis

, Noble

W.S.

2003. Matrix2png: a utility for visualizing matrix data. Bioinformatics, 19:295–296.

14.

Sandhu

J.S.

, Webster

C.I.

, Gray

J.C.

1998. A/T-rich sequences act as quantitative enhancers of gene expression in transgenic tobacco and potato plants. Plant Mol Biol, 37:885–896.

15.

Sandve

G.K.

, Drabløs

2006. A survey of motif discovery methods in an integrated framework. Biol Direct, 1:11.

16.

Tang

, Perry

S.E.

2003. Binding site selection for the plant MADS domain protein AGL15: an in vitro and in vivo study. J Biol Chem, 278:28154–28159.

17.

Wang

, Caruso

L.V.

, Downie

A.B.

, Perry

S.E.

2004. The embryo MADS domain protein AGAMOUS-Like 15 directly regulates expression of a gene encoding an enzyme involved in gibberellin metabolism. Plant Cell, 16:1206–1219.

18.

Yamaguchi-Shinozaki

, Shinozaki

2005. Organization of cis-acting regulatory elements in osmotic- and cold-stress-responsive promoters. Trends Plant Sci, 10:88–94.

19.

Zhou

D.X.

1999. Regulatory mechanism of plant gene transcription by GT-elements and GT-factors. Trends Plant Sci, 4:210–214.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

3.53 MB

0.09 MB

7.70 MB

0.00 MB

0.07 MB