Simultaneously Learning DNA Motif Along with Its Position and Sequence Rank Preferences Through Expectation Maximization Algorithm

Abstract

Although de novo motifs can be discovered through mining over-represented sequence patterns, this approach misses some real motifs and generates many false positives. To improve accuracy, one solution is to consider some additional binding features (i.e., position preference and sequence rank preference). This information is usually required from the user. This article presents a de novo motif discovery algorithm called SEME (sampling with expectation maximization for motif elicitation), which uses pure probabilistic mixture model to model the motif's binding features and uses expectation maximization (EM) algorithms to simultaneously learn the sequence motif, position, and sequence rank preferences without asking for any prior knowledge from the user. SEME is both efficient and accurate thanks to two important techniques: the variable motif length extension and importance sampling. Using 75 large-scale synthetic datasets, 32 metazoan compendium benchmark datasets, and 164 chromatin immunoprecipitation sequencing (ChIP-Seq) libraries, we demonstrated the superior performance of SEME over existing programs in finding transcription factor (TF) binding sites. SEME is further applied to a more difficult problem of finding the co-regulated TF (coTF) motifs in 15 ChIP-Seq libraries. It identified significantly more correct coTF motifs and, at the same time, predicted coTF motifs with better matching to the known motifs. Finally, we show that the learned position and sequence rank preferences of each coTF reveals potential interaction mechanisms between the primary TF and the coTF within these sites. Some of these findings were further validated by the ChIP-Seq experiments of the coTFs. The application is available online.

1. Introduction

Motif finding is an important classical bioinformatics problem. Given a set of biopolymer sequences (DNA or proteins), the motif-finding problem aims to identify the recurring patterns (motifs) in them. Motif finders can generally be classified into two approaches: combinatorial searching and probabilistic modeling. The former approach enumerates the consensus patterns, which are over-represented in the set of sequences. Using indexing data structures [e.g., suffix tree (Pavesi et al., 2001), suffix array (Lam et al., 2002), and hash table (Raphael et al., 2004)], it can efficiently identify short consensus motifs. Weeder (Pavesi et al., 2001), Trawler (Ettwiller et al., 2007), YMF (Sinha and Tompa, 2000), and DREME (Bailey, 2011) are a few examples representing this line of approach. On the other hand, (most) probabilistic modeling approaches represent motifs using position-weighted matrices (PWM) (Sinha, 2006). A PWM represents a length-w DNA motif as a 4×w matrix. It is more informative than a consensus pattern, but it is also more difficult to compute. The high computational complexity of probabilistic modeling approach is a formidable bottleneck for its practical use. Expectation maximization (Bailey and Elkan, 1994) and Gibbs sampling (Roth et al., 1998) are the two most common approaches to finding a PWM, but they require long running time. Recently, some hybrid algorithms combined both approaches to get a good balance between accuracy and efficiency (e.g., Sharov and Ko, 2009; Linhart et al., 2008; and Kulakovskiy et al., 2010).

By only examining the over-representation of sequence patterns, the previous generation motif finders often miss some real motifs and generate many false positives. On the other hand, additional information for the input sequences are found to be helpful for improving motif finding. For example, some transcription factor (TF) binding motifs (e.g., TATA-box) are localized to certain intervals with respect to the transcription start site(s) (TSS) of the gene. In this case, the position information can help to filter spurious sites. In protein-binding microarray (PBM) (Berger and Bulyk, 2006) data, the de Bruijn sequences are ranked by their binding affinities, and we expect the correct motif occurs in the high-ranking sequences; such data has a rank preference. In the chromatin immunoprecipitation sequencing (ChIP-Seq) data (Valouev et al., 2008), the ChIPed TF's motif (ChIPed TF is the TF pulled down in the ChIP experiment) prefers to occur in sequences with high ChIP intensity and also near the ChIP peak summits (thus having both position and rank preference). Hence, if we know the position preference and the sequence rank preference of the TF motifs in the input sequences, we can improve motif finding. In fact, many existing motif finders already utilize such additional information. MDscan (Liu et al., 2002) only considers high-ranking sequences to generate its initial candidate motifs. Other programs allow users to specify the prior distribution of position preference or sequence rank preference (Bailey and Elkan, 1994; Pavesi et al., 2001; Bailey, 2011; Kulakovskiy et al., 2010; Hu et al., 2010) or add such preferences as a prior knowledge component in their scoring functions (Chen et al., 2007; Narang et al., 2010; Linhart et al., 2008; Keilwagen et al., 2011; Frith et al., 2004). However, the users may not know the correct prior(s) to begin with. Even worse, different motifs may have different preferences. For example, in ChIP-Seq experiments, some motifs prefer to occur in high-ranking sequences and at the center of the ChIP peak summit while others do not.

To resolve such problems, we propose a novel motif-finding algorithm called SEME (sampling with expectation maximization for motif elicitation). SEME assumes the set of input sequences is a mixture of two models: a motif model and a background model. It uses an EM-based algorithm to learn the motif pattern (PWM), position preference, and sequence rank preference at the same time, instead of asking users to provide them as inputs. SEME does not assume the presence of both preferences but automatically detects them during the motif refinement process through statistical significance testing. We also observe that EM algorithms are generally slow in analyzing large-scale, high-throughput data. Speeding up EM using suffix tree was recently proposed (Reid and Wernisch, 2011), but the technique cannot be applied when one wants to also learn the position and sequence rank preferences. To improve the efficiency, SEME developed two EM procedures. The two EM procedures are based on the observations that the correct motifs usually have a short conserved pattern in it and majority of the sites in the input sequences are non-motif sites. The first EM procedure, called extending EM (EEM), starts by finding all over-represented short l-mers and then attempts to include and refine the flanking positions around the l-mers within the EM iterations. This way, SEME recovers the proper motif length within a single run, thus saving a substantial amount of time by avoiding multiple runs with different motif length, as done in many existing motif finders (Bailey and Elkan, 1994; Pavesi et al., 2001; Linhart et al., 2008; Kulakovskiy et al., 2010; Hu et al., 2010). The second EM procedure, called the resampling EM (REM), tries to further refine the motif produced by EEM. It is based on a theorem similar to importance sampling (Glynn and Iglehart, 1989), which stated that the motif parameters can be learned unbiasedly using a biased subsampling. By this principle, we can sample more sites that are similar to the EEM's motif and less sites from the background. This way, REM is able to learn the correct motifs using significantly less background sites. In our implementation, REM is capable of producing the correct TF motifs using approximately 1% of the sites normally considered in a typical EM procedure.

Using 75 large-scale synthetic datasets, we show that SEME is better both in terms of accuracy and running time when compared to MEME (Multiple EM for Motif Elicitation), a popular EM-based motif-finding program (Bailey and Elkan, 1994). We found that MEME is unable to find motifs with gap regions while SEME's EEM procedure can successfully extend the motifs to include them. In the real experimental datasets, we perform comparisons using 32 metazoan compendium datasets and 164 ChIP-Seq libraries. SEME consistently outperformed seven existing motif-finding programs that we compared. In general, we found that SEME not only finds more TF motifs but also gives more accurate results, as evaluated using either PWM divergence, AUC score, or STAMP's p-value (Mahony et al., 2007). Other TFs that bind nearby and function together with the ChIPed TF are called coTF. When we compare the programs to find coTF motifs from 15 ChIP-Seq datasets, the superior performance of SEME is more pronounced. We propose that SEME's ability to learn the underlying motif-binding preference is crucial in its performance. We further confirmed the correctness of the position and sequence-rank preference of the coTF motifs learned by SEME on three ChIP-Seq datasets. The actual ChIP-Seq data of the predicted coTFs clearly shows that SEME managed to infer the correct preferences. We also show that such preferences provide biological insights to the mechanism of the ChIPed TF–coTF interactions.

2. Seme Algorithm

SEME uses a probabilistic framework known as the two-component mixture model (TCM), which is first proposed by MEME (Bailey and Elkan, 1994). It assumes that the observed data is generated by two independent components: a motif model and a background model. Given an ordered list X of equal-length DNA sequences, each site X_i in X is associated with a DNA sequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$X_i^{( seq )}$$\end{document} and two integers: the rank of the sequence containing X_i ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$X_i^{( rank )}$$\end{document} ) and the position of the site X_i in the sequence ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$X_i^{( pos )}$$\end{document} ). We use an indicator variable Z_i to indicate if X_i is from the motif model or the background model, i.e., denote Z_i = 1 if X_i is from the motif model and 0 otherwise. The likelihood of an observed site X_i is written as: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}Pr ( X_i ) = Pr ( X_i \mid Z_i = 1 ) Pr ( Z_i = 1 ) + Pr ( X_i \mid Z_i = 0 ) Pr ( Z_i = 0 ) \tag{1}\end{align*}\end{document}

We use a naive Bayesian approach to combine three types of preferences (sequence, position, rank): \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}Pr ( X_i \mid Z_i ) = Pr ( X_{i}^{( seq )} \mid Z_i ) Pr ( X_{i}^{( pos )} \mid Z_i ) Pr ( X_{i}^{( rank )} \mid Z_i ) \tag{2}\end{align*}\end{document}

For sequence preference, we model the motif-site sequence with a position weight matrix (PWM) Θ, and the background sequence with a 0-order Markov model θ₀. Θ is a 4×w matrix where Θ_j,a is the probability that the nucleotide a occurs at position j. For any length-w sequence X_i, the probability that X_i is generated from the motif model and the background model are as follows. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}Pr ( X_i^{( seq )} \mid Z_i = 1 ) = Pr ( X_i^{( seq )} \mid \Theta ) = \prod_{j = 1}^{w} \Theta_{j , X_{i , j}^{( seq )}} \tag{3}\end{align*}\end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}Pr ( X_i^{( seq )} \mid Z_i = 0 ) = Pr ( X_i^{( seq )} \mid \overrightarrow{\theta_0} ) = \prod_{j = 1}^{w} \theta_{0 , X_{i , j}^{( seq )}} \tag{4}\end{align*}\end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$X_{i , j}^{( seq )}$$\end{document} is the nucleotide in the j-th position of the site X_i.

The position and sequence-rank preferences are modeled using multinomial distributions. The position preference models the preference of the motif site to certain positions. Similarly, the sequence rank preference tries to model if the motif site prefers the sequences with certain range of ranks, assuming input sequences are ordered by some criteria. To this end, we discretize both the positions and sequence ranks into K bins. The probability a binding site occurs at the k-th position bin is denoted as α_k, for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$k = 1 , \ldots , K$$\end{document} , while the background distribution is assumed to be uniform. Precisely, for every X_i, we have \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}Pr ( X_i^ {( pos )} = k \mid Z_i = 1 ) = \alpha_k; Pr ( X_i^ {( pos )} = k \mid Z_i = 0 ) = \frac {1} {K} \end{align*}\end{document}

Similarly, for sequence-rank preferences, the probability a motif site occurs at the k-th sequence rank bin is denoted as β_k and, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}Pr ( X_i^ {( rank )} = k \mid Z_i = 1 ) = \beta_k; Pr ( X_i^ {( rank )} = k \mid Z_i = 0 ) = \frac {1} {K} \end{align*}\end{document}

Let Pr(Z_i = 1) be λ. The parameters of the mixture model in SEME are \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\Phi = ( \lambda , \Theta , {\theta_0} , \{\alpha_1 , ... , \alpha_K \} , \{ \beta_1 , ... , \beta_K \} )$$\end{document} . We estimated these parameters by maximizing the log likelihood \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\sum\nolimits_{i = 1}^{n}{\log Pr ( X_i \mid \Phi )}$$\end{document} using expectation maximization (EM) procedure. Given a set of sequences X, the classical EM algorithm is as follows. It first gives an initial guess of the parameter Φ⁽⁰⁾. Then, it iteratively performs two steps: E-step and M-step. Given Φ^(t−1), the t-th iteration of the E-step estimates \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$Z_i^{( t )} = Pr ( Z_i \mid \Phi^{( t - 1 )} , X )$$\end{document} . Then, given \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$Z_i^{( t )}$$\end{document} , the t-th iteration of the M-step computes \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\Phi^{( t )} = \arg \max_{\Phi} \sum\nolimits_{i = 1}^n \log Pr ( X_i , Z_i^{( t )} \mid \Phi )$$\end{document} . The E-step and M-step are iterated until Φ^(t) is converged.

In this work, we developed four phases in the SEME pipeline (Fig. 1). To search for a good starting point, SEME first enumerates a set of over-represented short l-mers (phase 1) and extends each short l-mer to a proper length PWM motif by the EEM procedure (phase 2). The PWM reported by the EEM procedure will approximate the true motif when its starting l-mer captures the conserved region of the motif. To further refine EEM's PWM motif, SEME applies the resampling EM procedure (phase 3). It is an importance sampling version of the classical EM algorithm that greatly sped up the EM iterations. Finally, the refined PWM motifs are scored and filtered for redundancies (phase 4). Below, we briefly describe these four phases (see Supplementary Section 3, Supplementary Material available online at www.liebertonline.com/cmb).

FIG. 1.

Algorithm description for SEME Pipeline. AUC; area under the ROC curve; EM, expectation maximization; PWM, position-weighted matrices; REM, resampling EM; SEME, sampling with expectation maximization for motif elicitation.

Identifying over-represented l -mers. In the first phase, SEME computes the frequencies of all short l-mers (l = 5 by default) in the input sequences and the background. If no background sequences are provided, a first-order Markov model will be learned from the input sequences as the background model. We output all l-mers whose frequencies in the input sequences are higher than in the background to the next phase.

Extending EM. For each l-mer q obtained from the first phase, the aim of the extending EM procedure is to extend the l-mer to a longer motif that maximizes the likelihood of observing the sites with q. In this phase, the EEM procedure only needs to study the sites containing the l-mer q, i.e., \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$Y = \{X_i \in X \mid X_i^{( seq )}$$\end{document} matches (N)^w−|q|q(N)^w−|q|} (“N” is a wild char), and w is the maximum length of a motif. For example, if l-mer is “GGTCA” and the predefined longest possible motif length is 10, then EEM considers only those sites in X matching the string pattern “NNNNNGGTCANNNNN.” Figure 2 gives the pseudocode for the EEM procedure. The EEM procedure first initializes the parameters \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\Phi^{( 0 )} = ( \lambda^{( 0 )} , \Theta^{( 0 )} , \theta_0^{( 0 )} , \{ \alpha_1^{( 0 )} , \ldots , \alpha_K^{( 0 )} \} , \{ \beta_1^{( 0 )} , \ldots , \beta_K^{( 0 )} \} )$$\end{document} where λ⁽⁰⁾ is the estimated percentage of Y not from background, Θ⁽⁰⁾ is PWM representing q, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\theta_0^{( 0 )}$$\end{document} is the frequency of A,C,G,T in Y, excluding the conserved l-mer \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$q , \alpha_i^{( 0 )} = \beta_i^{( 0 )} = 1 / K$$\end{document} for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$i = 1 , \ldots , K$$\end{document} (uniform distribution). Then, it performs E-step (expectation) and M-step (maximization), iteratively.

FIG. 2.

Pseudocode for extending EM procedure.

In each iteration of the M-step, the EEM procedure will also try to include one additional column into Θ^(t) if such extension improves the likelihood. Precisely, for each position \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$j = 1 , \ldots , 2w - \mid q \mid$$\end{document} not in Θ^(t), we show that the maximum increment of the log likelihood before and after, including the position j, is G(j) where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}G ( j ) = \sup_J \sum_ {X_i \in Y} {Z_i^ {( t )}} \log \bigg( \frac {Pr ( X_ {i , j} ^ {( seq )} \mid J )} {Pr ( X_ {i , j} ^ {( seq )} \mid \theta_ {0} ^ {( t )} )} \bigg) \tag {5} \end{align*}\end{document}

where J is any probability distribution over the nucleotides {A,C,G,T}.

While the length of Θ^(t) is less than w, we extend the PWM Θ^(t) to include position j, which brings the largest G(j). To avoid over-fitting, the selected column also has to be tested (Chi-square) significantly different from the background frequency θ₀. The EEM procedure ends when PWM Θ converges. Finally, the columns in Θ representing the l-mer q will be further diluted (by setting all [1.0,0.0,0.0,0.0] columns representing “A” to [ \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$0.5 , \frac {0.5}{3} , \frac {0.5}{3} , \frac{0.5}{3}$$\end{document} ]—other nucleotides are handled similarly) before Θ is returned as the output of the EEM procedure. In Supplementary Section 1.1, we confirmed that EEM consistently recovers the correct motif length.

Resampling EM. The EEM procedure identifies an approximate motif model Θ^(EEM) with a proper motif length. This motif can be further refined using the classical EM algorithm to improve accuracy. However, when the input data X is big, this step will be slow. Using the idea of importance sampling, we proposed the resampling EM (REM) procedure that reduces the running time by running EM algorithm on a subsample of the original data.

Let Q(·) be the sampler function, where Q(X_i) = 1 if the sequence X_i is sampled, and 0 otherwise. In Supplementary Theorem 1 (Supplementary Section 3.4), we show that the log likelihood function log Pr(X, Z|Φ) can be unbiasedly approximated by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}\sum_ {X_i \in X} \log Pr ( X_ {i} , Z_ {i} \mid \Phi ) = E_ {X_Q} \bigg[ \sum_ {X_ {i} \in X_Q} \frac {\log Pr ( X_ {i} , Z_ {i} \mid \Phi )} {Pr ( Q ( X_ {i} ) = 1 )} \bigg] \tag {6} \end{align*}\end{document}

where each sampled site is weighted by factor \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$ \frac {1} {Pr ( Q ( X_ {i} ) = 1 )} $$\end{document} . The theorem implies that we need only run the EM algorithm on X_Q. Moreover, in the M-step of the original EM, instead of maximizing log Pr(X, Z|Φ), we maximize \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\sum\nolimits_ {X_ {i} \in X_Q} \frac {\log Pr ( X_ {i} , Z_ {i} \mid \Phi )} {Pr ( Q ( X_ {i} ) = 1 )} $$\end{document} .

Although Equation 6 is true for any arbitrary sampler function Q(·), running EM using different Q(.) yields different sampling efficiencies. For example, we can use a uniform random sampler, i.e., Pr(Q(X_i)=1)=μ for every \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$X_i \in X$$\end{document} , where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\mu \in [ 0 , 1 ]$$\end{document} is the sampling ratio. This function is expected to only cover 100μ% of the correct motif sites from X, which prohibits the use of small μ. In our work, we employ the idea of importance sampling. Our sampling function Q(.) satisfies Pr(Q(X_i)=1)=min{4^wμPr(X_i|Θ^(EEM)),1}, where w is motif length. This sampling function gives higher probabilities for the sites that are more consistent with Θ^(EEM); thus, it is expected to sample more from the correct motif sites and less from the background (assuming Θ^(EEM) models more of the correct motif-site signal than the background signal). This strategy is useful since we avoid most of the background sites in X. In fact, our simulation reveals that the REM procedure can achieve nearly 60% recall rate (of the correct motif sites) at the sampling ratio as small as 2⁻¹⁰(≈0.001) and 90% recall rate at the sampling ratio of 2⁻⁵(≈0.031) (see Supplementary Section 1.2). We chose a default sampling ratio of 0.01 in all experiments of this article.

The position and sequence-rank preferences are assumed to be nonexistent at the beginning of the REM iterations (i.e., \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$Pr ( X_i \mid Z_i ) = Pr ( X_{i}^{( seq )} \mid Z_i )$$\end{document} ). The position and/or sequence-rank preferences are considered only when the position and/or sequence-rank distributions of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{Z_i^{( t )} \} $$\end{document} are significantly different from the uniform distribution (by Chi-square test). This strategy allows SEME to tell users which preference is really important for the predicted motif. Figure 3 is the pseudocode for this procedure.

FIG. 3.

Pseudocode for the resampling EM procedure.

Sorting and Redundancy Filtering. The PWMs output by REM were evaluated and sorted by empirical ROC-AUC (the area under the receiver–operator characteristic curve) or over-representation Z-score (representing the motif abundance) with the input data (details on each scoring are in the Supplementary Section 4). The first score is preferred for the case when input sequences are short and most sequences contain at least one motif site (e.g., ChIPed-TF motif finding); for the other cases, we suggest using the Z-score. We eliminated redundant PWMs from the sorted list as follows. When the sites of a PWM motif overlap with those of another PWM motif by more than 10%, we treat the PWM motif with the lower score as redundant and remove it.

3. Results

3.1. Profiling two novel EM procedures

SEME significantly outperforms MEME in recovering the planted PWM. To analyze SEME's performance, we extracted all 75 motifs of lengths >9 in JASPAR (Wasserman and Sandelin, 2004) vertebrate core database. For each such motif, we generated a training dataset of 1000 random sequences of length 400 bp where 500 of them have a motif instance. The instances are planted uniformly across all positions and sequences.

For each dataset, we ran SEME (EEM only), SEME (EEM+REM), and MEME (the classical EM-based motif finder) and obtained the top five predicted PWMs from each program. To test the goodness of the predicted PWMs, we compared the PWM divergence between the predicted PWMs and the actual planted PWMs. We also generated independent testing sequences with length 400 bp (1000 positive sequences with one implanted motif site, 1000 negative sequences without motif site), and computed the ROC-AUC value for each predicted PWM. Figure 4a shows the comparison result. As expected, the random PWMs have the worst AUC values, while the actual planted PWMs have the best AUC values. EEM's predicted PWMs have significantly better discriminative capability (AUC) and similarity (less PWM divergence) to the actual planted PWM as compared to random PWMs. This indicates that EEM's PWMs are good starting points for the subsequent REM procedure. REM's predicted PWMs further improve the AUC score and are similar to the actual planted PWM (as indicated by the small PWM divergence).

FIG. 4.

The empirical performance of SEME on synthetic datasets. (a) The accuracy of SEME's PWM (the EEM's [unrefined] PWM and the REM's [final] PWM are listed). We quantify accuracy with the commonly used area-under-ROC curve (AUC) score and PWM divergence (PD). We show that EEM's predicted PWM is already significantly stronger than random; indicating the goodness of EEM's PWM as starting point for the subsequent REM step. The scores also show that SEME's PWMs are significantly better when compared to MEME's. (b) Based on the performances of SEME and MEME on the Pax4 motif dataset, we observe that MEME has serious difficulties mining PWMs with long gap regions within them. (c) The running time of SEME is shown against increasing input size. We observe that CUDA-MEME, the GPU-enabled version of MEME, still runs slower than SEME running on normal CPU (it takes 1 day to handle ≈ 6000 sequences while SEME takes around 1 hour for 10000 sequences).

Figure 4a also shows that SEME outperforms MEME. In fact, SEME is better than MEME in 42 out of 75 experiments (the cases with positive AUC differences in Fig. 4b). The cases in which SEME performed worse have relatively small AUC score differences (less than 0.04). We examined the Pax4 dataset in which SEME gains the highest improvement against MEME. The implanted JASPAR Pax4 motif is a diverged PWM of length 30. SEME successfully extended and recovered the full Pax4 motif, thanks to the ability of its EEM procedure to handle long gaps in its extension step. In contrast, MEME failed to model the long gaps due to their starting-point finding procedure, which assumes that all of the PWM positions are equally important.

SEME is more suitable in handling large-scale data. We further generated seven large datasets to observe the capability of SEME in handling large-scale data. Each dataset consists of different numbers of sequences (from 500 to 10000, each of length 400 bp). Figure 4c showed that the original MEME program cannot process more than 2000 sequences within one day; hence, we also used the GPU-accelerated version of MEME and CUDA-MEME (Liu et al., 2009) (run on two Intel X5670 CPUs and two Fermi M2050 GPUs with 48 GB RAM). SEME was run as a normal CPU program. SEME is still around 60 times faster than CUDA-MEME, which runs on the highly parallelized GPU system. In addition, SEME can process up to 10000 sequences (a typical dataset size for ChIP-Seq experiments) in 1 hour while the CUDA-MEME took more than one day to process 6000 sequences.

3.2. Comparing TF motif finding in large-scale real datasets

We compared the performance of SEME with other existing motif-finding programs on two large-scale TF-binding site data. We also study the ability of SEME to uncover the hidden position and/or sequence-rank preferences in the input dataset when they are present.

The metazoan compendium datasets. The first benchmark is a metazoan compendium dataset published by Linhart et al. (2008), consisting of 32 datasets based on experimental data from microarray, ChIP-chip, ChIP-DSL, and DamID as well as gene ontology data (Ashburner, 2000). A list of the promoter sequences of many target genes (1000 bp upstream and 200 bp downstream the transcription start site [TSS]) were used as the positive input for each motif-finding program, and promoter sequences of other non target genes are used as background sequences. The performance of six existing motif-finding programs, namely AlignACE (Roth et al., 1998), MEME (Bailey and Elkan, 1994), YMF (Sinha and Tompa, 2000), Trawler (Ettwiller et al., 2007), Weeder (Pavesi et al., 2001), and Amadeus (Linhart et al., 2008), were compared in the original benchmark study (Linhart et al., 2008). Each program's predicted PWMs are evaluated by the PWM divergence. Only PWMs with medium and strong matching with the known motifs (PWM divergence <0.18) were considered to be successfully detected (Linhart et al., 2008).

The result of this comparison is shown in Figure 5a. We find that SEME successfully detected the correct motifs in 21 datasets, whereas the second-best program, Amadeus, succeeded in 18. Weeder and Trawler found correct PWMs in 11 and 12 datasets, respectively. SEME also found more accurate motifs than the rest; it found 12 motifs with PWM divergence <0.12. SEME further detected a significant position preference for the correct motifs for many datasets in this benchmark: most of them tend to bind nearer to the TSS position (see Supplementary Section 1.4 for details).

FIG. 5.

The performance of SEME compared to existing motif-finding programs from large-scale real data. (a) Comparison result on the metazoan compendium datasets. Four PWM motifs returned by each motif-finding program are then compared to the known Transfac motifs using PD (as in Linhart et al., 2008) and further classified into three matching categories (strong, medium, weak) corresponding to different PD cutoffs (0.12,0.18,0.24). (b) Comparison result of 164 ChIP-Seq libraries over four different measurements: AUC, PPV (positive predictive value), ASP (average site performance), and SPC (specificity). The result shows that most motif finders perform similarly well in detecting ChIPed TF (but SEME is consistently better than all of them). (c) Comparison result for coTF motif finding on 15 ChIP-Seq libraries. The quality of reported PWMs is classified into three categories (strong, medium, weak) corresponding to different STAMP p-value cutoffs (0.0001, 0.01, 0.05). SEME reported the most number of coTF motifs that match the known PWM with STAMP p-value ≤ 0.0001 (strong match, blue bar). Overall, SEME also found the most number of coTF motifs (61) as compared to the second-best program, Amadeus (48).

ChIP-Seq experimental datasets: Discovery of the ChIPed TF motif from ChIP-Seq data. The second benchmark is a collection of large-scale ChIP-Seq experimental data that consists of 164 published ChIP-Seq libraries from the ENCODE project (Euskirchen et al., 2007) and our lab over different cell lines and TFs (Chen et al., 2008; Zhang et al., 2011; Kong et al., 2011). ChIP-Seq usually reports more than 10000 target sequences with narrower target regions (100 bp). We compute the area under ROC curve, positive predictive value, and average site performance and specificity scores of each program's predicted PWM. The formula for the above scores are given in the Supplementary Section 4. From each library, the 100 bp sequences around the top 10000 ChIP-Seq peaks were selected (sorted by ChIP intensity) as our input data. MEME and Weeder only use the top 2000 peaks due to their long running time. Peaks with odd-numbered ranks were used for training, while the even-numbered peaks were used as positive testing data. The negative dataset was generated a first-order Markov model trained using the same number of 100 bp random sequences extracted from the regions 1000 bp away from the ChIP-Seq peaks.

We compared SEME with seven popular de novo motif-finding programs for ChIP data: MEME, Weeder, Cisfinder, Trawler, Amadeus, ChIPMunk, and HMS. Each program's top five motifs are evaluated using the four statistics measurements on the test data. For each scoring, the best of the five motifs will be used to represent the performance of a program. Figure 5b shows the average performances of the motif finders. Again, we find that SEME is consistently better than all other programs (first rank in area under ROC curve and positive predictive value and specificity, and third rank in average site performance).

Discovery of coTF motifs from ChIP-Seq data. We noted that most motif finders show good performance in finding the ChIPed TF motifs. This is expected since the ChIPed TFs are highly enriched (Zhang et al., 2011). Compared to finding ChIPed TF motifs in ChIP-Seq datasets, the problem of finding coTF motifs in the ChIP-Seq datasets is much more challenging. The coTF motif instances are less abundant and most are not located exactly at the ChIP-Seq peaks. Nevertheless, finding the coTF(s) could potentially uncover previously unknown TF–TF interaction.

For coTF motif comparison, we used 15 ChIP-Seq libraries whose coTFs have been characterized (the list of coTFs for each ChIP-Seq is in Supplementary Section 2.5). We extracted 400 bp sequences around the ChIP-Seq peaks and compared the top 20 de novo motifs of each program to the known coTF motifs in the JASPAR and Transfac database; we cannot use the previous statistical measurements since coTFs may not occur in all ChIP-Seq peaks. Furthermore, the ChIPed TF binding sites need to be masked before we start the coTF motif finding. SEME and ChIPMunk can do this automatically and, for other programs without auto-masking mode, the input sequences were masked by the top two motifs reported from their ChIPed motif-finding results.

STAMP program (Mahony et al., 2007) was used to compute the p-value of the match between a predicted coTF motif against the known coTF motif. STAMP p-value provides a better match measurement compared to PWM divergence since it removes the motif-length bias (Mahony et al., 2007). We separated the p-value of the PWM matching into three significance levels: (1) weak match (0.05≥p-value>0.01), (2) medium match (0.01≥p-value>0.0001) and (3) strong match (p-value≤0.0001). Figure 5c shows the performances of different motif-finding programs for finding the coTF motifs from the 15 datasets. SEME recovered 61 known coTF motifs, compared to Amadeus and MEME, which find 48 and 44 coTF motifs, respectively. Thirty-one out of the 61 coTF motifs of SEME belong to the strong-match category (Amadeus only found 20) and another 27 are in the medium-match category. This indicates that SEME's predicted coTF PWMs are highly accurate.

To study the biological significance of the learnt preferences, we further studied the output of three datasets, involving the estrogen receptor (ER), androgen receptor (AR), FoxA1, Oct4, and c-Myc TFs, in detail (Fig. 6). The real binding site of each TF was defined to be the site around +/−100 bp around the TF's ChIP-Seq peak, whose known PWM score is better than a cutoff that yields FDR = 0.01. If multiple matches occur, only the best scoring site was chosen. Comparison between SEME's learnt distributions (Fig. 6, middle columns) and the real binding-site distributions (Fig. 6, right-most columns) indicates that SEME is able to learn the correct coTF position and sequence-rank preferences. We also found that the motif positions of FoxA1, a known coTF of ER, is not enriched exactly at the ER ChIP-Seq peak in the MCF7 data; instead it is found in the flanking regions near the ER peaks. Interestingly, in the LnCAP AR ChIP-Seq dataset (FoxA1 is also a known coTF of AR), we found that FoxA1 binds very closely to AR—it is enriched at the AR ChIP-Seq peak summits. This observation is consistent with the previous report that FoxA1 can physically interact with AR (Gao et al., 2003). This observation also indicates the different roles FoxA1 assumes when working with AR and ER (Sahu et al., 2011). In the ChIP-Seq data of Oct4 from mouse ES cells SEME found the motif of c-Myc enriched within Oct4's low-intensity peak regions. We conjectured that, in these regions, Oct4 indirectly bind the DNA through c-Myc (hence explaining the ChIP-Seq's low intensity). An earlier report showed that Oct4, along with Sox2, Nanog, and Stat3, form an enhancer module while c-Myc, along with n-Myc, E2F1, and Zfx, form a promoter module in the ES cell (Chen et al., 2008). In fact, interaction between these enhancer and promotor modules has also been reported previously (Wu and Ng, 2011).

FIG. 6.

Automatic learning of the position and sequence rank preference from the input data. Instead of requiring the user to input the expected coTF motif preference distribution (position and/or sequence-rank distribution), SEME learns such distributions directly from the input data. We show that most of the time, SEME can learn the correct distributions of each TF (as compared to real binding sites distribution in the rightmost column, defined by the ChIP-Seq and the known PWM of the TF). For position distribution, the x-axis is +/−200 bp from ChIP-seq peak summit (the black dashed line), and the y-axis is the fraction of binding sites in a given position. For rank distribution, the x-axis is the rank of ChIP-seq peak (left: high ChIP intensity, right: low ChIP intensity), and the y-axis is the fraction of binding sites in a given rank. The ChIP-seq peak rank distributions (MCF7 ER ChIP, LNCaP AR ChIP) of FoxA1, and the position distribution of Myc, are tested to be insignificant by SEME.

These examples indicate that the position and sequence-rank distribution learnt by SEME are reasonably accurate, and users could use them to infer the nature of the interaction between the ChIPed TF and the coTF(s). In this manner, SEME can be used to generate biological hypothesis for further experimental validations. Moreover, the highly diverse preferences that we observe highlight the difficulty for users to provide the correct prior in the first place.

4. Conclusion

This article developed a novel algorithm called SEME for mining motifs using mixture model and EM algorithm. We presented three important contributions: (1) automatic detection and learning of the position and sequence-rank preferences of a candidate motif; (2) ability to estimate the correct TF motif length (with possible gaps within); and (3) using importance sampling for efficiency while still able to estimate the EM parameters unbiasedly. As a result, we showed that SEME is substantially better, both in terms of accuracy and efficiency, compared to the existing motif-finding programs.

Moreover, in the task of finding coTF motif in the ChIP-Seq data, SEME not only reports more accurate coTF motifs than other programs but also correctly estimates the position and sequence rank distribution of each coTF's motif. We showed that such information provides useful insights on the interaction between the ChIPed TF and the predicted coTFs. SEME does have a few limitations. Firstly, it assumes that the target motif contains conserved 5-mer region. In cases without such 5-mer, SEME also allows users to provide custom seeds. Secondly, SEME is more suitable for large-scale input (≥100 sequences) since it needs enough samples to determine whether we should do extension (EEM) or include additional binding preferences (REM).

Footnotes

Acknowledgments

This work was supported in part by the MOEs AcRF Tier 2 funding R-252-000-444-112. Z.Z. Zhang is supported by the National University of Singapore research scholarship. C.W. Chang is supported by an A*STAR graduate scholarship.

Author Disclosure Statement

The authors declare that no competing financial interests exist.

References

Ashburner

2000. Gene ontology: tool for the unification of biology. Nature Genetics, 25:25–29.

Bailey

T.L.

2011. Dreme: Motif discovery in transcription factor chip-seq data. Bioinformatics, 27:1653.

Bailey

T.L.

, Elkan

. 1994. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In Proc. Int. Conf. Intell. Syst. Mol. Biol., 2:28–36.

Berger

M.F.

, Bulyk

M.L.

2006. Protein binding microarrays (pbms) for rapid, high-throughput characterization of the sequence specificities of dna binding proteins. Methods in Molecular Biology, 338:245.

Chen

, Hughes

T.R.

, Morris

2007. Rankmotif++: a motif-search algorithm that accounts for relative ranks of k-mers in binding transcription factors. Bioinformatics, 23:i72.

Chen

, Xu

, Yuan

et al. 2008. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell, 133:1106–1117.

Ettwiller

, Paten

, Ramialison

et al. 2007. Trawler: de novo regulatory motif discovery pipeline for chromatin immunoprecipitation. Nature Methods, 4:563–565.

Euskirchen

G.M.

, Rozowsky

J.S.

, Wei

C.L.

et al. 2007. Mapping of transcription factor binding regions in mammalian cells by chip: comparison of array-and sequencing-based technologies. Genome Research, 17:898.

Frith

M.C.

, Hansen

, Spouge

J.L.

, Weng

2004. Finding functional sequence elements by multiple local alignment. Nucleic Acids Research, 32:189.

10.

Gao

, Zhang

, Rao

M.A.

et al. 2003. The role of hepatocyte nuclear factor-3α (forkhead box a1) and androgen receptor in transcriptional regulation of prostatic genes. Molecular Endocrinology, 17:1484.

11.

Glynn

P.W.

, Iglehart

D.L.

1989. Importance sampling for stochastic simulations. Management Science, 1367–1392.

12.

, Yu

, Taylor

J.M.G.

et al. 2010. On the detection and refinement of transcription factor binding sites using chip-seq data. Nucleic Acids Research, 38:2154.

13.

Keilwagen

, Grau

, Paponov

I.A.

et al. 2011. De-novo discovery of differentially abundant transcription factor binding sites including their positional preference. PLoS Computational Biology, 7:e1001070.

14.

Kong

S.L.

, Li

, Loh

S.L.

et al. 2011. Cellular reprogramming by the conjoint action of erα, foxa1, and gata3 to a ligand-inducible growth state. Molecular Systems Biology, 7.

15.

Kulakovskiy

I.V.

, Boeva

V.A

, Favorov

A.V.

, Makeev

V.J.

2010. Deep and wide digging for binding motifs in chip-seq data. Bioinformatics, 26:2622.

16.

Lam

T.W.

, Sadakane

, Sung

W.K.

, Yiu

S.M.

2002. A space and time efficient algorithm for constructing compressed suffix arrays. Computing and Combinatorics, 21–26.

17.

Linhart

, Halperin

, Shamir

2008. Transcription factor and microRNA motif discovery: The Amadeus platform and a compendium of metazoan target sets. Genome Research, 18:1180.

18.

Liu

X.S.

, Brutlag

D.L.

, Liu

J.S.

2002. An algorithm for finding protein–dna binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nature Biotechnology, 20:835–839.

19.

Liu

, Schmidt

, Liu

, Maskell

D.L.

2009. CUDA-MEME: Accelerating motif discovery in biological sequences using CUDA-enabled graphics processing units. Pattern Recognition Letters, 31:2170–2177.

20.

Mahony

, Auron

P.E.

, Benos

P.V.

2007. Dna familial binding profiles made easy: comparison of various motif alignment and clustering strategies. PLoS Computational Biology, 3:e61.

21.

Narang

, Mittal

, Sung

W.K

. 2010. Localized motif discovery in gene regulatory sequences. Bioinformatics, 26:1152.

22.

Pavesi

, Mauri

, Pesole

. 2001. An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics, 17:S207–S214.

23.

Raphael

, Liu

L.T.

, Varghese

2004. A uniform projection method for motif discovery in dna sequences. IEEE Transactions on Computational Biology and Bioinformatics, 91–94.

24.

Reid

J.E.

, Wernisch

2011. Steme: efficient em to find motifs in large data sets. Nucleic Acids Research, 39:e126–e126.

25.

Roth

F.P.

, Hughes

J.D.

, Estep

P.W.

, Church

G.M.

1998. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nature Biotechnology, 16:939.

26.

Sahu

, Laakso

, Ovaska

et al. 2011. Dual role of foxa1 in androgen receptor binding to chromatin, androgen signalling and prostate cancer. The EMBO Journal, 30:3962–3976.

27.

Sharov

A.A.

, Ko

M.S.H.

2009. Exhaustive Search for Over-represented DNA Sequence Motifs with CisFinder. DNA Research, 16:261–73.

28.

Sinha

2006. On counting position weight matrix matches in a sequence, with application to discriminative motif finding. Bioinformatics, 22.

29.

Sinha

, Tompa

2000. A statistical method for finding transcription factor binding sites. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, 344–354.

30.

Valouev

, Johnson

D.S.

, Sundquist

et al. 2008. Genome-wide analysis of transcription factor binding sites based on chip-seq data. Nature Methods, 5:829.

31.

Wasserman

W.W.

, Sandelin

2004. Applied bioinformatics for the identification of regulatory elements. Nature Reviews Genetics, 5:276–287.

32.

, Ng

H.H.

2011. Mark the transition: chromatin modifications and cell fate decision. Cell Research, 21:1388–1390.

33.

Zhang

, Chang

C.W.

, Goh

W.L.

et al. 2011. Centdist: discovery of co-associated factors by motif distribution. Nucleic Acids Research, 39:W391.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.73 MB

0.00 MB