It has been shown that minimum free-energy structure for RNAs and RNA–RNA interaction is often incorrect due to inaccuracies in the energy parameters and inherent limitations of the energy model. In contrast, ensemble-based quantities such as melting temperature and equilibrium concentrations can be more reliably predicted. Even structure prediction by sampling from the ensemble and clustering those structures by Sfold has proven to be more reliable than minimum free energy structure prediction. The main obstacle for ensemble-based approaches is the computational complexity of the partition function and base-pairing probabilities. For instance, the space complexity of the partition function for RNA–RNA interaction is O(n4) and the time complexity is O(n6), which is prohibitively large. Our goal in this article is to present a fast algorithm, based on sparse folding, to calculate an upper bound on the partition function. Our work is based on the recent algorithm of Hazan and Jaakkola (2012). The space complexity of our algorithm is the same as that of sparse folding algorithms, and the time complexity of our algorithm is O(MFE(n)ℓ) for single RNA and O(MFE(m, n)ℓ) for RNA–RNA interaction in practice, in which MFE is the running time of sparse folding and ℓ ≤ n (ℓ ≤ n + m) is a sequence-dependent parameter.
1. Introduction
Since the turn of the millennium and the advent of high throughput biology in the post-genome era, startling discoveries have redefined the role of RNA as a key player in the cellular arena. The ribosome and spliceosome are essentially two major RNA machines that together with other structural RNAs, such as microRNAs (miRNA), long intergenic non coding RNAs (lincRNA), small bacterial non coding RNAs, and many other categories of structural RNAs, run the cell at an extent that is comparable to that of protein machinery. For instance, lincRNAs have been recently shown to play sophisticated regulatory roles in mammalian cells, and miRNAs play a significant role in the development of cancer. These discoveries have put RNA together with proteins in the center of focus for research and therapeutic purposes, including personalized medicine. Humanity has just begun to unravel RNA's complicated roles in living cells. RNA is no longer considered a mere information medium from DNA to proteins, but it is rather in the center of attention in molecular and cellular biology research. RNA molecules often function through interaction with other RNAs. In the absence of high-throughput experimental assays to observe RNA structure and RNA–RNA interactions, the problems of RNA structure prediction and RNA–RNA interaction prediction gain the highest priority in bioinformatics.
RNA structure and RNA–RNA interaction prediction have recently received significant attention. The majority of algorithms that have been developed predict the minimum free energy structure (Alkan et al., 2006) or binding sites (Mückstein et al., 2008). However, it is a well-known fact that minimum free energy structure is often incorrect due to inaccuracies in the energy parameters and inherent limitations of the energy model. On the other hand, it has been shown that thermodynamic quantities, such as melting temperature and equilibrium concentrations, that are derived from the partition function, which captures the properties of the whole Boltzmann ensemble rather than those of the most likely structure, can be more reliably predicted (Chitsaz et al., 2009). Even structure prediction by sampling from the ensemble and clustering those structures by Sfold (Ding and Lawrence, 2003) has proven to be more reliable than minimum free energy structure prediction (Ding et al., 2005, 2006).
The main obstacle for ensemble-based approaches is the computational complexity of the partition function and base-pairing probabilities. For instance, the space complexity of computing the partition function for RNA–RNA interaction is O(n4) and the time complexity is O(n6), which is prohibitively large (Chitsaz et al., 2009; Huang et al., 2009). On the other hand, recent progress in sparse folding algorithms has provided fast algorithms for the prediction of the most likely (minimum free energy) structure (Backofen et al., 2011; Mohl et al., 2010; Salari et al., 2010). Although the partition function cannot be calculated exactly using sparsification ideas, it may be approximated. Our goal in this article is to give a fast algorithm, based on sparse folding, to calculate an upper bound on the partition function. Our work is based on the recent algorithm of Hazan and Jaakkola (2012).
2. Related Work
Methods to approximate the partition function for interacting RNAs have not been proposed in the literature. Instead, methods for exact computation of the partition function have been developed, having high time and space complexity. Most notably, Chitsaz et al. (2009) developed an O(n6)–time and O(n4)–space dynamic programming algorithm that computes the partition function of RNA–RNA interaction complexes, thereby providing detailed insights into their thermodynamic properties. Huang et al. (2010) have developed a sampling algorithm that produces a Boltzmann weighted ensemble of RNA–RNA interaction structures for the calculation of interaction probabilities for any given interval on the target RNAs.
In the context of single RNA secondary structure prediction, Lou and Clote (2010) devised a Metropolis Monte Carlo algorithm, called “Wang and Landau” algorithm (Wang and Landau, 2001), to approximate the partition function as well as density of states. Although the computation of the partition function over all secondary structures and over all pseudoknot-free hybridizations can be done by efficient dynamic programming algorithms, the real advantage of Lou and Clote (2010) is in approximating the partition function where pseudoknotted structures are allowed; a context known to be NP-complete (Lyngso and Pedersen, 2000).
In the machine-learning community, there has been extensive research on obtaining nondeterministic and deterministic approximations of the log-partition function. Firstly, sampling and Monte Carlo methods (e.g., Gibbs Sampling and Monte Carlo Markov Chain) have been developed as nondeterministic approaches for estimating the partition function (cf. Koller and Friedman, 2009, and references therein). In high dimensions, obtaining independent samples from a given distribution is difficult since the mixing time is typically exponential in the size of the problem. Therefore, these methods are computationally very demanding, and in practice, they are rarely applied to large-scale problems. Secondly, variational techniques have been extensively developed as a deterministic approach to efficiently estimate the partition function in large-scale problems. In this approach, a simpler distribution is optimized as an approximation to the true distribution in a Kullback-Leibler (KL)-divergence sense. However, the difficulty of this approach comes from: (i) Nonconvexity of the set of feasible distributions (e.g., in “mean field” approximation) (Jordan et al., 1999), and/or (ii) hardness of computing the entropy embedded in the KL objective. Variational upper bounds on the other hand are convex, usually derived by replacing the entropy term in the KL objective with a simpler surrogate function and relaxing constraints on sufficient statistics, hence convexifying the set of feasible distributions (Wainwright et al., 2005).
The basis of our work is from Hazan and Jaakkola (2012), which provides a framework for approximating and bounding the partition function using maximum a posteriori (MAP) inference (i.e., prediction of the most likely structure) on randomly perturbed models. Particularly, they propose to estimate the partition function as the max-statistics of collections of random variables, which is a major topic in extreme value statistics (e.g., see Beirlant et al., 2004). More broadly, there is an emerging body of work on perturbation methods, showing the benefits of explicitly adding noise into the modeling, learning, and inference pipelines (Papandreou and Yuille, 2011; Tarlow et al., 2012).
3. Preliminaries
3.1. Notation
The input nucleic acid sequences are denoted by R and S throughout this article. Function L denotes the length of the input sequence, R is indexed from 1 to L(R), and S is indexed from 1 to L(S) in 5′ to 3′ direction. We refer to the ith nucleotide in R and S by iR and iS respectively. The subsequence from the ith nucleotide to the jth nucleotide in a strand is denoted by [i, j]. An intramolecular base pair between the nucleotides i and j in a strand is called an arc and denoted by a bullet i • j. An intermolecular base pair between the nucleotides iR and iS is called a bond and denoted by a circle iR ∘ iS. We denote an RNA (RNA–RNA interaction) secondary structure by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\mathfrak{s}$$\end{document}, which is mathematically a set of constituent base pairs (arcs and bonds). The collection of all such feasible structures is denoted by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\mathfrak{S}$$\end{document}.
Throughout this article, we denote the partition function by
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}Q : = \sum_ { \mathfrak {s} \in \mathfrak{S}} e^ {- \frac {G (\mathfrak{s})} {RT}} , \tag {1} \end{align*}\end{document}
in which G(\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\mathfrak{s}$$\end{document}) is the free energy of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\mathfrak{s}$$\end{document}, R is the gas constant, and T is temperature. For a more detailed presentation of the partition function, see Chitsaz et al. (2009) and McCaskill (1990). We use the Turner energy model for single RNAs (Mathews et al., 1999), and our energy model for RNA–RNA interaction (Chitsaz et al., 2009).
In this article, we consider only the canonical base-pairing system (i.e., each nucleotide is Watson-Crick paired with at most one nucleotide). We also assume there are no pseudoknots in individual secondary structures of R and S, and there are no crossing bonds and zigzags between R and S (Chitsaz et al., 2009). However, an extension of our ideas to non canonical base-pairing systems (Honer zu Siederdissen et al., 2011) and pseudoknotted (crossing and zigzagged) structures (Sheikh et al., 2012) is straightforward. The key requirement for such an extension is the existence of a fast minimum free energy structure prediction algorithm that can incorporate per base-pair energy contributions for the considered class of structures.
3.2. Energy perturbations
Following the Hazan-Jaakkola's approach (Hazan and Jaakkola, 2012), let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{ \gamma_{i_R \bullet j_R} \} , \{ \overline{\gamma}_{i_R \bullet j_R} \} , \{ \gamma_{i_S \bullet j_S} \} , \{ \overline{ \gamma}_{i_S \bullet j_S} \} , \{ \gamma_{i_R \circ i_S} \} $$\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{ \overline{ \gamma}_{i_R \circ i_S} \} $$\end{document} be six families of independent and identically distributed (i.i.d.) random variables, which are energy perturbations corresponding to presence and absence of base pairs, following the Gumbel distribution, whose cumulative distribution function is
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}F ( x ) : = P [ \gamma \leq x ] = e^{ - e^{ - ( x + C ) }}. \tag{2}\end{align*}\end{document}
Above, C is the Euler's constant, so that the mean of our Gumbel distribution defined in Equation (2) is zero. For every structure \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\mathfrak{s} \in \mathfrak{S}$$\end{document}, let the energy perturbation of a structure be
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}\gamma (\mathfrak{s}) = \sum_{i \bullet j \in \mathfrak{s}} \gamma_{i \bullet j} + \sum_{i \bullet j \notin \mathfrak{s}} \overline { \gamma} _{i \bullet j} \tag{3}\end{align*}\end{document}
if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\mathfrak{s}$$\end{document} is an RNA–RNA interaction structure.
4. Upper Bound On The Partition Function
Corollary 1 in Hazan and Jaakkola (2012) states that
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}\begin{split} \log Q & \leq E_ \gamma \left[ \max_{\mathfrak{s} \in \mathfrak{S}} \{ - G (\mathfrak{s} ) / RT + \gamma (\mathfrak{s} ) \} \right] \\ & = - E_ \gamma \left[ \min_ {\mathfrak{s} \in \mathfrak{S}} \{ G (\mathfrak{s}) / RT - \gamma (\mathfrak{s} ) \} \right] , \end{split}
\tag{5}\end{align*}\end{document}
in which E is the expectation with respect to γ's. The perturbations γ include terms that depend on base pairs (γ.), and terms that depend on lack of base pairs (\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$ \overline { \gamma}.$$\end{document}). To simplify the incorporation of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$ \overline { \gamma}$$\end{document} terms into the energy model, let
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}\lambda (\mathfrak{s} ) = \sum_{i \bullet j \in \mathfrak{s}} ( \gamma _{i \bullet j} - \overline { \gamma} _{i \bullet j} ) \tag{6}\end{align*}\end{document}
for RNA–RNA interaction structure. Since the difference between two random variables following the Gumbel distribution follows the logistic distribution, we can rewrite the single RNA λ(\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\mathfrak{s}$$\end{document}) as
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}\lambda (\mathfrak{s}) = \sum_{i \bullet j \in \mathfrak{s}} \lambda_{i \bullet j} \tag{8}\end{align*}\end{document}
and the RNA–RNA interaction one as
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}\lambda (\mathfrak{s}) = \sum_{i_R \bullet j_R \in \mathfrak{s}} \lambda_{i_R \bullet j_R} + \sum_{i_S \bullet j_S \in \mathfrak{s}} \lambda _{i_S \bullet j_S} + \sum_{i_R \circ i_S \in \mathfrak{s}} \lambda_{i_R \circ i_S} , \tag{9}\end{align*}\end{document}
where λ's are independent identically distributed random variables following the logistic distribution. In that case, Equations (3) and (6) imply
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}\gamma {(\mathfrak{s})} = \sum_{i \bullet j} \overline{ \gamma}_{i \bullet j} + \lambda (\mathfrak{s} ) \tag{10}\end{align*}\end{document}
Since \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$ \sum \nolimits_{i \bullet j} \overline{ \gamma}_{i \bullet j}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$ \sum \nolimits_{i \circ j} \overline{ \gamma}_{i \circ j}$$\end{document} are constants inside the minimization in Equation (5), whose expectations are zero because of Equation (2), we can rewrite Equation (5) as follows, where λ follows the logistic distribution:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}\log Q \leq - E_{ \lambda} \left[
\min_{\mathfrak{s} \in \mathfrak{S}} \{ G (\mathfrak{s} ) / RT -
\lambda (\mathfrak{s} ) \} \right]. \tag{12}\end{align*}\end{document}
Our algorithm computes the right-hand side of Equation (12), and calculates the upper bound \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$Q_{ub} = \exp ( - E_ \lambda \left[ \min \nolimits_{\mathfrak{s} \in \mathfrak{S}} \{ G (\mathfrak{s} ) / RT - \lambda(\mathfrak{s} ) \} \right] ) $$\end{document}. The minimization inside the expectation is essentially minimum free energy prediction, albeit with a perturbed energy. Recall that the energy perturbation −λ(\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\mathfrak{s}$$\end{document}) is the sum of individual base pair perturbations for all base pairs in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\mathfrak{s}$$\end{document}; therefore, incorporation of such a perturbation in fast minimum free energy prediction algorithms, such as Backofen et al. (2011), which exploits sparsity, is straightforward. Particularly, we only need to add −λi•j to the calculation of Lc(i, j) in Backofen et al. (2011) when i and j can form a base pair. Additionally, the scaling of energy by RT in Equation (12) has to be carefully applied to the sparse algorithm. Similarly, we only need to add −λiR∘iS to the calculation of hybrid components in Salari et al. (2010), in addition to proper handling of −λiR•jR and −λiS•jS in intramolecular base pairings.
4.1. Complexity analysis
Note that the perturbed energy is such that the triangle inequality in Property 1 of Backofen et al. (2011) still holds. Therefore, the running time of all sparse folding algorithms based on the triangle inequality (Backofen et al., 2011; Mohl et al., 2010; Salari et al., 2010) is not affected by our energy perturbation. To calculate the expectation above, we sample λ's independently from the logistic distribution until the estimation of the expectation by simple averaging converges. Our experimental results show that the number of samples needed is much lower than the size of the input sequence.
For a single RNA with length n, the time complexity of our upper bound algorithm is O([n2 + MFE(n)]ℓ), in which ℓ ≤ n is the number of samples needed for the expectation estimation to converge, and MFE(n) is the running time of minimum free energy prediction. In this case, the space complexity is O(n2 + MFES(n)), in which MFES(n) is the memory space needed for minimum free energy prediction.
For RNA–RNA interaction with lengths m and n, the time complexity of our algorithm is O([m2 + n2 + MFE(m, n)] ℓ), in which ℓ ≤ n + m is the number of samples needed for the expectation estimation to converge, and MFE(m, n) is the running time of minimum free energy prediction. In this case, the space complexity is O(m2 + n2 + MFES(m, n)), in which MFES(m, n) is the memory space needed for minimum free energy prediction. Usually the running time and space complexity of our upper bound are dominated by those of minimum free energy prediction; therefore, in practice, the time complexity of our upper bound is O(MFE(n)ℓ) for single RNA and O(MFE(m, n)ℓ) for RNA–RNA interaction, and its space complexity is often the same as that of minimum free energy prediction.
5. Results
We implemented the upper bound algorithm in our piRNA package (Chitsaz et al., 2009). To test the overestimation of partition function and also the number of samples needed for the algorithm to converge, we randomly selected 273,512 RNA sequences from Rfam 11.0 database (Gardner et al., 2011; Griffiths-Jones et al., 2003) and computed the upper bounds. Since the partition functions are often large numbers, we report the ensemble energies. The ensemble energy is proportional to the minus log of the partition function and an overestimation of the partition function becomes an underestimation of the ensemble energy. Figure 1 depicts the results of our experiment. The top plot shows the histogram of ensemble energy underestimation calculated by −RT log Q + RT log Qub. The middle plot shows the histogram of ensemble energy underestimation percentage calculated from the ratio log Qub/log Q − 1. This plot shows that for a vast majority of cases this ratio is below 40%. The bottom plot depicts the distribution of ensemble energy −RT log Q in the input dataset. Although this distribution exhibits multiple peaks, the middle distribution, which is the underestimation percentage, has a unimodal behavior. Out of 273,512 RNAs, for 249,622 of them the underestimation is less than 50% of their ensemble energy, and for about half of the sequences (148,762) this difference is smaller than 30%. The number of sequences for which this difference is negative is 2,356, which is less than 1% of the total number of RNAs.
Histograms of (A) ensemble energy underestimation −RT log Q + RT log Qub for all 273,512 sequences in our dataset, (B) percentage of ensemble energy underestimation log Qub / logQ − 1, and (C) ensemble energy −RT log Q in the input dataset.
Figure 2 shows the performance of our algorithm. Figure 2A is the histogram of the number of samples ℓ (iterations) needed for the algorithm to converge. Figure 2B and C show the histogram of the number of iterations per input size and per the log of input size. The vast majority of sequences required less than 15% of their length iterations to converge. The number of iterations starts with 7 and almost all the sequences need less than 40 samples. 243,119 or 89% of RNAs need no more than 20 iterations, and for more than half of the sequences (153,515), the experiment has been done with less than 15 iterations. Therefore, at most 40 iterations are enough for different Rfam RNAs with different lengths. The number of iterations per length ratio for most of the sequences is less than 30%. For 90% of the RNAs in this dataset, this number is between 7% and 25%. Clearly the relation between length and the number of iterations is not linear, and upper bounds for different RNAs with the same length require a different number of iterations. Recall that the space complexity of our algorithm is the same as that of sparse minimum free energy prediction. Therefore, our algorithm is both fast and space efficient in practice.
Histograms of (A) number of iterations ℓ needed for all sequences in our dataset, (B) number of iterations as a percentage of the input sequence length, and (C) number of iterations per the log of the input sequence length.
6. Conclusion
We gave a fast algorithm, based on the recent algorithm of Hazan and Jaakkola (2012), to iteratively compute an upper bound on the partition function of nucleic acids by perturbing energy. Our upper bound algorithm uses a fast minimum free energy prediction in each iteration. Our algorithm preserves the properties on which sparsification methods rely; therefore, we minimally modified sparse folding algorithms (Backofen et al., 2011; Mohl et al., 2010; Salari et al., 2010) to obtain the required fast minimum free energy prediction.
For the lower bound, one can trivially use the single term corresponding to the minimum free energy. The lower bound algorithm of Hazan and Jaakkola (2012) requires modification in order to be applicable to our problem. We leave such accurate lower-bound algorithms for future work.
Footnotes
Author Disclosure Statement
The authors declare that no competing financial interests exist.
References
1.
AlkanC., KarakocE., NadeauJ.H.et al.2006. RNA–RNA interaction prediction and antisense RNA target search. J. of Comp Biol., 13:267–282.
2.
BackofenR., TsurD., ZakovS., Ziv-UkelsonM. 2011. Sparse RNA folding: Time and space efficient algorithms. J. Discrete Algorithms, 9:12–31.
3.
BeirlantJ., GoegebeurY., SegersJ., TeugelsJ.2004. Statistics of Extremes: Theory and Applications. Wiley Series in Probability and Statistics: Wiley, New York.
4.
ChitsazH., SalariR., SahinalpS., BackofenR. 2009. A partition function algorithm for interacting nucleic acid strands. Bioinformatics, 25:i365–i373.
5.
DingY, LawrenceC.E.2003. A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res., 31:7280–7301.
6.
DingY., ChanC.Y., LawrenceC.E. 2005. RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. RNA, 11:1157–1166.
7.
DingY., ChanC.Y., LawrenceC.E. 2006. Clustering of RNA secondary structures with application to messenger RNAs. J. Mol. Biol., 359:554–571.
8.
GardnerP.P., DaubJ., TateJ.et al.2011. Rfam: Wikipedia, clans and the decimal release. Nucleic Acids Research, 39:D141–D145.
9.
Griffiths-JonesS., BatemanA., MarshallM.et al.2003. Rfam: an rna family database. Nucleic Acids Research, 31:439–441.
10.
HazanT, JaakkolaT. 2012. On the partition function and random maximum a-posteriori perturbations. In Proceedings of the International Conference on Machine Learning, ICML, Edinburgh, Scotland.
11.
Honer zu SiederdissenC., BernhartS.H., StadlerP.F., HofackerI.L. 2011. A folding algorithm for extended RNA secondary structures. Bioinformatics, 27:i129–136.
12.
HuangF.W.D., QinJ., ReidysC.M., StadlerP.F. 2009. Partition function and base pairing probabilities for RNA-RNA interaction prediction. Bioinformatics, 25:2646–2654.
13.
HuangF.W.D., QinJ., ReidysC.M., StadlerP.F. 2010. Target prediction and a statistical sampling algorithm for RNA-RNA interaction. Bioinformatics, 26:175–181.
14.
JordanM.I., GhahramaniZ., JaakkolaT., SaulL.K. 1999. An introduction to variational methods for graphical models. Machine Learning, 37:183–233,
15.
KollerD, FriedmanN.2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press: Cambridge, Massachusetts.
16.
LouF, CloteP. 2010. Thermodynamics of RNA structures by Wang-Landau sampling. Bioinformatics (ISMB 2010), 26:i278–i286.
MathewsD., SabinaJ., ZukerM., TurnerD. 1999. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol., 288:911–940.
19.
McCaskillJ. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29:1105–1119.
20.
MohlM., SalariR., WillS.et al.2010. Sparsification of RNA structure prediction including pseudoknots. Algorithms Mol Biol., 5:39.
21.
MücksteinU., TaferH., BernhartS.H.et al.2008. Translational control by RNA-RNA interaction: Improved computation of RNA-RNA binding thermodynamics, 114–127. ElloumiM., KüngJ., LinialM.et al.BIRD13Communications in Computer and Information Science. Springer: New York.
22.
PapandreouG, YuilleA. 2011. Perturb-and-map random fields: Using discrete optimization to learn and sample from energy models. In Proc. IEEE Int. Conf. on Computer Vision (ICCV), 193–200Barcelona, Spain.
23.
SalariR., MöhlM., WillS.et al.2010. Time and space efficient RNA-RNA interaction prediction via sparse folding. 473–490BergerB.Research in Computational Molecular Biology6044Lecture Notes in Computer Science. Springer: Berlin / Heidelberg.
24.
SheikhS., BackofenR., PontyY. Impact of the energy model on the complexity of RNA folding with pseudoknots. 321–333KarkkainenJ., StoyeJ.CPM7354Lecture Notes in Computer Science. Springer: New York.
25.
TarlowD., AdamsR.P., ZemelR.S. 2013. Randomized optimum models for structured prediction. In Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS).
26.
WainwrightM.J., JaakkolaT., WillskyA.S.2005. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51:2313–2335. 2005.
27.
WangF, LandauD.P. 2001. Efficient, multiple-range random walk algorithm to calculate the density of states. Phys. Rev. Lett, 86:2050–2053.