Maximum Stacking Base Pairs: Hardness and Approximation by Nonlinear Linear Programming-Rounding

Abstract

Maximum stacking base pairs is a fundamental combinatorial problem from ribonucleic acid (RNA) secondary structure prediction under the energy model. The basic maximum stacking base pairs problem can be described as: given an RNA sequence, find a maximum number of base pairs such that each chosen base pair has at least one parallel and adjacent partner (i.e., they form a stacking). This problem is NP-hard, no matter whether the candidate base pairs follow the biology principle or are given explicitly as input. This article investigates a restricted version of this problem where the base pairs are given as input and each base is associated with at most k (a constant) base pairs. We show that this restricted version is still APX-hard, even if the base pairs are weighted. Moreover, by a nonlinear LP-rounding method, we present an approximation algorithm with a factor $\frac{32 {(k - 1)}^{3} e^{3}}{8 (k - 1) e - 1}$ . Applying our algorithms on the simulated data, the actual approximation factor is in fact much better than this theoretical bound.

1. Introduction

Aribonucleic acid (RNA) is single stranded and can be viewed as a sequence of nucleotides (also known as bases, denoted by A, C, G, and U). It plays an important role in regulating genetic and metabolic activities according to the central dogma of biology. To understand the biological functions of RNAs elaborately, we should know their structures at first. The primary structure of an RNA strand is formed by the order of the nucleotides. An RNA folds into a three-dimensional structure by forming hydrogen bonds between nonconsecutive bases that are complementary, such as the Watson–Crick pairs C-G and A-U and the wobble pair G-U. The three-dimensional arrangement of the atoms in the folded RNA molecule is the tertiary structure; the collection of base pairs in the tertiary structure is the secondary structure. Actually, the secondary structure can tell us where there are additional connections between the bases and where the RNA molecule could be folded. In the article (Tinoco and Bustamante, 1999), the author claimed that “the folding of RNA is hierarchical, since secondary structure is much more stable than tertiary folding”, which results in that, the tertiary folding would obey the secondary structure mostly. Since the three-dimensional structure determines the function of the RNA to some extent, predicting the secondary structure of RNA becomes a key problem to study RNA in a larger and deeper scope.

Nussinov et al. (1978) began considering the computational study of RNA secondary structure prediction, but this problem is still not well solved yet. The biggest impediment is the existing of pseudoknots, which are composed of two interleaving base pairs provided when we arrange the RNA sequence in a linear order.

Lyngsø and Pedersen (2000) have proven that determining the optimal secondary structure possibly with pseudoknots is NP-hard (nondeterministic polynomial-hard) under special energy functions. And Akutsu (2000) has shown that it remains NP-hard, even if the secondary structure requires to be planar. There are a lot of positive works where there are no pseudoknots. Nussinov et al. (1978), Nussinov and Jacobson (1980), Zuker and Stiegler (1981), Zuker and Sankoff (1984), Sankoff (1985), and Lyngsø et al. (1999) have computed the optimal RNA secondary structure in $O (n^{3})$ time and $O (n^{2})$ space by the method of dynamic programming. Rivas and Eddy (1999), Uemura et al. (1999), and Akutsu (2000) have presented a polynomial-time algorithm when the types of pseudoknots are limited.

To predict secondary structures with pseudoknots, most research focus on the base pairs individually. The nearest neighbor energy model was studied popularly (Akutsu, 2000; Lyngsø and Pedersen, 2000; Ieong et al., 2003; Lyngsø, 2004): the energy of each base pair depends not only on its two bases but also on the other adjacent base pairs. According to the Tinoco model (Tinoco et al., 1973): an RNA structure can recursively be decomposed into loops with independent free energy; the energy of each loop is an affine function in the number of unpaired bases and the number of interior base pairs. The only type of loops without unpaired bases is formed by two adjacent and parallel base pairs, which is called a stacking; the negative energy of such stackings stabilizes the RNA structure. Lyngsø (2004) initiated the study for the maximum stacking base pairs problem. He showed this problem to be NP-hard and devised a polynomial-time exact algorithm over a fixed-size alphabet Σ and with a subset of Σ × Σ of legal pair types. Unfortunately, this algorithm has very high complexities of $Ω (n^{80})$ time and $Ω (n^{80})$ space even for the canonical alphabet ${A, C, G, U}$ .

Among all the above results, the base pairs are given implicitly, that is, under some fix biology principle, for example, Watson–Crick base pairs: A-U and C-G, any such two bases can form a base pair. As an alternative, the set of candidate base pairs may be given explicitly as input, because there could be additional conditions from comparative analysis which prevent two bases forming a pair. It would generalize the maximum stacking base pairs problem with explicit base pairs, so the problem remains NP-hard. Jiang (2010) improved the approximation factor for the maximum stacking base pairs problem with explicit base pairs to 5/2. Zhou et al. (2017) further improved the approximation factor to 7/3 by a local search method. Once the candidate base pairs are taken as the input for this problem, naturally, we can put restriction and generalization on it. Similar to the research on graph problems, one basic restriction is to bound the degree of each base, that is, we require each base to associate with at most a constant of k candidate base pairs. This problem is called k-MSBP. In addition, in light that the optimal secondary structure has the minimum negative energy, we generalize this problem to the weighted version, where we give each base pair a weight (representing energy) and the problem becomes computing a maximum weight stacking base pairs, this problem is called k-MWSBP. So far as we know, there are no results on k-MWSBP.

The main contributions of this article are: (1) we show that k-MSBP is APX-hard for k ≥ 2; (2) we devise an approximation algorithm with a factor of $\frac{32 {(k - 1)}^{3} e^{3}}{8 (k - 1) e - 1}$ for k-MWSBP (and k-MSBP) by the nonlinear LP-rounding (linear programming-rounding) method. For k-MSBP, although the approximation factor in Jiang (2010) and Zhou et al. (2017) is better, the time complexity is as high as $O (n^{14})$ , while our algorithm takes linear time besides solving a linear program. Moreover, our simulations show a much better practical performance compared with this theoretical bound.

2. Preliminaries

Let $S = s_{1} s_{2} \dots s_{n}$ be an RNA sequence of n bases. A base pair is a pair of two nonconsecutive bases, say s_i, s_j, where $|i - j| > 1$ , and is denoted by (s_i, s_j). The degree of a base s_i is the number of base pairs that are associated with s_i. Two base pairs are compatible if they do not share a common base. A secondary structure of S is a set of mutually compatible base pairs $(s_{i_{1}}, s_{j_{1}}),$ $(s_{i_{2}}, s_{j_{2}}),$ $\dots,$ $(s_{i_{r}}, s_{j_{r}}) .$ Two base pairs, such as (s_i, s_j) and ( $s_{i + 1}$ , $s_{j - 1}$ ), are mutually adjacent. A stacking is constituted by two mutually adjacent base pairs. A feasible secondary structure FS(S) of an RNA sequence S fulfills that if (s_i, s_j) is a base pair in FS(S), then either ( $s_{i + 1}$ , $s_{j - 1}$ ) or ( $s_{i - 1}$ , $s_{j + 1}$ ) or both are base pairs in FS(S).

Now we present the formal definition of the problems studied in this article.

Definition 1. Maximum Stacking Base Pairs with Degree Bounded, k-MSBP.

Input: An RNA sequence S and a set of candidate base pairs BP, where the degree of each base is bounded by k.

Output: A feasible secondary structure FS(S) such that the number of the base pairs is maximized.

Definition 2. Maximum Weighted Stacking Base Pairs with Degree Bounded, k-MWSBP.

Input: An RNA sequence S and a set of candidate weighted base pairs BP, where the degree of each base is bounded by k.

Output: A feasible secondary structure FS(S) such that the total weight of the base pairs is maximized.

k-MSBP is a special case of k-MWSBP with all base pairs having a weight of 1. In the next section, we will prove that k-MSBP is APX-hard, which implies that k-MWSBP is also APX-hard. Note that an approximation algorithm for k-MWSBP also works on k-MSBP.

3. Hardness Results

In this section, we will show that k-MSBP is APX-hard by a reduction from the Maximum Independent Set Problem on Cubic Graphs (3-MIS).

Theorem 1. It is NP-hard to approximate 3-MIS within $\frac{140}{139} - E$ (Berman and Karpinski, 1999).

For the sake of simplicity, we just prove that 2-MSBP is NP-hard and then APX-hard, which means that k-MSBP is also APX-hard for all k ≥ 2.

Given a cubic graph $G = (V, E)$ as an input for 3-MIS, for each vertex $v \in V$ , we construct an RNA subsequence of 32 bases: $A_{v}^{1}, A_{v}^{2}, \dots, A_{v}^{18}, U_{v}^{1}, \dots, U_{v}^{14}$ , and 18 candidate base pairs: ( $A_{v}^{1}$ , $U_{v}^{5}$ ), ( $A_{v}^{2}$ , $U_{v}^{4}$ ), ( $A_{v}^{3}$ , $U_{v}^{2}$ ), ( $A_{v}^{4}$ , $U_{v}^{1}$ ), ( $A_{v}^{7}$ , $U_{v}^{8}$ ), ( $A_{v}^{8}$ , $U_{v}^{7}$ ), ( $A_{v}^{11}$ , $U_{v}^{14}$ ), ( $A_{v}^{12}$ , $U_{v}^{13}$ ), ( $A_{v}^{15}$ , $U_{v}^{11}$ ), ( $A_{v}^{16}$ , $U_{v}^{10}$ ), ( $A_{v}^{4}$ , $U_{v}^{9}$ ), ( $A_{v}^{5}$ , $U_{v}^{8}$ ), ( $A_{v}^{6}$ , $U_{v}^{4}$ ), ( $A_{v}^{7}$ , $U_{v}^{3}$ ), ( $A_{v}^{8}$ , $U_{v}^{12}$ ), ( $A_{v}^{9}$ , $U_{v}^{11}$ ), ( $A_{v}^{10}$ , $U_{v}^{7}$ ), ( $A_{v}^{11}$ , $U_{v}^{6}$ ). See Figure 1 for an example. There are two feasible secondary structures of this RNA subsequence: the first 10 base pairs (which is maximum, see the solid matching edges in Fig. 1) and the last 8 base pairs (see the dotted matching edges). The complete RNA sequence RG is the concatenation of all the subsequences, which are split by peg bases.

FIG. 1.

The RNA subsequence and base pairs corresponding to a vertex. RNA, ribonucleic acid.

To make use of the edges, we first orient the edges of G in such a way that each vertex has at most two incoming edges and at most two outgoing edges. This can be done as follows: iteratively find edge-disjoint cycles in G and, in each cycle, orient the edges to form a directed cycle. The remaining edges form a forest till there does not exist any cycle. For each tree in the forest, choose one of its nodes of degree one to be the root and orient all edges in the tree away from the root. This orientation will clearly satisfy the desired properties.

For each vertex v, we name $A_{v}^{1}$ , $A_{v}^{3}$ as two incoming interfaces and $A_{v}^{12}$ , $A_{v}^{16}$ as two outgoing interfaces. Note that the degree of the interfaces is at most 1. Initially all interfaces are free. Let (u, v) be an edge from u to v in G.

Construct new base pairs as follows: 1.

If $A_{u}^{12}$ is a free outgoing interface of u and $A_{v}^{1}$ is a free incoming interface of v, then delete the two base pairs: ( $A_{v}^{1}$ , $U_{v}^{5}$ ), ( $A_{v}^{2}$ , $U_{v}^{4}$ ), and make two new base pairs: ( $A_{u}^{12}$ , $U_{v}^{5}$ ), ( $A_{u}^{13}$ , $U_{v}^{4}$ ).

If $A_{u}^{12}$ is a free outgoing interface of u and $A_{v}^{3}$ is a free incoming interface of v, then delete the four base pairs: ( $A_{v}^{3}$ , $U_{v}^{2}$ ), ( $A_{v}^{4}$ , $U_{v}^{1}$ ), ( $A_{v}^{4}$ , $U_{v}^{9}$ ), ( $A_{v}^{5}$ , $U_{v}^{8}$ ), and make four new base pairs: ( $A_{u}^{12}$ , $U_{v}^{2}$ ), ( $A_{u}^{13}$ , $U_{v}^{1}$ ), ( $A_{u}^{13}$ , $U_{v}^{9}$ ), ( $A_{u}^{14}$ , $U_{v}^{8}$ ).

If $A_{u}^{16}$ is a free outgoing interface of u and $A_{v}^{1}$ is a free incoming interface of v, then delete the two base pairs: ( $A_{v}^{1}$ , $U_{v}^{5}$ ), ( $A_{v}^{2}$ , $U_{v}^{4}$ ), and make two new base pairs: ( $A_{u}^{16}$ , $U_{v}^{5}$ ), ( $A_{u}^{17}$ , $U_{v}^{4}$ ).

If $A_{u}^{16}$ is a free outgoing interface of u and $A_{v}^{3}$ is a free incoming interface of v, then delete the four base pairs: ( $A_{v}^{3}$ , $U_{v}^{2}$ ), ( $A_{v}^{4}$ , $U_{v}^{1}$ ), ( $A_{v}^{4}$ , $U_{v}^{9}$ ), ( $A_{v}^{5}$ , $U_{v}^{8}$ ), and make four new base pairs: ( $A_{u}^{16}$ , $U_{v}^{2}$ ), ( $A_{u}^{17}$ , $U_{v}^{1}$ ), ( $A_{u}^{17}$ , $U_{v}^{9}$ ), ( $A_{u}^{18}$ , $U_{v}^{8}$ ).

Lemma 1. Let G be a cubic graph on N vertices. Then, there exists an independent set of size l in G if and only if there exists a feasible secondary structure of size $8 N + 2 l$ .

Proof. From our construction, each RNA subsequence can have a feasible secondary structure with either 10 base pairs or 8 base pairs. The crucial observation is that, if there is an edge (u, v) between two vertices in G, then the RNA subsequence corresponding to u and v cannot both have feasible secondary structures with 10 base pairs.

So, if there is an independent set I of size l in G, for u ɛ I, choose a feasible secondary structure with 10 base pairs; for $u \notin I$ , choose a feasible secondary structure with 8 base pairs, then we can obtain a feasible secondary structure with $8 N + 2 l$ base pairs.

Conversely, if there is a feasible secondary structure FS(RG) of size f, let I consist of all vertices u such that the subsequence corresponding to u contributes 10 base pairs; it is obvious that I is an independent set, and $f = 8 (N - | I |) + 10 | I | = 8 N + 2 | I | .$

Theorem 2. 2-MSBP is APX-hard.

Proof. Note that the maximum stacking base pairs instance we construct from an instance of 3-MIS is actually an instance of 2-MSBP. Let I be an instance of 3-MIS and OPT(I) be its optimal solution. Let f (I) be an instance of 2-MSBP constructed from I and $O P T (f (I))$ be its optimal solution. Let $y'$ be some other solution of f (I) and $g (y')$ be the corresponding solution of I. The reduction is an L-reduction since it fulfills the following two conditions:

$| O P T (f (I)) | = 8 N + 2 O P T (I) \leq 34 \cdot O P T (I)$ , since $O P T (I) \geq N ∕ 4$ .

$| O P T (I) | - | g (y') | = (| O P T (f (I)) | - 8 N) ∕ 2 - (| y' | - 8 N) ∕ 2 = (| O P T (f (I)) | - | y' |) ∕ 2$ .

This completes the proof.

4. Approximation Algorithms for k-MWSBP by LP-Rounding

In this section, we will design an approximation algorithm for k-MWSBP by a nonlinear LP-rounding method. First, we formulate k-MWSBP as a 0–1 Integer Linear Program (ILP). Let $S = s_{1} s_{2} \dots s_{n}$ be an RNA sequence of n bases; let BP be the set of candidate base pairs. For each base pair (s_i, s_j), we define a 0–1 variable $x_{i, j}$ , if (s_i, s_j) is chosen into the feasible secondary structure, then $x_{i, j}$ = 1, otherwise $x_{i, j}$ = 0.

ILP-(1): $M A X . \sum_{(s_{i}, s_{j}) E B P} ω_{i, j} x_{i, j}$ $S . T . \sum_{j = 1}^{n} (x_{i, j} + x_{j, i}) \leq 1, f o r i = 1, \dots, n$ (1)

x_{i, j} - (x_{i - 1, j + 1} + x_{i + 1, j - 1}) \leq 0, f o r i \leq j

(2)

x_{i, j} E \{0, 1\}

Constraints (1) guarantee that the chosen base pairs are mutually compatible. Constraints (2) require that each chosen base pair must have at least one adjacent partner. Relaxing ILP-(1) to the linear programming formulation.

LP-(2): $M A X . \sum_{(s_{i}, s_{j}) E B P} ω_{i, j} x_{i, j}$ $S . T . \sum_{j = 1}^{n} (x_{i, j} + x_{j, i}) \leq 1, f o r i = 1, \dots, n$

x_{i, j} - (x_{i - 1, j + 1} + x_{i + 1, j - 1}) \leq 0, f o r i \leq j

0 \leq x_{i, j} \leq 1

Algorithm 1 Nonlinear LP-rounding
1: Solving LP-(2) and obtain an optimal solution $x_{i, j} = x_{i, j}^{<} s u p > * < ∕ s u p > .$
2: Rounding Strategy: Pr( $x'_{i, j}$ = 1) = 1 − $e^{- a \sqrt{x_{i, j}^{<} s u p > * < ∕ s u p >}}$ .
3: Chose the base pair (s_i, s_j) if and only if $x'_{i, j}$ = 1.

Theorem 3. Algorithm-1 is an approximation algorithm for 2-MWSBP with an expected factor of $\frac{32 e^{3}}{8 e - 1} .$

Proof. To obtain a feasible secondary structure, every chosen base pair must be compatible with each other; we say that such base pairs are effective. Let $A_{i, j}$ be the event that the base pair (s_i, s_j) is effective. Assume that there are three such base pairs: (s_i, s_j), (s_k, s_i), (s_j, s_l). To make (s_i, s_j) effective, it requires $x'_{i, j} = 1$ , $x'_{k, i} = 0$ , and $x'_{k, l} = 0 .$ Thus,

\begin{matrix} P r (A_{i, j}) = (1 - e^{- a \sqrt{x_{i, j}^{*}}}) \times e^{- a \sqrt{x_{k, i}^{*}}} \times e^{- a \sqrt{x_{j, l}^{*}}} \\ \geq (1 - e^{- a \sqrt{x_{i, j}^{*}}}) \times e^{- 2 a \sqrt{1 - x_{i, j}^{*}}} \\ \geq c \sqrt{x_{i, j}^{*}} \end{matrix}

where c is a constant, to be determined later. To make effective base pairs (s_i, s_j) chosen into the feasible secondary structure, it also requires at least one of ( $s_{i - 1}$ , $s_{j + 1}$ ) and ( $s_{i + 1}$ , $s_{j - 1}$ ) to be effective. Let $B_{i, j}$ be the event that the base pair (s_i, s_j) take part in constituting stacking and let $z_{i, j}$ be an 0–1 variable, where $z_{i, j}$ = 1 if $B_{i, j}$ happens, and $z_{i, j}$ = 0 if not. $\begin{matrix} P r (z_{i, j} = 1) = P r (A_{i, j}) [1 - (1 - P r (A_{i - 1, j + 1})) (1 - P r (A_{i + 1, j - 1}))] \\ \geq c \sqrt{x_{i, j}^{*}} [1 - (1 - c \sqrt{x_{i - 1, j + 1}^{*}}) (1 - c \sqrt{x_{i + 1, j - 1}^{*}})] \\ \geq c \sqrt{x_{i, j}^{*}} \{1 - {[\frac{2 - c (\sqrt{x_{i - 1, j + 1}^{*}} + \sqrt{x_{i + 1, j - 1}^{*}})}{2}]}^{2}\} \end{matrix}$

Since $\sqrt{x_{i - 1, j + 1}^{*}} + \sqrt{x_{i + 1, j - 1}^{*}} \geq \sqrt{x_{i - 1, j + 1}^{*} + x_{i + 1, j - 1}^{*}}$ , and by constraint (2), we have $x_{i - 1, j + 1}^{*} + x_{i + 1, j - 1}^{*} \geq x_{i, j}^{*}$ , then $\sqrt{x_{i - 1, j + 1}^{*}} + \sqrt{x_{i + 1, j - 1}^{*}} \geq \sqrt{x_{i, j}^{*}}$ . $\begin{matrix} P r (z_{i, j} = 1) \geq c \sqrt{x_{i, j}^{*}} \cdot [1 - (\frac{2 - c \sqrt{x_{i, j}^{*}}}{2})^{2}] \\ \geq c \sqrt{x_{i, j}^{*}} \cdot (c \sqrt{x_{i, j}^{*}} - \frac{c^{2} x_{i, j}^{*}}{4}) \\ \geq x_{i, j}^{*} \cdot (c^{2} - \frac{c^{3}}{4}) \end{matrix}$

Let APP denote the size of the output solution of Algorithm 1, OPT denote the size of the optimal solution, which is also the optimal solution of ILP-(1), and OPT(LP) denote the optimal solution of LP-(2). Obviously, $O P T (L P) \geq O P T$ . Then we have, $\begin{matrix} E (A P P) = E (\sum_{(s_{i}, s_{j}) E B P} ω_{i, j} \cdot z_{i, j}) \\ = \sum_{(s_{i}, s_{j}) E B P} ω_{i, j} P r (z_{i, j} = 1) \\ \geq \sum_{(s_{i}, s_{j}) E B P} [ω_{i, j} \cdot x_{i, j}^{*} \cdot (c^{2} - \frac{c^{3}}{4})] \\ = (c^{2} - \frac{c^{3}}{4}) \cdot \sum_{(s_{i}, s_{j}) E B P} ω_{i, j} x_{i, j}^{*} \\ = (c^{2} - \frac{c^{3}}{4}) \cdot O P T (L P) \\ \geq (c^{2} - \frac{c^{3}}{4}) \cdot O P T \end{matrix}$

Let $t = \sqrt{x_{i, j}^{*}}$ , the function $F (t) = \frac{(1 - e^{- a t}) e^{- 2 a \sqrt{1 - t^{2}}}}{t}, (0 < a \leq 1, 0 \leq t \leq 1)$ , reaches its minimum value, when t trends to 0: ${lim}_{t \to 0} F (t) = {lim}_{t \to 0} \frac{(1 - e^{- a t}) e^{- 2 a \sqrt{1 - t^{2}}}}{t} = \frac{a}{e^{2 a}} .$

By setting $a = \frac{1}{2}$ and $c = \frac{1}{2 e}$ , we obtain the best approximation ratio of $\frac{32 e^{3}}{8 e - 1}$ for 2-MWSBP.

Theorem 4. Algorithm 1 is an approximation algorithm for k-MWSBP with an expected factor of $\frac{32 {(k - 1)}^{3} e^{3}}{8 (k - 1) e - 1}$ .

Proof. The difference between k-MWSBP and 2-MWSBP is the degree of bases. In an k-MWSBP instance, a base pair (s_i, s_j) is not compatible with $(s_{t_{1}}, s_{i})$ , $(s_{t_{2}}, s_{i})$ , $\dots$ , $(s_{t_{k - 1}}, s_{i})$ and $(s_{j}, s_{l_{1}})$ , $(s_{j}, s_{l_{2}})$ , $\dots$ , $(s_{j}, s_{l_{k - 1}})$ . Then the probability that (s_i, s_j) is effective is

P r (A_{i, j}) = (1 - e^{- a \sqrt{x_{i, j}^{*}}}) \cdot e^{- a \sqrt{x_{t 1, i}^{*}}} \dots e^{- a \sqrt{x_{t k - 1}^{*}, i}} \cdot e^{^{- a \sqrt{x_{j, l 1}^{*}}}} \dots e^{- a \sqrt{x_{j l k - 1}^{*}}}

by constraint (1), $x_{t_{1}, i}^{*} + \dots + x_{t_{k - 1}, i}^{*} + x_{i, j}^{*} \leq 1$ and $x_{i, j}^{*} + x_{j, l_{1}}^{*} + \dots + x_{j, l_{k - 1}}^{*} \leq 1$ . Thus, $P r (A_{i, j}) = (1 - e^{- a \sqrt{x_{i, j}^{*}}}) \cdot e^{- a \sqrt{x_{t_{1}, i}^{*}}} \dots e^{- a \sqrt{x_{t_{k - 1}, i}^{*}}} \cdot e^{- a \sqrt{x_{j, l_{1}}^{*}}} \dots e^{- a \sqrt{x_{j, l_{k - 1}}^{*}}}$

\geq (1 - e^{- a \sqrt{x_{i, j}^{*}}}) \times e^{- (2 k - 2) a \sqrt{1 - x_{i, j}^{*}}}

\geq c \sqrt{x_{i, j}^{*}}

The probability that (s_i, s_j) takes part in constituting stacking is $\begin{matrix} P r (z_{i, j} = 1) = P r (A_{i, j}) [1 - (1 - P r (A_{i - 1, j + 1})) (1 - P r (A_{i + 1, j - 1}))] \\ \geq c \sqrt{x_{i, j}^{*}} \cdot [1 - (1 - c \sqrt{x_{i - 1, j + 1}^{*}}) (1 - c \sqrt{x_{i + 1, j - 1}^{*}})] \\ \geq c \sqrt{x_{i, j}^{*}} \{1 - {[\frac{2 - c (\sqrt{x_{i - 1, j + 1}^{*}} + \sqrt{x_{i + 1, j - 1}^{*}})}{2}]}^{2}\} \\ \geq c \sqrt{x_{i, j}^{*}} \cdot [1 - (\frac{2 - c \sqrt{x_{i, j}^{*}}}{2})^{2}] \\ = c \sqrt{x_{i, j}^{*}} \cdot (c \sqrt{x_{i, j}^{*}} - \frac{c^{2} x_{i, j}^{*}}{4}) \\ \geq x_{i, j}^{*} \cdot (c^{2} - \frac{c^{3}}{4}) \end{matrix}$

Let $t = \sqrt{x_{i, j}^{*}},$ the function $F (t) = \frac{(1 - e^{- a t}) e^{- (2 k - 2) a \sqrt{1 - t^{2}}}}{t}, (0 < a \leq 1, 0 \leq t \leq 1)$ , reaches its minimum value, again when t trends to 0:

{lim}_{t \to 0} F (t) = {lim}_{t \to 0} \frac{(1 - e^{- a t}) e^{- (2 k - 2) a \sqrt{1 - t^{2}}}}{t} = \frac{a}{e^{(2 k - 2) a}} .

By setting $a = \frac{1}{2 k - 2}$ and $c = \frac{1}{(2 k - 2) e},$ we obtain the best approximation ratio of $\frac{32 {(k - 1)}^{3} e^{3}}{8 (k - 1) e - 1}$ for k-MWSBP.

5. Simulations

In this section, we show some experiments on randomly generated simulated data. In the simulated data, the length of the RNA sequences ranges from n = 500 to n = 2000, we choose three values for k: k = 2, k = 3, k = 4. For comparison, besides running the LP-rounding approximation algorithm, we also run the ILP-(1) to obtain the optimal solutions (although when n gets large, the running time gets really high.) The performance is summarized as follows.

5.1. Performance evaluation

For k = 2, the experimental results are shown in Table 1 and Figure 2. As what is stated in Theorem 2, the approximation factor for 2-MWSBP is about 31. From the experimental results in Table 1, the actual approximation factor is about 5.41, which is much better than the theoretical bound.

FIG. 2.

The plot graph of the optimal solution and the approximate solution generated by Algorithm 1, when k = 2.

Table 1.

Values of Optimal Solution [OPT(I)], Approximation Solution [APP(I)], and the Approximation Factor, When k = 2

n	OPT(I)	APP(I)	Approximation ratio
500	311	59	5.27
600	379	73	5.19
700	429	98	4.38
800	499	91	5.48
900	556	107	5.20
1000	628	104	6.04
1100	679	126	5.39
1200	770	187	4.12
1300	781	142	5.50
1400	853	155	5.50
1500	924	167	5.53
1600	976	202	4.83
1700	1059	206	5.14
1800	1121	181	6.19
1900	1218	183	6.66
2000	1287	212	6.07

For k = 3, the experimental results are shown in Table 2 and Figure 3. Similarly, the actual approximation factor is about 14.16. The theoretical approximation factor for 3-MWSBP is about 121. Again, the experimental results show much better performance compared with the theoretical results.

FIG. 3.

The plot graph of the optimal solution and the approximate solution generated by Algorithm 1, when k = 3.

Table 2.

Values of Optimal Solution [OPT(I)], Approximation Solution [APP(I)], and the Approximation Factor, When k = 3

n	OPT(I)	APP(I)	Approximation ratio
500	469	37	12.68
600	537	41	13.10
700	613	35	17.51
800	708	55	12.87
900	822	62	13.26
1000	915	67	13.66
1100	1060	84	12.62
1200	1146	81	14.15
1300	1188	99	12.00
1400	1271	90	14.12
1500	1394	100	13.94
1600	1483	97	15.29
1700	1530	93	16.45
1800	1660	110	15.09
1900	1778	109	16.31
2000	1898	141	13.46

For k = 4, the experimental results are shown in Table 3 and Figure 4. While the practical approximation factor fluctuates more in the case, the average approximation factor is about 26.38. The theoretical approximation factor for 4-MWSBP is about 270.

FIG. 4.

The plot graph of the optimal solution and the approximate solution generated by Algorithm 1, when k = 4.

Table 3.

Values of Optimal Solution [OPT(I)], Approximation Solution [APP(I)], and the Actual Approximation Factor, When k = 4

n	OPT(I)	APP(I)	Approximation ratio
500	623	53	11.75
600	732	30	24.40
700	861	33	26.09
800	979	48	20.40
900	1092	34	32.12
1000	1245	40	31.13
1100	1462	56	26.11
1200	1579	61	25.88
1300	1599	53	30.18
1400	1729	53	32.62
1500	1889	79	23.92
1600	1897	70	27.11
1700	2056	88	23.36
1800	2223	75	29.64
1900	2399	85	28.22
2000	2532	87	29.10

From our experimental results, we can conclude that the actual performance of our algorithm is much better than the corresponding theoretical bound; the reason is probably due to that the theoretical result is based on the worst-case analysis.

5.2. Runtime analysis

As discussed above, solving the ILP-(1) takes quite a lot of time when n grows larger. Hence, we compare the running time of solving the ILP and our LP-rounding approximation algorithm. The results are summarized in Table 4 and Figure 5. As shown in Figure 5, solving the ILP takes much more time as n increases, while the running time of our approximation algorithm is very stable. This is probably due to that the ILP solver takes exponential time, while the approximation algorithm takes polynomial time.

FIG. 5.

The plot graph of the running times of the ILP and LP-rounding algorithms. ILP, Integer Linear Program; LP.

Table 4.

The Running Time (Seconds) of Solving the Integer Linear Program and The LP-Rounding Approximation Algorithm

n	k = 2		k = 3		k = 4
n	ILP	LP-rounding	ILP	LP-rounding	ILP	LP-rounding
500	70.22	55.41	67.29	54.94	2684.34	54.87
600	112.78	86.63	341.41	86.60	2172.00	86.93
700	195.53	128.84	394.73	129.11	157306.57	129.13
800	167.45	137.94	622.53	136.24	9838.53	135.90
900	168.85	156.30	308.30	154.99	23961.93	154.99
1000	189.89	181.52	2949.96	182.16	4657.14	181.53

ILP, Integer Linear Program; LP.

6. Concluding Remarks

In this article, we studied a restricted version of the maximum stacking base pairs problem, which originates from RNA secondary structure prediction. Regardless of whether the base pairs are weighted or not, we show that this problem is APX-hard, when the degree of each base is bounded by a constant k. In addition, we design the first approximation algorithm with a factor of $\frac{32 {(k - 1)}^{3} e^{3}}{8 (k - 1) e - 1}$ for k-MWSBP by a nonlinear LP-rounding method. Our experimental results indicate a much better performance compared with this theoretical approximation factor and our algorithm is much faster than the exponential time of solving ILP. How to improve the approximation factor for k-MWSBP is an interesting open problem.

Footnotes

Author Disclosure Statement

The authors declare they have no conflicting financial interests.

Funding Information

This research is supported by NSF of China under Grant Nos. 61872427, 61732009, and 61628207, by NSF of Shandong Provence under grant ZR201702190130. H.J. is also supported by Young Scholars Program of Shandong University. P.L. is also supported by Key Research and Development Program of Yantai City (No. 2017ZH065) and CERNET Innovation Project (No. NGII20161204).

References

Akutsu

2000. Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots. Discrete Appl. Math. 104, 45–62.

Berman

, and Karpinski

1999. On some tighter inapproximability results. Lecture Notes in Computer Science, 1644, LNCS. Pgs. 200–209.

Ieong

, Kao

M.Y.

, Lam

T.W.

, et al. 2003. Predicting RNA secondary structure with arbitrary pseudoknots by maximizing the number of stacking pairs. J. Comput. Biol. 10, 981–995.

Jiang

2010. Approximation algorithms for predicting RNA secondary structures with arbitrary pseudoknots. IEEE/ACM Trans. Comput. Biol. Bioinform. 7, 323–332.

Lyngsø

R.B.

, and Pedersen

C.N.S.

2000. RNA pseudoknot prediction in energy based models. J. Comput. Biol. 7, 409–427.

Lyngsø

R.B.

2004. Complexity of pseudoknot prediction in simple models. Lecture Notes in Computer Science, 3142, LNCS. Pgs. 919–931.

Lyngsø

R.B.

, Zuker

, and Pedersen

C.N.S.

1999. Fast evaluation of interval loops in RNA secondary structure prediction. Bioinformatics, 15, 440–445.

Nussinov

, and Jacobson

A.B.

1980. Fast algorithm for predicting the secondary structure of single-stranded RNA. Proc. Natl. Acad. Sci. U. S. A. 77, 6309–6313.

Nussinov

, Pieczenik

, Griggs

J.R.

, et al. 1978. Algorithms for loop matchings. SIAM J. Appl. Math. 35, 68–82.

10.

Rivas

, and Eddy

S.R.

1999. A dynamic programming algorithm for RNA structure prediction including pseudoknots. J. Mol. Biol. 285, 2053–2068.

11.

Sankoff

1985. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J. Appl. Math. 45, 810–825.

12.

Tinoco Jr

, Borer

P.N.

, Dengler

, et al. 1973. Improved estimation of secondary structure in ribonucleic acids. Nature, 246, 40–42.

13.

Tinoco Jr

, and Bustamante

1999. How RNA folds. J. Mol. Biol. 293, 271–281.

14.

Uemura

, Hasegawa

, Kobayashi

, et al. 1999. Tree adjoining grammars for RNA structure prediction. Theor. Comput. Sci. 210, 277–303.

15.

Zhou

, Jiang

, Guo

, et al. 2017. A new approximation algorithm for the maximum stacking base pairs problem from RNA secondary structures prediction. Lecture Notes in Computer Science, 10627, LNCS. Pgs. 85–92.

16.

Zuker

, and Sankoff

1984. RNA secondary structures and their prediction. Bull. Math. Biol. 46, 591–621.

17.

Zuker

, and Stiegler

1981. Optimal computer folding of large RNA Sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 9, 133–148.