Prediction of siRNA Potency Using Sparse Logistic Regression

Abstract

RNA interference (RNAi) can modulate gene expression at post-transcriptional as well as transcriptional levels. Short interfering RNA (siRNA) serves as a trigger for the RNAi gene inhibition mechanism, and therefore is a crucial intermediate step in RNAi. There have been extensive studies to identify the sequence characteristics of potent siRNAs. One such study built a linear model using LASSO (Least Absolute Shrinkage and Selection Operator) to measure the contribution of each siRNA sequence feature. This model is simple and interpretable, but it requires a large number of nonzero weights. We have introduced a novel technique, sparse logistic regression, to build a linear model using single-position specific nucleotide compositions which has the same prediction accuracy of the linear model based on LASSO. The weights in our new model share the same general trend as those in the previous model, but have only 25 nonzero weights out of a total 84 weights, a 54% reduction compared to the previous model. Contrary to the linear model based on LASSO, our model suggests that only a few positions are influential on the efficacy of the siRNA, which are the 5’ and 3’ ends and the seed region of siRNA sequences. We also employed sparse logistic regression to build a linear model using dual-position specific nucleotide compositions, a task LASSO is not able to accomplish well due to its high dimensional nature. Our results demonstrate the superiority of sparse logistic regression as a technique for both feature selection and regression over LASSO in the context of siRNA design.

1. Introduction

Under normal conditions, cells are able to recognize and use single-stranded mRNA for translation. However, when cells are infected by certain viruses, double-stranded RNAs are introduced as part of viral replication. Thus removal of these double-stranded RNAs is beneficial to the cells and prevents the further spread of infection. As a result, cells have been engineered to recognize double-stranded regions of RNA as aberrant and degrade them. RNA interference (RNAi) takes advantage of this by introducing artificial double-stranded (sense/antisense strands) short interfering RNA (siRNA) molecules of 20–25 nucleotides long that trigger the degradation of the targeted mRNA through a protein complex referred to as the “RNA Induced Silencing Complex” (RISC), which in turn blocks translation of the targeted protein (Zamore et al., 2000). Only one strand is selectively incorporated as a guide strand into the RISC, and in instances where both sense and antisense strands remain in the RISC, silencing is compromised (Khvorova et al., 2003; Schwarz et al., 2003). Today, RNAi has become a powerful tool for the study of gene functions. However, only a small fraction of siRNAs are found to be effective in RNAi. Therefore, designing functional siRNAs still remains a challenge.

Researchers have discovered some siRNA sequence characteristics known to contribute to siRNA efficacy (Li and Cha, 2007; Amarzguioui and Prydz, 2004; Holen, 2006). Using Artificial Neural Networks, Huesken et al. (2005) revealed that highly potent siRNAs contain A/U at position 1, A at position 10, and U at position 2. Other motifs which are overrepresented in potent siRNAs and underrepresented in ineffective siRNAs included U at position 7, U at position 11, and C at position 19. Reynolds et al. (2004) formulated eight rules based on experimental evidence that also involve position specific requirements for functional siRNAs. Some machine learning techniques, such as Support Vector Machines and Artificial Neural Networks, can predict siRNA potency with high accuracy, but their results are difficult to interpret (Peek, 2007; Huesken et al., 2005). Biologists need simple guides or rules to help design siRNA of great potency and specificity for use in RNAi experiments. Tibshirani (1996) proposed the LASSO (Least Absolute Shrinkage and Selection Operator) method as a feature selection technique using L₁ penalty. This method is attractive since it can shrink coefficients for some of the predictors to exactly zero. Using LASSO linear regression, Vert et al. (2006) proposed a sparse linear model to provide a direct quantitative estimate of siRNA sequence feature's contribution to siRNA efficacy. The main advantages of this model were interpretability and accuracy.

In Frutiger et al. (2008), we used a simplified version of genetic algorithm to build a sparse linear model for siRNA sequence features. Our main idea was to use the Pearson correlation value between the predicted and real efficacy values as the fitness to directly search for better weights than the weights obtained from LASSO as in Vert et al. (2006). In this article, we also built a sparse linear model, but using sparse logistic regression method implemented by Cawley and Talbot (2006).

Shevade and Keerthi (2003) first proposed an algorithm for sparse logistic regression with L₁-norm regularization. Their implementation depended on the parameter of L₁-norm regularization. To find the best parameter for their regression model, a computationally expensive search based on cross-validation was needed. Cawley and Talbot (2006) made use of the fact that L₁-norm regularization problems can be reframed as Bayesian regularization problems so the regularization parameter can be integrated out analytically. Their approach completely eliminated the need for seeking the best parameters

The sparse logistic regression is a technique for regression as well as feature/variable selection. In this article, we decided to explore its dual functions to predict the efficacy of siRNA sequences and to select the siRNA sequence features that are influential on siRNA efficacy. We conducted a numerical analysis on each siRNA sequence feature based on position specific nucleotide compositions and dual-position specific nucleotide compositions. The efficient algorithm of Cawley and Talbot (2006) made it possible for applying the sparse logistic regression to the high (3360) dimensional space data in this study.

2. Methods

In the current study, we used the siRNA data set from Huesken et al. (2005). This data set consists of a total of 2431 randomly selected siRNA sequences targeting 34 mRNA species, assayed through a high-throughput fluorescent reporter gene system. The efficacies of these siRNA sequences had been experimentally determined. The data, referred to as dataset 2431, was split by Huesken et al. (2005) into a training set of 2182 siRNA sequences and a test set of 249 siRNA sequences.

In this article, we will always refer to the nucleotides in the antisense guide strand, with positions in the sequence ordered in the 5' → 3' direction on the guide strand. To encode the single position features of siRNA sequences of length 21, we used a binary feature vector of dimension 84 (4 × 21) that indicates the presence (1) or absence (0) of each nucleotide at each of the 21 positions. To encode dual-position features, we use a_ib_j to represent any combination of the two letters, say a and b, from the alphabet set, A, C, G, and U. Index i is the position for a and j is the position for b, 1 ≤ i ≤ 20, 2 ≤ j ≤ 21, and i < j. Each siRNA sequence of length 21 is then encoded as a binary feature vector of 3360 binary digits ((1 + 2 + 3 + … + 20) × 16 = 3360).

The least squared linear regression method applied to siRNA efficacy regression can be described as follows:

Given a dataset \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{ x_i , y_i \} ^m_{i = 1}$$ \end{document} where x_i is a vector of siRNA sequence features and y_i is experimental validated efficacy value, find weights \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha = ( \alpha_0 , \alpha_1 , \alpha_2 , \ldots , \alpha_n )$$ \end{document} such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}\min_{ \alpha} \sum_i ( y_i - f ( x_i ) ) ^2\end{align*} \end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$f ( x_i ) = \sum \limits_{j = 1}^n \alpha_jx_{ij} + \alpha_0$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x_i = ( x_{i1} , x_{i2} , x_{i3} , \ldots , x_{in} )$$ \end{document} . The LASSO regression method, proposed by Tibshirani (1996), puts a constraint on α

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\min \limits_{ \alpha} \sum \limits_i ( y_i - f ( x_i ) ) ^2$$ \end{document} with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\sum \limits_{j = 1}^n \mid \alpha_j \mid \le t$$ \end{document} , where t controls the L₁ norm of α.

This constraint on α produces a sparse model; i.e., many components of α can be zero. The sparsity of LASSO makes it an ideal technique for feature selection.

Following the idea of LASSO, Shevade and Keerthi (2003) studied the following problem for sparse logistic regression, when \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$y_i \in \{ - 1 , + 1 \} $$ \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}\min \limits_{ \alpha} \sum \limits_{i}g ( - y_i f ( x_i ) ) \quad { \rm with} \quad \sum \limits_{j = 1}^n \mid \alpha_j \mid \le t , \tag{1}\end{align*} \end{document}

where g(ξ) = log(1 + e^ξ) which is the negative log-likelihood function associated with the probability model: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { \rm Prob } \ ( y \mid x ) = \frac { 1 } { 1 + e^ { - y \cdot f ( x ) } } \tag { 2 } \end{align*} \end{document}

The problem in (1) is equivalent to the following unconstrained optimization problem: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}\min_{ \alpha} \left( \sum_i g ( - y_i f ( x_i ) ) + \lambda \sum_{j = 1}^n \mid \alpha_j \mid \right) \ \hbox{for some} \ \gamma > 0 \tag{3}\end{align*} \end{document}

Shevade and Keerthi (2003) used a time-consuming cross-validation procedure to find a good value for γ. Cawley and Talbot (2006) used a Bayesian marginalization technique from Buntine and Weigend (1991) and Williams (1995) to transform problem in (2.2) into the following problem, thus eliminating parameter γ \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}\min_{ \alpha} \left( \sum_i g ( - y_i f ( x_i ) ) + { \rm N} \sum \limits_{j = 1}^n \mid \alpha_j \mid \right) \tag{4}\end{align*} \end{document}

where N is the number of active (nonzero) model parameters \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$( \alpha_1 , \alpha_2 , \ldots , \alpha_n )$$ \end{document} . Note that the bias term α₀ is not regularized.

Logistic regression takes input with binary dependent values (y values), so we converted the continuous potency values in dataset 2148 into binary. To this end, we chose the value of 0.45 as our threshold value for high potency or low potency of siRNA sequences. Any siRNA sequence with potency values greater than 0.45 will be assigned + 1, and any siRNA sequence with potency values equal to or less than 0.45 will be assigned −1 in our logistic model.

3. Results

3.1. Biological significance of the weights produced by sparse logistic regression

Our weights, as shown in Figure 1, give a clear, concise, and quantitative picture of all the siRNA sequence single-position features that are influential on siRNA efficacy. They imply that the two ends and the seed region (2–7 positions) of the siRNA sequences are important for functional siRNAs with an apparent emphasis on strict low GC content in the seed region. In the study of the thermodynamic stability profiles for siRNAs, Shabalina (2006) discovered that the Gibbs free energy is always lower from positions 1–7 for effective siRNAs, which is in agreement with our weights from positions 1–7. The accessibility of the siRNA and the target mRNA for hybridization, as measured by folding free energy change, is shown to be significantly correlated with efficacy (Lu and Mathews, 2008). Birmingham (2006) noted that off-target gene silencing is associated with the presence of one or more perfect 3' untranslated region (UTR) matches with the seed region of the antisense strand of the siRNA. These studies proved the significance of the seed region for siRNA efficacy. In contrast to Birmingham (2006), our numerical analysis hints that the relevant seed region of siRNAs may consist of six (positions 2–7) rather than seven nucleotides (positions 2–8).

FIG. 1.

This plot displays only the nonzero weights on single-position specific features of siRNA sequences in the training data of dataset 2147 produced by sparse logistic regression. This plot shows that only a few positions, i.e., the 5’ and 3’ ends and the seed region (2–7 positions), are critical for functional siRNAs. The y-axis represents the weight values.

Furthermore, our weights confirm, as in Ui-Tei (2004), that potent siRNAs should have low GC content at the 5' end and high GC content at the 3' end. Our nonzero weights are all negative on G and C features except for C18 and are all positive on A and U features except for A18, A19, U19, indicating that potent siRNAs should have overall low GC content and, more importantly, the well-known asymmetric requirement of GC content at the two ends (Ui-Tei, 2004). Vert et al. (2006) noticed that the two negative weights on C20 and C21 explain why more accurate models are obtained from adding the two overhangs.

Vert et al. (2006) had the zero weights for the following single-position specific features, a total of 30:

C1, A2, U3, U4, G5, C6, G6, U7, A8, C8, A9, C9, G10, U10, G11, A12, C12, U13, C14, A15, G15, U15, C16, U16, A17, C17, G18, U19, G20, U21

Our weights in Table 1 are all zero for these features except for G5 (weight value −0.2598), U13 (weight value 0.2464), and U19 (weight value −0.8379). These three weights still reflect the need for overall low GC content and high GC content at 3’ end.

Table 1.

Weights Produced by Sparse Logistic Model

Position	Weight	Position	Weight	Position	Weight	Position	Weight
A1	1.2386	C1	0	G1	0	U1	2.4044
A2	0	C2	−0.6203	G2	−0.6696	U2	0
A3	0	C3	−0.3292	G3	−0.2303	U3	0
A4	0	C4	−0.3371	G4	−0.4593	U4	0
A5	0	C5	−0.3427	G5	−0.2598	U5	0
A6	0	C6	0	G6	0	U6	0.1831
A7	0	C7	−0.7577	G7	−0.4454	U7	0
A8	0	C8	0	G8	0	U8	0
A9	0	C9	0	G9	0	U9	0
A10	0.4467	C10	0	G10	0	U10	0
A11	0	C11	−0.3173	G11	0	U11	0
A12	0	C12	0	G12	0	U12	0
A13	0	C13	0	G13	0	U13	0.2464
A14	0	C14	0	G14	−0.3260	U14	0.2866
A15	0	C15	0	G15	0	U15	0
A16	0	C16	0	G16	−0.1854	U16	0
A17	0	C17	0	G17	0	U17	0
A18	−0.4392	C18	0.2436	G18	0	U18	0
A19	−1.3398	C19	0	G19	0	U19	−0.8379
A20	0	C20	−0.2800	G20	0	U20	0
A21	0	C21	−0.6295	G21	0	U21	0

Single-position weights measure the importance of each nucleotide in the siRNA sequence. The dual-position weights measure the importance of the presence of two nucleotides in the siRNA sequence. From Table 2, we can see that there is only one feature U3U6, which means U at positions 3 and 6, contributes positively to the siRNA functionality. The top negative features are G1A19, G1U19, C1A19, C1U19, and G1G19, which implies that the two ends of siRNA sequences are important for siRNA functionality. These pairs again show the asymmetric requirement of GC content at the two ends.

Table 2.

Dual-Position Weights Produced by Sparse Logistic Model

Position	U3U6	G1A19	G1U19	C1A19	C1U19	G1G19
A18A20	C7G16	C2C24	G13G17	C1G14	G2C7	C4C5
Weight	0.7070	−2.2994	−1.6163	−1.5891	−0.9861	−0.9814
−0.8607	−0.8691	−0.8684	−0.8385	−0.8178	−0.7914	−0.7852
Position	C1C20	C7G10	C11C13	C10C17	G7C21	G1G7
G4C16	C1C14	G5G14	C7C21	U10U19	G11G12	G1A6
Weight	−0.7822	−0.7816	−0.7799	−0.7634	−0.7607	−0.7562
−0.7536	−0.7265	−0.7150	−0.6739	−0.6733	−0.6585	−0.5917

The dual-position weights offer new insight into the siRNA functionality. G1 and G19 were weighted zero in single-position weights, but when combined together, they make quite negative contribution to siRNA functionality. A19 and U19 have very negative values in single-position weights, and also have negative values in dual-position weights when combined with C1 and G1. Of the 28 nonzero dual-position weights, there were three negative features, C1C20, G7C21, and C7C21, that involve the two overhangs, which suggests that these overhangs don't play primary roles, but rather secondary roles in siRNA functionality.

3.2. Evaluation of efficacy prediction accuracy of different weights

The linear model based on LASSO has 54 nonzero values out of total 84 weights, while the sparse logistic model has 25 nonzero values, a 54% reduction of nonzero weights. Each weight numerically measures the contribution of each sequence feature.

Peek (2007) tested individual siRNA sequence features for their significance of correlation to siRNA efficacy and listed the t-test values of significance for these features. In this study, we used these t-test values of significance as weights to compare with our weights. A small discrepancy is that these t-test values were calculated using all the sequences in dataset 2148 (see Figure 1 in Peek, 2007), while the weights produced by LASSO and our weights were all based on the 2182 sequences in the training data of dataset 2148 since our primary goal is compare our weights with those produced by LASSO.

To illustrate the differences of the three sets of weights, we plotted them in Figure 2. They share the same general trend, but ours are much sparser than the other two. The numerical values of our weights are also displayed in Table 1.

FIG. 2.

These three plots display the three sets of weights on single-position features of siRNA sequences in dataset 2147 using sparse logistic regression, LASSO, and t-test values of signific ance respectively. The three sets of weights share the same general trend, but our sparse logistic weights are much sparser than the other two.

In order to assess the prediction performance of these weights, we first need to explain how the predictions are made in both models: the linear model and the logistic model.

Under the linear model, the predicted efficacy value of a siRNA sequence x_i is obtained with the formula: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}f ( x_i ) = \sum_{j = 1}^n \alpha_jx_{ij} + \alpha_0\end{align*} \end{document}

and under the logistic model, predicted efficacy value is obtained with the formula: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { \rm Prob } \ ( y_i \mid x_i ) = \frac { 1 } { 1 + e^ { - y_i \cdot f ( x_i ) } } \end{align*} \end{document}

When tested on the 249 sequences in the test set of dataset 2147, the weights produced by LASSO have Pearson value of 0.6487, under the liner model. Ours have Pearson value of 0.652 under the logistic model and Perason value of 0.6407 under the linear model. The weights based on t-test (Peek, 2007), trained on the whole dataset 2147, have Pearson value of 0.6663 under the linear model.

Our main goal in this section is to fairly compare our weights with those based on LASSO under the same linear model.

To further measure the prediction performance of these three sets of weights, we used another independent set of 579 siRNA sequences of 19 nucleotides in length from Sætrom (2004) referred to as dataset 579.

Under the linear model, when tested on dataset 579, our weights have Pearson value of 0.5053, while weights based on LASSO have Pearson value of 0.5066 and weights based on t-test (Peek 2007) have 0.4725. There might be two reasons for the performance drop. The first reason is that the dataset 579 is independent. The second reason is that dataset 579 has sequences of length 19, i.e., without the two overhangs. As observed in Huesken et al. (2005), the inclusion of the 2 overhang nucleotides improves prediction accuracy.

The weights based on t-test (Peek, 2007) have a high Pearson value partially due to the fact that they were trained on the whole dataset 2147, and showed a lower Pearson value when tested on the independent dataset 579.

Dual-position weights are produced by sparse logistic regression from the training data in dataset 2147 and then tested on the test data of dataset 2147. These weights have 3360 positions and have 28 nonzero values. The weights have Pearson value of 0.5616 under linear model. We applied LASSO to the same training data, and obtained 1953 nonzero weights with Pearson value of 0.2248, which makes these weights infeasible for any practical use. This demonstrates the better performance of sparse logistic regression as a feature selection method than LASSO.

Both Vert et al. (2006) and Peek (2007) studied the predictive power of position independent sequence motifs (N-grams) and showed that the inclusion of these motifs can improve the prediction accuracy, but the primary factor is still position specific nucleotide compositions. In this article, we only explored the position specific features of siRNAs.

3.3. Evaluation of the robustness of different weights under the change of data

In this section, we assess the advantage of our sparse logistic model, which is much sparser than LASSO model, in terms of stability of its numerical calculation. We added dataset 279 to the training data of dataset 2147, restricted to the first 19 nucleotides, and examined how weights in different models respond to this change. Here we use Pearson values to measure the similarity of two sets of weights, not for prediction accuracy. As seen in Figure 3, sparse logistic weights have Pearson value of 0.9226 for the two sets of weights obtained before and after adding dataset 279 to the training data of dataset 2147. The weights based on LASSO have only Pearson value of 0.7024 before and after. They can be easily influenced by change of data, while sparse logistic weights are more resilient to such changes and therefore are more stable.

FIG. 3.

(Top plot) Two sets of weights on single-position features of siRNA sequences in dataset 2147 alone, and in the union of dataset 2147 and dataset 249 using sparse logistic regression, respectively. (Bottom plot) Two sets of weights on single-position features of siRNA sequences in the same two datasets as in the top plot, but using LASSO, respectively. These two plots show that our sparse logistic weights are more robust to the change of data.

In this experiment, we observed the LASSO weights changed their values at 66 positions out of 76 positions, while sparse logistic weights only changed at 31 positions out of 76 positions. Sparse logistic weights were less perturbed than the LASSO weights.

LASSO weights produced from the training data of dataset 2147 have Pearson value 0.6487 on the test data of dataset 2147. The predictions of LASSO weights trained on the same training set but restricted to the first 19 nucleotides have Pearson value of 0.6251 on the same test data, restricted to the first 19 nucleotides. After adding dataset 249 to the same training set, the new weights based on LASSO have Pearson value 0.6280 on the same test data, increasing the Pearson value by 0.0029. Sparse logistic weights changed their Pearson value from 0.652 to 0.6216 after switching from 21 to 19 nucleotides. After adding dataset 249, the new weights have Pearson value of 0.6404, increased by 0.0153. The sparse logistic weights generalize better than LASSO weights (0.0153 compared to 0.0029).

4. Conclusion

In this article, we tackled the siRNA design problem by utilizing sparse logistic regression technique developed by Cawley and Talbot (2006). Our results suggest that only a few positions, i.e., the 5’ and 3’ ends and the seed region (2–7 positions) of siRNA sequences, are critical for siRNA functionality. Our sparse linear model has a much smaller set of features, compared to LASSO model, for biologists to use when designing potent siRNA sequences. These features better represent the characteristics of effective siRNAs, validated by their efficacy prediction accuracy, robustness to change of data, and consistency with previous findings. Occam's Razor tells us that if two models both can explain the same data equally well, the simpler one is preferred. Our results imply that sparse logistic regression is an effective tool in the study of functional siRNA sequence properties.

Footnotes

Acknowledgments

W.H. thanks Houghton College for providing financial support.

Disclosure Statement

No competing financial interests exist.

References

Amarzguioui

, and Prydz

2004. An algorithm for selection of functional siRNA sequences. Biochem. Biophys. Res. Commun., 316, 1050–1058.

Birmingham

, Anderson

E.M.

, Reynolds

, et al. 2006. 3' UTR seed matches, but not overall identity, are associated with RNAi off-targets. Nat. Methods, 3, 199–204.

Buntine

W.L.

, and Weigend

A.S.

1991. Bayesian back-propagation. Complex Syst. 5, 603–643.

Cawley

G.C.

, and Talbot

L.C.

2006. Gene selection in cancer classification using sparse logistic regression with Bayesian regularization. Bioinformatics, 22, 2348–2355.

Elbashir

S.M.

, Martinez

, Patkaniowska

, et al. 2001. Functional anatomy of siRNAs for mediating efficient RNAi in Drosophila melanogaster embryo lysate. EMBO J. 20, 6877–6888.

Frutiger

, Collins

, and Hu

2008. Using random pertubation method to improve efficacy prediction of siRNA sequences. Proc. Int. Conf. Biomed. Eng. Inform. 194–199.

Huesken

, Lange

, Mickanin

, et al. 2005. Design of a genome-wide siRNA library using an artificial neural network. Nat. Biotechnol. 23, 995–1001.

Holen

2006. Efficient prediction of siRNAs with siRNArules 1.0: an open-source JAVA approach to siRNA algorithms. RNA, 12, 1620–1625.

Khvorova

, Reynolds

, and Jayasena

S.D.

2003. Functional siRNAs and miRNAs exhibit strand bias. Cell, 115, 209–216.

10.

, and Cha

2007. Predicting siRNA efficiency. Cell Mol. Life Sci., 64, 1785–92.

11.

Z.J.

, and Mathews

D.H.

2008. Efficient siRNA selection using hybridization thermodynamics. Nucleic Acids Res. 36, 640–647.

12.

Peek

A.S.

2007. Improving model predictions for RNA interference activities that use support vector machine regression by combining and filtering features. BMC Bioinform., 6, 182.

13.

Reynolds

, Leake

, Boese

, et al. 2004. Rational siRNA design for RNA interference. Nat. Biotechnol. 22, 326–330.

14.

Schwarz

, Hutvagner

, Du

, et al. 2003. Asymmetry in the assembly of the RNAi enzyme complex. Cell, 115, 199–208.

15.

Shabalina

S.A.

, Spiridonov

A.N.

, and Ogurtsov

A.Y.

2006. Computational models with thermodynamic and composition features improve siRNA design. BMC Bioinform. 12, 7–65

16.

Shevade

S.K.

, and Keerthi

S.S.

2003. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19, 2246–2253.

17.

Sætrom

2004. Predicting the efficacy of short oligonucleotides in antisense and RNAi experiments with boosted genetic programming. Bioinformatics, 20, 3055–3063.

18.

Tibshirani

1996. Regression shrinkage and selection via the lasso. J. R Statist. Soc. B, 58, 267–288.

19.

Ui-Tei

, Naito

, and Takahashi

2004. Guidelines for the selection of highly effective siRNA sequences for mammalian and chick RNA interference. Nucleic Acids Res. 32, 936–948.

20.

Vert

J.P.

, Foveau

, Lajaunie

, et al. 2006. An accurate and interpretable model for siRNA efficacy prediction. BMC Bioinform., 7, 520.

21.

Williams

P.M.

1995. Bayesian regularization and pruning using a Laplace prior. Neural Comput. 7, 117–143.

22.

Zamore

P.D.

, Tuschl

, Sharp

P.A.

, et al. 2000. RNAi: double-stranded RNA directs the ATP-dependent cleavage of mRNA at 21 to 23 nucleotide intervals. Cell, 101, 25–33.