Strong/Weak Feature Recognition of Promoters Based on Position Weight Matrix and Ensemble Set-Valued Models

Abstract

In this article, we propose a method to recognize the strong/weak property of the promoters based on the nucleotide sequence. To the best of our knowledge, it is the first time to predict the strong/weak property of the promoters. First, position weight matrix (PWM) is used to evaluate the contributions of the nucleotides to the promoter strength. Then, the set-valued model is used to describe the relation between the nucleotide sequence and the strength. Considering the small-sample and imbalance features of the promoter data, we propose an ensemble approach to predict the strong/weak property of the promoters. The proposed method is used to recognize 60 \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \bm {\delta} ^E}$$ \end{document} promoters of Escherichia coli . The results show the effectiveness of the proposed method. This article provides a simple way for a biologist to evaluate the strong/weak feature of promoters from the nucleotide sequence.

1. Introduction

One of the major tasks of metabolic engineering is to coordinate the gene expression in organisms (Boyle and Silver 2012; Keasling, 2012). The rapid development of synthetic biology enables the tuning of the gene expression, and promoters are the widely used and straightforward tool to tune the gene expression by controlling the production of mRNA (Hammer et al., 2006). In the case of changing the promoter to regulate the gene expression, the first task is to recognize the strong/weak property of the promoters in a promoter library. The strength of a promoter is the rate of transcription of the gene controlled by this promoter. The strong or active promoter means the rate of transcription is high; and the weak or inactive promoter means the rate of transcription is relatively low.

The relations between the promoter sequences and their strengths have been studied for decades (Mulligan and Mcclure, 1986; Straney et al., 1994). Most of these methods were based on various experiments and were tried to find some consensus sequences contributing to the promoter strengths. The drawbacks of these methods are that the experiments are time consuming and the cost of the experiments is too high to analyze various promoters. A better way to recognize the strength feature is to establish a statistical model to predict the strength. This has gained much attention recently, and the related works can be found in Kiryu et al. (2005), Rhodius and Gross (2010), and Meng et al. (2013). These works tried to establish a mathematical model to predict the accurate value of the promoter strength. However, due to the fact that the promoter strength always dynamically changes and the relation between the promoter sequence and strength is highly nonlinear, it is difficult to develop a general method to establish the promoter strength model. In fact, in many cases of metabolic engineering, we always use the promoters that are relatively strong or weak to tune the gene expression for the metabolic pathway. In these cases, we only need to know that the promoter is strong or weak and not the exact strength values.

However, to the best of our knowledge, few works have reported how to recognize the strong/weak property of the promoter. Motivated by these facts, we study the promoter feature from another perspective. Namely, we develop a method to recognize the promoter is strong or weak from the DNA sequence without the exact promoter strength, and only the rough strength information is needed to predict the promoter feature. We think this issue is a supplement of the promoter strength studies.

The first issue to study the relation between the promoter strength and sequence is to analyze the characteristic of the promoter sequences (DNA sequences). Many previous works discussed the analyses of the DNA sequences (Roth et al., 1998; Yang and Ramsey, 2015; Yu et al., 2015; Li et al., 2017). Similar to other DNA sequence analyses, feature extraction is essential to the sequence modeling. In terms of machine learning, if the feature can reflect the data characteristics very well, it is relatively easy to establish a model. Position weight matrix (PWM) is the widely used and classical feature to show the contribution of the promoter sequence to the strength (Stormo, 1990; Kiryu et al., 2005; Rhodius and Gross, 2010; Yang and Ramsey, 2015). PWM is still used in this work as the nucleotide feature matrix.

The strong/weak property recognition can be regarded as a classification problem. However, the promoter data are often imbalanced (He and Garcia, 2009), and the sample number is relatively small. These two data characteristics make the promoter recognition difficult. Hence, many classification methods cannot be directly used to establish the promoter strength model. Some special strategies must be used to improve the recognition accuracy. Two strategies are employed in this article to improve the prediction accuracy. First, a boosting approach is proposed to establish an optimal training set to make the training set more balanced. Second, an ensemble approach based on the similarity measurement between the test set and the training set is proposed to produce the prediction result. To the best of our knowledge, it is the first time to use such method for promoter strength studies. The proposed method is used to recognize 60 \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \delta ^E}$$ \end{document} promoters of Escherichia coli, and the results show the effectiveness of the proposed method.

2. Methods

2.1. Materials and objectives

A promoter library of E. coli is considered in this study. It includes 52 \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \delta ^E}$$ \end{document} natural promoters of E. coli, and the promoter sequences are from −35 to +20 relative to the transcription start site (Rhodius and Gross, 2010). The strength was measured in vivo and in vitro using GFP (Green Fluorescent Protein) reporter.

Our goal is to extract the features of the DNA sequence and to establish a model to predict the strong/weak property of the promoters. The promoters are divided into two classes: the strong and the weak promoters. Namely, the promoters whose strength is higher than the defined threshold C are regarded as strong promoters, and the other ones are regarded as weak promoters. Generally, a promoter can be regarded as strong or weak promoter as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \left\{ { \begin{matrix} {{ \rm{the \ promoter \ is }} \ S , { \rm{if \ activity}} \ge C;} \hfill \\ {{ \rm{the \ promoter \ is }} \ W , { \rm{if \ activity}} < C} \hfill \\ \end{matrix} } \right. \tag{1} \end{align*} \end{document}

where S denotes the strong promoters and W denotes the weak promoters, C is the predefined strong threshold. That is, a promoter library is divided into two groups by C.

Based on these assumptions, a promoter library is described as in Table 1, where N denotes the sample number of a promoter library and M denotes the number of the nucleotide of a promoter.

Table 1.

Description of Promoters

	Nucleotide 1	Nucleotide 2	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cdots$$ \end{document}	Nucleotide M	Strong/weak feature
Sample 1	A	T	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cdots$$ \end{document}	T	S
Sample 2	G	G	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cdots$$ \end{document}	G	W
Sample 3	C	T	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cdots$$ \end{document}	A	W
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cdots$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cdots$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cdots$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cdots$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cdots$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cdots$$ \end{document}
Sample N	T	A		A	S

2.2. Position weight matrix

The classical PWM method evaluates the contribution of the nucleotides based on the frequencies of occurrences of the nucleotides. The effectiveness of this method has been demonstrated in many references (Stormo, 1990, 2000; Kiryu et al., 2005). Hence, we also use PWM to describe the effect of the nucleotide type on the promoter strength. The PWM is defined as Stormo (2000) and Rhodius and Gross (2010): \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { \omega _ { i , k } } = \ln \left[ { { \frac { ( { n_ { b , i } } + 0.005N ) / ( N + 0.08N ) } { { p_b } } } } \right] \; ( i = 1 , 2 , \cdots , M , k = 1 , 2 , \cdots , N ) \tag { 2 } \end{align*} \end{document}

2.3. Set-valued model of the promoter strength

The strong/weak feature is the set-valued feature, and the traditional continuous value cannot be used to describe it. In this article, we use a linear set-valued model (Wang et al., 2010; Chen et al., 2012) to describe the relationship between the promoter sequence and the strong/weak feature. In this study, it is the first time to use set-valued model to describe the relationship between the promoter sequence and strength feature. A simple linear model is used to describe the relation between the nucleotide weights and the strength, namely \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {y_k} = {b_k} + \mathop \sum \limits_{i = 1}^M \,{ \omega _{i , k}}{ \theta _i} + {d_k} , \;k = 1 , 2 , \cdots , N \tag{3} \end{align*} \end{document}

where y_k is strength of the promoter of the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${k_{{ \rm{th}}}}$$ \end{document} promoter in a library, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \omega _{i , k}}$$ \end{document} is the weight of the nucleotide at the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${i_{{ \rm{th}}}}$$ \end{document} position, b_k and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \theta _i}$$ \end{document} are the parameters of the linear model, and d_k is the measurement noise. We assume that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${d_k} \sim ( 0 , { \delta ^2} )$$ \end{document} . Note that y_k is unknown, and we only know the strong/weak property of the promoter, denoted by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {I_k} = \left\{ { \begin{matrix} {1 , { \rm{if}} \;{y_k} \ge C} \hfill \\ {0 , { \rm{if}} \;{y_k} < C} \hfill \\ \end{matrix} } \right. \tag{4} \end{align*} \end{document}

where C is the predefined threshold in Equation (1). \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${I_k} = 1$$ \end{document} means the promoter is a strong promoter and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${I_k} = 0$$ \end{document} means the promoter is a weak promoter. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\bm{\theta} = [ {b_k} , { \theta _1} , { \theta _2} , \cdots , { \theta _M}{ ] ^T}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \phi _k} = [ 1 , { \omega _{1 , k}} , { \omega _{2 , k}} , \cdots , { \omega _{M , k}}{ ] ^T}$$ \end{document} . The classical promoter strength prediction is to estimate θ given \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{ {y_k} , { \phi _k} , k = 1 , 2 , \cdots , N \} $$ \end{document} . Here, y_k is unknown, and we estimate θ given \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{ {I_k} , { \phi _k} \} $$ \end{document} .

We use maximum likelihood estimation to estimate the parameter θ, namely, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\hat {\bm{\theta}}} = \arg \mathop { \max } \limits_{\bm{\theta}} L ( {\bm{\theta}} ) \tag{5} \end{align*} \end{document}

The likelihood function of θ is given by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} L ( \bm{\theta} ) = \prod \limits_{i = 1}^N \,p ( {I_1} , {I_2} , \cdots , {I_k} \vert \bm{\theta} ) = \prod \limits_{i = 1}^N \,p ( {I_1} \vert \bm{\theta} ) \,p ( {I_2} \vert \bm{\theta} ) \cdots p ( {I_k} \vert \bm{\theta} ) \tag{6} \end{align*} \end{document}

where p denotes the probability. If \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${I_k} = 1$$ \end{document} , we have \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \begin{split} p ( { I_k } = 1 \vert {\bm { \theta}} ) & = p ( { d_k } \ge C - \phi _k^T {\bm {\theta}} \vert {\bm {\theta}} ) \\ & = \int_ { C - \phi _k^T {\bm {\theta}}} ^ \infty \frac { 1 } { { \sqrt { 2 \pi \delta } } } { e^ { - { \frac { { x^2 } } { 2 { \delta ^2 } } } } } { \kern 1pt } dx \\ & = 1 - F ( C - \phi _k^T {\bm {\theta}} ) \\ \end{split} \tag { 7 } \end{align*} \end{document}

Similarly, if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${I_k} = 0$$ \end{document} , we have \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$p ( {I_k} = 0 \vert \bm{\theta} ) = F ( C - \phi _k^T \bm{\theta} )$$ \end{document} . The log-likelihood of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$L ( \bm{\theta} )$$ \end{document} is given by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \ln L ( \bm{\theta} ) = \mathop \sum \limits_{k:{I_k} = 1} F ( C - \phi _k^T \bm{\theta} ) + \mathop \sum \limits_{k:{I_k} = 0} ( 1 - F ( C - \phi _k^T \bm{\theta} ) ) \tag{8} \end{align*} \end{document}

The estimated \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\hat {\bm{\theta}}}$$ \end{document} is obtained as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\partial \ln L ( \bm{\theta} ) / \partial \bm{\theta} = 0$$ \end{document} . \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\partial \ln L ( \bm{\theta} ) / \partial \bm{\theta}$$ \end{document} is given by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {{ \frac { \partial \ln L ( {\bm { \theta}} ) } {\partial {\bm {\theta}} } } = } { \left( { \mathop \sum \limits_ { k: { I_k } = 0 } - { \frac { f ( C - \phi _k^T {\bm { \theta}} ) } { F ( C - \phi _k^T {\bm { \theta}} ) } } + \mathop \sum \limits_ { k: { I_k } = 1 } { \frac { f ( C - \phi _k^T {\bm {\theta}} ) } { 1 - F ( C - \phi _k^T {\bm {\theta}} ) } } } \right) { \phi _k } } \tag{9} \end{align*} \end{document}

2.4. Ensemble approach

In this study, we assume that the strength of the promoter is the sum of the nucleotide weights. To establish such models, a good training set is needed. As stated above, the training set is divided into two classes by threshold C. The class with more training samples is called as major class, and the class with less training samples is called as minor class. For promoter data, in some cases, the training sample number from the major class is significantly different from the sample number from the minor class, namely the number of the training samples from different classes are imbalanced. In this case, an effective method is to construct a new training set to establish the prediction model. Boosting the minor class training set can make the samples from different classes more balanced. Motivated by this, we first propose a method to construct a boosted training set.

The original training set T is divided into two subsets. The set for the samples from the major class is denoted by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\overline T = \{ { \bar t_1} , { \bar t_2} , \cdots , { \bar t_{ \bar n}} \} $$ \end{document} , and the set for the samples from the minor class is denoted by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\underline T = \{ { \underline t _1} , { \underline t _2} , \cdots , { \underline t _{ \underline n }} \} ( \bar n < \underline n )$$ \end{document} . The testing set is denoted by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$S = \{ {s_1} , {s_2} , \cdots , {s_{{n_s}}} \} $$ \end{document} . First, we construct an auxiliary set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${T_{Au}} = \{ { \underline t _1} , { \underline t _2} , \cdots , { \underline t _{ \underline n }} , { \underline t _1} , { \underline t _2} , \cdots \} $$ \end{document} , which is composed of the elements of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\underline T$$ \end{document} . The number of the element of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${T_{Au}}$$ \end{document} is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\bar n - \underline n$$ \end{document} . In this study, we use an oversampling method to boost the minor class training set, and the boosted minor class training set is given by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\underline {T ^\prime } = \underline T \cup {T ^\prime _{Au}} = \{ \underline t { \prime _1} , \underline t { \prime _2} , \cdots , \underline t { \prime _{ \underline {n ^\prime } }} \} $$ \end{document} , where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${T ^\prime _{Au}}$$ \end{document} is a subset of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${T_{Au}}$$ \end{document} , namely \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${T ^\prime _{Au}} \in {T_{Au}}$$ \end{document} . To avoid the effect of the data imbalance, our goal is to select the training samples from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${T_{Au}}$$ \end{document} such that the major class training set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\overline T$$ \end{document} and the boosted minor class training set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\underline {T ^\prime }$$ \end{document} are more balanced than before. Balance not only means that the number of the training samples from different classes are close, but also means that the sequence information extracted both from the major class training set and the minor class training set are close. To obtain this goal, we hope the similarities between the training samples and the testing samples are close. Hamming distance is the commonly used index to evaluate the similarity of two nucleotide sequences, which is defined as the number of the same nucleotide at the same position. Based on the Hamming distance, we define the similarity between the major class training set and the testing set as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { D_ { \overline T , S } } = \frac { 1 } { { \bar n } } \mathop \sum \limits_ { i = 1 } ^ { \bar n } { \mathop \sum \limits_ { j = 1 } ^ { ns } { { D_ { { { \bar t } _i } , } } _ { { s_j } } } } \tag { 10 } \end{align*} \end{document}

Then, the construction of the optimal training set is converted to minimizing \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\vert {D_{ \overline T , S}} - {D_{ \underline {T ^\prime } , S}}{ \rm{ \vert }}$$ \end{document} . Some population-based optimization approaches, such as particle swarm optimization (PSO) (Eberhart and Kennedy, 1995) can be used to solve this optimization problem. If PSO is used, a particle in PSO can be coded as an indicator function set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$X = \{ {x_1} , {x_2} , \ldots , {x_{ \bar n - \underline n }} \} $$ \end{document} . That is, the dimension of X is the number of the elements of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${T_{Au}}$$ \end{document} . If x_j takes 1, it means that the corresponding \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${j_{{ \rm{th}}}}$$ \end{document} sample in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${T_{Au}}$$ \end{document} is included in the boosted minor class training set; if x_j takes 0, it means that the corresponding \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${j_{{ \rm{th}}}}$$ \end{document} sample in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${T_{Au}}$$ \end{document} is excluded in the boosted minor class training set. The fitness function is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\vert {D_{ \overline T , S}} - {D_{ \underline {T ^\prime } , S}}{ \rm{ \vert }}$$ \end{document} , and the particles for each iteration can be updated by the discrete PSO algorithm (Clerc, 2004). Algorithm 1 summarizes the procedure of obtaining the optimal training set, where the function MinimizingSimilarity_PSO is used to select some samples in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${T_{Au}}$$ \end{document} such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\vert {D_{ \overline T , S}} - {D_{ \underline {T ^\prime } , S}}{ \rm{ \vert }}$$ \end{document} .

Algorithm 1 Construction of a new training set
Input: The original training set T and testing set S
Output: The optimal training set T090005
1: The training set is divided into two subsets \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\overline T$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\underline T$$ \end{document}
2: Construct an auxiliary minor class training set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${T_{Au}}$$ \end{document}
3: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${T ^\prime _{Au}} =$$ \end{document} MinimizingSimilarity_PSO( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${T_{Au}}$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\bar T$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\underline T$$ \end{document} , S)
4: 4 The boosted minor class training set is computed as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\underline {T ^\prime } = \underline T \cup {T ^\prime _{Au}}$$ \end{document}
5 The optimal training set is computed as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$T ^\prime = \bar T \cup \underline {T ^\prime }$$ \end{document}

Although the minor class training set is boosted, the size of the training samples is still relatively small, which makes the prediction of the strong/weak feature of the promoters difficult. Motivated by this, we propose an ensemble learning approach to improve the prediction accuracy. The basic idea of the ensemble learning is to integrate multiple models into a unified model to improve the accuracy and robustness of the model. That is, we do not directly use the optimal training set to train the prediction model; the training set is first divided into multiple subsets and each subset produce a submodel. The prediction result is the integration of the results provided by these submodels. The key here is how to allocate the training set into some subsets. To the best of our knowledge, no allocation method has been proposed for the promoter data. We propose an ensemble learning method based on the similarities between the optimal training samples and the testing set. The similarity between a training sample and the testing set is defined as, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {D_{{{t ^\prime }_i} , {s_j}}} = \mathop \sum \limits_{j = 1}^{{{n ^\prime }_T}} {D_{{{t ^\prime }_i} , {s_j}}} ( i = 1 , 2 , 3 , \cdots , {n ^\prime _T};j = 1 , 2 , \cdots , {n_s} ) \tag{12} \end{align*} \end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${t ^\prime _i}$$ \end{document} is one of the elements of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$T ^\prime$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${n ^\prime _T}$$ \end{document} is the element number of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$T ^\prime$$ \end{document} . The training set can be sorted in descending order along with the decrease of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${D_{{{t ^\prime }_i} , {s_j}}}$$ \end{document} , denoted by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{ {t {^\prime} {^\prime} _1} , {t {^\prime} {^\prime} _2} , \cdots , {t {^\prime} {^\prime} _{{{n ^\prime }_T}}} \} $$ \end{document} . The training set is divided into the following \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${n ^\prime _T} - K + 1$$ \end{document} subsets: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} & {T^1}: \{ {{t {^\prime} {^\prime} }_1} , { \kern 1pt} \;{{t {^\prime} {^\prime} }_2} , \cdots , {{t {{^\prime} {^\prime}} }_K} \} \\ & {T^2}: \{ {{t {^\prime} {^\prime} }_2} , { \kern 1pt} \;{{t {^\prime} {^\prime} }_3} , \cdots , {{t {^\prime} {^\prime} }_{K + 1}} \} \\ & {T^3}: \{ {{t {^\prime} {^\prime} }_3} , \;{ \kern 1pt} {{t {^\prime} {^\prime} }_4} , \cdots , {{t {^\prime} {^\prime} }_{K + 2}} \} \\ & {T^{{{n ^\prime }_T} - K + 1}}: \{ {{t {^\prime} {^\prime} }_{{{n ^\prime }_T} - K + 1}} , {{t {^\prime} {^\prime} }_{n + 1}} , \cdots , {{t {^\prime} {^\prime} }_{{{n ^\prime }_T}}} \} \\ \end{align*} \end{document}

where K denotes the sample number of each subspace.

Based on the ensemble approach, we can establish n submodels denoted by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${Z^1} , {Z^2} , \cdots , {Z^{{{n ^\prime }_T} - K + 1}}$$ \end{document} . For a testing sample s_j, each model generates a prediction result, denoted by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$y_j^{{z_i}}$$ \end{document} , and the ensemble result is given by, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {I_i} = \left\{ { \begin{matrix} {1 , { \rm{if}} \;{1 \over 2} \left( {{1 \over { \sum {_{i = 1}^{{{n ^\prime }_T} - K + 1}} I ( y_j^{{z_i}} ) }} \mathop \sum \limits_{i = 1}^{{{n ^\prime }_T} - K + 1} y_j^{{z_i}}I ( y_j^{{z_i}} ) + {1 \over { \sum {_{i = 1}^{{{n ^\prime }_T} - K + 1}} ( 1 - I ( y_j^{{z_i}} ) ) }} \mathop \sum \limits_{i = 1}^{{{n ^\prime }_T} - K + 1} y_j^{{z_i}} ( 1 - I ( y_j^{{z_i}} ) ) } \right) \ge C} \\ {0 , { \rm{if}} \;{1 \over 2} \left( {{1 \over { \sum {_{i = 1}^{{{n ^\prime }_T} - K + 1}} I ( y_j^{{z_i}} ) }} \mathop \sum \limits_{i = 1}^{{{n ^\prime }_T} - K + 1} y_j^{{z_i}}I ( y_j^{{z_i}} ) + {1 \over { \sum {_{i = 1}^{{{n ^\prime }_T} - K + 1}} ( 1 - I ( y_j^{{z_i}} ) ) }} \mathop \sum \limits_{i = 1}^{{{n ^\prime }_T} - K + 1} y_j^{{z_i}} ( 1 - I ( y_j^{{z_i}} ) ) } \right) < C} \\ \end{matrix} } \right. \tag{13} \end{align*} \end{document}

FIG. 1.

The procedure of the proposed algorithm.

Algorithm 2 Ensemble approach
Input: A testing sample s_j, the optimal training set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$T ^\prime$$ \end{document} and K
Output: the prediction strong/weak feature of s_j
1: for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$i = 1$$ \end{document} to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${n ^\prime _T}$$ \end{document} do
2: Calculate the Hamming distance \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${D_{{{t ^\prime }_i} , {s_j}}}$$ \end{document}
3: end for
4: Sort \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${D_{{{t ^\prime }_i} , {s_j}}}$$ \end{document} in descending order, and obtain the ordered training set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${t {^\prime} {^\prime} _1} , {t {^\prime} {^\prime} _2} , \cdots , {t {^\prime} {^\prime} _{{{n ^\prime }_T}}}$$ \end{document} ;
5: Divide the training set into n subsets;
6: Establish n submodels \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$( {Z^1} , {Z^2} , \cdots , {Z^{{{n ^\prime }_T} - K + 1}} )$$ \end{document} for each training subset according to the method in Section 2.3;
7: Calculate the predictted results provided by each submodel;
8: Use Eq. (13) to determine the final result.

It has been demonstrated that ensemble approach can improve the prediction accuracy. According to the characteristic of the promoter data, we proposed an ensemble strategy based on the similarity index. To the best of our knowledge, it is the first time to use the ensemble approach for the promoter strength recognition. In the proposed method, K is the only parameter of the algorithm, which is related to the size of the training set.

3. Results and Discussion

We consider the promoters in Rhodius and Gross (2010). The promoter strength is measured in vitro and in vivo [the measurement method can be found in Rhodius and Gross (2010)]. According to the results in Rhodius and Gross (2010), the range of the strength in vitro is from −1.49 to 52.96, and the range in vivo is from −0.01 to 0.97. Without the loss of the generality, the strength ranges are mapped to [0, 1]. To test the method, we consider the performance of the proposed for different threshold C. The prediction performance is measured by accuracy (Ac), sensitivity (Sn), and specificity (Sp): \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} Sn = { \frac { TP } { TP + FN } } \tag { 14 } \end{align*} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} Sp = { \frac { TN } { TN + FP } } \tag { 15 } \end{align*} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} Ac = { \frac { TP + TN } { TP + FN + TN + FP } } \tag { 16 } \end{align*} \end{document}

where TP and FN denotes the number of the strong promoters that are predicted as strong and weak respectively, and TN and FP denotes the number of the weak promoters that are predicted as weak and strong, respectively. There are 52 promoters in the library, and 40 promoters are randomly selected as the training set and 20 promoters are selected as the test set. Since the sample number is too small, the experiment is repeated 50 times, and the mean Ac, Sn, Sp is calculated as performance indices.

The values of Sn and Sp are listed in Table 2, and the values of Ac are listed in Table 3. The contribution of this article is to develop an ensemble approach for promoter data. To show the effectiveness of the proposed ensemble approach, the performance indices provided by the method without ensemble approach are still listed in Tables 2 and 3. The key strategy of the proposed method is the ensemble approach used here. Hence, we also give the prediction performance provided by the same set-valued method without ensemble approach. Compared with the method without ensemble approach, the mean accuracy is improved about 10% (Table 4). The proposed method is for the imbalanced data. If the imbalance feature of the strong and weak promoters in the training set is not significant. The proposed method may not significantly improve the prediction performance.

Table 2.

The Values of Sn and Sp

	C = 0.85		C = 0.7		C = 0.55		C = 0.4		C = 0.25
	Sn, %	Sp, %	Sn, %	Sp, %	Sn, %	Sp, %	Sn, %	Sp, %	Sn, %	Sp, %
In vivo
Without ensemble approach	75	90	80	91	70	94	60	72	61	67
Ensemble approach	85	95	90	96	85	95	60	89	29	81
In vitro
Without ensemble approach	80	92	75	83	38	79	20	67	26	62
Ensemble approach	70	100	80	95	55	89	30	87	45	73

Table 3.

The Values of Ac

	C = 0.85	C = 0.7	C = 0.55	C = 0.4	C = 0.25
	Ac, %	Ac, %	Ac, %	Ac, %	Ac, %
In vivo
Without ensemble approach	88	89	91	68	62
Ensemble approach	94	95	93	84	66
In vivo
Without ensemble approach	90	79	73	58	55
Ensemble approach	97	92	85	79	66

Table 4.

The Improved Values of Ac by Ensemble Approach

	C = 0.85, %	C = 0.7, %	C = 0.55, %	C = 0.4, %	C = 0.25, %
In vivo	5.94	6.16	1.97	15.74	4.52
In vitro	7.34	13.15	12.33	20.78	11.15

For most classifiers, receiver-operating characteristic (ROC) curve is the widely used method to evaluate the prediction performance. However, we use the set-valued model to describe the relation between the promoter sequence and the strength. The threshold value C is predefined based on the rough measurement of the promoter strength. Hence, no ROC curve can be plotted in our case. Although some other works predicted the strength values of the promoter, to the best of our knowledge, it is the first time to predict the strong/weak feature of the promoters. Our work is a supplement of the promoter strength prediction.

4. Conclusions

In this article, we considered the recognition of the promoter strong/weak feature only with the rough promoter strength information, which is a complementary issue for promoter strength study. The nucleotide information is extracted using PWM method. Considering the characteristics of the promoter data, an optimal training set is established using the proposed boosting approaches, and an ensemble approach is proposed to establish the prediction model. Some approaches employed in this study is for the first time for promoter data. The proposed method provides a simple way for biologists to evaluate the strong/weak feature of promoters from the nucleotide sequence without the exact promoter strength values.

Footnotes

Acknowledgment

This work was supported by China Postdoctoral Science Foundation (2017M610894).

Author Disclosure Statement

The authors declare that no competing financial interests exist.

References

Boyle

P.M.

, and Silver

P.A.

2012. Parts plus pipes: Synthetic biology approaches to metabolic engineering. Metab. Eng. 14, 223.

Chen

, Zhao

, and Ljung

2012. Impulse response estimation with binary measurements: A regularized fir model approach. IFAC Proceedings Volumes. 45, 113–118. (16th IFAC Symposium on System Identification, Brussels, Belgium.)

Clerc

2004. Discrete Particle Swarm Optimization, Illustrated by the Traveling Salesman Problem. Springer, Berlin, Heidelberg.

Eberhart

, and Kennedy

1995. A new optimizer using particle swarm theory. In: 6th International Symposium on Micro Machine and Human Science. IEEE, 39–43.

Hammer

, Mijakovic

, and Jensen

P.R.

2006. Synthetic promoter libraries–tuning of gene expression. Trends Biotechnol. 24, 53–55.

, and Garcia

E.A.

2009. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21, 1263–1284.

Keasling

J.D.

2012. Synthetic biology and the development of tools for metabolic engineering. Synthetic Biology: New Methodologies and Applications for Metabolic Engineering. Metab. Eng., 14, 189–195.

Kiryu

, Oshima

, and Asai

2005. Extracting relations between promoter sequences and their strengths from microarray data. Bioinformatics, 21, 1062–1068.

, Lv

, Li

, et al. 2017. Sequence comparison and essential gene identification with new inter-nucleotide distance sequences. J. Theor. Biol. 418, 84–93.

10.

Meng

, Wang

, Xiong

, et al. 2013. Quantitative design of regulatory elements based on high-precision strength prediction using artificial neural network. PLoS One, 8, e60288.

11.

Mulligan

M.E.

, and Mcclure

W.R.

1986. Analysis of the occurrence of promoter-sites in DNA. Nucleic Acids Res. 14, 109–126.

12.

Rhodius

V.A.

, and Gross

C.A.

2010. Predicting strength and function for promoters of the Escherichia coli alternative sigma factor, sigmae. Proc. Natl. Acad. Sci. U S A. 107, 2854–2859.

13.

Roth

F.P.

, Hughes

J.D.

, Estep

P.W.

, et al. 1998. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat. Biotechnol. 16, 939–945.

14.

Stormo

G.D.

1990. Consensus patterns in DNA. Methods Enzymol. 183, 211–221.

15.

Stormo

G.D.

2000. DNA binding sites: Representation and discovery. Bioinformatics, 16, 16–23.

16.

Straney

, Krah

, and Menzel

1994. Mutations in the-10 TATAAT sequence of the gyrA promoter affect both promoter strength and sensitivity to DNA supercoiling. J. Bacteriol. 176, 5999–6006.

17.

Wang

L.Y.

, Yin

G.G.

, Zhang

J.F.

, et al. 2010. System Identification with Quantized Observations. Springer Basel Ag, xviii +317, Boston, MA, USA.

18.

Yang

, and Ramsey

S.A.

2015. A DNA shape-based regulatory score improves position-weight matrix-based recognition of transcription factor binding sites. Bioinformatics, 31, 3445–3450.

19.

, Huo

, Chen

, et al. 2015. An efficient algorithm for discovering motifs in large DNA data sets. IEEE Trans. Nanobiosci. 14, 535–544.