Rapid methods for alignment-free sequence comparison make large-scale comparisons between sequences increasingly feasible. Here we study the power of the statistic D2, which counts the number of matching k-tuples between two sequences, as well as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document}, which uses centralized counts, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^S_2$$
\end{document}, which is a self-standardized version, both from a theoretical viewpoint and numerically, providing an easy to use program. The power is assessed under two alternative hidden Markov models; the first one assumes that the two sequences share a common motif, whereas the second model is a pattern transfer model; the null model is that the two sequences are composed of independent and identically distributed letters and they are independent. Under the first alternative model, the means of the tuple counts in the individual sequences change, whereas under the second alternative model, the marginal means are the same as under the null model. Using the limit distributions of the count statistics under the null and the alternative models, we find that generally, asymptotically \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^S_2$$
\end{document} has the largest power, followed by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document}, whereas the power of D2 can even be zero in some cases. In contrast, even for sequences of length 140,000 bp, in simulations \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document} generally has the largest power. Under the first alternative model of a shared motif, the power of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document} approaches 100% when sufficiently many motifs are shared, and we recommend the use of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document} for such practical applications. Under the second alternative model of pattern transfer, the power for all three count statistics does not increase with sequence length when the sequence is sufficiently long, and hence none of the three statistics under consideration can be recommended in such a situation. We illustrate the approach on 323 transcription factor binding motifs with length at most 10 from JASPAR CORE (October 12, 2009 version), verifying that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document} is generally more powerful than D2. The program to calculate the power of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^S_2$$
\end{document} can be downloaded from http://meta.cmb.usc.edu/d2. Supplementary Material is available at www.liebertonline.com/cmb.
1. Introduction
Alignment-free sequence comparisons have received extensive attention recently (Burden et al., 2006; Forêt et al., 2006, 2009a,b; Ivan et al., 2008; Kantorovitz et al. 2007a,b). One widely used statistic for alignment free sequence comparison is the D2 statistic that counts the number of matching k-tuples (also referred as k-words or k-grams) between the two sequences. Throughout this paper, we use tuples and words interchangeably. It was pointed out in Lippert et al. (2002) that D2 is not appropriate for the comparison of two sequences because it is dominated by the deviation of the word counts from the corresponding expectations in each sequence. In Reinert et al. (2009), two new variants of the D2 word count statistics, referred to as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^S_2$$
\end{document}, were proposed. The statistic \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document} is based on centered counts, divided by the square root of their means, whereas \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^S_2$$
\end{document} is a self-standardized statistic. More specifically, let Xw and Yw be the numbers of occurrences of word w in the first and the second sequences, respectively. The D2 statistic is defined as
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
D_2 \equiv \sum_{{\bf w} \in{\cal A}^k}X_{\bf w}Y_{\bf w}.
\end{align*}
\end{document}
To define \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^S_2$$
\end{document} as in [9], we first introduce the centralized count variables by
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
{\tilde X}_{\bf w} = X_{\bf w} - n p_{\bf w} \ {\rm and} \ {\tilde
Y}_{\bf w} = Y_{\bf w} - n p_{\bf w},
\end{align*}
\end{document}
where pw is the probability of word w under the null model. Then we put
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
D_2^S = \sum_ {{\bf w} \in {\cal A} ^k} \frac {{\tilde X} _ {\bf
w} {\tilde Y} _ {\bf w}} {\sqrt {{\tilde X} _ {\bf w} ^2 +
{\tilde Y} _ {\bf w} ^2}}, \ {\rm and} \ D_2^ {*} = \sum_ {{\bf w}
\in {\cal A} ^k} \frac{{\widetilde X} _ {\bf w} {\widetilde Y} _
{\bf w}} {n p_ {\bf w}}.
\end{align*}
\end{document}
Here we set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\frac {0} {0} = 0$$
\end{document}.
The power of those statistics under two alternative models were explored via simulation approaches. The first alternative model is that the two sequences contain random instances of a common motif, whereas the second alternative model is a pattern transfer model, where randomly chosen DNA segments in the first sequence are used to replace corresponding segments in the second sequence.
It has been shown that, under the first alternative model, the power of both \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^S_2$$
\end{document} is an increasing function of the sequence length for any tuple size k ≥ 2, while the power of D2 does not necessarily increase with sequence length and sometimes can even be smaller than the pre-specified type I error. In almost all the simulations considered, the power of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} is higher than that of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^S_2$$
\end{document}. Under the second alternative model, the power of both \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^S_2$$
\end{document} quickly reaches their plateau and does not seem to change with sequence length. The power of D2 can decrease with sequence length in some examples.
Simulation studies can only explore very limited ranges of parameter values to compare the power of detecting the relationship between two sequences or genomes. To compare the performance of the different statistics under a broad range of evolutionary scenarios, theoretical studies of the power of these statistics are needed. In addition, it should be very useful to have an easy to use program for calculating the power of sequence comparisons using the various statistics without resorting to time consuming simulations. In this article, we achieve the following objectives: (1) to study the limiting distributions of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^S_2$$
\end{document} under the two alternative models; (2) to compare the theoretical approximate mean, variance, and power of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} with the corresponding simulated values (we show that the approximations are reliable for D2 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document}. However, for the approximations of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} to be reasonable, very long sequences are usually needed); (3) and to develop a program to calculate the power of detecting the relationship between two sequences using D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document}, as well as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document}. As our calculations are based on approximations, we note that the power in this article is approximate. For easier exposition we omit the word “approximate”; any power is understood to be approximate.
The organization of the article is as follows. In Section 2, we give details of the alternative model I, and show that the distributions of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2/ n^2, \ D^*_2 / n \ {\rm and} \ D^S_2 / n$$
\end{document} converge to normal distributions as the sequence length tends to infinity. Formulas for the approximate mean and variance of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2/ n^2, \ D^*_2 / n \ {\rm and} \ D^S_2 / n$$
\end{document} are presented, and they are put to use to calculate the power of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^S_2$$
\end{document}. In Section 3, we give details of alternative model II and develop a new hidden Markov model (HMM) for generating pairs of sequences related through alternative model II. The approximate distributions of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^S_2$$
\end{document} under alternative model II are then derived. These approximate distributions are not normal and are complicated. We show that the power of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^S_2$$
\end{document} converges rapidly and does not change much as sequence length n increases, a phenomenon observed in the simulation studies of Reinert et al. (2009). Under the second model, we do not have an efficient method for calculating the mean and variance of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document}, but we are able to present methods for calculating the approximate mean of D2 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document}. In Section 4, we first describe a web-based and a R program package for calculating the power of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} to detect the relationships between two sequences under alternative model I. We then evaluate the program by comparing the theoretical mean, variance, and power derived in this study with the corresponding simulated quantities presented in Reinert et al. (2009) and show that the approximate mean and variance are generally close to their corresponding true values when the sequence length is very large. We find that convergence for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} is considerably slower than for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} and for D2. This also affects the power of the statistic—the power approximation for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} is poor in the parameter regimes we considered. Hence, we concentrate on D2 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} for the remainder of the article. Moreover, D2 has zero power under some models, and hence cannot be used to infer the relationship between sequences under such models. For D2 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document}, the program developed in this study can be readily used to study the power of comparing sequences using k-tuples. We then extend our study to 323 transcription factor (TF) binding motifs and show the superiority of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} compared to D2 for sequence comparison for general motif patterns although there are a few exceptions where D2 is more powerful than \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document}. For alternative model II, we study how the means of D2 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} change with the word length k in order to explain the observation that the power of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} using k = 10 is much higher than the power using k = 5 in the simulation studies reported in Reinert et al. (2009). The article concludes with some discussion and potential extensions to more general background sequence models.
The results regarding the approximate distributions of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} and the power of detecting the relationships between the sequences using these statistics can be easily extended to sequence pairs with different background letter frequencies, sequence lengths, and motif densities. However, the notation and presentation will be more complicated. For notational simplicity and clarity of presentation, we present the results for two sequences having the same background probability distribution, sequence length, and motif density. The results for the general situations are given in the Appendix. As the proofs are very similar to the ones presented in the article, they are omitted.
2. Alternative Model I
2.1. The model and the count statistics
The alternative model I renders the two sequences dependent through a common motif which is randomly distributed across the two sequences. As in Reinert et al. (2009), we model the background sequence as independent identically distributed (IID) random variables taking different letters from finite alphabet \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\cal A}$$
\end{document} with probability \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$p_a (a\in {\cal A})$$
\end{document}. For notational convenience, we also denote \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$p_a^{(0)} = p_a$$
\end{document}. For nucleotide sequences, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\cal A} = \{A , C , G , T \}$$
\end{document} and for amino acid sequences, the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\cal A}$$
\end{document} is the set of 20 amino acids. In general, we assume that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\cal A}$$
\end{document} contains L letters and write \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\cal A} = \{0 , 1 , 2 , \cdots , L - 1 \}$$
\end{document}. For the motif instances, we use the model in Zhai et al. (2010), which is more general than the model used in Reinert et al. (2009), where fixed motifs were used. In this article and in Zhai et al. (2010), a position weight matrix (PWM) is used to describe the distribution of the nucleotides at the different positions of a motif (Stormo, 2000). For a given motif of length M, and at the m-th position of the motif, the probability that the base takes value a from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\cal A}$$
\end{document} is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$p_a^{(m )} , m = 1 , 2 , \cdots , M$$
\end{document}. The motif instances are randomly distributed across the sequence with density 1 − λ (0 < λ < 1). That is, at each position in the sequence which is not already covered by an instance of a motif, with probability λ, a base with the background distribution is generated, and with probability 1 −λ, an instance of the motif of length M is generated based on the PWM for the motif. Once an instance of a motif is generated, we move to the end of the instance of the motif to repeat this process.
For the model in more detail, see Zhai et al. (2010). The sequences with random motif instances were modeled by an HMM (Rabiner, 1989). The underlying Markov chain (MC) of each sequence is denoted as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$Q_1Q_2 \cdots Q_i \cdots Q_{n + k - 1}$$
\end{document} (i is the position index of the sequence with length n + k − 1) which take values in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\{0 , 1 , 2 , \cdots , M \}$$
\end{document}. The 0 indicates that the state of the sequence is the background sequence while m (1 ≤ m ≤ M) indicates the state at the m-th position of the motif. Under each state, the emission probability of each letter from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\cal A}$$
\end{document} is denoted as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$p^{(m)}_a \ (a \in{\cal A} \ {\rm and} \ m = 0 , 1 , 2 , \cdots , M)$$
\end{document}. The transition matrix for the underlying MC \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$Q_1Q_2 \cdots Q_i \cdots Q_{n + k - 1}$$
\end{document} is given by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$T = (t_{mm^{\prime}}) _{(M + 1) \times (M + 1)}$$
\end{document}, where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$t_{00} = t_{M0} = \lambda , \ t_{01} = t_{M1} = 1 - \lambda , \ t_{m , m + 1} = 1 , \ m = 1 , 2 , \cdots , M - 1$$
\end{document}, and all the other t's are 0. The MC has as stationary distribution \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\pi = \frac {1} {\lambda + M (1 - \lambda)} (\lambda , 1 -
\lambda , 1 - \lambda , \cdots , 1 - \lambda)$$
\end{document} (Zhai et al., 2010). Therefore, in stationarity, the expected fraction of the sequence that is covered by the motif instances is M(1 − λ)/(λ + M(1 − λ)). Unless λ is close to 1, the expected fraction of the sequence covered by inserted motif instances can be unrealistically large (Table S1; for Supplementary Material, see www.liebertonline.com/cmb). Hence we only study values of λ which are no smaller than 0.9.
Now we consider two sequences of length n + k − 1 generated by the above HMM, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\bf A} = A_1A_2 \cdots A_{n + k - 1}$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\bf B} = B_1B_2 \cdots B_{n + k - 1}$$
\end{document}. We let the sequence length be n + k − 1 for notational simplicity in the remainder of the paper. Given a k-tuple \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\bf w} = (w_1 , w_2 , \ldots , w_k) \in {\cal A}^k$$
\end{document}, let Xw and Yw be the numbers of occurrences of w within A and B, respectively; within each sequence, the occurrences could overlap. Assume that the Markov process starts in the stationary distribution. Based on Proposition 2.2 in Zhai et al. (2010), the means of Xw(n) and Yw(n) can be calculated as
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
{\mathbb E}_{\lambda}X_{\bf w} = {\mathbb E}_{\lambda}Y_{\bf w} =
n P_{\lambda} ({\bf w}),
\end{align*}
\end{document}
where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$P_{\lambda}{({\bf w})} = \sum\nolimits_{m = 0}^{M}
\alpha_{k}^{({\bf w})} (m)$$
\end{document} is the probabiltiy of the word w under the alternative model I. The \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\alpha_i^{({\bf w})} (m) = P (A_j = w_j , j = 1 , 2 , \cdots , i;
Q_i = m) , i = 1 , 2 , \cdots , k$$
\end{document}, are calculated recursively using the standard forward procedure for calculating the probability of an observation sequence based on HMM (Zhai et al., 2010; Rabiner, 1989) for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$i = 1 , 2 \cdots$$
\end{document}:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\alpha_{i + 1}^{({\bf w})} (0) & = (\alpha_i^{({\bf w})} (0) +
\alpha_i^{({\bf w})} (M)) \lambda p^{(0)}_{w_{i + 1}} ,\\
\alpha_{i + 1}^{({\bf w})} (1) & = (\alpha_i^{({\bf w})} (0) +
\alpha_i^{({\bf w})} (M)) (1 - \lambda) p^{(1)}_{w_{i + 1}} , \\
\alpha_{i + 1}^{({\bf w})} (m) & = \alpha_i^{({\bf w})} (m - 1)
p^{(m)}_{w_{i + 1}} , \qquad (m = 2 , 3 , \ldots , M),
\end{align*}
\end{document}
2.2. The expectations of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\rm D}^*_2$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\rm D}^S_2$$
\end{document} under alternative model I
It is easy to see that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$E_ \lambda ({\widetilde X}_{\bf w}) = n (P_ \lambda ({\bf w}) -
p_{\bf w})$$
\end{document}, where Pλ(w) is the probability of word w under the alternative model I. However, for the mean of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\frac {{\tilde X} _ {\bf w} {\tilde Y} _ {\bf w}} {\sqrt {{\tilde
X} _ {\bf w} ^2 + {\tilde Y} _ {\bf w} ^2}}$$
\end{document}, it is in general only known that it is non-negative, and when \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\tilde X}_{\bf w}$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\tilde Y}_{\bf w}$$
\end{document} are IID, the mean is zero if and only if the distribution of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\tilde X}_{\bf w}$$
\end{document} is symmetric (Novak, 2007). Note that the two sequences A and B are independent under the alternative model I. Then, we have the following theorem.
Theorem 2.1
Assume alternative model I for the two sequencesAandB, and let\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$P_{\lambda} ({\bf w}) =P (A_1 A_2 \cdots A_k = w_1 w_2 \cdots w_k)$$
\end{document}be as calculated in Subsection 2.1. Then for the expectations of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document}and\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^S_2$$
\end{document}, we have\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
{\mathbb E} (D_2) & = n^2 \sum_ {{\bf w} \in {\cal A} ^k} (P_
{\lambda} ({\bf w})) ^2 , \\
{\mathbb E} (D_2^ {*}) & = n \sum_ {{\bf w} \in {\cal A} ^k}
\frac {(P_ {\lambda} ({\bf w}) - p_ {\bf w}) ^2} {p_ {\bf w}} ,\\
\hbox {and} \lim_ {n \rightarrow \infty} \frac {{\mathbb E} (D_2^
{S})} {n} & = \frac {1} {\sqrt 2} \sum_ {{\bf w} \in {\cal A}
^k} \mid P_ {\lambda} ({\bf w}) - p_ {\bf w} \mid .
\end{align*}
\end{document}
The first two equations can be easily proven by the independence of the two sequences. The last two limit expressions can be proven by Taylor expansion (the delta method); see the proof of Theorem 2.4 for details.
2.3. The approximate distributions of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\rm D}^*_2$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\rm D}^S_2$$
\end{document} under alternative model I
The variances of D2 and its variants are complicated. Under the null model of IID sequences, upper and lower bounds for the variance of D2 were first explored in Lippert et al. (2002). In Kantorovitz et al. (2007b), an explicit formula for the variance of D2 is given in the IID case. To study the power of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^S_2$$
\end{document} in detecting the relationship between two sequences, we explore the approximate distributions of these statistics as the sequence length goes to infinity. Note that the distributions of D2 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document} under the null model when \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$(p^{(0)}_A , p^{(0)}_C , p^{(0)}_G , p^{(0)}_T) = (1 / 4 , 1 / 4 ,
1 / 4 , 1 / 4)$$
\end{document} have been carefully studied in Reinert et al. (2009). Therefore, in the rest of the article, we assume that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$(p^{(0)}_A , p^{(0)}_C , p^{(0)}_G , p^{(0 )}_T)\neq (1 / 4 , 1 /
4 , 1 / 4 , 1 / 4)$$
\end{document}.
can be calculated using the method in Zhai et al. (2010), Proposition 2.3; for λ = 1, the corresponding values can be found, for example, in Reinert et al. (2009), Corollary 6.1. We denote the asymptotic variance of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\sum\nolimits_{{\bf w} \in {\cal A}^k} P_{\lambda} ({\bf w})X_{\rm
w} / {\sqrt n}$$
\end{document} in one sequence by
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
(\Sigma_ \lambda)^2 = \sum_{{\bf w} \in {\cal A}^k} P^2_{\lambda}
({\bf w})\sigma_{\lambda}^2 ({\bf w})+ \sum_{{\bf w} \neq {\bf
w}^{\prime}}P_{\lambda} ({\bf w})P_{\lambda} ({\bf
w}^{\prime})\sigma_{\lambda} ({\bf w} , {\bf w}^{\prime}). \tag {2}
\end{align*}
\end{document}
The following theorem gives the approximate distributions of D2 under the null and the alternative model I.
Theorem 2.2
Assume that in the background model not all letters are equally likely.
a. [Lippert et al. (2002), Theorem 4.2.] Suppose λ = 1 (the null model that the sequences are IID). Then\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\lim_ {n \rightarrow \infty} {\sqrt n} \left(\frac {D_2} {n^2} -
\sum_ {{\bf w} \in {\cal A} ^k} p_ {\bf w} ^2 \right) = Z_1 ,
\end{align*}
\end{document}
where Z1has normal distribution\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\cal N} (0 , 2 (\Sigma_1)^2)$$
\end{document}. Here the asymptotic is valid when the sequence length tends to infinity with alphabet size, motif length, and word length kept fixed.
where Zλ has normal distribution\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\cal N} (0 , 2 (\Sigma_ \lambda)^2)$$
\end{document}. Here the asymptotic is valid when the sequence length tends to infinity with alphabet size, motif length, and word length kept fixed.
On the other hand, under the null model that no motif instances are inserted, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document} is approximately the sum of products of dependent mean 0 normal random variables (and thus not normal). However, it is approximately normally distributed when the sequence length is large under the alternative model I, as long as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\frac {(P_ \lambda ({\bf w})- p_ {\bf w} )} {p_ {\bf w}}$$
\end{document} is not constant in w, as the following theorem shows. We put
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
(\Sigma^*_ \lambda)^2 = \sum_ {{\bf w} \in {\cal A} ^k} \frac {(P_
{\lambda} ({\bf w})- p_ {\bf w})^2} {p^2_ {\bf w}} \sigma_
{\lambda} ^2 ({\bf w})+ \sum_ {{\bf w} \neq {\bf w} ^ {\prime}}
\frac {(P_ {\lambda} ({\bf w})- p_ {\bf w})(P_ {\lambda} ({\bf w}
^ {\prime})- p_ {{\bf w} ^ {\prime}} )} {p_ {\bf w} p_ {{\bf w} ^
{\prime}}} \sigma_ {\lambda} ({\bf w} , {\bf w} ^ {\prime}), \tag {3}
\end{align*}
\end{document}
with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\sigma_{\lambda}^2 ({\bf w})$$
\end{document} and σλ(w, w′) given in (1).
Theorem 2.3
a. Suppose λ = 1 (the null model that the sequences are IID). Then, in distribution,\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\lim_ {n \rightarrow \infty} D^*_2 = Z^*_1 = \sum_ {{\bf w} \in
{\cal A} ^k} \frac {Z^ {(1 )} _ {\bf w} Z^ {(2 )} _ {\bf w}} {p_
{\bf w}} ,
\end{align*}
\end{document}
where\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\{Z^{(1 )}_{\bf w} , {\bf w} \in {\cal A}^k \}$$
\end{document}and\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\{Z^{(2 )}_{\bf w} , {\bf w} \in {\cal A}^k \}$$
\end{document}are independent and have the same mean 0 normal distribution (with non-trivial covariance matrix).
where sign(x) = 1 if x > 0, sign(x) = −1 if x < 0, and sign(0) = 0; again \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\sigma_{\lambda}^2 ({\bf w})$$
\end{document} and σλ(w, w′) are given in (1). The following theorem gives the approximate distribution of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^S_2$$
\end{document} under the null and the alternative models.
Theorem 2.4
a. Suppose λ = 1 (the null model that the sequences are IID). Then, in distribution,\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\lim_ {n \rightarrow \infty} \frac {D^S_2} {\sqrt n} = Z^S_1 =
\sum_ {{\bf w} \in {\cal A} ^k} \frac {Z^ {(1 )} _ {\bf w} Z^ {(2
)} _ {\bf w}} {\sqrt {(Z^ {(1 )} _ {\bf w})^2 + (Z^ {(2 )} _ {\bf
w})^2}} \tag {5}
\end{align*}
\end{document}
where\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\{Z^{(1 )}_{\bf w} , {\bf w} \in {\cal A}^k \}$$
\end{document}and\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\{Z^{(2 )}_{\bf w} , {\bf w} \in {\cal A}^k \}$$
\end{document}are independent and have the same mean 0 normal distribution.
b. Suppose 0 < λ < 1 (the alternative model I), and assume that Pλ(w) − p(w) have different sign inw. Then, in distribution,\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\lim_ {n \rightarrow \infty} {\sqrt n} \left(\frac {D^S_2} {n} -
\sum_ {{\bf w} \in {\cal A} ^k} \frac {\mid P_ \lambda ({\bf w})-
p_ {\bf w} \mid} {{\sqrt 2}} \right)= Z^S_ \lambda ,
\end{align*}
\end{document}
Since each term on the right hand side of (5) has a normal distribution under the null model by Reinert et al. (2009), and the terms are jointly normal, the limit of\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\frac {D^S_2} {{\sqrt n}}$$
\end{document}is mean zero normally distributed. The variance can be estimated from the empirical distribution, as illustrated in Reinert et al. (2009).
Replacing\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\sum\nolimits_ {{\bf w} \in {\cal A} ^k} \frac {\mid P_ \lambda
({\bf w})- p_ {\bf w} \mid} {{\sqrt 2}}$$
\end{document}by\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\sum\nolimits_ {{\bf w} \in {\cal A} ^k} \frac {\mid P_ \lambda
({\bf w})- p_ {\bf w} \mid} {{\sqrt 2}} - \frac {3 {\sqrt 2}}
{8n} \sum\nolimits_ {{\bf w} \in {\cal A} ^k} \frac {\sigma_
{\lambda} ^2 ({\bf w} )} {\mid P_ {\lambda} ({\bf w})- p_ {\bf w}
\mid}$$
\end{document}can be significant when we study the power of detecting the relationships between two sequences using\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document}, as we shall see in Section 4.2.
The proofs of these theorems are presented in the Appendix.
2.4. The power of detecting the relationship between two sequences under alternative model I using D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\rm D}^*_2$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\rm D}^{\rm S}_2$$
\end{document}
Knowing the asymptotic distributions of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^S_2$$
\end{document} under the null and the alternative models, we are able to approximate the power of detecting the relationships between two sequences using any of the three statistics. For notational simplicity, let
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
A (\lambda)= \sum_ {{\bf w} \in {\cal A} ^k} P_ {\lambda} ^2 ({\bf
w}), \quad A^* (\lambda)= \sum_ {{\bf w} \in {\cal A} ^k} \frac
{(P_ {\lambda} ({\bf w})- p_ {\bf w})^2} {p_ {\bf w}} ,
\end{align*}
\end{document}\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
A^S (\lambda)= \frac {1} {{\sqrt 2}} \sum_ {{\bf w} \in {\cal A}
^k} \mid P_ {\lambda} ({\bf w})- p_ {\bf w} \mid
\end{align*}
\end{document}
denote the (asymptotic) means of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^S_2$$
\end{document} under alternative model I. Let Φ(·) be the cumulative distribution for the standard normal distribution. From Theorems 2.2, 2.3, and 2.4, we can show the following theorem to hold.
Theorem 2.5
Assume that\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\frac {(P_ {\lambda} ({\bf w})- p_ {\bf w})^2} {p_ {\bf w}}$$
\end{document}and Pλ(w) − pware not constant inw. Then, for any given type I error α, the power of detecting the relationship between two sequences against the null model that λ = 1 using D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document}and\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document}can be approximated by 1 − Φ(C(λ)), 1 − Φ(C*(λ)), and 1 − Φ(CS(λ)), respectively, where\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
C (\lambda)& = - {\sqrt n} B (\lambda)+ z_{\alpha} / (\sqrt{2} \Sigma_{\lambda}), \\ C^{*} (\lambda)& = - {\sqrt n} B^* (\lambda ) + z_{\alpha}^{*} / (\sqrt{2n} \Sigma^{*}_{\lambda}), \\ \ and \ C^S (\lambda)& = - {\sqrt n} B^S (\lambda)+ z_{\alpha}^{S} / (\sqrt{2} \Sigma^{S}_{\lambda})
\end{align*}
\end{document}
Note that we can again replace AS(λ) by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$A_m^S (\lambda)= A^S (\lambda)- \frac {3 {\sqrt 2}} {8n}
\sum\nolimits_ {{\bf w} \in {\cal A} ^k} \frac {\sigma_ {\lambda}
^2 ({\bf w} )} {\mid P_ {\lambda} ({\bf w})- p_ {\bf w} \mid}$$
\end{document} when we calculate the power of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} for relative small values of sequence length n. Here the subscript m stands for modified.
Theorem 2.5 indicates that when sequence length n is large, the dominant terms in C(λ), C*(λ), and CS(λ) are the first term and the second term becomes negligible when n is large. Therefore, the higher the values of the B's, the more powerful the corresponding statistic is when n is sufficiently large. In Section 4, we present some examples for values of the B's and the C's.
The tests under alternative model I make extensive use of the fact that the means of our statistics are different under the alternative model versus the null model. Under alternative model II, this will turn out not to be the case.
3. Alternative Model II
In this section, we consider the second alternative model which is inspired by horizontal gene transfer. We randomly choose a certain number of segments in the first sequence and then replace the corresponding segments (position-wise) in the second sequence by the letters in the first sequence.
3.1. A second HMM model for the sequence pair A and B
Alternative model II is again a HMM model for the sequence pair \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\bf A} = A_1A_2 \cdots A_{n + k - 1}$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\bf B} = B_1B_2 \cdots B_{n + k - 1}$$
\end{document}. First, two IID sequences A and B′ are generated. From these two sequences we construct B as follows. We assume that at each position which is not already covered by a chosen segment, with probability λ, the original bases of the two sequences at the position are kept. With probability 1 − λ, a segment of length M from the first sequence is chosen, and the same segment in the second sequence is replaced by it. Then we move to the end of the segment to start this process again. Consider an underlying Markov chain \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$Q_1 Q_2 \cdots Q_i \cdots$$
\end{document} defined as follows. Each Qi takes values in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\{0 , 1 , 2 , \cdots , M \}$$
\end{document}, where Qi = 0 indicates that, at position i, Ai and Bi are the originally generated bases, whereas \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$Q_{\rm i} =m\ (1 , 2 , \cdots , M)$$
\end{document} indicates that position i is at the m-th position of a segment which was copied from the first sequence to the second sequence. The transition matrix of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$Q_1 Q_2 \cdots Q_i \cdots$$
\end{document} is given by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$T = (t_{mm^{\prime}})_{(M + 1)\times (M + 1 )}$$
\end{document}, where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$t_{00} = t_{M0} = \lambda , \ t_{01} = t_{M1} = 1 - \lambda , \ t_{m , m + 1} = 1 , \ m = 1 , 2 , \cdots , M - 1$$
\end{document}, and all the other t's are 0. It is easy to see that the stationary distribution of this Markov chain is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\pi = \frac {1} {\lambda + M (1 - \lambda )} (\lambda , 1 -
\lambda , 1 - \lambda , \cdots , 1 - \lambda)$$
\end{document} (see Proposition 2.1 in Zhai et al. [2010]).
Let Ci = (Ai, Bi)t. With pa denoting the probability of letter a in the IID model, the emission probabilities are given by
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
P ((A_i , B_i)= (a , b)\mid Q_i = 0)= p_a p_b , \quad P ((A_i , B_i)= (a , b)\mid Q_i = m)= p_a I (a = b).
\end{align*}
\end{document}
Then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$C_1 C_2 \cdots C_i \cdots$$
\end{document} form a HMM.
3.2. The asymptotic distributions and power of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\rm D}^*_2$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\rm D}_2^S$$
\end{document} for detecting relationships between sequences under alternative model II
Under alternative model II, the marginal distributions of the individual sequences are IID sequences and hence the means of Xw and Yw are unchanged compared to the IID model. However, the two sequences depend on each other because they share some common segments. The following theorem shows an efficient way to calculate the covariance of the number of occurrences of word w in sequence A and the number of occurrences of word w′ in sequence B. These covariances are used to derive the limiting distributions of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} when the sequence length tends to infinity.
Theorem 3.1
Let Xwand Ywbe the number of occurrences of wordwin sequenceAandB, respectively. Assume that the MC starts in the stationary distribution. For any pair of words (w, w′), we have under alternative model II,
where\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\{{\widetilde Z}^{(1 )}_{\bf w} , {\bf w} \in {\cal A}^k \}$$
\end{document}and\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\{{\widetilde Z}^{(2 )}_{\bf w} , {\bf w} \in {\cal A}^k \}$$
\end{document}are the same as in part (b).
Based on the above theorem, we can obtain the approximate power of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} for detecting the relationships between two sequences under the alternative model II.
Theorem 3.3
Suppose 0 < λ < 1 and the alternative model II. For a given type I error α, let\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\widetilde z}_{\alpha} , {\widetilde z}^*_{\alpha}$$
\end{document}, and\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\widetilde z}^S_{\alpha}$$
\end{document}be the upper α quantile for\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\widetilde Z}_1, \ {\widetilde Z}^{*}_1, \ and \ {\widetilde Z}^{S}_1$$
\end{document}, respectively. Then the corresponding power of\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\widetilde Z}_ \lambda , {\widetilde Z}^*_ \lambda , \ and \ {\widetilde Z}^S_ \lambda$$
\end{document}under the alternative model II when λ < 1 is asymptotically\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$P \{{\widetilde Z}_ \lambda \geq {\widetilde z}_{\alpha} \} , \ P \{{\widetilde Z}^*_ \lambda \geq {\widetilde z}^*_{\alpha} \} , \ and \ P \{{\widetilde Z}^S_ \lambda \geq {\widetilde z}^S_{\alpha} \}$$
\end{document}, respectively.
Since \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\widetilde Z}_1$$
\end{document} is normally distributed with mean 0, the threshold value \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\widetilde z}_{\alpha} > 0$$
\end{document} if α < 0.5. From this theorem, it is clear that the power of the three statistics for detecting the relationships between the two sequences does not increase with sequence length n when n is sufficiently large, which is consistent with the simulation results in Reinert et al. (2009). The theoretical results presented here explain that none of the three statistic is what would be most desirable for detecting the relationships between sequences under the alternative model II. One unsolved problem is what statistics we should use under alternative model II.
4. Results
In this section, we describe an online implementation and a stand-alone R program for calculating the power of detecting the relationships between two sequences under the alternative model I using any of the statistics studied in this article. Then we compare the mean, variance, and power of the statistics D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} derived using our formula with the simulated quantities for the situations in Reinert et al. (2009). As an illustration of the difficulties involved, we present the results for the relatively simple two letter sequences under alternative model I in the supplementary material. In particular, this simple case shows that in some cases D2 will have zero power for detecting the relationship between two sequences when they share a common motif. In some scenarios, however, we see that D2 can be more powerful than \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document}. It also shows that the convergence of the mean and variance of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} to their theoretical limit is very slow, which affects the approximate power calculation; in the parameter region which we considered, the theoretical approximate power of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} differs so considerably from the power under simulation that we do not recommend using \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} for moderate sequence lengths. Finally, the power of detecting the relationships between two sequences when any of the 323 motifs with motif length at most 10 in JASPAR (Sandelin et al., 2004) (October 12, 2009 version) are present in the two sequences are given. For alternative model II, we give an explanation why the power of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} using k = 10 is much higher than using k = 2, 3, 4, 5 for the parameters in simulation studies (Reinert et al., 2009).
4.1. A program for calculating the power of detecting the relationships between two sequences under alternative model I
To facilitate the use of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} or \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} for sequence or genome comparison and for evaluation of statistical power for detecting the relationship between the sequences, a web-based online program (http://meta.cmb.usc.edu/d2) and a stand-alone R program were developed to calculate the power of sequence comparison using these statistics. We describe the program for the above model. However, the program can be easily extended to the general scenario of different background letter frequencies, sequence lengths, and motif densities as in the supplementary materials. The inputs of the program are:
The background nucleotide or amino acid frequencies \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$p^{(0 )}_l , l = 0 , 1 , \cdots L - 1$$
\end{document} of the two sequences A and B under study;
the nucleotide or amino acid frequencies \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$p^{(m )}_l , \ l = 0 , 1 , \cdots L - 1 , \ m = 1 , 2 , \cdots , M$$
\end{document} at each position of the motif (PWM);
the lengths n of the sequences A and B;
the motif density, 1 − λ, for the sequences A and B;
the type I error, α.
For each set of parameters, the program first calculates the mean \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$P_{\lambda} ({\bf w})= {\mathbb E}_{\lambda} (X_{\bf w} )$$
\end{document} for any word w and the covariance σλ(w, w′) = Cov(Xw, Xw′) for two words w and w′, related to sequence A. The corresponding quantities related to sequence B are also calculated. Secondly, the program calculates the approximate variance, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$2 (\Sigma_{\lambda})^2 , \ 2 (\Sigma^{*}_{\lambda})^2 , \ {\rm and} \ 2 (\Sigma^{S}_{\lambda})^2$$
\end{document} of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} using formulas derived in Theorems 2.2, 2.3, and 2.4, respectively. Thirdly, for the given type I error α, the threshold values zα, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$z^*_{\alpha}$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$z^S_{\alpha}$$
\end{document} for the corresponding statistics D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} or \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} in Theorem 2.5 are calculated. Since the cumulative distribution functions of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$Z^*_1$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$Z^S_1$$
\end{document} are not readily available, a simulation based method is used to obtain the threshold values. A large number of independent sequence pairs are simulated according to the specified letter frequencies and the sequence lengths, and the empirical distributions of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^S_2$$
\end{document} are estimated. The threshold values are estimated by the upper α% quantile of the simulated values of each statistic. Finally, the values of C(λ), C*(λ), and CS(λ), and thus the power using the corresponding statistics in Theorem 2.5 is calculated.
We use the program to study the power of detecting the relationship between related sequences under alternative model I using the different statistics. In Subsection 4.2, we present the results for the parameter sets used in Reinert et al. (2009) and compare the results derived using our program with the simulated quantities in previous studies. In Subsection 4.3, we present the power of the various statistics for comparing the relationships between sequences when any of the motifs with motif length at most 10 in JASPAR (Sandelin et al., 2004) are present in both sequences.
4.2. Comparison of theoretical mean, standard deviation, and power of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\rm D}_2^*$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\rm D}_2^{\rm S}$$
\end{document} with their corresponding simulated values from Reinert et al. (2009)
In this subsection, we present some numerical results on the mean, standard deviation, and power of detecting the relationships between two sequences for the three statistics D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} under the alternative model I using the same set of parameters as in Reinert et al. (2009). The objective is to see how close the corresponding quantities calculated using our formulas approximate the true values. We let the background letter frequencies for the two sequences be pA = pT = \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\frac {1} {6}$$
\end{document}, pC = pG = \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\frac {1} {3}$$
\end{document}. The inserted motif is “AGCCA” and the motif density 1 − λ = 0.01. The size of the k-tuple is k = 5. We used 10,000 simulations to find the threshold values z0.05, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$z^*_{0.05}$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$z^S_{0.05}$$
\end{document}. The type I error α was set at 0.05 and 0.01.
from Theorems 2.2, 2.3, and 2.4, it can be seen that the (approximate) means are \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\sqrt n} (A (\lambda)- A (1)), \ {\sqrt n} A^* (\lambda) \ {\rm and} \ {\sqrt n} A^S (\lambda)$$
\end{document}, respectively. Similarly, the approximate variance of ND2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$ND_2^*$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$ND_2^S$$
\end{document} are \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$2 (\Sigma_{\lambda})^2, \ 2 (\Sigma^*_{\lambda})^2, \ {\rm and} \ 2 (\Sigma^S_ \lambda)^2$$
\end{document}, respectively.
Table 1 shows the simulated mean and standard deviation of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$ND_2 , \ ND_2^*$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$ND_2^S$$
\end{document}, respectively, and their corresponding limits. Surprisingly, the approximate mean and standard deviation of ND2 are within 15% of their limit even when the sequence length is just 1Kbp. For \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document}, the simulated mean is roughly the same as the theoretical limit and the simulated standard deviation is within 21% of its theoretical limit when the sequence length is at least 1Kbp. However, the simulated mean of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} is much smaller than its limit. The corrected mean for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} is very different from the simulated mean, too, probably because the difference between Pλ(w) − pw for most 5-tuples are very small; both approximations do not work well in this parameter regime. Therefore, while we expect that the power formulas we derived should approximate the true power of D2 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} well even for sequences of over 1Kbp long the power formula for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} can significantly over-estimate the true power.
Comparison of Simulated Mean and Variance of ND2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$ND_2^*$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$ND_2^S$$
\end{document} for Different Sequence Length n with the Corresponding Theoretical Limits (the last row), with (pA, pC, pG, pT) = (1/6, 1/3, 1/3, 1/6),λ = 0.99, Motif = “AGCCA”, and Word Length k = 5
Table 2 shows the theoretical approximate power of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} calculated using our formulas and the simulated power with the same setting as in Table 1. The results show that the approximations are very close for D2 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document}. However, the theoretical approximate power based on the first approximation significantly over-estimates, while the approximate power based on the second approximation significantly under-estimates the simulated power for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document}, in the parameter regime we consider.
Comparison of the Theoretical and the Simulated Power Under Alternative Model I for Different Values of Sequence Length, with (pA, pC, pG, pT) = (1/6, 1/3, 1/3, 1/6),λ = 0.99, Motif = “AGCCA”, and Word Length k = 5
As the approximate power for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} is not accurate in the parameter regimes we have considered, in the following, we only show the results related to D2 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} using the theoretical approximate power. Figure 1 shows the values of C(λ) and C*(λ) (upper panel) and the power of D2 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} for detecting the relationships between pairs of sequences (lower panel) as a function of sequence length and the word length k when λ = 0.99. It should be noted that the power is a decreasing function of the values of C's and the smaller the values of C, the higher the power of the corresponding statistic is. From the left panel related to D2, it can be seen that, when k = 2 or 3, the value of C actually increases and that the power 1 − Φ(C(λ)) decreases with the sequence length. For given sequence length and word size k, the power of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document} is generally higher than the power of D2. All these conclusions are consistent with the simulation studies in Reinert et al. (2009). Comparing the two figures in the lower panel of Figure 1 here with Figures 1 and 2 in Reinert et al. (2009), respectively, we can see that the the theoretical power is slightly higher than the simulated power, but the difference is generally small, less than 10% in all the situations considered.
The values of C(λ) and C*(λ) (upper panels) and the power of D2 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} (lower panels) for detecting the relationships between sequence pairs related through alternative model I for different values of word size k = 2, 3, 4, 5 and sequence length n. The parameters were set at pA = pT = 1/6, pC = pG = 1/3, λ = 0.99, and type I error α = 0.05.
The values of B(λ) and B*(λ) for λ = 0.93, 0.99 and k = 2, 3, 4, 5. Dashed lines refer to B and solid lines to B*; triangles refer to λ = 0.93 and circles to λ = 0.99. B(0.99), dash line with circle points; B(0.99), dash line with triangle points; B*(0.99), solid line with circle points; B*(0.99), solid line with triangle points.
Simulation studies can only explore the influence of a relatively small range of parameter sets on the power of the different tests. With the theoretical results presented in this paper, we are able to explore a much larger parameter space. Theorem 2.5 shows that the power of D2 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} is mainly determined by B(λ) and B*(λ), respectively. The higher the values of B's, the more powerful the test is. Therefore, we also plot the values of B(λ) and B*(λ) for k = 2, 3, 4, 5 and λ = 0.93 or 0.99 (Fig. 2). Again it is shown that B*(λ) is generally larger than B(λ) indicating that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} is generally more powerful than D2. We note that both B and B* decrease when λ increases. The smaller λ is, the larger is the probability of inserting a motif, and the eaiser it is to detect a difference from the null model.
4.3. The power of D2 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\rm D}_2^*$$
\end{document} for comparing two sequences when motifs in JASPAR are present
Since the approximate distribution of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} in Theorem 2.4 requires very long sequences and the resulting formula for calculating the power of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} significantly over-estimates the true power, we will not consider \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} in the following. We next investigate whether the relative performance of D2 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document} for comparing the relationships between two sequences holds for a large class of motifs. To achieve this objective, we downloaded the transcription factor (TF) binding sites in the database JASPAR CORE (Sandelin et al., 2004) as motifs and studied the power of detecting the relationship between two sequences if such motifs are present in the sequences. The same letter frequencies for the background as in Reinert et al. (2009) are used. The theoretical formulas obtained in this paper make such large scale comparisons possible. Due to the long computational time required when the motif length is large, we only consider motifs with length at most 10.
A total of 323 transcription factor binding profiles with length at most 10 from JASPAR CORE (Sandelin et al., 2004) (October 12, 2009 version) are currently available. These motifs represent the most abundant publicly available knowledge regarding nucleotide sequence motifs. The corresponding PWMs are used to insert motifs as in alternative model I. Based on these assumptions, we can calculate the values of B(λ), B*(λ), C(λ), C*(λ), and the corresponding power for different values of word length k and motif density 1 − λ. The resulting figures and the corresponding letter frequencies in each position for all the motifs are presented in the supplementary material. From this large-scale study, we can conclude that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document} is more powerful than D2 in more than 90% of the motifs. An example motif profile “MA0003” for which D2 is more powerful than \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document} is given in Figure 3. Note that in this motif, the overall frequencies of (A, C, G, T) in the motif are (0.11, 0.40, 0.40 0.09).
The sequence LOGO of motif “MA0003”.
We then calculate the mean overall letter frequencies of (A, C, G, T) in those motifs for which D2 is more powerful than \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document} for at least three of the k = 2, 3, 4, 5 (λ = 0.93) and the corresponding frequencies are (0.08, 0.22, 0.57, 0.13). On the other hand, the mean overall letter frequencies of (A, C, G, T) in the other motifs are (0.30, 0.22, 0.25, 0.23). Under the background sequence model with (A, C, G, T) frequencies equal to (1/6, 1/3, 1/3, 1/6), in general, the GC content of the motifs for which D2 is more powerful than \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document} is higher than that of the other motifs under the background model considered in this article. If the background sequence model is changed, the PWM of motifs for which D2 outperforms \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} should also change. As a general rule, D2 outperforms \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} if the letter frequencies in a motif are close to the background letter frequencies.
Since we found that, in most situations, the power of D2 can be even smaller than the type I error, whereas the power of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} always approaches 1 for sequence length tending to infinity, we do not suggest using D2 in general situations even if it can potentially perform well in some special cases.
4.4. The power of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\rm D}_2^*$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\rm D}_2^{\rm S}$$
\end{document} for detecting the relationships between two sequences under alternative model II
Previous simulation studies have shown that, under alternative model II, the power of D2 is less than 0.4 and decreases with sequence length n when the word size k is 2 to 6. Actually, we can show that when n is large, the power of D2 is always less than 0.5 for any parameter set. Note that Theorem 3.3 shows that the power of D2 is approximately \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$P ({\widetilde Z}_ \lambda \geq {\widetilde z}_{\alpha} )$$
\end{document}. Since \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\widetilde z}_{\alpha}$$
\end{document} is positive and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\widetilde Z}_{\lambda}$$
\end{document} is approximately normally distributed, the power is less than 0.5 when the sequence length is large for any set of parameters. However, similar arguments will not work for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} since the distributions of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\widetilde Z}^*_{\lambda}$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\widetilde Z}^S_{\lambda}$$
\end{document} are not normal when λ < 1. This shows that D2 does not have enough power to detect the relationship between sequences under alternative model II. So we will not study D2 further under alternative model II. Previous simulation studies also showed that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} is less powerful than \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document}. So we now concentrate on further understanding \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document}.
Theorems 3.2 and 3.3 show that the power of detecting the relationships between two sequences related through the alternative model II using any of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} reaches its plateau quickly as the sequence length increases, and the limit is generally much smaller than 1. Theorems 3.2 and 3.3 justify the simulation results that the simulated power by any of the statistics tends to a limit which is typically less than 1 when sequence length goes to infinity (Reinert et al., 2009), which was quite intriguing at the time of the simulation studies. Let T be any one of statistics D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^*_2$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document}. It is theoretically shown here that the primary reason for the power of T to be stable with respect to sequence length n is that there exist constants an and bn such that Uλ,n = an(T − bn) approximates non-degenerate random variables Uλ under both the null model (λ = 1) and the alternative model λ < 1. Although Uλ is stochastically decreasing with respect to λ, the power of the test approaches a constant P(Uλ ≥ uα), where P(U1 ≥ uα) = α. In order for the power of T to increase with respect to sequence length n and to finally reach 1, we need that (1) U1,n approximates a non-degenerate random variable U1 under the null model (λ = 1), and (2) Uλ,n tends to infinity as n tends to infinity.
Another interesting observation from previous simulation studies is that the power of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} seems to increase with the length, k, of word pattern used (see Figure 8 in Reinert et al. (2009)). In order to explain this phenomenon, we study the mean \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} as a function of word length k. We are aware that in general the power of a test depends on the distributions of the test statistics under the null and the alternative hypothesis, not just the mean and/or the variance. However, as an explanation to the intriguing observation, we try to see if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\mathbb E} (D^*_2)$$
\end{document} increases with k when other parameters are fixed. Theorem 3.1 (d) shows that
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\lim_ {n \rightarrow \infty} (ED_2^*) = \sum_ {{\bf w} \in {\cal A} ^k} \left[ \frac {\gamma_0 ({\bf w})+ 2 \sum_ {j = 1} ^ {k - 1} \gamma_j ({\bf w} , {\bf w} )} {p_ {\bf w}} \right] - (2k - 1)= S (\lambda , k).
\end{align*}
\end{document}
Figure 4 shows the relationship between S(λ, k) and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\lambda \in (0.9 , 1)$$
\end{document} for k = 2, 4, 6, 8, 10. It can be seen that S(λ, k) increases with k for any \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\lambda \in (0.9 , 1.0)$$
\end{document}, as does the discrepancy between S(λ, k) and S(1, k) for λ < 1. As our statistic is based on comparing the means of the counts under the two models, this partially explains that the power of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} increases with word length k.
The values of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$S (\lambda , k)= \lim\nolimits_{n \rightarrow \infty} {\mathbb E} (D_2^*)$$
\end{document} as a function of motif density λ and word length k, λ = 0.9 to 1.0 by step 0.01, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$k = 2 , \cdots , 10$$
\end{document}
5. Discussion
Alignment-free sequence comparison has become increasingly important as new sequencing technologies can generate enormous amount of sequence data in a relative short time and at low cost. However, the statistics used for alignment-free sequence comparison are usually ad-hoc, and it is not clear whether such ad-hoc statistics can actually find the relationships between sequences. It is also important to know under which evolutionary models the statistics are meaningful. One of the widely discussed and studied statistics for alignment free sequence comparison is the D2 statistic. Previously simulation studies have shown the limitations of D2 in detecting the relationships between sequences under a common motif model (alternative model I) and a pattern transfer model (alternative model II). It was shown that the power of D2 can even be smaller than the pre-specified type I error under some situations. Two new statistics, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document}, were developed to overcome the inherent problems of D2 and simulation studies showed their superior performance compared to D2 (Reinert et al., 2009).
However, the approximate distributions of these statistics were not known at the time of the study (Reinert et al., 2009), and thus, it was not possible to give a theoretical formula to calculate the power of the different tests. Having the limiting distribution of the test statistics can help us design algorithms to calculate the power. With the power calculator, we are able to explore a large range of the parameter space and study how the parameters individually and collectively contribute to the power of the tests. The theoretical studies also give insights into when and how the test statistics can be applied to compare sequences. In this paper, we carried out a systematic theoretical study of the power of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} for detecting the relationships between sequences under alternative models I and II. Under alternative model I, we provided an easy to use program to calculate the power of the test statistics D2 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} for different combinations of parameters. Using the program, we then obtained the theoretical power and compared with the simulated power using the same parameters as in Reinert et al. (2009) and showed that they are generally close, thus validating the usefulness of our program. However, the convergence of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} to our theoretical limit is very slow and the approximation is only reasonable for very long sequences. We then carried out a large-scale comparison of D2 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} statistics for sequence comparison under alternative model I when the motif is any one of the 323 motifs with length at most 10 in JASPAR CORE. Our program made such a large-scale comparison possible. We verified the relative performance of D2 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} observed in previous studies, i.e. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} is generally more powerful than D2. Under alternative model II, we theoretically showed that the power of the three statistics tends to a constant, usually less than 1. We also gave some reasons why the power of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} increases with the word size k.
This study has several limitations regarding the models of the background sequences and the foreground motif models. The IID model was used to model the background sequence. It is known that the genomes of organisms are hierarchically organized (Mantegna et al., 1994) and simple IID models cannot fully describe the background sequences; instead high-order Markovian models could be more appropriate. Similarly, the positions of the motifs are assumed independent and again this assumption can be violated in many motifs. To incorporate such complexity into our model, high-order HMMs can potentially be used; the calculations would then become much more involved. Although the extensions to higher order HMM are conceptually simple, heavy computational issues need to be solved.
We made several simple assumptions regarding the distribution of the motifs along the sequences as in Reinert et al. (2009). First it was assumed that the motifs are uniformly distributed along the sequences. Motifs can cluster together in some regions and may be sparse in other regions of the sequences. If such inhomogeneity is known to be present, an inhomogeneous HMM can be used to model the distribution of motifs by assuming large motif density λ in motif-clustered regions and low motif density λ in sparse motif regions. If such motif-clustered and motif-sparse regions are unknown, but suspected, we can assume that λ is a random variable following certain distributions. Second, we considered the presence of just one motif along the sequences. In many situations, several motif patterns work together to form modules. How to model such sequences is a problem for future studies. Third, we emphasize that the three statistics we consider here are most likely not optimal and other more powerful statistics may possibly be constructed. Fourth, applying these statistics to practical examples is another topic for future research.
In this article, we theoretically showed that, under alternative model II, the power of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} converges to a value that is generally much less than 1 when the sequence length tends to infinity. Therefore, they are not appropriate to test for relationships between sequences under this model. The obvious important question is which statistics based on word counts should be used for testing against this model instead.
6. Appendix A: Proofs of the Theorems
In this Appendix, we prove the theorems in the main text.
A.1. Proofs of Theorems 2.2–2.5 under alternative model I
It has been shown in Zhai et al. (2010), Proposition 2.4, for 0 < λ < 1, and in Reinert et al. (2009), Proposition 6.1, for λ = 1, that, in distribution,
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\lim_ {n \rightarrow \infty} {\sqrt n} \left(\frac {X_ {\bf w}} {n} - P_ {\lambda} ({\bf w})\right) = \lim_ {n \rightarrow \infty} {\sqrt n} \left(\frac {Y_ {\bf w}} {n} - P_ {\lambda} ({\bf w})\right) = N (0 , \sigma_ {\lambda} ^2 ({\bf w})) , \tag {10}
\end{align*}
\end{document}
where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\sigma_{\lambda} ^2 ({\bf w})= \lim\nolimits_ {n \rightarrow \infty} \frac {{\rm Var} (X_ {\bf w})} {n}$$
\end{document}. Therefore, the first term in equation (9) tends to 0 when n → ∞, with alphabet size fixed, and
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
{\sqrt n} \Bigg(\frac {X_ {\bf w}} {n} - P_ {\lambda} ({\bf w})\Bigg) + {\sqrt n} \Bigg(\frac {Y_ {\bf w}} {n} - P_ {\lambda} ({\bf w})\Bigg) \rightarrow {\cal N} (0 , 2 \sigma_ {\lambda}^2 ({\bf w})).
\end{align*}
\end{document}
Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\sigma_{\lambda} ({\bf w , w}^{\prime})= \lim\nolimits_{n \rightarrow \infty} \frac {{\rm Cov} (X_ {\bf w} , X_ {\bf w^{\prime}})} {n}$$
\end{document} which can be calculated as in Zhai et al. (2010) for 0 < λ < 1, and as in Reinert et al. (2009) for λ = 1. Since \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\{X_{\bf w} , {\bf w} \in {\cal A}^k \}$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\{Y_{\bf w} , {\bf w} \in {\cal A}^k \}$$
\end{document} are independent, the second term in (9) is asymptotically normal with mean 0 and variance 2(Σλ)2. Theorem 2.2 is proved.
We note that the proof of Theorem 2.2 breaks down when all letters are equally likely, as then with p = pw,
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\sum_{\bf w} p {\sqrt n} \left(\frac {X_ {\bf w}} {n} - p \right) = 0
\end{align*}
\end{document}
and thus the second term in (9) vanishes.
Proof of Theorem 2.3
The proof of Theorem 2.3 is similar to the proof of Theorem 2.2. The first part can be easily proved using the normal approximation Corollary 6.1 in Reinert et al. (2009) for the individual centered word counts, which also holds when all letters are equally likely. To prove the second part, note that
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\frac {D_2^*} {n} = & \sum \frac {1} {p_ {\bf w}} \Bigg(\frac {X_{\bf w}} {n} - P_ {\lambda} ({\bf w})\Bigg) \Bigg(\frac {Y_{\bf w}} {n} - P_{\lambda} ({\bf w})\Bigg) + \sum \frac {(P_{\lambda} ({\bf w})- p_{\bf w})} {p_ {\bf w}} \Bigg(\frac {X_{\bf w}} {n} - P_{\lambda} ({\bf w})\Bigg) \\ & + \sum \frac {(P_{\lambda} ({\bf w})- p_{\bf w} )} {p_{\bf w}} \Bigg(\frac {Y_{\bf w}} {n} - P_{\lambda} ({\bf w} )\Bigg) + \sum \frac {(P_{\lambda} ({\bf w})- p_{\bf w})^2} {p_{\bf w}}.
\end{align*}
\end{document}
It follows from the normal approximation for individual word counts that, in distribution,
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
{\sqrt n} \sum \frac {1} {p_ {\bf w}} \Bigg(\frac {X_ {\bf w}} {n} - P_ {\lambda} ({\bf w})\Bigg) \Bigg(\frac {Y_ {\bf w}} {n} - P_ {\lambda} ({\bf w})\Bigg) \rightarrow 0 \ {\rm as} \ n \rightarrow \infty.
\end{align*}
\end{document}
For 0 < λ < 1, under the assumption that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\frac {(P_{\lambda} ({\bf w})- p_{\bf w})} {p_{\bf w}}$$
\end{document} is not constant in w, this expression has a normal distribution with mean 0 and variance \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$2 (\Sigma^*_{\lambda})^2$$
\end{document}, where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$(\Sigma^*_{\lambda})$$
\end{document} is given in (3). Theorem 2.3 is proved.
Proof of Theorem 2.4
The first part of Theorem 2.4 has been proved in Theorem 2.1 in Reinert et al. (2009). We only present the outline for the proof of the second part. Using Taylor expansion, it is straightforward to show that for any a ≠ 0 and (x, y) in the neighborhood of (0,0),
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\frac {(x + a)(y + a )} {{\sqrt {(x + a)^2 + (y + a)^2}}} = \frac {\mid a \mid} {{\sqrt 2}} + \frac {{\rm sign} (a )} {2 {\sqrt 2}} (x + y)+ O (x^2 + y^2),
\end{align*}
\end{document}
where O(x2 + y2) indicates a term such that there exists a constant C with
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\mid O (x^2 + y^2)\mid \leq C (x^2 + y^2).
\end{align*}
\end{document}
Similar as in the proof of Theorem 2.2, under the assumption that Pλ(w) − p(w) is not constant in w, we see that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\sqrt n} \left(\frac {D_2^S} {n} - \sum\nolimits_{{\bf w} \in {\cal A} ^k} \frac {\mid P_{\lambda} ({\bf w})- p_{\bf w} \mid} {{\sqrt 2}} \right)$$
\end{document} is asymptotically normal with mean 0 and variance \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$2 (\Sigma^S_{\lambda})^2$$
\end{document}.
For the last assertion, we refine the Taylor expansion to
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\frac {(x + a)(y + a )} {\sqrt {(x + a)^2 + (y + a)^2}} \approx \frac {\mid a \mid} {{\sqrt 2}} + \frac { {sign} (a )} {2 {\sqrt 2}} (x + y)- \frac {3} {8 {\sqrt 2} \mid a \mid} \left\{x^2 - 2 x y + y^2 \right\} ,
\end{align*}
\end{document}
and using a = Pλ(w) − pw, if Pλ(w) − pw ≠ 0, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$x = \frac {X_{\bf w}} {n} - P_{\lambda} ({\bf w})$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$y = \frac {Y_{\bf w}} {n} - P_{\lambda} ({\bf w})$$
\end{document}, taking expectations completes the proof of Theorem 2.4.
Proof of Theorem 2.5
The proof of the three equations are roughly the same, and thus we only give the proof for the first equation.
Note that under the alternative model I, we expect that the k-tuple counts for the two sequences are more correlated than that for two random sequences. Therefore we use one-sided test. For fixed type I error α, based on Theorem 2.2 (a), we find zα such that P{Z1 ≥ zα} = α. Under the null hypothesis that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\lambda = 1 , \ n^{- 2} D_2 - \sum p_{\bf w}^2$$
\end{document} has approximate mean zero, whereas under the alternative λ < 1, the approximate mean of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$n^{- 2} D_2 - \sum p_{\bf w}^2$$
\end{document} will not be zero. We reject the null hypothesis if Z1 > zα, which is approximately equivalent to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2 > n^2 A (1)+ z_{\alpha} \sqrt{n^3}$$
\end{document}. The power for D2 is
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
1 - \beta &= P_ {\lambda} (D_2 > n^2 A_1 (1)+ z_ {\alpha} \sqrt {n^3})\\ &= P_ {\lambda} \left(\frac {D_2 - n^2 A (\lambda )} {\sqrt {2n^3 (\Sigma_ {\lambda})^2}} \geq \frac {n^2 A (1)+ z_{\alpha} {\sqrt n^3} - n^2 A (\lambda )} {{\sqrt 2n^3 (\Sigma_{\lambda})^2}} \right) \\ &\approx 1 - \Phi \left(C (\lambda)\right).
\end{align*}
\end{document}
The last approximation holds because of Theorem 2.2 (b).
A.2. Proofs of Theorems 3.1, 3.2, and 3.3
Proof of Theorem 3.1
We calculate \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\mathbb E} (X_{\bf w} Y_{{\bf w}^{\prime}})$$
\end{document} for any two words w and w′ of length k. Let
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\begin{cases}
I^A_{\bf w} (i)= I (A_iA_{i + 1} \cdots A_{i + k - 1} = w_1 w_2 \cdots w_k)\\ \\ I^B_{{\bf w^\prime}} (i)= I (B_iB_{i + 1} \cdots B_{i + k - 1} = w ^\prime _1 w ^\prime _2 \cdots w ^\prime _k)
\end{cases} ,
\end{align*}
\end{document}
Then part (b) can be easily deduced from part (a).
Part (c) and (d) can be proved by the definition of D2 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document}, respectively, and by part (b) above by letting w = w′. The recursion follows as in Reinert et al. (2009).
Proof of Theorem 3.2
The proofs of parts (a), (b), and (c) of the theorem are similar to that of Theorems 2.2–2.4, respectively.
Therefore, the first term in Equation 12 tends to 0 as n tends to infinity. The second term tends to a normal distribution with mean 0 and variance 2(Λλ)2. Part (a) is proved. Parts (b) and (c) follow directly from the normal approximation (13).
Proof of Theorem 3.3
The proof of this theorem is similar to the proof of Theorem 2.5. For illustration only, we prove the claim for the power of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document}. From Theorem 3.2 (b) with λ = 1, we can choose \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\widetilde z}^*_{\alpha}$$
\end{document} such that
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
P ({\widetilde Z}^{*}_1 \geq {\widetilde z}^*_{\alpha})= \alpha.
\end{align*}
\end{document}
We reject the null hypothesis that the two sequences are not related if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^* \geq {\widetilde z}^*_{\alpha}$$
\end{document}. We use one sided test since the mean of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} is expected to be greater than 0 under the alternative model. From Theorem 3.2 (b), the test has an approximate type I error α under the null hypothesis λ = 1.
The power is the probability that the null model is rejected under the alternative model II λ < 1. Thus, the power is
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
1 - \beta = P_{\lambda} (D_2^* \geq {\widetilde z}^*_{\alpha})\approx P \{{\widetilde Z}^*_ \lambda \geq {\widetilde z}^*_{\alpha} \}.
\end{align*}
\end{document}
Appendix B: Limit Distributions Of D2, D2*, and D2S When The Two Sequences Have Different Letter Frequencies, Motif Densities, and Sequence Lengths
For simplicity of presentation, we have so far assumed that the two sequences have the same letter frequency, motif density, and sequence length. The theorems in the main text can be easily extended to the general situations. Let nX be the length and 1 − λX be the motif density for sequence A. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$p^X_{\bf w}$$
\end{document} be the probability of pattern w under the null model and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$P^X_{\lambda_X} ({\bf w} )$$
\end{document} be the probability of word pattern w as calculated in subsection 2.1 for sequence A. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$(\Sigma^X_{\lambda_X})^2$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$(\Sigma_{\lambda_X}^{X*})^2$$
\end{document} be similarly defined as in equations 2 and 3, respectively, by replacing λ with λX. Similar notation can be defined for sequence B; here we use the superscript or subscript Y. We define D2 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} similarly as above by replacing pw by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$p^X_{\bf w}$$
\end{document} or \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$p^Y_{\bf w}$$
\end{document} appropriately. Let CXY = nX/nY. For simplicity of presentation, we also define CYX = nY/nY = 1/CXY. Under the general model, we redefine \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document} as
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
D_2^* = \sum_ {{\bf w} \in {\cal A} ^k} \frac {{\widetilde X}_{\bf w} {\widetilde Y} _ {\bf w}} {\sqrt {n_X p^X_ {\bf w}} \sqrt {n_Y p^Y_ {\bf w}}}.
\end{align*}
\end{document}
From the law of large numbers we deduce that, in distribution and almost surely, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\frac {{\widetilde X}_{\bf w}} {n_X} \rightarrow P^X_{\lambda_X} ({\bf w})- p^X_{\bf w}$$
\end{document}, and a similar statement holds for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\frac {{\widetilde Y}_w} {n_Y}$$
\end{document}. Hence, we abbreviate in connection with the asymptotic means, see Theorem 5.1\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
A^g (\lambda_X , \lambda_Y)& = \sum_ {{\bf w} \in {\cal A} ^k} P^X_ {\lambda_X} ({\bf w})P^Y_ {\lambda_Y} ({\bf w}), \\ A^ {g*} (\lambda_X , \lambda_Y)& = \sum_ {{\bf w} \in {\cal A} ^k} \frac {(P^X_ {\lambda_X} ({\bf w})- p^X_ {\bf w})(P^Y_ {\lambda_Y} ({\bf w})- p^Y_ {\bf w} )} {\sqrt {p^X_ {\bf w} p^Y_ {\bf w}}} , \\ A^{gS} (\lambda_X , \lambda_Y)& = \sum_ {{\bf w} \in {\cal A} ^k} \frac {(P^X_ {\lambda_X} ({\bf w})- p^X_ {\bf w})(P^Y_ {\lambda_Y} ({\bf w})- p^Y_ {\bf w} )} {\sqrt {C_ {XY} (P^X_ {\lambda_X} ({\bf w})- p^X_ {\bf w})^2 + C_ {YX} (P^Y_ {\lambda_Y} ({\bf w})- p^Y_ {\bf w})^2}} ,
\end{align*}
\end{document}
where and in the following, the superscript “g” indicates the general model. In analogy to Theorems 2.1, 2.2, 2.3, 2.4, and 2.5, we have the following theorems. As the proofs are very similar to the ones presented in the article, they are omitted.
Theorem 5.1
Under alternative model I for the two sequences as described above, the expectations of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document}and\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D^S_2$$
\end{document}can be calculated as follows.\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
{\mathbb E} (D_2)& = n_X n_Y A^g (\lambda_X , \lambda_Y), \\ {\mathbb E} (D_2^ {*})& = \sqrt {n_X n_Y} A^ {g*} (\lambda_X , \lambda_Y), \\ and \ \lim_ {n \rightarrow \infty} \frac {{\mathbb E} (D_2^ {S} )} {\sqrt {n_X n_Y}} & = A^ {gS} (\lambda_X , \lambda_Y).
\end{align*}
\end{document}
The limiting distributions of D2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document}, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} under the general model are given as follows.
Theorem 5.2
Assume that in the background model not all letters are equally likely.
a. Suppose λX = λY = 1 (the null model that the sequences are independent). Then\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\lim_ {n \rightarrow \infty} \left({n_X n_Y} \right) ^ \frac {1} {4} \bigg(\frac {D_2} {n_X n_Y} - \sum_ {{\bf w} \in {\cal A} ^k} p^X_ {\bf w} p^Y_ {\bf w} \bigg) = Z^g_1 ,
\end{align*}
\end{document}
where\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$Z^g_1$$
\end{document}has normal distribution\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\cal N} (0 , {\sqrt {C_{YX}}} (\Sigma^X_1)^2 + {\sqrt {C_{XY}}} (\Sigma^Y_1)^2)$$
\end{document}. Here the asymptotics is valid when the sequence length tends to infinity with alphabet size, motif length, and word length kept fixed.
where\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$Z^g_{\lambda_X , \lambda_Y}$$
\end{document}has normal distribution\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$${\cal N} (0 , {\sqrt {C_{YX}}} (\Sigma^X_{\lambda_X})^2 + {\sqrt {C_{XY}}} (\Sigma^Y_{\lambda_Y})^2 )$$
\end{document}. Here the asymptotics is valid when the sequence length tends to infinity with alphabet size, motif length, and word length kept fixed.
For \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^*$$
\end{document}, we have:
Theorem 5.3
a. Suppose λX = λY = 1 (the null model that the sequences are independent). Then, in distribution,\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\lim_ {n \rightarrow \infty} D^*_2 = Z^ {g*} _1 = \sum_ {{\bf w} \in {\cal A} ^k} \frac {Z^ {(g1 )} _ {\bf w} Z^ {(g2 )} _ {\bf w}} {\sqrt {p^X_ {\bf w}} \sqrt {p^Y_ {\bf w}}} ,
\end{align*}
\end{document}
where\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\{Z^{(g1 )}_{\bf w} , {\bf w} \in {\cal A}^k \}$$
\end{document}and\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\{Z^{(g2 )}_{\bf w} , {\bf w} \in {\cal A}^k \}$$
\end{document}are independent and have mean 0 normal distributions (with non-trivial covariance matrix).
b. Suppose 0 < λ < 1 (the alternative model I), and that\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$\frac {(P_{\lambda} ({\bf w})- p_{\bf w})} {p_{\bf w}}$$
\end{document}is not constant inw. Then, in distribution,\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\lim_ {n \rightarrow \infty} \left({n_X n_Y} \right) ^ \frac {1} {4} \left(\frac {D^*_2} {{\sqrt n_X n_Y}} - A^ {g*} (\lambda_X , \lambda_Y)\right) = Z_ {\lambda_X , \lambda_Y} ^ {g*} ,
\end{align*}
\end{document}
The following theorem gives the approximate distribution of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}
\begin{document}
$$D_2^S$$
\end{document} under the null and the alternative models for the general situation.
The alternative model II can equally be extended to the situation of different letter frequencies in the two sequences; we omit the details here.
Footnotes
Acknowledgments
L.W. was supported by NIH grant no. P50 HG 002790 and by NIH grant no. R21AG032743. G.R. was supported in part by EPSRC grant no. GR/R52183/01, and by BBSRC and EPSRC through OCISB. F.S. was supported by NIH grants no. P50 HG 002790 and R21AG032743 and NSFC grants 60928007 and 60805010. M.S.W. was supported by NIH grant no. P50 HG 002790 and by NIH grant no. R21AG032743.
Disclosure Statement
No competing financial interests exist.
References
1.
BurdenC.J., KantorovitzM.R., WilsonS.R.2006. Approximate word matches between two random sequences. Ann. Appl. Probab., 18:1–21.
2.
ForêtS., KantorovitzM.R., BurdenC.J.2006. Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences. BMC Bioinform., 7:S21.
3.
ForêtS., WilsonS.R., BurdenC.J.2009a. Empirical distribution of k-word matches in biological sequences. Pattern Recogn., 42:539–548.
4.
ForêtS., WilsonS.R., BurdenC.J.2009b. Characterizing the D2 statistic: word matches in biological sequences. Stat. Appl. Genet. Mol. Biol., 8:43.
5.
IvanA., HalfonM.S., SinhaS.2008. Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs. Genome Biol., 9:R22.
6.
KantorovitzM.R., BoothH.S., BurdenC.J., WilsonS.R.2007a. Asymptotic behavior of k-word matches between two uniformly distributed sequences. J. Appl. Probab., 44:788–805.
7.
KantorovitzM.R., RobinsonG.E., SinhaS.2007b. A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics, 23:i249–i255.
8.
LippertR.A., HuangH.Y., WatermanM.S.2002. Distributional regimes for the number of k-word matches between two random sequences. Proc. Natl. Acad. Sci. USA, 100:13980–13989.
9.
MantegnaR.N., BuldyrevS.V., GoldbergerA.L.et al.1994. Linguistic features of noncoding DNA sequences. Phys. Rev. Lett., 73:3169–3172.
10.
NovakS.Y.2007. A new characterization of the normal law. Stat. Probabil. Lett., 77:95–98.
11.
ReinertG., ChewD., SunF.Z.et al.2009. Alignment-free sequence comparison (I): Statistics and power. J. Comput. Biol., 16:1615–1634.
12.
RabinerL.R.1989. A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE, 77:257–286.
13.
StormoG.D.2000. DNA binding sites: representation and discovery. Bioinformatics, 16:16–23.
14.
SandelinA., AlkemaW., EngströmP.et al.2004. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res., 32:D91–D94.
15.
ZhaiZ.Y., KuS.Y., LuanY.H.et al.2010. The power of detecting enriched patterns: an HMM approach. J. Comput. Biol., 17:581–592.
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.