Fractional poisson process: Long-range dependence in DNA sequences

Abstract

Modeling DNA sequences with stochastic models and developing statistical methods to analyze the multiple projects of DNA sequencing are challenging questions for statisticians and biologists. Some of the most manifestations are the study of long-range dependence in DNA sequences that transform the DNA sequence into a numerical time series to study the long-range dependence in a DNA sequence. It is still discussed in the works if the type of transformation can alter the conclusion of long-range dependence on the DNA sequence. Here we model the DNA sequence considering the Fractional Poisson Process, propose a method based on moments for estimating the parameters of the Fractional Poisson Process in the DNA sequence, and analyze the long-range dependence in various DNA sequences by the detrended fluctuation analysis method.

Keywords

Long-range dependence fractional poisson process DNA sequence DFA

1. Introduction

There is a commonly accepted view that our world is complex and correlated. Many physical and biological systems exhibit complex behavior characterized by long-range power-law correlations. For this reason, a large number of studies in different fields of science have been studying systems with long-range interactions and natural sequences with non-trivial information content. Some of the most manifestations of this concept are DNA sequences (see Peng et al., 1992; Karmeshu & Krishnamachari, 2004; Crato et al., 2010; Crato et al., 2011; Melnik & Usatenko, 2014; Sutthibutpong et al., 2016; Ghorbani et al., 2018).

Long-range statistical dependence in DNA sequences means that the base appearance tends to co-vary at regions separated by a long distance. Karlin and Brendel (1993) show that the mosaic character of DNA consisting of patches of different compositions can fully account for apparent long-range dependence in DNA sequence. Peng et al. (1994) address the question of whether long-range dependence in DNA sequence may be a trivial consequence of the known mosaic structure (“patchiness”) of DNA. Oliver et al. (2003) explore the phylogenetic distribution of large-scale genome patchiness by considering the deviations of the power-law behavior in long-range dependence. Karmeshu (2004) highlights the properties of parametric and non-parametric entropy measures and focuses on a few applications that use entropic measures. Using the maximum entropy principle to capture the well-known long-range dependence in DNA sequences is drawn the link between Tsallis entropy and power-law. Cochoa et al. (2014) show that the long-range correlation in the bacterial genomic sequence is mainly due to a mixing of heterogeneous statistics at different codon positions. Understanding the dynamic of gene expression could prove crucial in unraveling the physical complexities involved in this process. Ghorbani et al. (2018) report on the scaling properties of gene expression time series in Escherichia coli and Saccharomyces cerevisiae and investigate the individual gene expression dynamics and the cross-dependency between them in the context of the gene regulatory network. They saw that gene expression display fractal and long-range dependence characteristics.

One of the most appropriate methods for studying long-range correlations in DNA sequences is the detrended fluctuation analysis (DFA) (see Peng et al., 1994; Podobnik et al., 2007). The Detrended fluctuation analysis (DFA) proposed by Peng et al. (1992) is a method for analyzing time series that appear to be long-memory processes (see Linhares, 2016). It has successfully applied to different fields of interest, such as DNA sequences (see Peng et al., 1992), economic time series (see Liu et al., 1997), heart rate variability analysis (see Yeh et al., 2006), and long-time weather records (see Koscielny-Bunde et al., 1998). The objective of this technique is to evaluate the statistical fluctuation to obtain a set of measures to have a scaling exponent $\alpha$ . The obtained exponent is similar to the Hurst exponent, except that DFA may also apply to signals whose underlying statistics (such as mean and variance) or dynamics are non-stationary (changing with time).

Modeling DNA sequences with stochastic models and developing statistical methods to analyze the enormous data that results from the multiple projects of DNA sequencing are challenging questions for statisticians and biologists. It is necessary to transform the DNA sequence into a numerical time series to analyze the long-range dependence. There are different transformations proposed in the literature to study the long-range dependence in DNA sequences (see Peng et al., 1992; Stanley et al., 1999; Guharay et al., 2000; Lopes & Nunes, 2006; Podobnik et al., 2007; Crato et al., 2010; Crato et al., 2011), where the principal transformation is the DNA walk. But if we consider the naive approach of arbitrarily assigning numerical values (scales) to the bases and then proceeding with long-range dependence analysis, the result may depend on the particular assignment of the numerical values. It is still discussed in the works if the type of transformation can alter the conclusion of long-range dependence on the DNA sequence. Here we propose the Fractional Poisson Process (see Laskin, 2003) to model the DNA sequences, a method to estimate the parameters of the Fractional Poisson Process, and we analyze the long-range dependence by the DFA method.

The paper is composed as follows: In Section 2, we describe the DFA method. Section 3, presents the Fractional Poisson Process. In Section 4, we propose a method based upon the method of moments for estimating the parameters of the Fractional Poisson Process in the DNA sequence. We present simulations and the analysis of long-range dependence in several DNA sequences. Section 5 concludes the paper.

2. DFA method

Figure 1.

The summary of the proposed work.

To investigate the long-range dependence in DNA sequence, we propose the Fractional Poisson Process to model the DNA sequence and we analyze the long-range dependence by the DFA method, for that, in this section, we present some information about the DFA method. Figure 1 resumes the proposed of this work.

In short memory time series, the autocovariance function decays rapidly, perhaps exponentially, and the spectral density function is at least bounded and possibly very smooth. In long-range dependence stationary time series, autocovariances are not summable, and the spectral density functions are unbounded (see Beran, 1994). One of the most appropriated methods proposed for the study of long-range correlations in DNA sequences is the detrended fluctuation analysis (DFA) (see Peng et al., 1994).

To apply the Detrended Fluctuation Analysis method (see Pend, 1992; Linhares, 2016) to a given time series $\mathbf{N}=\{N(j)\}_{j=1}^{l}$ , it is necessary the following steps.

In a first step, a running sum of the observed variable $\mathbf{N}=\{N(j)\}_{j=1}^{l}$ , is calculated

$\displaystyle Y_{t}=\sum_{j=1}^{t}(N(j)-\overline{\mathbf{N}}),$

for each $j\in\{1,2,\cdots,l\}$ , where $\overline{\mathbf{N}}$ is the average value of $\{N(j)\}_{j=1}^{l}$ . In the second step, we divide the time series $\{Y_{j}\}_{j=1}^{l}$ into $\big{\lfloor}\frac{l}{k}\big{\rfloor}$ nonoverlapping blocks, where each block has $k$ observations. For each block, one fits a least-square line to the data. We denote by $\widehat{Y}_{j}^{k}$ , for $j=1,\cdots,l$ , the adjusted fit for each $j$ on each block of length $k$ . After that, we detrend the time series $\{Y_{j}\}_{j=1}^{l}$ , that is, in each block we calculate

$\displaystyle Z^{k}_{j}=Y_{j}-\widehat{Y}_{j}^{k},\mbox{for all }j\in\{1,% \cdots,l\}.$ (1)

In the third step, for each $k\in\{4,5,\cdots,g(l)\}$ , we calculate the root mean square fluctuation of the new sequence,

$\displaystyle\mathrm{F}(k)=\sqrt{\frac{1}{\tilde{l}}{\sum_{j=1}^{\tilde{l}}({Z% ^{k}_{j}})^{2}}},$ (2)

where $\tilde{l}=k\cdot\big{\lfloor}\frac{l}{k}\big{\rfloor}$ .

Under such condition, the smoothed fluctuations can be characterized by a scaling exponent $\mathrm{\alpha}$ , which is the slope of the line when one regresses $\ln(\hat{F}(k))$ on $\ln(k)$ ,

$\displaystyle\hat{F}(k)\sim\varphi k^{\mathrm{\alpha}},$ (3)

where

•

$0<{\alpha}<0.5$ indicates short-range dependence;

•

${\alpha}=0.5$ indicates absence of long-range dependence;

•

$0.5<{\alpha}<1$ indicates long-range dependence.

By taking the logarithm of the relationship in Eq. (3), we obtain $\ln(\mathrm{\hat{F}}(k))\sim\ln(\varphi)+\mathrm{\alpha}\ln(k)$ . Then by the least squares method, the estimator for the exponent $\mathrm{\alpha}$ is given by

$\displaystyle\hat{\mathrm{\alpha}}_{\mathrm{DFA}}=\frac{\displaystyle\sum_{k=4% }^{g(l)-3}(\ln(k)-\overline{x})\ln(\mathrm{\hat{F}(}k))}{\displaystyle\sum_{k=% 4}^{g(l)-3}(\ln(k)-\overline{x})^{2}},$ (4)

where $\displaystyle\overline{x}=\frac{1}{g(l)-3}\sum_{k=4}^{g(l)-3}\ln(k)$ . The usual choice of $g(l)$ is $\big{\lfloor}\frac{l}{10}\big{\rfloor}$ , where $\lfloor\cdot\rfloor$ indicates the integer part function.

The 95% confidence level for the parameter $\alpha$ is given by

$\displaystyle\left[\hat{\mathrm{\alpha}}-t_{\left(0.975,g(l)-5\right)}\sqrt{{% \displaystyle\frac{\textit{MSE}}{\displaystyle\sum\limits_{k=4}^{g(l)-3}(\ln(k% )-\bar{x})^{2}}}}\right.;\left.\hat{\mathrm{\alpha}}+t_{\left(0.975,g(l)-5% \right)}\sqrt{{\displaystyle\frac{\textit{MSE}}{\displaystyle\sum\limits_{k=4}% ^{g(l)-3}(\ln(k)-\bar{x})^{2}}}}\right]$

where $t$ has Student’s T Distribution, $\displaystyle\overline{x}=\frac{1}{g(l)-3}\sum_{k=4}^{g(l)-3}\ln(k)$ , and MSE is the mean squared error.

3. Fractional poisson process

In this section, we present the Fractional Poisson Process of parameter $H$ , which is a renewal process with Mittag-Leffler waiting times. This process has been invented, developed, and encouraged for applications by Laskin (2003).

(Fractional Poisson Process (FP)).

The Fractional Poisson process $N_{H}^{\lambda}(t)$ is a renewal process with probability $P_{(H,\lambda)}(n,t)$ of arriving $n$ items $(n=0,1,2,\ldots)$ by time $t$ . The probability function is given by

$\displaystyle P_{(H,\lambda)}(n,t)=\frac{(\lambda t^{H})^{n}}{n!}\sum_{k=0}^{% \infty}\frac{(k+n)!}{k!}\frac{(-\lambda t^{H})^{k}}{\Gamma(H(k+n)+1)},0<H% \leqslant 1.$ (5)

The $P_{(H,\lambda)}(n,t)$ gives us the probability that in the time $[0,t]$ we observe $n$ events. When $H=1$ the $P_{(H,\lambda)}(n,t)$ is transformed to the standard Poisson distribution. Thus, Eq. (5) can be considered as fractional generalization of the standard Poisson distribution. The renew function of the Fractional Poisson Process is given by

$\displaystyle M_{(H,\lambda)}(t)=E(N_{H}^{\lambda}(t))=\frac{\lambda t^{H}}{% \Gamma(H+1)}.$ (6)

The time between two successive arrivals is a random variable called waiting time. The waiting time probability distribution function is an important attribute of any arrival or counting random process. The waiting time probability distribution function $\psi(\tau)$ represents the probability density of event that an arrival is occurred at the time moment $t_{k}=t_{k-1}+\tau$ after the previous one happened at the moment $t_{k-1}$ .

(Waiting Time Distribution).

The waiting time probability function $\psi_{H}(\tau)$ of the Fractional Poisson process is given by

$\displaystyle\psi_{H}(\tau)=\lambda\tau^{H-1}E_{H,H}(-\lambda\tau^{H}),t% \geqslant 0,0<H\leqslant 1,$ (7)

where the generalized two-parameter Mittag-Leffler function is

$\displaystyle E_{\alpha,\beta}(z)=\sum_{m=0}^{\infty}\frac{z^{m}}{\Gamma(% \alpha m+\beta)}.$ (8)

When $H=1$ , the $\psi_{H}(\tau)$ defined by Eq. (6) is a fractional generalization of the well known exponential probability distribution, because of $E_{1,1}(z)=e^{z}$ . For more information about Fractional Poisson Process see Laskin (2003).

4. DNA sequence analysis

The DNA consists of two long polymers of simple units (nucleotide). Each nucleotide in DNA contains one of four possible nitrogenous bases: adenine (A), guanine (G), cytosine (C), and thymine (T), a base pair (bp) is a unit consisting of two nucleobases linked together by hydrogen bonds, and a kilobase (kb) is a unit of measurement in molecular biology equal to 1000 base pairs of DNA. A DNA sequence is a series of nucleotide letters representing the primary structure of a DNA chain (real or hypothetical) with the ability to carry information. Here we consider DNA sequences corresponding to parts of the Homo sapiens chromosome 21, parts of the chromosome X. These sequences are available in the National Center for Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov/nuccore/).

Here we are interested in analyzing the long-range dependence in some DNA sequences under the Fractional Poisson Process presented in Section 3. Thus, for the analysis of each DNA sequence of Table 2, we consider the time series given in Definition 4.1.

.

Given a DNA sequence $\{n_{i}\}_{i=1}^{n}$ , $n_{i}\in\{\mathrm{A,G,C,T}\}$ . We consider that there is a renewal when there is a change of the nitrogenous base in the DNA sequence. To define the time series for the DNA sequence $\{n_{i}\}_{i=1}^{n}$ , it is necessary the following steps. In a first step, we divide the DNA sequence $\{n_{i}\}_{i=1}^{n}$ into $l=\big{\lfloor}\frac{n}{t}\big{\rfloor}$ non-overlapping blocks $\{B_{1},\cdots,B_{l}\}$ , where each block has $t$ base pairs. For each block $B_{j}$ , $j\in\{1,2,\cdots,l\}$ , we obtain the frequency of renewals denoted by ${N}_{t}(j)$ . Then we define $\underline{\mathbf{N_{H}^{\lambda}(t)}}=\{{N}_{t}(j)\}_{j=1}^{l}$ the time series obtained by the frequencies of renewals for the DNA sequence obtained by each length $t$ . Let $\mathbf{T_{\mu}}=\{T_{j}\}_{j=1}^{m}$ the time series of the inter-arrival intervals between occurrences in the DNA sequence, such that $T_{j}$ is the time between the occurrence of the $(i-1)$ -th event until the $j$ -th event occurs.

We present in the Example 4.1 the mapping of a small part of a DNA sequence to the numerical value.

.

Given a small part of a DNA sequence $\{n_{i}\}_{i=1}^{20}=\{\mathrm{C,C,C,A,A,A,G,G,T,T,T,A,A,A,A,A,}\linebreak% \mathrm{G,G,G,A}\}$ . We consider that there is a renewal when there is a change of the nitrogenous base in the DNA sequence. In this example we divide the DNA sequence $\{n_{i}\}_{i=1}^{20}$ into $l=\big{\lfloor}\frac{20}{5}\big{\rfloor}=4$ non-overlapping blocks $\{B_{1},\cdots,B_{4}\}$ , where each block has $t=5$ base pairs.

$\displaystyle{\@setsize{\scriptsize}{8pt}{\viipt}{\@viipt}\{n_{i}\}_{i=1}^{16}% }=\big{\{}\overbrace{\mathrm{C}\,\mathrm{C}\,\mathrm{C}\underbrace{{{\mathbf{A% }}}}_{\mathbf{renewal}}\mathrm{\mathbf{A}}}^{B_{1}}\,\overbrace{{\mathrm{% \mathbf{A}}\,\underbrace{\mathrm{{G}}}_{{\mathrm{renewal}}}\,\mathrm{G}}% \underbrace{\mathrm{\mathbf{T}}}_{\mathbf{renewal}}{\mathrm{\mathbf{T}}}}^{B_{% 2}}\,\overbrace{{\mathrm{\mathbf{T}}}\underbrace{\mathrm{A}}_{\mathrm{renewal}% }\,\mathrm{A}\,\mathrm{A}\,\mathrm{A}}^{B_{3}}\,\overbrace{\mathrm{A}% \underbrace{\mathrm{\mathbf{G}}}_{\mathbf{renewal}}{\mathrm{\mathbf{G}}}\,{% \mathrm{\mathbf{G}}}\underbrace{\mathrm{A}}_{{\mathrm{renewal}}}}^{B_{4}}\big{\}}$

For each block $B_{j}$ , $j\in\{1,2,3,4\}$ , we obtain the frequency of renewals denoted by ${N}_{5}(j)$ . Then $\underline{\mathbf{N_{H}^{\lambda}(5)}}=\{{N}_{5}(j)\}_{j=1}^{4}=\{1,2,1,2\}$ is the time series obtained by the frequencies of renewals for the small DNA sequence considering $t=5$ . The time series $\mathbf{T_{\mu}}=\{T_{j}\}_{j=1}^{7}=\{3,3,2,3,5,3,1\}$ is the time series of the inter-arrival intervals between renewals in the DNA sequence.

.

For the analysis in Table 2, for each DNA sequence we consider $\mathbf{t}$ as $\mathbf{1}$ kilobase (kb), so the time series used is $\underline{\mathbf{N_{H}^{\lambda}(1)}}=\{{N}_{1}(j)\}_{j=1}^{l}$ .

Figure 3 presents the relative frequencies of the number of occurrences of some DNA sequences, where we check the possibility of adherence to a theoretical distribution given by the Eq. (5). Therefore, in this Fig. 3, we can note that the fractional Poisson process seems to be a good model. In Section 4.3, we check the possibility of adhering to a theoretical distribution given by the Eq. (5), by the Kolmogorov-Smirnov test performance, where the hypothesis null considers that the time series $\underline{\mathbf{N_{H}^{\lambda}(1)}}$ has the same distribution of the Eq. 5.

To specify the model, it is necessary to estimate the parameters and identify the probability distributions of the time series $\underline{\mathbf{N_{H}^{\lambda}(1)}}$ e $\mathbf{T}_{\mu}$ (see Definition 4.1).

4.1 Estimation of the parameters

Cahoy et al. (2010) proposed the formal usual estimation procedure for the parameters of the fractional Poisson process denoted here by $\hat{H}_{\textit{usual}}$ and $\hat{\lambda}_{\textit{usual}}$ . It estimates the parameters in the time series of the inter-arrival intervals between occurrences and not from the original time series of the FP. Here we propose a new estimating method, and the novelty of the proposed method is estimating the parameters directly in the time series obtained by the frequencies of renewals (the original time series from FP).

This paper proposes a method based on the moment’s method for estimating the parameters of the DNA Sequences modeled by the Fractional Poisson Process. It is an analytical estimation procedure that yields explicit estimators and involves minimal computation.

Given $\mathbf{t_{1},t_{2}}>0$ , $\mathbf{t_{1}}\neq\mathbf{t_{2}}$ , by Eq. (6) we have

$\displaystyle E(\mathbf{N_{H}^{\lambda}(\mathbf{t_{1}})})=\frac{\lambda\mathbf% {t_{1}}^{H}}{\Gamma(H+1)},\text{ and }E(\mathbf{N_{H}^{\lambda}(\mathbf{t_{2}}% )})=\frac{\lambda\mathbf{t_{2}}^{H}}{\Gamma(H+1)}.$ (9)

By taking the ratio between the two equations in Eq. (9), we obtain

$\displaystyle\frac{E(\mathbf{N_{H}^{\lambda}(t_{1})})}{E(\mathbf{N_{H}^{% \lambda}(t_{2})})}=\biggr{(}\frac{\mathbf{t_{1}}}{\mathbf{t_{2}}}\biggr{)}^{H},$ (10)

applying the logarithm, hence

$\displaystyle H=\frac{\ln\left(\frac{E(\mathbf{N_{H}^{\lambda}(t_{1})})}{E(% \mathbf{N_{H}^{\lambda}(t_{2})})}\right)}{\ln\left(\frac{\mathbf{t_{1}}}{% \mathbf{t_{2}}}\right)}.$

Then, the Definition 4.2 below states an estimator for the parameter $H$ .

( $\hat{H}$ Estimator).

Given $\mathbf{t_{1},t_{2}}\in\mathbb{R}$ , $\mathbf{t_{1}}\neq\mathbf{t_{2}}$ . Let $\underline{\mathbf{N_{H}^{\lambda}(t_{1})}}=\{{N}_{t_{1}}(j)\}_{j=1}^{l}$ and $\underline{\mathbf{N_{H}^{\lambda}(t_{2})}}=\{{N}_{t_{2}}(j)\}_{j=1}^{l}$ time series obtained by the frequencies of the number of occurrences in a DNA sequence. The estimator proposed for $H$ , based on the the method of moments, is defined by

$\displaystyle\hat{H}=\frac{\ln\left(\frac{\hat{M}(\mathbf{t_{1}})}{\hat{M}(% \mathbf{t_{2}})}\right)}{\ln\left(\frac{\mathbf{t_{1}}}{\mathbf{t_{2}}}\right)},$ (11)

where $\hat{M}(\cdot)$ is the arithmetic mean of the time series $\underline{\mathbf{N_{H}^{\lambda}(\cdot)}}$ . Here we consider, the values $\mathbf{t_{1}}=\frac{1}{10}$ kb and $\mathbf{t_{2}}=\linebreak 1$ kb.

Consequently, we propose to estimate the parameter $\lambda$ as

$\displaystyle\hat{\lambda}=\Gamma(\hat{H}+1)\big{(}\hat{M}(1)\big{)},$ (12)

where $\hat{M}(1)=\sum_{j=1}^{l}\frac{{N}_{1}(j)}{l}$ is the arithmetic mean of the time series $\underline{\mathbf{N_{H}^{\lambda}(1)}}$ .

Let $\mu$ the expected number of the time between the successive occurrences of change in bases in a DNA sequence. To estimate the parameter $\mu$ , we consider

$\displaystyle\hat{\mu}=\frac{\displaystyle\sum_{j=1}^{m}T_{j}}{m},$ (13)

where $m$ is the length of the time series $\mathbf{T}_{\mu}=\{T_{j}\}_{j=1}^{m}$ (see Definition 4.1).

Finalizing, to estimate the parameter $\alpha$ , we consider the $\hat{\alpha}_{\textit{DFA}}$ , obtained by the DFA method (see Section 2).

4.2 Simulation

Using simulated data, we compare the proposed estimation method with the usual (see Cahoy et al., 2010) in the literature. To compare the estimation method proposed here with the usual method, we generate $n=$ 100 samples of the original FP time series and their respective time series of the inter-arrival intervals, of sample size $N=$ 1000. We then calculate the estimates for ${H}$ and ${\lambda}$ for each of the $n$ samples and then average them to obtain the means. These values are in the Table 1 with their bias, mse and var. To simulate the FP time series, we considered five different pairs of values of $H$ ’s and $\lambda$ ’s. We can see in Table 1 the proposed estimators perform better in obtaining a lower bias for all considered values of $H$ and $\lambda$ , especially for larges values of $\lambda$ .

Table 1
Mean estimates of and dispersions from the parameter for simulated FP dates with different pairs of values of $H$ ’s and $\lambda$ ’s

( $H=0.1,\lambda=1$ )
Estimator	Mean	\|Bias\|	Mse	Var
$\hat{H}_{\mathrm{proposed}}$	0.1004	0.0004	0.0001	0.0001
$\hat{H}_{\mathrm{usual}}$	0.1005	0.0005	0.0002	0.0002
$\hat{\lambda}_{\mathrm{proposed}}$	0.9982	0.0018	0.0030	0.0030
$\hat{\lambda}_{\mathrm{usual}}$	1.0039	0.0039	0.0032	0.0033
( $H=0.3,\lambda=5$ )
$\hat{H}_{\mathrm{proposed}}$	0.2999	0.0001	0.0001	0.0001
$\hat{H}_{\mathrm{usual}}$	0.3014	0.0014	0.0001	0.0001
$\hat{\lambda}_{\mathrm{proposed}}$	5.0090	0.0090	0.0326	0.0329
$\hat{\lambda}_{\mathrm{usual}}$	5.0393	0.0393	0.1351	0.1349
( $H=0.5,\lambda=10$
$\hat{H}_{\mathrm{proposed}}$	0.4967	0.0033	0.0001	0.0001
$\hat{H}_{\mathrm{usual}}$	0.5072	0.0072	0.0002	0.0002
$\hat{\lambda}_{\mathrm{proposed}}$	10.0165	0.0165	0.0744	0.0926
$\hat{\lambda}_{\mathrm{usual}}$	10.3372	0.3372	0.4747	0.4512
( $H=0.7,\lambda=100$ )
$\hat{H}_{\mathrm{proposed}}$	0.6963	0.0037	0.0002	0.0002
$\hat{H}_{\mathrm{usual}}$	0.7098	0.0098	0.0003	0.0003
$\hat{\lambda}_{\mathrm{proposed}}$	100.501	0.5010	2.3750	2.2358
$\hat{\lambda}_{\mathrm{usual}}$	108.5699	8.5699	222.3495	156.7434
( $H=0.9,\lambda=600$ )
$\hat{H}_{\mathrm{proposed}}$	0.9006	0.0006	0.0001	0.0001
$\hat{H}_{\mathrm{usual}}$	0.8986	0.0014	0.0002	0.0002
$\hat{\lambda}_{\mathrm{proposed}}$	599.2937	0.7063	13.6661	13.8602
$\hat{\lambda}_{\mathrm{usual}}$	592.4814	7.5186	3717.0682	3853.1982

4.3 Results

When the problem is determining the distribution, a graphical representation of the data is recommended. This graphical representation provides subsidies for adjusting some probability distribution.

Figure 2.

The number of occurrences of changes in bases per Kb for some DNA sequences.

Figure 3.

Distribution of the number of changes in bases per Kb for some DNA sequences.

Figure 4.

Distribution of the inter-arrival intervals between occurrences in the DNA sequence.

For that, first, at all we plotted graphs of the time series of each DNA sequence in Table 2 (see Fig. 2), where we note that there is a non-significant variability in the arrival rate, then we consider $\hat{\lambda}$ to be constant.

.

The relative frequencies of the number of occurrences were plotted (see Fig. 3), checking the possibility of adhering to a theoretical distribution given by the Eq. (5) of parameter $\hat{\lambda}$ and $\hat{H}$ . To verify this adherence, the Kolmogorov-Smirnov test was performed, where the hypothesis null considers that the time series $\underline{\mathbf{N_{H}^{\lambda}(1)}}$ has the same distribution of the Eq. (5), and the $p$ -value denote by $p_{\mathrm{\mathbf{N}}}$ is on Table 2. We can note that for all DNA sequences considered in Table 2, at the level of 5% of significance, there is no evidence for the rejection of the hypothesis. The graphs of Fig. 4 suggest that the distribution of the time series $\mathbf{T}_{\mu}=\{T_{j}\}_{j=1}^{m}$ is given by Definition 6. To verify this adherence, the Kolmogorov-Smirnov test was performed, where the hypothesis null considers that the time series $\mathbf{T}_{\mu}$ has the same distribution of the Eq. (7), and the $p$ -value, denote by $p_{\mathrm{\mathbf{T}}}$ is on Table 2. We can note that for all DNA sequences considered in Table 2, at the level of $5\%$ of significance, there is no evidence for the rejection of the hypothesis.

All DNA sequences with bold from Table 2 present the long-range dependence property. For each DNA sequence, the conclusion is statistically significant at the 5% significance level. There is no evidence of long-range dependence in the DNA sequences Z98255, AL929410, and AL163207, at the 5% significance level because the value of 0.5 belongs to its 95% confidence intervals.

Table 2

DNA sequences modeling by fractional poisson process

DNA sequence $\{n_{i}\}_{i=1}^{n}$		Time series $\underline{\mathbf{N_{H}^{\lambda}(1)}}$						Time series $\mathbf{T}$
Locus	Length (bp)	Mean	$\hat{H}$	$\hat{\lambda}$	$p_{\mathbf{N}}$	$\hat{\alpha}_{\mathrm{DFA}}$	$[\alpha_{\mathrm{1}},\alpha_{2}]$	$\hat{\mu}=\overline{T}$	$\frac{1kb}{\hat{\mu}}$	$p_{\mathrm{\mathbf{T}}}$
Z98255	169998	695,95	1,00	695,95	0,87	0,52	$[0.41,0.63]$	1,44	696,05	0,63
AL732374	224187	697,82	1,00	697,82	0,22	0,66	$[0.55,0.77]$	1,43	697,87	0,43
AL163209	340000	697,30	1,00	697,30	0,07	0,63	$[0.56,0.70]$	1,43	697,30	0,06
AL591435	138038	692,04	1,00	692,04	0,95	0,61	$[0.51,0.72]$	1,45	689,66	0,42
AL929410	186649	697,75	1,00	697,75	0,64	0,57	$[0.49,0.65]$	1,43	699,30	0,98
AL163207	340000	698,45	1,00	698,45	0,17	0,47	$[0.38,0.56]$	1,43	699,30	0,17
AC073493	211422	703,95	1,00	703,95	0,12	0,63	$[0.55,0.71]$	1,42	704,23	0,11
AC004673	236281	698,08	1,00	698,08	0,34	0,63	$[0.54,0.72]$	1,43	699,30	0,10
AL163210	340000	693,01	1,00	693,01	0,09	0,61	$[0.56,0.66]$	1,44	694,44	0,06
AL163208	340000	698,43	1,00	698,43	0,29	0,65	$[0.59,0.71]$	1,43	699,33	0,28
AL445312	170984	720,72	1,00	720,72	0,31	0,99	$[0.90,1.08]$	1,39	719,42	0,78
AL163204	340000	700,78	1,00	700,78	0,53	0,63	$[0.58,0.68]$	1,43	699,30	0,10

Notations: $p_{\mathbf{N}}$ and $p_{\mathrm{\mathbf{T}}}$ (see Remark 4.2), $[\alpha_{\mathrm{1}},\alpha_{2}]$ is the 95% confidence interval for $\alpha_{\textit{DFA}}$ .

.

Both the $\hat{H}$ estimator and the $\hat{\alpha}_{\textit{DFA}}$ estimator are obtained over the same time series $\underline{\mathbf{N_{H}^{\lambda}(1)}}=\{{N}_{1}(j)\}_{j=1}^{l}$ . The DFA estimates the Hurst exponent of the time series $\underline{\mathbf{N_{H}^{\lambda}(1)}}=\{{N}_{1}(j)\}_{j=1}^{l}$ , and the $\hat{H}$ is an estimator for the $H$ parameter of the Fractional Poisson process (FP), which can also be considered as an estimator of the Hurst exponent of the process formed by the increments of the FP given by $X_{t}^{H}=N_{H}^{\lambda}(t)-N_{H}^{\lambda}(t-1)$ , $t\geqslant 1$ , known as fractional Poisson noise (FBN). So it is expected to have a difference between the estimated values of $H$ and the DFA exponent.

5. Conclusions

Here we model the DNA sequence considering the Fractional Poisson Process. We propose a method based on the moment’s method for estimating the parameters of the Fractional Poisson Process in the DNA sequence. We also analyze the long-range dependence in various DNA sequences by the detrended fluctuation analysis method. Almost all DNA sequences studied here present the long-range dependence property at the 5% significance level.

The properties of the estimators proposed in this article are statistically evident through simulations. In the future, we intend to obtain mathematical proof for these properties and apply this methodology in another area of research.

References

Beran

(1994). Statistics for Long Memory Processes. New York: Chapman & Hall.

Cahoy

D.O.

Uchaikin

V.V.

, & Woyczynski

W.A.

(2010). Parameter estimation for fractional Poisson processes. Journal of Statistical Planning and Inference, 140, 3106-3120.

Crato

Linhares

R.R.

, & Lopes

S.R.C.

(2010). Statistical properties of detrended fluctuation analysis. Journal of Statistical Computation and Simulation, 80(6), 625-641.

Crato

Linhares

R.R.

, & Lopes

S.R.C.

(2011). α-stable laws for noncoding regions in DNA sequences. Journal of Applied Statistics, 38(2), 261-271.

Cochoa

Miramontes

Mansillac

, & Li

(2014). Bacterial genomes lacking long-range correlations may not be modeled by low-order Markov chains: The role of mixing statistics and frame shift of neighboring genes. Computational Biology and Chemistry, 53, 15-25.

Ghorbani

Jonckheere

E.A.

, & Bogdan

(2018). Gene expression is not random: Scaling, long-range cross-dependence, and fractal characteristics of gene regulatory networks. Frontiers in Physiology, 9, article 1446.

Guharay

Hunt

B.R.

Yorke

J.A.

, & White

O.R.

(2000). Correlations in DNA sequences across the three domains of life. Physica D: Nonlinear Phenomena, 146(1-4), 388-396.

Karlin

, & Brendel

(1993). Patchiness and correlations in DNA sequences. Science, 259(5095), 677-680.

Karmeshu & Krishnamachari

(2004). Sequence variability and long-range dependence in DNA: An information theoretic perspective. Lecture Notes in Computer Science, 3316, 1354-1361.

10.

Laskin

(2003). Fractional poisson process. Commun. Nonlinear Sci. Numer. Simul., 8(3-4), 201-213.

11.

Linhares

R.R.

(2016). Smoothed detrended fluctuation analysis. Journal of Statistical Computation and Simulation, 86(17), 3388-3397.

12.

Lopes

S.R.C.

, & Mendes

B.V.M.

(2006). Bandwidth selection in classical and robust estimation of long memory. International Journal of Statistics and Systems, 1(2), 167-190.

13.

Lopes

S.R.C.

, & Nunes

M.A.

(2006). Long Memory Analysis in DNA Sequences. Physica A: Statistical Mechanics and its Applications, 361(2), 569-588.

14.

Melnik

S.S.

, & Usatenko

O.V.

(2014). Entropy and long-range correlations in DNA sequences. Computational Biology and Chemistry, 53, 26-31.

15.

Oliver

J.L.

Carpena

Hackenberg

, & Bernaola-Galvan

(2004). IsoFinder: Computational prediction of isochores in genome sequences. Nucleic Acids Res., 32, 287-292.

16.

Peng

Buldyrev

S.V.

Goldberger

A.L.

Havlin

Sciortino

Simons

, & Stanley

H.E.

(1992). Long-range correlations in nucleotide sequences. Nature, 356, 168-170.

17.

Peng

Buldyrev

S.V.

Havlin

Simons

Stanley

H.E.

, & Goldberger

A.L.

(1994). Mosaic organization of DNA nucleotides. Physical Review E, 49(5), 1685-1689.

18.

Podobnik

Shaoc

Dokholyand

N.V.

Zlatice

Stanley

H.E.

, & Grossef

(2007). Similarity and dissimilarity in correlations of genomic DNA. Physica A: Statistical Mechanics and its Applications, 373, 497-502.

19.

Stanley

H.E.

Buldyrev

S.V.

Goldberger

A.L.

Havlin

Peng

C.K.

, & Simons

(1999). Scaling features of noncoding DNA. Physica A: Statistical Mechanics and its Applications, 273(1), 1-18.

20.

Sutthibutpong

Matek

Benham

Slade

G.G.

, & Noy

(2016). Long-range correlations in the mechanics of small DNA circles under topological stress revealed by multi-scale simulation. Nucleic Acids Research, 44(19), 9121-9130.

Fractional poisson process: Long-range dependence in DNA sequences

Abstract

Keywords

1. Introduction

2. DFA method

(Fractional Poisson Process (FP)).

(Waiting Time Distribution).

.

.

.

4.1 Estimation of the parameters

( H ^ Estimator).

Table 1 Mean estimates of and dispersions from the parameter for simulated FP dates with different pairs of values of H ’s and λ ’s

.

.

References

( $\hat{H}$ Estimator).

Table 1
Mean estimates of and dispersions from the parameter for simulated FP dates with different pairs of values of $H$ ’s and $\lambda$ ’s