Abstract
Modeling DNA sequences with stochastic models and developing statistical methods to analyze the multiple projects of DNA sequencing are challenging questions for statisticians and biologists. Some of the most manifestations are the study of long-range dependence in DNA sequences that transform the DNA sequence into a numerical time series to study the long-range dependence in a DNA sequence. It is still discussed in the works if the type of transformation can alter the conclusion of long-range dependence on the DNA sequence. Here we model the DNA sequence considering the Fractional Poisson Process, propose a method based on moments for estimating the parameters of the Fractional Poisson Process in the DNA sequence, and analyze the long-range dependence in various DNA sequences by the detrended fluctuation analysis method.
Introduction
There is a commonly accepted view that our world is complex and correlated. Many physical and biological systems exhibit complex behavior characterized by long-range power-law correlations. For this reason, a large number of studies in different fields of science have been studying systems with long-range interactions and natural sequences with non-trivial information content. Some of the most manifestations of this concept are DNA sequences (see Peng et al., 1992; Karmeshu & Krishnamachari, 2004; Crato et al., 2010; Crato et al., 2011; Melnik & Usatenko, 2014; Sutthibutpong et al., 2016; Ghorbani et al., 2018).
Long-range statistical dependence in DNA sequences means that the base appearance tends to co-vary at regions separated by a long distance. Karlin and Brendel (1993) show that the mosaic character of DNA consisting of patches of different compositions can fully account for apparent long-range dependence in DNA sequence. Peng et al. (1994) address the question of whether long-range dependence in DNA sequence may be a trivial consequence of the known mosaic structure (“patchiness”) of DNA. Oliver et al. (2003) explore the phylogenetic distribution of large-scale genome patchiness by considering the deviations of the power-law behavior in long-range dependence. Karmeshu (2004) highlights the properties of parametric and non-parametric entropy measures and focuses on a few applications that use entropic measures. Using the maximum entropy principle to capture the well-known long-range dependence in DNA sequences is drawn the link between Tsallis entropy and power-law. Cochoa et al. (2014) show that the long-range correlation in the bacterial genomic sequence is mainly due to a mixing of heterogeneous statistics at different codon positions. Understanding the dynamic of gene expression could prove crucial in unraveling the physical complexities involved in this process. Ghorbani et al. (2018) report on the scaling properties of gene expression time series in Escherichia coli and Saccharomyces cerevisiae and investigate the individual gene expression dynamics and the cross-dependency between them in the context of the gene regulatory network. They saw that gene expression display fractal and long-range dependence characteristics.
One of the most appropriate methods for studying long-range correlations in DNA sequences is the detrended fluctuation analysis (DFA) (see Peng et al., 1994; Podobnik et al., 2007). The Detrended fluctuation analysis (DFA) proposed by Peng et al. (1992) is a method for analyzing time series that appear to be long-memory processes (see Linhares, 2016). It has successfully applied to different fields of interest, such as DNA sequences (see Peng et al., 1992), economic time series (see Liu et al., 1997), heart rate variability analysis (see Yeh et al., 2006), and long-time weather records (see Koscielny-Bunde et al., 1998). The objective of this technique is to evaluate the statistical fluctuation to obtain a set of measures to have a scaling exponent
Modeling DNA sequences with stochastic models and developing statistical methods to analyze the enormous data that results from the multiple projects of DNA sequencing are challenging questions for statisticians and biologists. It is necessary to transform the DNA sequence into a numerical time series to analyze the long-range dependence. There are different transformations proposed in the literature to study the long-range dependence in DNA sequences (see Peng et al., 1992; Stanley et al., 1999; Guharay et al., 2000; Lopes & Nunes, 2006; Podobnik et al., 2007; Crato et al., 2010; Crato et al., 2011), where the principal transformation is the DNA walk. But if we consider the naive approach of arbitrarily assigning numerical values (scales) to the bases and then proceeding with long-range dependence analysis, the result may depend on the particular assignment of the numerical values. It is still discussed in the works if the type of transformation can alter the conclusion of long-range dependence on the DNA sequence. Here we propose the Fractional Poisson Process (see Laskin, 2003) to model the DNA sequences, a method to estimate the parameters of the Fractional Poisson Process, and we analyze the long-range dependence by the DFA method.
The paper is composed as follows: In Section 2, we describe the DFA method. Section 3, presents the Fractional Poisson Process. In Section 4, we propose a method based upon the method of moments for estimating the parameters of the Fractional Poisson Process in the DNA sequence. We present simulations and the analysis of long-range dependence in several DNA sequences. Section 5 concludes the paper.
DFA method
The summary of the proposed work. 
To investigate the long-range dependence in DNA sequence, we propose the Fractional Poisson Process to model the DNA sequence and we analyze the long-range dependence by the DFA method, for that, in this section, we present some information about the DFA method. Figure 1 resumes the proposed of this work.
In short memory time series, the autocovariance function decays rapidly, perhaps exponentially, and the spectral density function is at least bounded and possibly very smooth. In long-range dependence stationary time series, autocovariances are not summable, and the spectral density functions are unbounded (see Beran, 1994). One of the most appropriated methods proposed for the study of long-range correlations in DNA sequences is the detrended fluctuation analysis (DFA) (see Peng et al., 1994).
To apply the Detrended Fluctuation Analysis method (see Pend, 1992; Linhares, 2016) to a given time series
In a first step, a running sum of the observed variable
for each
In the third step, for each
where
Under such condition, the smoothed fluctuations can be characterized by a scaling exponent
where
By taking the logarithm of the relationship in Eq. (3), we obtain
where
The 95% confidence level for the parameter
where
In this section, we present the Fractional Poisson Process of parameter
(Fractional Poisson Process (FP)).
The Fractional Poisson process
The
The time between two successive arrivals is a random variable called waiting time. The waiting time probability distribution function is an important attribute of any arrival or counting random process. The waiting time probability distribution function
(Waiting Time Distribution).
The waiting time probability function
where the generalized two-parameter Mittag-Leffler function is
When
The DNA consists of two long polymers of simple units (nucleotide). Each nucleotide in DNA contains one of four possible nitrogenous bases: adenine (A), guanine (G), cytosine (C), and thymine (T), a base pair (bp) is a unit consisting of two nucleobases linked together by hydrogen bonds, and a kilobase (kb) is a unit of measurement in molecular biology equal to 1000 base pairs of DNA. A DNA sequence is a series of nucleotide letters representing the primary structure of a DNA chain (real or hypothetical) with the ability to carry information. Here we consider DNA sequences corresponding to parts of the Homo sapiens chromosome 21, parts of the chromosome X. These sequences are available in the National Center for Biotechnology Information (NCBI,
Here we are interested in analyzing the long-range dependence in some DNA sequences under the Fractional Poisson Process presented in Section 3. Thus, for the analysis of each DNA sequence of Table 2, we consider the time series given in Definition 4.1.
.
Given a DNA sequence
We present in the Example 4.1 the mapping of a small part of a DNA sequence to the numerical value.
.
Given a small part of a DNA sequence
For each block
.
For the analysis in Table 2, for each DNA sequence we consider
Figure 3 presents the relative frequencies of the number of occurrences of some DNA sequences, where we check the possibility of adherence to a theoretical distribution given by the Eq. (5). Therefore, in this Fig. 3, we can note that the fractional Poisson process seems to be a good model. In Section 4.3, we check the possibility of adhering to a theoretical distribution given by the Eq. (5), by the Kolmogorov-Smirnov test performance, where the hypothesis null considers that the time series
To specify the model, it is necessary to estimate the parameters and identify the probability distributions of the time series
Estimation of the parameters
Cahoy et al. (2010) proposed the formal usual estimation procedure for the parameters of the fractional Poisson process denoted here by
This paper proposes a method based on the moment’s method for estimating the parameters of the DNA Sequences modeled by the Fractional Poisson Process. It is an analytical estimation procedure that yields explicit estimators and involves minimal computation.
Given
By taking the ratio between the two equations in Eq. (9), we obtain
applying the logarithm, hence
Then, the Definition 4.2 below states an estimator for the parameter
(
Estimator).
Given
where
Consequently, we propose to estimate the parameter
where
Let
where
Finalizing, to estimate the parameter
Using simulated data, we compare the proposed estimation method with the usual (see Cahoy et al., 2010) in the literature. To compare the estimation method proposed here with the usual method, we generate
Mean estimates of and dispersions from the parameter for simulated FP dates with different pairs of values of
’s and
’s
Mean estimates of and dispersions from the parameter for simulated FP dates with different pairs of values of
When the problem is determining the distribution, a graphical representation of the data is recommended. This graphical representation provides subsidies for adjusting some probability distribution.
The number of occurrences of changes in bases per Kb for some DNA sequences. 
Distribution of the number of changes in bases per Kb for some DNA sequences. 
Distribution of the inter-arrival intervals between occurrences in the DNA sequence. 
For that, first, at all we plotted graphs of the time series of each DNA sequence in Table 2 (see Fig. 2), where we note that there is a non-significant variability in the arrival rate, then we consider
.
The relative frequencies of the number of occurrences were plotted (see Fig. 3), checking the possibility of adhering to a theoretical distribution given by the Eq. (5) of parameter
All DNA sequences with bold from Table 2 present the long-range dependence property. For each DNA sequence, the conclusion is statistically significant at the 5% significance level. There is no evidence of long-range dependence in the DNA sequences Z98255, AL929410, and AL163207, at the 5% significance level because the value of 0.5 belongs to its 95% confidence intervals.
DNA sequences modeling by fractional poisson process
Notations:
.
Both the
Here we model the DNA sequence considering the Fractional Poisson Process. We propose a method based on the moment’s method for estimating the parameters of the Fractional Poisson Process in the DNA sequence. We also analyze the long-range dependence in various DNA sequences by the detrended fluctuation analysis method. Almost all DNA sequences studied here present the long-range dependence property at the 5% significance level.
The properties of the estimators proposed in this article are statistically evident through simulations. In the future, we intend to obtain mathematical proof for these properties and apply this methodology in another area of research.
