Abstract
Abstract
Characterizing the empirical distribution of the frequency of n-mers is a vital step in understanding the entire genome. This will allow for researchers to examine how complex the genome really is, and move beyond simple, traditional modeling frameworks that are often biased in the presence of abundant and/or extremely rare words. We hypothesize that models based on the negative binomial distribution and its zero-inflated counterpart will characterize the n-mer distributions of genomes better than the Poisson. Our study examined the empirical distribution of the frequency of n-mers (6 ≤ n ≤ 11) in 2,199 genomes. We considered four distributions: Poisson, negative binomial, zero-inflated Poisson, and zero-inflated negative binomial (ZINB). The number of genomes that have nullomers in 6-, 7-, and 8-mers was 150, 602 and 2,012, respectively, whereas all of the genomes for the 9-, 10-, and 11-mers had nullomers. In each n-mer considered, the negative binomial model performed the best for at least 93% of the 2,199 genomes; however, a small percentage (i.e., <7%) of the genomes did prefer the ZINB. The negative binomial and zero-inflation distributions extend the traditional Poisson setting and are more flexible in handling overdispersion that can be caused by an increase in nullomers. In an effort to characterize the distribution of the frequency of n-mers, researchers should also consider other discrete distributions that are more flexible and adjust for possible overdispersion.
1. Introduction
D
Nullomers are short DNA sequences that are absent from genomes of humans and other species (Acquisti et al., 2007), and the discovery of these nullomers is somewhat surprising when one considers that the genomes of higher organisms are extremely large (Kidwell, 2002). Hampikian and Andersen (2006) carried out a systematic study of these nullomers, and their work was motivated by the lack of research in examining these universally absent sequences in the hope of discovering insights into the most conserved mechanisms of molecular biology. This work focused on reporting the list of missing n-mers in 12 species, including humans. Nullomers play an important role when classifying prokaryotic sequences, and the naïve Bayes classifier can use a heuristic “linguistic smoothing” to make a better estimate of a nullomer (Chen and Goodman, 1996; Rosen et al., 2008). However, as the amount of genomes available grows, a more accurate estimate can be obtained from a better model of the genome n-mer distribution. Although some work has been done in characterizing the empirical distribution of these nullomers, the noted literature generally focuses on either Gaussian approximations or Poisson processes, and we aim to improve upon this body of work.
Several discrete distributions can be considered when describing the frequency of any DNA word in a given genome, but when overdispersion is present, traditional Poisson processes may not be appropriate. Overdispersion could be present for a variety of reasons, such as excessive zeros and outliers; therefore, distributions like the negative binomial allow for the extra variability present in the count data. To accommodate zero inflation in count data, zero-inflated models also have been developed (Lambert, 1992).
The purpose of this study is to characterize the distribution of the unique n-mer (6 ≤ n ≤ 11) frequency profiles of 2,199 microbial genomes, publicly available as of December 2012. The Materials and Methods section outlines the characteristics of each discrete distribution, as well as the statistical analysis used in examining the 2,199 genomes. In the Results and Discussion section of the article, we describe the empirical distribution of the n-mers and then compare the various discrete distributions (i.e., Poisson, negative binomial, zero-inflated Poisson [ZIP], and zero-inflated negative binomial [ZINB]) presented via Akaike information criterion (AIC) in order to determine the best probabilistic model fit. We also discuss the advantages and disadvantages of each distribution. Finally, the Conclusion section summarizes our main findings and details their importance and relevance.
2. Materials and Methods
We obtained DNA sequences for 2,199 prokaryotic genomes from GenBank (Benson et al., 2012) as of December 2012. The 2,199 genomes were marked as “completed” in the microbial genomes database. For each genomic sequence, the total number of possible n-mers is 4 n . Because we are interested only in fitting low-order Markov models, such as zero-order Markov models, only the frequencies of each nucleotide for a given genome are needed (Chor et al., 2009). We systematically examined four discrete distributions to characterize the behavior of each n-mer (6 ≤ n ≤ 11) frequency. The distributions are outlined below, and the description of our statistical analyses follows.
2.1. Poisson
The Poisson distribution is commonly used to characterize count data. For example, if one is interested in modeling a setting where we are waiting for an occurrence (i.e., waiting for a bus), the number of occurrences in a given time interval can be modeled by this type of distribution. A random variable X, taking values in the nonnegative integers, has a Poisson distribution if
where λ is termed the intensity parameter. A distinct characteristic of this distribution is the fact that the mean and the variance are both equal to λ; therefore, E(X) = V(X) = λ (Casella and Berger, 1990). Unfortunately, when the variance significantly exceeds the mean, overdispersion is present, and this distribution will no longer be appropriate to use. Lastly, one can obtain the moment generating function for the Poisson distribution via a Taylor series, such that
2.2. Negative binomial
The negative binomial distribution is also commonly used to characterize count data, except that this setting focuses on counting the number of Bernoulli trials required to get a fixed number of successes (Casella and Berger, 1990). Assuming a sequence of independent Bernoulli (p) trials, where p is the rate of success, the random variable X denotes the number of trials before the rth success occurs. X has a negative binomial distribution, such that
The name of this distribution comes from the relationship and defining equation for binomial coefficients with negative integers (Feller, 2008), where
Substituting this relationship into the negative binomial distribution results in
The mean of this distribution is
2.3. Zero-inflated Poisson
Lambert (1992) first introduced the ZIP, although Mullahy (1986) initially explored the specification and testing of some modified count data models like the Poisson, while others have discussed this issue in detail (Cohen, 1963; Johnson and Kotz, 1969). Excessive zeros can occur in various settings, such as in manufacturing equipment that is properly aligned and defects are nearly impossible. This distribution provides an environment for modeling excessive zero counts in addition to positive counts. More specifically, this mixture model allows each count to have two possible generating processes, often called states, where a Bernoulli trial determines which processes the observed count follows. A ZIP assumes two states: (1) a perfect zero state and (2) an imperfect state. The perfect state allows only for the count to equal 0 with a certain probability, whereas the imperfect state allows each count to follow the Poisson distribution with another probability, such that X∼0 with probability π for the perfect state and X∼Poisson(λ) with probability (1−π). Therefore, the ZIP distribution is as follows:
The advantage of this distribution and its corresponding regression is that it inflates the number of zeros by mixing a point mass at 0 with a Poisson distribution, and it is an extremely practical way to model count data with both excessive zeros and large counts that may appear as outliers.
2.4. Zero-inflated negative binomial
The ZINB distribution has a mixture form, similar to the ZIP distribution. This mixture distribution allows for a more flexible environment to handle overdispersion, where the nonzero counts can assume a zero-truncated negative binomial distribution. The ZINB distribution is as follows:
where
2.5. Statistical approach
In this study, we are interested in moving beyond the traditional Poisson setting that is often used to characterize the distribution of these DNA words, especially in the presence of overdispersion that may be caused by a significant number of absent words. Our analyses were carried out in two steps, where the first step involved utilizing graphical techniques. Similar to Chor et al. (2009), we looked at the histograms of various genomes in an effort to visually assess the empirical distributions. It allowed us to determine a few things: Is there a spike at zero, and is there a significant amount of skewness in the right tail? We were also interested in the percentage of absent words present, since we hypothesized that the number of nullomers would increase as the size of each n-mer increased. After these initial analyses, our second step involved fitting the data using regression-based models with only a fixed intercept, assuming the various distributions outlined above.
To select a good model in terms of estimating quantities of interest, we chose to use the AIC (Akaike, 1974). In comparing simple versus more complex models, a simple model may be more parsimonious and result in better estimates of the certain characteristics of the true model; however, it may actually be farther from the true model than a more complex model. The AIC statistic judges a model by how close its fitted values tend to be to the true values with respect to a certain expected value. It is defined as
where it penalizes a model for having many parameters (Agresti, 2002). For instance, while the Poisson model has only one parameter to estimate, that is, λ, its zero-inflated counterpart has an additional parameter to estimate, that is, π. Although the AIC penalizes the ZIP for this extra parameter, it may actually be closer to the true underlying model than the Poisson if a significant amount of zero inflation (i.e., absent words) is present. Researchers have proposed benchmarks for judging whether the size of a difference in AIC between models is practically significant. For example, an AIC difference between two models of less than 2 provides little evidence for one over the other; an AIC difference of 10 or more is strong evidence (Burnham and Anderson, 2004). All analyses were conducted in R (R Core Team, 2012) as well as SAS 9.3 (SAS Institute Inc., 2012).
3. Results and Discussion
To characterize the distribution of the 2,199 genomes, we first examined the histograms of the empirical frequencies for various n-mers. Because there were a large number of genomes, we present only the results for all chromosomes associated with Streptococcus_thermophilus_MN_ZLW_002_uid166827 {NC017927.fna} (which we will refer to as S. thermophilus) in Figure 1. Several other genomes are presented in Supplementary Data Files 1–4 (Supplementary Material is available online at www.liebertpub.com/cmb). Figure 1a–f shows the histogram of each n-mer, where the x-axis displays the number of n-mers and the y-axis displays the probability of appearance. Although we considered the ZIP and ZINB in addition to the Poisson and negative binomial distributions, the fit between the Poisson and ZIP as well as that between the negative binomial and ZINB were indistinguishable; therefore, we focus only on presenting the Poisson and negative binomial density curves for ease of interpretation. The histogram of the 6-mers (Fig. 1a) for S. thermophilus is skewed to the right with a small point mass at 0, and a noticeably better fit, assuming an underlying negative binomial distribution as opposed to the Poisson. Figure 1b,c shows the distribution of the 7- and 8-mers, respectively, and the patterns are very consistent, again, with the negative binomial model displaying a better fit to the empirical data as opposed to the Poisson. Figure 1d displays the distribution of 9-mers, where the negative binomial, again, has a noticeably better fit; however, the increasing point mass at zero is now very evident, more so than in the 6-, 7-, and 8-mers settings. In Figure 1e, the negative binomial density curve seems to underestimate a portion of the point mass at 0, whereas the Poisson's density is shifted to the right as to not capture a significant portion of that same point mass at zero. Finally, Figure 1f shows that the probability of appearance for nullomers in S. thermophilus is as large as 0.80 in the 11-mers, where the negative binomial distribution has only a slightly better fit than the Poisson. The overall trend in S. thermophilus for each n-mer shows the efficiency in using the negative binomial distribution over the Poisson.

Empirical frequencies. Empirical frequency of S. thermophilus for
Table 1 shows the distribution of nullomers for each n-mer in the 2,199 genomes considered. The numbers of genomes with nullomers present are 150, 602, and 2,012 in the 6-, 7-, and 8-mers, respectively. All of the 2,199 genomes have nullomers present in the 9-, 10-, and 11-mers, indicating a strong presence of these absent words in these higher order n-mers. The percentage of nullomers in each n-mer is also presented with its corresponding range, where the percentage increases from 0.30 (6-mers) to 49.96 (11-mers). Table 2 shows the AIC results when comparing the various distributions for the n-mers in the 2,199 genomes. More specifically, the table displays the number of times each genome distribution's AIC value was lowest when comparing across the four different models. Overall, the negative binomial distribution has the best fit in all but 8 of the 6-mers, over 98% of the 7-mers, roughly 93% of the 8- and 9-mers, and over 99% of the 10- and 11-mers. There is a small percentage of 9-mers (<7%) and an even smaller percentage of 10-mers (0.73%) and 11-mers (0.09%) where the ZINB has a better fit. The Poisson distribution and its zero-inflated counterpart do not fit this data well across the various n-mers in any of the 2,199 genomes.
Count (N), mean (%), and range (%) of nullomers for each n-mer in 2,199 genomes.
Count and percentage of lowest AIC statistic when comparing the Poisson, negative binomial, ZIP, and ZINB distributions for each n-mer in 2,199 genomes. AIC, Akaike information criterion; NB, negative binomial; ZINB, zero-inflated negative binomial; ZIP, zero-inflated Poisson.
4. Conclusions
The purpose of this study was to characterize the distribution of several n-mers in an effort to model prokaryotic genomes with a more flexible and accurate distribution. Initially, we used graphical approaches in determining how to characterize each distribution. Our histograms show that each n-mer has a unimodal distribution with an extreme right tail. The histograms for 9-, 10-, and 11-mers for the randomly chosen genomes all showed a spike at zero, indicating the possibility of an increase in the frequency of nullomers. These absent words create a layer of complexity when describing the distribution of any DNA sequence; therefore, probabilistic models assuming the Poisson distribution may not be the appropriate choice to fully describe the behavior of a given genome.
Moving beyond the graphical approaches, which are more subjective, we chose to use a more objective method, such as the AIC statistic. Using this statistic showed that, for all 2,199 n-mers considered, at least 93% of the genomes preferred the negative binomial distribution, in comparison to the other three distributions. There were a few genomes that preferred the ZINB, but this percentage was small. The Poisson and ZIP distributions were not the best fit in any of the 2,199 genomes considered.
Interestingly, the preference for the ZINB did not increase with n-mer size and, in fact, 7% of the genomes were best fit with the ZINB in the 8- and 9-mer range. Upon further investigation, as shown in Supplementary Data File 5, there were interesting results. For the 9-mers, every Escherichia coli and many of the Salmonella strains have a better fit with the ZINB. For 8-mers, a significant amount of E. coli and Salmonella are better fit with ZINB over the negative binomial, but the correlation is less clear. However, Shigella, which is sometimes treated as a subgenus of E. coli, is best fit with negative binomial at both of these n-mer sizes (i.e., 8 and 9). Such analysis is revealing particular structures that are phylogenetically independent. Another interesting result is that at the 11-mer level, the only genomes that are better fit by the ZINB are the Candidatus Tremblaya genomes, which have the smallest genomes and have a higher noncoding portion (27%) than most prokaryotes (McCutcheon and Moran, 2012). However, this is not the case for the 10-mers. We hypothesized that the modeling is highlighting particular genomes' n-mer usage at different sizes of n and may be related to “coding” structure, in the case of E. coli and Salmonella, and noncoding structure, in the case of Trembalaya. While it is difficult to make concrete biological conclusions from the n-mer distribution modeling, it provides novel opportunities for further analysis into genome structure.
Our study was able to move beyond examining the characteristics of the common Poisson process, which is often used to illustrate the distribution of prokaryotic genomes. Although the Poisson distribution was not preferred in any of the genomes considered, the advantage of having only one parameter to estimate for this distribution remains. If there are excessive zeros present, the ZIP is often used, but its corresponding parameter estimates can be severely biased if the nonzero counts are overdispersed in relation to the Poisson distribution. The negative binomial and ZINB extend the traditional Poisson setting and are more flexible in handling overdispersion that can be caused by an increase in nullomers as well as outliers. Although these models are more complex and contain extra parameters in need of estimation, they tend to have a better model fit than a Poisson distribution. In an effort to characterize the distribution of the frequency of n-mers, researchers should also consider other discrete distributions that are more flexible and adjust for possible overdispersion. Such models can improve prokaryotic sequence classification and provide insights into genome organization.
Footnotes
Acknowledgments
This work was supported by the National Science Foundation Award No. DMS-1120622. All work was initiated when W.Z. and J.H. were affiliated with Drexel University.
Author Disclosure Statement
No competing financial interests exist.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
