Fast Approximation of Frequent k -Mers and Applications to Metagenomics

Abstract

Estimating the abundances of all k-mers in a set of biological sequences is a fundamental and challenging problem with many applications in biological analysis. Although several methods have been designed for the exact or approximate solution of this problem, they all require to process the entire data set, which can be extremely expensive for high-throughput sequencing data sets. Although in some applications it is crucial to estimate all k-mers and their abundances, in other situations it may be sufficient to report only frequent k-mers, which appear with relatively high frequency in a data set. This is the case, for example, in the computation of k-mers' abundance-based distances among data sets of reads, commonly used in metagenomic analyses. In this study, we develop, analyze, and test a sampling-based approach, called Sampling Algorithm for K-mErs approxIMAtion (SAKEIMA), to approximate the frequent k-mers and their frequencies in a high-throughput sequencing data set while providing rigorous guarantees on the quality of the approximation. SAKEIMA employs an advanced sampling scheme and we show how the characterization of the Vapnik–Chervonenkis dimension, a core concept from statistical learning theory, of a properly defined set of functions leads to practical bounds on the sample size required for a rigorous approximation. Our experimental evaluation shows that SAKEIMA allows to rigorously approximate frequent k-mers by processing only a fraction of a data set and that the frequencies estimated by SAKEIMA lead to accurate estimates of k-mer-based distances between high-throughput sequencing data sets. Overall, SAKEIMA is an efficient and rigorous tool to estimate k-mers' abundances providing significant speedups in the analysis of large sequencing data sets.

1. Introduction

The analysis of substrings of length k, called k-mers, is ubiquitous in biological sequence analysis and is among the first steps of processing pipelines for a wide spectrum of applications, including de novo assembly (Pevzner et al., 2001; Zerbino and Birney, 2008), error correction (Kelley et al., 2010; Salmela et al., 2016), repeat detection (Li and Waterman, 2003), genome comparison (Sims et al., 2009), digital normalization (Brown et al., 2012), RNA-seq quantification (Patro et al., 2014; Zhang and Wang, 2014), metagenomic reads classification (Wood and Salzberg, 2014), and binning (Girotto et al., 2016), and fast search-by-sequence over large high-throughput sequencing repositories (Solomon and Kingsford, 2016). A fundamental task in k-mer analysis is to compute the frequency of all k-mers, with the goal to distinguish frequent k-mers from infrequent k-mers (Marçais and Kingsford, 2011; Melsted and Pritchard, 2011). For example, this task is relevant in the analysis of high-throughput sequencing data, since infrequent k-mers are often assumed to result from sequencing errors. For several applications, the computation of k-mer frequencies is among the most computationally demanding steps of the analysis.

Many algorithms have been proposed for computing the exact frequency of all k-mers, such as Tallymer (Kurtz et al., 2008), Jellyfish (Marçais and Kingsford, 2011), BFCounter (Melsted and Pritchard, 2011), DSK (Rizk et al., 2013), KAnalyze (Audano and Vannberg, 2014), Turtle (Roy et al., 2014), KMC 3 (Kokot et al., 2017), and Squeakr-exact (Pandey et al., 2017). These methods typically perform a linear scan of the sequences to analyze and use a combination of parallelism and efficient data structures (such as Bloom filters and Hash tables) to maintain membership and counting information associated with all k-mers.

Since the computation of exact k-mer frequencies is computationally demanding, in particular for large sequence analysis or for high-throughput sequence data sets, recent methods have focused on providing approximate solutions to the problem, improving the time and memory requirements. KmerStream (Melsted and Halldórsson, 2014), Kmerlight (Sivadasan et al., 2016), and ntCard (Mohamadi et al., 2017) proposed streaming approaches for the approximation of the k-mer frequencies histogram. KmerGenie (Chikhi and Medvedev, 2013) performs a linear scan of the input, counting a random subset (chosen before processing the data set) of all possible k-mers to approximate the abundance histogram, providing an exploratory tool to choose the value of k. khmer (Zhang et al., 2014) and the recently proposed Squeakr (Pandey et al., 2017) rely on probabilistic data structures to approximate the counts of individual k-mers. With the only exception of KmerGenie, all these methods process all the k-mer occurrences in the input data set; in addition, all the aforementioned approximate methods that report the counts of individual k-mers do not provide simultaneous estimates with rigorous guarantees for all the counts k-mers that are provided in output.

All the methods cited previously try to estimate the frequency of all k-mers or of all k-mers that appear at least few times (e.g., twice) in the data set. Although this is crucial in some applications (e.g., in genome assembly k-mers that occur exactly once often represent sequencing errors and it is, therefore, important to estimate the count of all observed k-mers), in other applications this is less justified. For example, in the comparison of high-throughput sequencing metagenomic data sets, abundance-based distances or dissimilarities [e.g., the Bray–Curtis (BC) dissimilarity] between k-mer counts of two data sets are often used (Benoit et al., 2016; Danovaro et al., 2017; Dickson et al., 2017) to assess the distance between the corresponding data sets. In contrast to presence-based distances (Ondov et al., 2016; e.g., Jaccard distance), abundance-based distances take into account the frequency of each k-mer, with frequent k-mers contributing more to the distance than k-mers that appear with low frequency, but still more than a handful of times, in the data set.

Thus, two natural questions are (1) whether the results obtained considering all k-mers can be estimated by considering the abundances of frequent k-mers only and (2) whether the abundances of frequent k-mers can be computed more efficiently than the counts of all k-mers. Recently, preliminary work (Hrytsenko et al., 2018) has shown that, for the cosine distance and k = 12, the answer to the first question is positive, and in Section 4 we show that this is indeed the case for larger values of k and other abundance-based distances as well as presence-based distances (e.g., the Jaccard distance). To the best of our knowledge, the second question is hitherto unexplored. In addition, considering only frequent k-mers allows to focus on the most reliable information in a metagenomic data set, since a high stochastic variability in low frequency k-mers is to be expected due to the sampling process inherent in sequencing.

A natural approach to reduce time and memory requirements for frequency estimation problems is to process only a portion of the data, for example, by sampling some parts of a data set. Sampling approaches are appealing because infrequent k-mers naturally tend to appear with lower probability in a sample, allowing to directly focus on frequent k-mers in subsequent steps. However, major challenges in sampling approaches are (1) to provide rigorous guarantees relating the results obtained by processing the sample and the results that would be obtained from the whole data set and (2) to provide effective bounds on the size of the sample required to achieve such guarantees. The application of sampling to k-mers is even more challenging than in other scenarios since, for values of k in the typical range of interest to applications (e.g., 20–60), even the most frequent k-mers have relatively low frequency in the data. To the best of our knowledge, no approach based on sampling a portion of the input data set has been proposed to approximate frequent k-mers and their frequencies while providing rigorous guarantees.

1.1. Our contribution

We study the problem of approximating frequent k-mers, that is, k-mers that appear with frequency more than a user-defined threshold θ in a high-throughput sequencing data set. In these regards, our contributions are fourfold. First, we define a rigorous definition of approximation, governed by an accuracy parameter ɛ. Second, we propose a new method, Sampling Algorithm for K-mErs approxIMAtion (SAKEIMA), to obtain an approximation to the set of frequent k-mers using sampling. SAKEIMA (see Fig. 1) is based on a sampling scheme that goes beyond naive sampling of k-mers and allows to estimate k-mers of relatively low frequency considering only a fraction of all k-mers occurrences in the data set. Third, we provide analytical bounds to the sample size needed to obtain rigorous guarantees on the accuracy of the estimated k-mer frequencies, with respect to the frequencies measured on the entire data set.

FIG. 1.

SAKEIMA computes a fast and rigorous approximation of the frequent k-mers in a high-throughput sequencing data set by sampling a fraction of all k-mer occurrences in a data set, providing a significant speedup for the computation of k-mers' abundance-based distances between data sets of reads (e.g., in metagenomics). SAKEIMA, Sampling Algorithm for K-mErs approxIMAtion.

Our bounds are based on the notion of Vapnik–Chervonenkis (VC) dimension, a fundamental concept from statistical learning theory (Shalev-Shwartz and Ben-David, 2014), which has been used to design efficient algorithms to identify frequent patterns in other scenarios (Riondato and Upfal, 2014; Riondato and Kornaropoulos, 2016; Servan-Schreiber et al., 2018). To our knowledge, ours is the first method that applies concepts from statistical learning to provide a rigorous approximation of the k-mer frequencies. Fourth, we use SAKEIMA to extract frequent k-mers from metagenomic data sets from the Human Microbiome Project (HMP) and to approximate abundance-based and presence-based distances among such data sets, showing that SAKEIMA allows to accurately estimate such distances by analyzing only a fraction of the entire data set, resulting in a significant speedup.

Our approach is orthogonal to previous studies; any exact or approximate algorithm can be applied to the sample extracted by SAKEIMA, which can, therefore, be used before applying previously proposed methods, thus reducing their computational requirements while providing rigorous guarantees on the results w.r.t. to the entire data set. Although we present our methodology in the case of finding frequent k-mers from a set of sequences representing a high-throughput sequencing data set of short reads, our results can be applied to data sets of long reads and to whole-genome sequences as well.

2. Preliminaries

Let a data set $D$ be a bag of n reads $D = {r_{0}, \dots, r_{n - 1}}$ , where each read r_i, 0 ≤ i ≤ n – 1, is a string of length n_i from an alphabet Σ of cardinality |Σ| = σ. For j∈{0, …, n_i – 1}, let r_i[j] be the j-th character of r_i. For a given integer $k \leq {m i n}_{i} {n_{i} : r_{i} \in D}$ , we define a k-mer A as a string of length k from Σ, that is, A∈Σ^k. We say that a k-mer A appears in r_i at position j∈{0, …, n_i – k} if $r_{i} [j + h] = A [h], \forall h \in {0, \dots, k - 1}$ . For every i, 0 ≤ i ≤ n – 1, and every j∈{0, …, n_i – k}, we define the indicator function ϕ_{r_i} _{, A}(j) that is 1 if the k-mer A appears in r_i at position j, while ϕ_{r_i} , _A(j) = 0 otherwise. The total number of k-mers in $D$ is $t_{D, k} = \sum_{i = 0}^{n - 1} (n_{i} - k + 1)$ . We define the support $o_{D} (A)$ of a k-mer A as the number of distinct positions in $D$ where A appears: $o_{D} (A) = \sum_{i = 0}^{n - 1} \sum_{j = 0}^{n_{i} - k} ϕ_{r_{i}}$ , A(j). We define the frequency $f_{D} (A)$ of A in $D$ as the ratio between the number of distinct positions where A appears in $D$ and the total number of k-mers in $D$ : $f_{D} (A) = o_{D} (A) ∕ t_{D, k}$ .

2.1. Frequent k-mers and approximations

We are interested in obtaining the set $F K (D, k, θ)$ of frequent k-mers in a data set $D$ with respect to a minimum frequency threshold θ, defined as follows.

Definition 1. Given a data set $D$ , an integer k > 0, and a frequency threshold θ∈(0, 1), the set $F K (D, k, θ)$ of frequent k-Mers in $D$ w.r.t. θ is the collection of all k-mers with frequency at least θ in $D$ and of their corresponding frequencies in $D$ : $F K (D, k, θ) = {(A, f_{D} (A)) : f_{D} (A) \geq θ} .$ (1)

$F K (D, k, θ)$ can be computed with a single scan of all the k-mers occurrences in $D$ maintaining the k-mers supports in an appropriate data structure; however, when $D$ is extremely large and k is not small, the exact computation of $F K (D, k, θ)$ is extremely demanding in terms of time and memory, since the number of k-mers grows exponentially with k. In this case, a fast to compute approximation of the set $F K (D, k, θ)$ may be preferable, provided it ensures rigorous guarantees on its quality. In this study, we focus on the following approximation.

Definition 2. Given a data set $D$ , an integer k > 0, a frequency threshold θ∈(0, 1), and a constant ɛ ∈(0, θ), an ɛ-approximation of $F K (D, k, θ)$ is a collection C = {(A, f_A): f_A∈(0, 1)} such that

for any $(A, f_{D} (A)) \in F K (D, k, θ)$ there is a pair (A, f_A) ∈C;

for any (A, f_A) ∈C it holds that $f_{D} (A) \geq θ - ε$ ; and

for any (A, f_A) ∈C it holds that $| f_{D} (A) - f_{A} | \leq ε ∕ 2$ .

Definition 2 guarantees that every frequent k-mer of $D$ is in the approximation and that no k-mer with frequency <θ – ɛ is in the approximation. The third condition guarantees that the estimated frequency f_A of A in the approximation is close (i.e., within ɛ/2) to the frequency $f_{D} (A)$ of A in $D$ . It is easy to show that obtaining an ɛ-approximation of $F K (D, k, θ)$ with absolute certainty requires to process all k-mers in $D$ .

2.2. Simple sampling-based algorithms and bounds

We aim to provide an approximation to $F K (D, k, θ)$ with sampling, by processing only randomly selected portions of $D$ . The simplest sampling scheme is that in which a random sample is a bag P of m positions taken uniformly at random, with replacement, from the set $P_{D, k} = {(i, j) : i \in [0, n - 1], j \in [0, n_{i} - k]}$ (note that $| P_{D, k} | = t_{D, k}$ ) of all positions where k-mers occur in the data set $D$ , corresponding to m occurrences of k-mers (with repetitions) taken uniformly at random. Given such sample P, an integer k > 0, and a minimum frequency threshold θ∈(0, 1), one can define the set of frequent k-mers (and their frequencies) in the sample P as FK(P, k, θ) = {(A, f_P(A)): f_P(A) ≥ θ}, where f_P(A) is the frequency of k-mer A in the sample.

Obtaining an ɛ-approximation from a random sample with absolute certainty is impossible, thus we focus on obtaining an ɛ-approximation with probability 1 – δ > 0, where δ∈(0, 1) is a confidence parameter, whose value is provided by the user. Intuitively, the set $F K (D, k, θ)$ of frequent k-mers is well approximated by the set of frequent k-mers in a random sample P when P is sufficiently large. One natural question regards how many samples are needed to obtain the desired ɛ-approximation. By using Hoeffding's inequality (Mitzenmacher and Upfal, 2017) to bound the deviation of the frequency of a k-mer A in the sample from $f_{D} (A)$ and a union bound on the maximum number σ^k of k-mers, where σ = |Σ|, we have the following result that provides a first such bound, and a corresponding first algorithm to obtain an ɛ-approximation to $F K (D, k, θ)$ .

Proposition 1. Consider a sample P of size m of $D$ . If $m \geq \frac{2}{ε^{2}} (ln (2 σ^{k}) + ln (\frac{1}{δ}))$ for fixed ɛ ∈(0, θ), δ ∈(0, 1), then, with probability ≥1 – δ, FK(P, k, θ – ɛ/2) is an ɛ-approximation of $F K (D, k, θ)$ .

Proof. We first prove that when $m \geq \frac{2}{ε^{2}} (ln (2 σ^{k}) + ln (\frac{1}{δ}))$ , then, with probability ≥1 – δ, for every k-mer A simultaneously we have $| f_{P} (A) - f_{D} (A) | \leq ε ∕ 2$ .

For an arbitrary k-mer A, given the definition of f_P(A) we have that $f_{P} (A) = \sum_{(i, j) \in P} ϕ_{r_{i}, A} (j) ∕ m$ where $\sum_{(i, j) \in P} ϕ_{r_{i}, A} (j)$ is the sum of m 0–1 independent random variables. Since $ℰ [ϕ_{r_{i}, A} (j)] = f_{D} (A)$ , we have that $ℰ [f_{P} (A)] = f_{D} (A)$ , and by Hoeffding inequality (Mitzenmacher and Upfal, 2017), we have $Pr (| f_{P} (A) - f_{D} (A) | \geq ε) = Pr (|\sum_{(i, j) \in P} ϕ_{r_{i}, A} (j) - m f_{D} (A)| \geq m ε) \leq 2 e^{\frac{- 2 m^{2} ε^{2}}{m}} = 2 e^{- 2 m ε^{2}} .$ (2)

Now define the event $E_{A} ='' | f_{P} (A) - f_{D} (A) | \leq ε ∕ 2''$ and let $Ē_{A}$ be the complementary event. From Equation 2 and the choice of m, $Pr (Ē_{A}) \leq 2 e^{- m ε^{2} ∕ 2} = δ ∕ σ^{k}$ . By union bound, the probability that at least one $Ē_{A}$ holds is bounded by $\sum_{A \in Σ^{k}} Pr (Ē_{A}) \leq δ$ . Therefore, with probability at least 1 – δ all events E_A hold.

We now prove that when $| f_{P} (A) - f_{D} (A) | \leq ε ∕ 2$ for every k-mer A, then FK(P, k, θ – ɛ/2) is an ɛ-approximation of $F K (D, k, θ)$ . Consider an arbitrary pair $(A, f_{D} (A)) \in F K (D, k, θ)$ . By the definition of $F K (D, k, θ)$ , we have that $f_{D} (A) \geq θ$ , and, since $| f_{P} (A) - f_{D} (A) | \leq ε ∕ 2$ , we have that f_P(A) ≥ θ – ɛ/2, that is, there is a pair (A, f_A)∈FK(P, k, θ – ɛ/2). Now consider a k-mer A with $f_{D} (A) < θ - ε$ : since $| f_{P} (A) - f_{D} (A) | \leq ε ∕ 2$ we have that $f_{P} (A) \leq f_{D} (A) + ε ∕ 2 < θ - ε ∕ 2$ , that is, there is no pair (A, f_A)∈FK(P, k, θ – ɛ/2).

In addition, by using known results in statistical learning theory (Vapnik and Chervonenkis, 1971; Mitzenmacher and Upfal, 2017) relating the VC dimension (see Section 3 for its definition) of a family of functions to a newly derived bound on the family of functions ${f_{D} (A)}$ , we obtain the following improved bound and algorithm. (The derivation is given in Supplementary Appendix).

Proposition 2. Let P be a sample of size m of $D$ . For fixed ɛ∈(0, θ), δ∈(0, 1), if $m \geq \frac{2}{ε^{2}} (1 + ln (\frac{1}{δ}))$ , then FK(P, k, θ – ɛ/2) is an ɛ-approximation for $F K (D, k, θ)$ with probability ≥1 – δ.

3. Advanced and Practical Bounds and Algorithms for k -Mer Approximations

Although the bound of Proposition 2 significantly improves the simple bounds of Section 1, since the factor ln(2σ^k) has been reduced to 1, it still has an inverse quadratic dependency with respect to the accuracy parameter ɛ, which is problematic when the quantities to estimate are small. In these cases, one needs a small ɛ to produce a meaningful approximation (since ɛ < θ), and the inverse quadratic dependence of the sample size from ɛ often results in a sample size larger than the entire input, defeating the purpose of sampling. The case of k-mers is particularly challenging, since the sum $\sum_{A \in Σ^{k}} f_{D} (A)$ of all k-mer frequencies is exactly 1. Therefore, the higher the number of distinct k-mers appearing in the input, the lower their frequencies will be, with the consequence that θ (and, therefore, ɛ) typically needs to be set to a very low value. For example, a typical data set from the HMP has $n \approx 1 0^{8}$ reads of (average) length $\approx 100$ : therefore, if we are interested in k-mers for k = 31, by setting δ = 0.05, the bound of Section 2.2 gives $ε \approx 1 0^{- 5}$ , that is, only k-mers with frequency ≥10⁻⁵ could be reliably reported by sampling. However, in data sets we considered, no or a very small number (≤30) of k-mers have frequency ≥10⁻⁵; therefore, according to the result from Section 2.2, we cannot obtain a meaningful approximation of k-mers and their frequencies. In the remainder of this section we develop more refined sampling schemes and estimation techniques, leading to a practical sampling-based algorithm.

3.1. Sampling bags of positions and VC dimension bound

We propose a method to provide an efficiently computable approximation to $F K (D, k, θ)$ when the minimum frequency θ is low, by properly defining samples so that any k-mer A will appear in a sample with probability higher than $f_{D} (A)$ , thus lessening the dependence of the sample size from 1/ɛ². For this to be achievable, we need to relax the notion of approximation defined in Section 2. In particular, the guarantees, provided by our method, in such relaxed approximation are that all k-mers with frequency >θ^′, with θ^′ slightly higher than θ, are reported in output, and that no k-mer having frequency <θ – ɛ is reported in output. (See Proposition 5 for the definition of θ^′.) Our experiments show that the fraction of k-mers having frequency∈[θ, θ^′) that are not reported is very small. Our method works by sampling bags of positions instead of single positions. In particular, an element of the sample is now a set of ℓ positions chosen independently at random from the set $P_{D, k}$ of all positions.

Let I_ℓ = {(i₁, j₁), (i₂, j₂), …, (i_ℓ, j_ℓ)} be a bag of ℓ positions for k-mers in $D$ , chosen uniformly at random from the set $P_{D, k}$ . We define the indicator functions ${\hat{ϕ}}_{A} (I_{ℓ})$ that, for a given bag I_ℓ of ℓ positions, is equal to 1 if k-mer A appears in at least one of the ℓ positions in I_ℓ and is equal to 0 otherwise. That is, ${\hat{ϕ}}_{A} (I_{ℓ}) = min \{1, \sum_{(i, j) \in I_{ℓ}} ϕ_{r_{i}, A} (j)\} .$ We define the ℓ-positions sample P_ℓ as a bag of m bags {I_ℓ_{, 0}, I_ℓ_{, 1}, …, I_ℓ_{, m–1}}, where each I_ℓ_{, j}, 0 ≤ j ≤ m – 1 is a bag of ℓ positions, sampled independently, and ${\hat{f}}_{P_{ℓ}} (A) = \frac{1}{m} \sum_{I_{ℓ, i} \in P_{ℓ}} \frac{{\hat{ϕ}}_{A} (I_{ℓ, i})}{ℓ} .$ (3)

Intuitively, ${\hat{f}}_{P_{ℓ}} (A)$ is the biased version of the unbiased estimator $f_{P_{ℓ}} (A) = \frac{1}{m} \sum_{I_{ℓ, i} \in P_{ℓ}} \frac{\sum_{(i, j) \in I_{ℓ, i}} ϕ_{r_{i}, A} (j)}{ℓ}$ (4)

of $f_{D} (A)$ , where the bias arises from considering a value of 1 every time $\sum_{(i, j) \in I_{ℓ, i}} ϕ_{r_{i}, A} (j) > 1$ .

In our analysis, we use the VC dimension (Vapnik, 1998; Vapnik and Chervonenkis, 1971), a statistical learning concept that measures the expressivity of a family of binary functions. We define a range space Q as a pair Q = (X, R_X), where X is a finite or infinite set and R_X is a finite or infinite family of subsets of X. The members of R_X are called ranges. Given D ⊂ X, the projection of R_X on D is defined as proj_{R_X}(D) = {r ∩ D: r∈R_X}. We say that D is shattered by R_X if $p r o j_{R_{X}} (D) = 2^{| D |}$ . The VC dimension of Q, denoted as VC(Q), is the maximum cardinality of a subset of X shattered by R_X. If there are arbitrary large shattered subsets of X shattered by R_X, then VC(Q) = ∞.

A finite bound on the VC dimension of a range space Q implies a bound on the number of random samples required to obtain a good approximation of its ranges, defined as follows.

Definition 3. Let Q = (X, R_X) be a range space and let D be a finite subset of X. For ɛ ∈(0, 1], a subset B of D is an ɛ-approximation of D if for all r∈R_X we have $|\frac{| D \cap r |}{| D |} - \frac{| B \cap r |}{| B |}| \leq ε ∕ 2 .$

The following result (Mitzenmacher and Upfal, 2017) relates ɛ and the probability that a random sample of size m is an ɛ-approximation for a range space of VC dimension at most v.

Proposition 3 ((Mitzenmacher and Upfal, 2017)). There is an absolute positive constant c such that if (X, R_X) is a range space of VC dimension at most v, D is a finite subset of X, and 0 < ɛ, δ < 1, then a random subset B ⊂ D of cardinality m with $m \geq \frac{4 c}{ε^{2}} (v + ln (\frac{1}{δ}))$ is an ɛ-approximation of D with probability at least 1 – δ.

The universal constant c has been experimentally estimated to be at most 0.5 (Löffler and Phillips, 2009).

We now prove an upper bound to the VC dimension VC(Q) of the range space Q associated with the class of functions ${\hat{ϕ}}_{A}$ that grows sublinearly with respect to ℓ. To this aim, we first define the range space associated with bags of ℓ positions of k-mers.

Definition 4. Let $D$ be a data set of n reads and let k and ℓ be two integers ≥1. We define $Q = (X_{D, k, ℓ}, R_{D, k, ℓ})$ to be the following range space:

$X_{D, k, ℓ}$ is the set of all bags of ℓ positions of k-mers in $D$ , that is, the set of all possible subsets, with repetitions, of size ℓ from $P_{D, k}$ ;

$R_{D, k, ℓ} = {P_{D, ℓ} (A) | A \in Σ^{k}}$ is the family of sets of starting positions of k-mers, such that for each k-mer A, the set $P_{D, ℓ} (A)$ is the set of all bags of ℓ starting positions in $D$ where A appears at least once.

We prove the following results on the VC dimension of the mentioned range space.

Proposition 4. Let Q be the range space from Definition 4. Then $V C (Q) \leq ⌊{log}_{2} (ℓ)⌋ + 1$ .

Proof. If VC(Q) ≥ v, then there must exist a set $Z \subseteq X_{D, k, ℓ}$ with |Z| ≥ v that is shattered. This means that 2^v subsets of Z must be in projection of $R_{D, k, ℓ}$ on Z. If this is true, then every element of Z needs to belong to exactly 2^v–1 such sets. Therefore, every element of Z needs to contain at least ℓ = 2^v–1 distinct k-mers. This implies that $v \leq l o g_{2} (ℓ) + 1$ , and the thesis follows.

Using the mentioned result, we prove the following.

Proposition 5. Let ℓ ≥ 1 be an integer and P_ℓ be a bag of m bags of ℓ positions of $D$ with $m \geq \frac{2}{{(ℓ ε)}^{2}} (⌊{log}_{2} min (2 ℓ, σ^{k})⌋ + ln (\frac{1}{δ})) .$ (5)

Then, with probability at least 1 – δ:

for any k-mer $A \in F K (D, k, θ)$ such that $f_{D} (A) \geq θ' = 1 - (1 - ℓ θ) 1 ∕ ℓ$ it holds ${\hat{f}}_{P_{ℓ}} (A) \geq θ - ε ∕ 2$ ;

for any k-mer A with ${\hat{f}}_{P_{ℓ}} (A) \geq θ - ε ∕ 2$ it holds $f_{D} (A) \geq θ - ε$ ;

for any k-mer $A \in F K (D, k, θ)$ it holds $f_{D} (A) \geq {\hat{f}}_{P_{ℓ}} (A) - ε ∕ 2$ ;

for any k-mer A with ${\hat{f}}_{P_{ℓ}} (A) - ε ∕ 2 \geq 0$ , it holds $f_{D} (A) \geq 1 - (1 - ℓ ({\hat{f}}_{P_{ℓ}} (A) - ε ∕ 2)) 1 ∕ ℓ$ ;

for any k-mer A with $ℓ ({\hat{f}}_{P_{ℓ}} (A) + ε ∕ 2) \leq 1$ it holds $f_{D} (A) \leq 1 - (1 - ℓ ({\hat{f}}_{P_{ℓ}} (A) + ε ∕ 2)) 1 ∕ ℓ$ .

Proof. For a given k-mer A, consider the event $E_{A} ='' ℰ | [{\hat{f}}_{P_{ℓ}} (A)] - {\hat{f}}_{P_{ℓ}} (A) | \leq ε ∕ 2''$ . Note that it is equivalent to “ $| ℰ [ℓ {\hat{f}}_{P_{ℓ}} (A)] - ℓ {\hat{f}}_{P_{ℓ}} (A) | \leq ℓ ε ∕ 2$ ” and that $ℓ {\hat{f}}_{P_{ℓ}} (A) = \frac{1}{m} \sum_{i = 0}^{m - 1} {\hat{ϕ}}_{A} (I_{ℓ, i})$ , therefore, $ℰ [ℓ {\hat{f}}_{P_{ℓ}} (A)] = ℰ [{\hat{ϕ}}_{A} (I_{ℓ, i})]$ . Now note that if for the range space $Q = (X_{D, k, ℓ}, R_{D, k, ℓ})$ we consider $r_{A} = P_{D, ℓ} (A)$ , we have that $\frac{| X_{D, k, ℓ} \cap r_{A} |}{| X_{D, k, ℓ} |} = ℰ [{\hat{ϕ}}_{A} (I_{ℓ, i})]$ , since I_ℓ_{, i} is a bag of ℓ positions taken uniformly at random among all possible such bags and, therefore, $ℰ [{\hat{ϕ}}_{A} (I_{ℓ, i})]$ is the fraction of bags of ℓ positions that contain at least a position where A occurs (i.e., $ℰ [{\hat{ϕ}}_{A} (I_{ℓ, i})]$ is w.r.t. the uniform distribution over bags of ℓ positions). Therefore, combining Proposition 4 and Proposition 3, for the given choice of m, we have that with probability 1 – δ it holds that $| ℰ [ℓ {\hat{f}}_{P_{ℓ}} (A)] - ℓ {\hat{f}}_{P_{ℓ}} (A) | \leq ℓ ε ∕ 2, \forall A$ , or, equivalently, $| ℰ [{\hat{f}}_{P_{ℓ}} (A)] - {\hat{f}}_{P_{ℓ}} (A) | \leq ε ∕ 2, \forall A$ : we assume that this holds in the rest of the proof.

Consider a k-mer A with frequency $f_{D} (A)$ in $D$ . From the definition of ${\hat{f}}_{P_{ℓ}} (A)$ , we have $ℰ [{\hat{f}}_{P_{ℓ}} (A)] \leq ℰ [f_{P_{ℓ}} (A)] = f_{D} (A)$ . Let $X_{i} = {\hat{ϕ}}_{A} (I_{ℓ, i}) ∕ ℓ$ be the random variable taking value 1/ℓ if the k-mer A appears at least once in the ℓ positions of I_ℓ_{, i}, and value 0 otherwise. We have that $ℰ [{\hat{f}}_{P_{ℓ}} (A)] = \frac{1}{m} \sum_{I_{ℓ, i} \in P_{ℓ}} ℰ [X_{i}] = \frac{1}{m} \sum_{I_{ℓ, i} \in P_{ℓ}} \frac{1}{ℓ} Pr (X_{i} \geq 1) = (1 - {(1 - f_{D} (A))}^{ℓ}) ∕ ℓ$ . Now consider a k-mer A with $f_{D} (A) \geq 1 - (1 - ℓ θ) 1 ∕ ℓ$ . By the mentioned derivation, we have that $ℰ [{\hat{f}}_{P_{ℓ}} (A)] \geq θ$ and, therefore, its frequency ${\hat{f}}_{P_{ℓ}} (A)$ in the sample P_ℓ satisfies ${\hat{f}}_{P_{ℓ}} (A) \geq θ - ε ∕ 2$ , which completes the proof of the first part.

For the second part, consider a k-mer A with $f_{D} (A) < θ - ε$ . By the mentioned derivation, we have that $ℰ [{\hat{f}}_{P_{ℓ}} (A)] \leq ℰ [f_{P_{ℓ}} (A)] = f_{D} (A) < θ - ε$ . Since $| ℰ [{\hat{f}}_{P_{ℓ}} (A)] - {\hat{f}}_{P_{ℓ}} (A) | \leq ε ∕ 2, \forall A$ , we have that ${\hat{f}}_{P_{ℓ}} (A) < θ - ε ∕ 2$ , which proves the second part of the result.

The third result follows from $| ℰ [{\hat{f}}_{P_{ℓ}} (A)] - {\hat{f}}_{P_{ℓ}} (A) | \leq ε ∕ 2$ and $ℰ [{\hat{f}}_{P_{ℓ}} (A)] \leq f_{D} (A)$ .

The last two results follow from $| ℰ [{\hat{f}}_{P_{ℓ}} (A)] - {\hat{f}}_{P_{ℓ}} (A) | \leq ε ∕ 2$ and $ℰ [{\hat{f}}_{P_{ℓ}} (A)] = (1 - (1 - f_{D} (A)) ℓ) ∕ ℓ$ .

Note that from Proposition 5, the set ${(A, f_{P_{ℓ}} (A)) : {\hat{f}}_{P_{ℓ}} (A) \geq θ - ε ∕ 2}$ is almost an ɛ-approximation to $F K (D, k, θ)$ : in particular, there may be k-mers A for which $ℰ [{\hat{f}}_{P_{ℓ}} (A)] = (1 - {(1 - f_{D} (A))}^{ℓ}) ∕ ℓ < θ$ while $f_{D} (A) = ℰ [f_{P_{ℓ}} (A)] \geq θ$ and such that for the given sample P_ℓ we have ${\hat{f}}_{P_{ℓ}} (A) \approx ℰ [{\hat{f}}_{P_{ℓ}} (A)] - ε ∕ 2$ . Although this can happen, we can limit the probability of this happening by appropriately choosing ℓ, and still enjoy the reduction in sample size of the order of $\frac{{log}_{2} ℓ}{ℓ^{2}}$ w.r.t. Proposition 2 obtained by considering bags of bags of ℓ positions. In particular, this result allows the user to set θ, ɛ, δ, and ℓ to effectively find, with probability at least 1 – δ, all frequent k-mers A for which $f_{D} (A) \geq θ'$ and do not report any k-mer with frequency <θ – ɛ, while still being able to report in output almost all k-mers with frequency∈[θ, θ^′). Our experimental analysis (Section 4) shows that in practice choosing ℓ close from below to 1/θ is very effective to obtain such result.

Then, the third, fourth, and fifth guarantees from Proposition 5 state that we can use the biased estimates ${\hat{f}}_{P_{ℓ}} (A)$ to derive guaranteed upper and lower bounds to $f_{D} (A)$ that will be much tighter than that obtained using the bounds of Section 2.2. We show how to obtain further improved upper and lower bounds to $f_{D} (A)$ in Section 3.3. Such lower bounds ℓb_A can be used, for example, to prove that the set {(A, f_{P_ℓ} (A)): ℓb_A ≥ θ – ɛ} enjoys the same last four guarantees from Proposition 5, while the first guarantee holds for a θ^′ < 1 – (1 – ℓθ)^1/ℓ; therefore, when false negatives are problematic, the set {(A, f_{P_ℓ} (A)): ℓb_A ≥ θ – ɛ} can be used to obtain a different approximation of $F K (D, k, θ)$ with fewer false negatives.

3.2. SAKEIMA: an efficient algorithm to approximate frequent k-mers

We now present our SAKEIMA that builds on Proposition 5 and efficiently samples a bag P_ℓ of bags of ℓ positions from $D$ to obtain an approximation of the set $F K (D, k, θ)$ with probability 1 – δ, where δ is a parameter provided by the user.

SAKEIMA is described in Algorithm 1. Although SAKEIMA performs a linear scan of the input data set, it practically reduces the number of k-mers that need to be processed with the following strategy.

SAKEIMA performs a pass on the stream of k-mers appearing in $D$ , and for each position in the stream it draws the number a of times that the position appears in the sample P_ℓ independently at random from the Poisson distribution Poisson(λ) of parameter $λ = m ℓ ∕ t_{D, k}$ . SAKEIMA stores such values, if strictly positive, in a counting structure T (lines 3–7) that keeps, for each k-mer A, the total number of occurrences of A in the sample P_ℓ. Note that $t_{D, k}$ can be computed with a very quick linear scan of the data set, where n_i is computed for every $r_{i} \in D$ without extracting and processing (e.g., inserting or updating information for) k-mers; alternatively a lower bound to $t_{D, k}$ can be used, simply resulting in a number of samples higher than needed. For each k-mer A appearing at least once in the sample, the unbiased estimate f_A is computed in line 11 as the number T[A] of occurrences of A in the sample P_ℓ divided by the total number of positions in the sample t.

The biased estimate ${\hat{f}}_{A}$ can be computed partitioning the T[A] occurrences of A into m bags I_ℓ_{, 0}, …, I_ℓ_{, m–1}; ${\hat{f}}_{A}$ is then simply the ratio between the number of bags where A appears at least once and mℓ. We describe a more efficient way of computing such biased estimate at the end of this section. Then SAKEIMA flags A as frequent if ${\hat{f}}_{A} \geq θ - ε ∕ 2$ (line 14) and, in this case, the couple (A, f_A) is added to the output set $O$ (that is reported in output at line 15), since f_A is the best (and unbiased) estimate to $f_{D} (A)$ .

Algorithm 1. SAKEIMA

Input: data set

D

, total number of k-mers

t_{D, k}

D

frequency threshold θ, accuracy parameter ɛ∈(0, θ),

confidence parameter δ∈(0, 1), integer ℓ ≥ 1.

Output: approximation {(A, f_A)} of

F K (D, k, θ)

with probability ≥1 – δ.

m \leftarrow ⌈\frac{2}{{(ℓ ε)}^{2}} (⌊{log}_{2} min (2 ℓ, σ^{k})⌋ + ln (\frac{2}{δ}))⌉

;

λ \leftarrow \frac{m ℓ}{t_{D, k}}

;

2 T ← empty hash table;

3 forall the reads

r_{i} \in D

4 forall the j∈[0, n_i – k] do

5 A ← k-mer in position j of read r_i;

6 a ← Poisson(λ);

7 if a > 0 then T[A] ← T[A] + a;

O \leftarrow \emptyset

;

t \leftarrow \sum_{A \in T} T [A]

;

9 P_ℓ ← random partition of the occurrences in T into m bags;

10 forall the k-mers A∈T do

11 f_A ← T[A]/t;

12 P_A ← bags of P_ℓ where A appears at least once;

{\hat{f}}_{A} \leftarrow | P_{A} | ∕ (m ℓ)

;

14 if

{\hat{f}}_{A} \geq θ - ε ∕ 2

then

O \leftarrow O \cup (A, f_{A})

15 return

O

;

Note that SAKEIMA does not sample m bags of exactly ℓ positions each, since the number of occurrences of each position in $D$ in the sample P_ℓ is sampled independently from a Poisson distribution, even if the expected number of total occurrences sampled from the algorithm is mℓ. However, the independent Poisson distributions used by SAKEIMA provide an accurate approximation of the random sampling of exactly mℓ positions used in the analysis of Section 3.1. In particular, this holds when one focuses on the events of interests for our approximation of Section 3.1 (e.g., the event “there exists a k-mer A such that $| ℰ [{\hat{f}}_{P_{ℓ}} (A)] - {\hat{f}}_{P_{ℓ}} (A) | > ε ∕ 2$ ”). In fact, a simple adaptation of a known result (corollary 5.11 of Mitzenmacher and Upfal, 2017) on the relation between sampling with replacement and the use of independent Poisson distributions gives the following.

Proposition 6. Let E be an event whose probability is either monotonically increasing or monotonically decreasing in the number of sampled positions. If E has probability p when the independent Poisson distributions are used, then E has probability at most 2p when the sampling with replacement is used.

As a simple corollary, the output $O$ features the guarantees of Proposition 5 with probability ≥1 – δ^′, with δ^′ = 2δ.

The technique we just described can be used to avoid the exact computation of ${\hat{f}}_{A}$ , which requires to maintain and update the counters for the m buckets; in fact, we can approximate the number of occurrences of a k-mer A, appearing T[A] times in the random sample of SAKEIMA into a given bucket as a sample from Poisson(T[A]/m). This means that the number of buckets where A will be inserted at least once is well approximated by a sample from Binomial(m, 1 – e^−T[A]/m), which models the number of successes in m independent trials with probability of success 1 – e^−T[A]/m. Owing to this second Poisson approximation, we obtain that the output $O$ provides the guarantees of Proposition 5 with probability ≥1 – δ^″, with δ^″ = 4δ. In terms of Algorithm 1, such modification simply requires to substitute $\frac{2}{δ}$ with $\frac{4}{δ}$ in line 1, to remove line 9, and to substitute lines 12–13 with “ ${\hat{f}}_{A} \leftarrow B i n o m i a l (m, 1 - e^{- T [A] ∕ m}) ∕ (m ℓ)$ .” This also allows to efficiently compute multiple values of ${\hat{f}}_{A}$ , corresponding to different values of ℓ, by simply taking samples from binomial distributions of different appropriate parameters. (In particular, if one samples a total t of k-mers, then the value m to be used for both parameters of the binomial distribution is t/ℓ.) The next section shows why this is useful.

3.3. Improved lower and upper bounds to k-mer frequencies

Note that Proposition 5 guarantees that we can obtain upper and lower bounds to $f_{D} (A)$ for every $A \in F K (D, k, θ)$ from the sample of bags of ℓ positions. These bounds are meaningful only in specific ranges of the frequencies; for example, the lower bound from the third guarantee in Proposition 5 is meaningful when the frequency of A is fairly low, that is, $f_{D} (A) \approx 1 ∕ ℓ$ , whereas for very frequent k-mers they could be a multiplicative factor 1/ℓ away from than the correct value. For example, if a k-mer is very frequent and appears in all bags of ℓ k-mers in a sample P_ℓ, its corresponding lower bound is still only 1/ _ℓ – ɛ/2.

However, Proposition 5 can be generalized to obtain tighter upper and lower bounds to the frequency of all k-mers. For given ℓ, ɛ, and δ, let m be as given in Proposition 5. Note that the total number of k-mers' positions in the sample P_ℓ is mℓ. Let $ℒ$ be a set of integer values $ℒ = {ℓ_{i}}$ with $ℓ_{i} \in [1, m ℓ], \forall i = 0, \dots, | ℒ | - 1$ . Now, for every $ℓ_{i} \in ℒ$ , we can partition the same mℓ k-mers that are in P_ℓ into m_i = mℓ/ℓ_i partitions having size ℓ_i. Let P_{ℓ_i} be such a random partition of such positions into m_i bags of ℓ_i positions each. Note that each P_{ℓ_i} is a “valid” sample (i.e., a sample of independent bags of positions, each obtained by uniform sampling with replacement) for Proposition 5, even if the P_{ℓ_i'}s are not independent. From each P_{ℓ_i} , we define a maximum deviation ɛ_i from Proposition 5 as $ε_{i} = \frac{1}{ℓ_{i}} \sqrt{\frac{2}{m_{i}} (⌊{log}_{2} (min (2 ℓ_{i}, σ^{k}))⌋ + ln (| ℒ | ∕ δ))}$ . We have the following result.

Proposition 7. With probability at least 1 – δ, for all k-mers A simultaneously and for all the random partitions induced by $ℒ$ it holds

$f_{D} (A) \geq max {{\hat{f}}_{P_{ℓ_{i}}} (A) - ε_{i} ∕ 2 : i = 0, \dots, | ℒ | - 1}$ ;

$f_{D} (A) \geq max {1 - (1 - ℓ ({\hat{f}}_{P_{ℓ_{i}}} (A) - ε_{i} ∕ 2)) 1 ∕ ℓ : i = 0, \dots, | ℒ | - 1 a n d {\hat{f}}_{P_{ℓ_{i}}} (A) - ε_{i} ∕ 2 \geq 0}$ ;

$f_{D} (A) \leq min {1 - (1 - ℓ ({\hat{f}}_{P_{ℓ_{i}}} (A) + ε_{i} ∕ 2)) 1 ∕ ℓ : i = 0, \dots, | ℒ | - 1 a n d {\hat{f}}_{P_{ℓ_{i}}} (A) + ε_{i} ∕ 2 \leq 1 ∕ ℓ}$ .

Proof. Combining proposition 4 and Proposition 3 and by union bound on the $| ℒ |$ values of i, we have that with probability 1 – δ it holds that $| ℰ [{\hat{f}}_{P_{ℓ_{i}}} (A)] - {\hat{f}}_{P_{ℓ_{i}}} (A) | \leq ε ∕ 2, \forall A$ and $\forall i = 0, \dots, | ℒ | - 1$ : we assume that this holds in the rest of the proof. To prove the lower bound, note that since $ℰ [{\hat{f}}_{P_{ℓ_{i}}} (A)] = (1 - (1 - f_{D} (A)) ℓ) ∕ ℓ$ , from the above we have that $(1 - (1 - f_{D} (A)) ℓ) ∕ ℓ \geq {\hat{f}}_{P_{ℓ_{i}}} (A) - ε_{i} ∕ 2$

that is equivalent to $f_{D} (A) \geq 1 - {(1 - ℓ ({\hat{f}}_{P_{ℓ_{i}}} (A) - ε_{i} / 2))}^{1 / ℓ}$

when ${\hat{f}}_{P_{ℓ_{i}}} (A) - ε_{i} ∕ 2 \geq 0$ . The proof of the upper bound is analogous.

In our experiments, we use $ℒ = {ℓ_{i}}$ with $ℓ_{i} = ℓ ∕ 2^{i}, \forall i \in [0, ⌊{log}_{2} ℓ⌋ - 1]$ ; in this case, note that P_ℓ₀ = P_ℓ. Using this scheme, we can compute upper and lower bounds for k-mers having frequencies of many different orders of magnitude, but any (application dependent) distribution can be specified by the user. Then, these upper and lower bounds can be used to obtain different approximations of $F K (D, k, θ)$ with different guarantees. For example, by reporting all k-mers (and their frequencies) that have an upper bound ≥θ, we have an approximation that guarantees that all k-mers A with $f_{D} (A) \geq θ$ are in the approximation.

4. Experimental Results

In this section, we present the results of our experimental evaluation for SAKEIMA. Section 4.1 describes the data sets, our implementation for SAKEIMA^*, and the baseline for comparisons. In Section 4.2, we report the results for computing the approximation of the frequent k-mers using SAKEIMA. Section 4.3 reports the results of using our approximation to compute abundance-based and presence-based distances between metagenomic data sets.

4.1. Data sets and implementation

We considered six data sets from the HMP^†, one of the largest publicly available collection of metagenomic data sets from high-throughput sequencing. In particular, we selected the three largest data sets of stool and the three largest of tongue dorsum (Table 1). These data sets constitute the most challenging instances, due to their size, and provide a test case with different degrees of similarities among data sets. We implemented SAKEIMA in C++ as a modification of Jellyfish (Marçais and Kingsford, 2011; the version we used is 2.2.10^‡), a very popular and efficient algorithm for exact k-mer counting. Doing so, our algorithm enjoys the succinct counting data structure provided by Jellyfish publicly available implementation. We remark that our sampling-based approach can be used in combination with any other highly tuned method available for exact, approximate, and parallel k-mer counting. For this reason, we only compare SAKEIMA with the exact counting performed by Jellyfish, since they share the underlying characteristics, allowing us to evaluate the impact of SAKEIMA's sampling strategy.

Table 1.
Data Sets for Our Experimental Evaluation

Data set $t_{D, k}$ $| D |$ ${max}_{n_{i}}$ avg_{n_i}

SRS024388(s) 7.92 · 10⁹ 1.20 · 10⁸ 102 97.21

SRS011239(s) 8.13 · 10⁹ 1.24 · 10⁸ 102 96.69

SRS024075(s) 8.82 · 10⁹ 1.38 · 10⁸ 96 94.88

SRS075404(t) 7.75 · 10⁹ 1.22 · 10⁸ 102 94.51

SRS062761(t) 8.26 · 10⁹ 1.18 · 10⁸ 101 101.00

SRS043663(t) 9.15 · 10⁹ 1.31 · 10⁸ 101 101.00

Data set	$t_{D, k}$	$\| D \|$	${max}_{n_{i}}$	avg_{n_i}
SRS024388(s)	7.92 · 10⁹	1.20 · 10⁸	102	97.21
SRS011239(s)	8.13 · 10⁹	1.24 · 10⁸	102	96.69
SRS024075(s)	8.82 · 10⁹	1.38 · 10⁸	96	94.88
SRS075404(t)	7.75 · 10⁹	1.22 · 10⁸	102	94.51
SRS062761(t)	8.26 · 10⁹	1.18 · 10⁸	101	101.00
SRS043663(t)	9.15 · 10⁹	1.31 · 10⁸	101	101.00

For each data set $D$ the table shows the data set name and site [(s) for stool, (t) for tongue dorsum]; the total number $t_{D, k}$ of k-mers (k = 31) in $D$ ; the number $| D |$ of reads it contains; the maximum read length ${max}_{n_{i}} = {m a x}_{i} {n_{i} | r_{i} \in D}$ ; and the average read length avg $_{n_{i}} = \sum_{i = 0}^{n - 1} n_{i} ∕ n$ .

For running time and memory, we computed the average from 10 runs. When comparing Jellyfish and SAKEIMA using one worker, we show the CPU time, while when using multiple threads we show the overall running time. We did not include the time to compute $t_{D, k}$ in our experiments since we assume it is provided in input (e.g., computed while the data set of read is created). In cases when it is not known in advance, $t_{D, k}$ can be computed by simply scanning all the k-mers without counting them. We computed the time required for this task for the data sets we consider and it was always small (i.e., always <175 seconds with 1 worker, and <70 seconds with 32 workers) compared with the time for counting k-mers.

For the computation of the abundance-based distances from the k-mer counts of two data sets, we implemented in C++ a simple algorithm that loads the counts of one data set in main memory and then performs one pass on the counts of the other data set, producing the distances in output. We executed all our experiments on the same machine with 512 GB of RAM and 2.30 GHz Intel Xeon CPUs (with 64 cores in total), compiling both implementations with GCC 8. SAKEIMA can be used in combination with more efficient algorithms and implementations for the computation of these (and other) distances (Benoit et al., 2016), resulting in speedups analogous to the those we present hereunder. For all the experiments of SAKEIMA, given θ and a data set $D$ , we fixed the parameters δ = 0.1, $ε = θ - 2 ∕ t_{D, k}$ , and we fix $ℓ = ⌊0.9 ∕ θ⌋$ .

4.2. Approximation of the frequent k-mers

We fixed k = 31, and we compared SAKEIMA with the exact counting of all k-mers (from Jellyfish) in terms of (1) running time, including, for both algorithms, the time required to write the output on disk and (2) memory requirement. We also assessed the accuracy of the output of SAKEIMA.

Figure 2 shows the average running times and peak memory as function of θ, using one worker. Note that for the exact counting algorithm, these metrics do not depend on θ, since it always counts all k-mers. SAKEIMA is always faster than the exact counting, with a difference that increases when θ increases and a speedup ∼2 even for θ = 2 · 10⁻⁸. The memory requirement of SAKEIMA reduces when θ increases, and for θ = 2 · 10⁻⁸ it is half of the memory required by the exact counting. This is due to SAKEIMA's sample size being much smaller than the data set size (Fig. 2d); therefore, a large portion of extremely low frequency k-mers are naturally left out from the random sample and do not need to be accounted for in the counting data structure, as confirmed by counting the number of distinct k-mers that are inserted in the counting data structure by the two algorithms (Fig. 2c). (The difference between the memory requirement and the number of distinct k-mers is given by Jellyfish's strategy to double the size of the counting data structure when it is full.)

FIG. 2.

Running time, memory requirements, and number of distinct k-mers counted, for SAKEIMA and exact counting as function of θ. (a) Running time (average ±2 standard deviations from 10 runs). (b) Memory requirement (the standard deviation is not shown when all the 10 runs have the same peak memory). (c) Number of distinct k-mers counted. (d) Sample sizes of SAKEIMA, total size $t_{D, k}$ of the data sets, and number (c.p.) of data set's distinct covered positions (i.e., included in SAKEIMA's sample), as function of θ.

Figure 3 shows the average running times of SAKEIMA and Jellyfish as function of θ and the number of workers w for counting k-mer from data set SRS043663. The memory used by both approaches does not depend on w, therefore, it is the same as shown in Figure 2. We can see that increasing w reduces the running time of both approaches, and that the relative improvements provided by the sampling strategy of SAKEIMA are maintained. This shows that SAKEIMA is well suited to be combined with parallel approaches.

FIG. 3.

Running time for SAKEIMA and exact counting for data set SRS043663, as function of θ and the number of workers w.

In terms of quality of the approximation, the output of SAKEIMA satisfied the guarantees given by Proposition 5 for all runs of our experiments, therefore, with probability >1 – δ. Although SAKEIMA may incur in false negatives, its false negative ratio (i.e., the fraction of k-mers in $F K (D, k, θ)$ not reported by SAKEIMA) is always ≤3 · 10⁻⁴ (Fig. 4a), even if the sampling technique given in Section 3.1 does not provide rigorous guarantees on such quantity. Therefore, SAKEIMA is very effective in reporting almost all frequent k-mers. As mentioned in Section 3.3, SAKEIMA can be easily modified so to report all frequent k-mers in output, even if at the cost of reporting also more k-mers with frequency between θ – ɛ and θ. In addition, the estimated frequencies f_A reported by SAKEIMA are always close to the true values $f_{D} (A)$ , with a small maximum deviation $| f_{A} - f_{D} (A) |$ (Fig. 4b), and an even smaller average deviation (Fig. 4c). In addition, the upper and lower bounds computed as in Section 3.3 provide small confidence intervals always containing the value $f_{D} (A)$ (e.g., Figure 4d for data set SRS062761), and could be used to obtain sets of k-mers with various guarantees from the sample used by SAKEIMA.

FIG. 4.

Quality of the approximation of $F K (D, k, θ)$ produced by SAKEIMA. (a) False negative rate, that is, the fraction r of k-mers in $F K (D, k, θ)$ not reported by SAKEIMA. (b) Maximum deviation $| f_{A} - f_{D} (A) |$ of the estimates reported by SAKEIMA for various θ. (c) Average value of $| f_{A} - f_{D} (A) |$ for the k-mers A reported by SAKEIMA for various θ. (d) Frequencies and bounds for data set SRS062761 and θ = 10⁻⁸ shown for k-mers sorted in increasing order of exact frequencies. Red: exact frequencies $f_{D} (A)$ . Green: estimate f_A of $f_{D} (A)$ from SAKEIMA. Blue: lower bound lb_A to $f_{D} (A)$ from SAKEIMA. Brown: upper bound ub_A to $f_{D} (A)$ from SAKEIMA.

4.3. Application to metagenomics: computation of ecological distances

We evaluate the use of SAKEIMA to speed up the computation of commonly used k-mer-based ecological distances (Benoit et al., 2016) between data sets of next-generation sequencing reads. We present results for the BC distance; analogous results hold for other distances (Supplementary Appendix).

We first investigated how the distances change when those are computed considering only the frequent k-mers (w.r.t. a frequency threshold θ) instead that the full spectrum of k-mers appearing in the data. Therefore, given a pair of data sets $D_{1}$ and $D_{2}$ and θ, we computed the sets $O_{1} = F K (D_{1}, k, θ)$ and $O_{2} = F K (D_{2}, k, θ)$ using Jellyfish and then computed a generalized version of the distances for all pairs of data sets we used for our experiments. For the BC distance, this generalization is defined as $B C (D_{1}, D_{2}, O_{1}, O_{2}) = 1 - 2 \frac{\sum_{A \in O_{1} \cap O_{2}} min {o_{D_{1}} (A), o_{D_{2}} (A)}}{\sum_{A \in O_{1}} o_{D_{1}} (A) + \sum_{A \in O_{2}} o_{D_{2}} (A)}$ .

Note that when θ ≤ 10⁻¹⁰ then $F K (D, k, θ)$ coincides with the set of all k-mers, for any of the data sets we tested. The results (Fig. 5a) show that for θ up to 5 × 10⁻⁸, the values of the distances are fairly stable and, therefore, one can use only frequent k-mers for such values of θ to compute the distances, and that for θ up to 10⁻⁷ the relation between distances of different pairs of data sets is almost always conserved. We underline that the exact counting approach needs to count all the k-mers and only afterward can we filter the infrequent k-mers before writing them to disk to compute $F K (D, k, θ)$ . We then used SAKEIMA to extract approximations (of k-mers and their frequencies) of $F K (D_{1}, k, θ)$ and $F K (D_{2}, k, θ)$ and used such approximations to compute the distances among data sets (Fig. 5b). Strikingly, the distances computed from the output of SAKEIMA are very close to their exact variant (Fig. 5c). Interestingly this holds also for the Jaccard distance, a presence-based distance that does not depend neither on k-mer abundances nor on k-mer ranking by frequencies.

FIG. 5.

Results for BC distances of metagenomic data sets. (a) BC distance computed using k-mers with frequency ≥θ. (b) BC distances computed using the approximation of k-mers with frequency ≥θ from SAKEIMA. (c) Comparison of the BC distance using all k-mers with exact counts and the approximation of frequent k-mers by SAKEIMA. (d) Total time required by SAKEIMA and the exact approach to find frequent k-mers and compute all distances between data sets as a function of θ. BC, Bray–Curtis.

We then compared, for different values of θ, the total running time required to compute the approximations of the frequent k-mers using SAKEIMA for all data sets in Table 1 and all distances among such data sets using SAKEIMA approximations with the running time required when the exact counting algorithm is used for the same tasks. SAKEIMA reduces the computing time by >75% (Fig. 5d). This result comes from both the efficiency of SAKEIMA and the fact that by focusing on the most frequent k-mers, we greatly reduce the number of distinct k-mers that need to be processed for computing the distances. Therefore, SAKEIMA can be used for a very fast comparison of metagenomic data sets while preserving the ability of distinguishing similar data sets from different data sets.

5. Conclusion

We presented SAKEIMA, a sampling-based algorithm, to approximate frequent k-mers and their frequencies with rigorous guarantees on the quality of the approximation. We show that SAKEIMA can be used to speed up the analysis of large high-throughput sequencing metagenomic data sets, in particular to compute abundance-based distances among such data sets. Interestingly SAKEIMA allows to compute accurate approximations also for presence-based distances (e.g., the Jaccard distance), even if for such distances other, potentially faster, tools (Ondov et al., 2016) are available. SAKEIMA can be combined with any highly optimized method that counts all k-mers in a set of strings, including recent parallel methods designed for comparative metagenomics (Benoit et al., 2016). Although we presented results for k-mers from data sets of short reads, SAKEIMA can also be used for the analysis of spaced seeds (Břinda et al., 2015), large data sets of long reads, and whole genome sequences.

Footnotes

Author Disclosure Statement

The authors declare they have no conflicting financial interests.

Funding Information

This study was supported, in part, by the University of Padova grants SID2017 and STARS: Algorithms for Inferential Data Mining.

Supplementary Material

References

Audano

, and Vannberg

2014. Kanalyze: a fast versatile pipelined k-mer toolkit. Bioinformatics, 30, 2070–2072.

Benoit

, Peterlongo

, Mariadassou

, et al. 2016. Multiple comparative metagenomics using multiset k-mer counting. PeerJ Comput. Sci. 2, e94.

Břinda

, Sykulski

, and Kucherov

2015. Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics, 31, 3584–3592.

Brown

C.T.

, Howe

, Zhang

, et al. 2012. A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv preprint arXiv: 1203.4802.

Chikhi

, and Medvedev

2013. Informed and automated k-mer size selection for genome assembly. Bioinformatics, 30, 31–37.

Danovaro

, Canals

, Tangherlini

, et al. 2017. A submarine volcanic eruption leads to a novel microbial habitat. Nat. Ecol. Evol. 1, 144.

Dickson

L.B.

, Jiolle

, Minard

, et al. 2017. Carryover effects of larval exposure to different environmental bacteria drive adult trait variation in a mosquito vector. Sci. Adv. 3, e1700585.

Girotto

, Pizzi

, and Comin

2016. Metaprob: Accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics, 32, i567–i575.

Hrytsenko

, Daniels

N.M.

, and Schwartz

R.S.

2018. Efficient distance calculations between genomes using mathematical approximation, 546–546. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Washington, DC. ACM.

10.

Kelley

D.R.

, Schatz

M.C.

, and Salzberg

S.L.

2010. Quake: Quality-aware detection and correction of sequencing errors. Genome Biol. 11, R116.

11.

Kokot

, Długosz

, and Deorowicz

2017. KMC 3: Counting and manipulating k-mer statistics. Bioinformatics, 33, 2759–2761.

12.

Kurtz

, Narechania

, Stein

J.C.

, et al. 2008. A new method to compute k-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics, 9, 517.

13.

, and Waterman

M.S.

2003. Estimating the repeat structure and length of DNA sequences using ℓ-tuples. Genome Res. 13, 1916–1922.

14.

Löffler

, and Phillips

J.M.

2009. Shape fitting on point sets with probability distributions, 313–324. In Fiat

, and Sanders

, eds. European Symposium on Algorithms. Springer-Verlag, Berlin, Heidelberg.

15.

Marçais

, and Kingsford

2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27, 764–770.

16.

Melsted

, and Halldórsson

B.V.

2014. Kmerstream: Streaming algorithms for k-mer abundance estimation. Bioinformatics, 30, 3541–3547.

17.

Melsted

, and Pritchard

J.K.

2011. Efficient counting of k-mers in DNA sequences using a bloom filter. BMC bioinformatics, 12, 333.

18.

Mitzenmacher

, and Upfal

2017. Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis. Cambridge University Press, New York.

19.

Mohamadi

, Khan

, and Birol

2017. ntcard: A streaming algorithm for cardinality estimation in genomics data. Bioinformatics, 33, 1324–1330.

20.

Ondov

B.D.

, Treangen

T.J.

, Melsted

, et al. 2016. Mash: Fast genome and metagenome distance estimation using minhash. Genome Biol, 17, 132.

21.

Pandey

, Bender

M.A.

, Johnson

, et al. 2017. Squeakr: An exact and approximate k-mer counting system. Bioinformatics, 34, 568–575.

22.

Patro

, Mount

S.M.

, and Kingsford

2014. Sailfish enables alignment-free isoform quantification from rna-seq reads using lightweight algorithms. Nat. Biotechnol. 32, 462.

23.

Pevzner

P.A.

, Tang

, and Waterman

M.S.

2001. An eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA. 98, 9748–9753.

24.

Riondato

, and Kornaropoulos

E.M.

2016. Fast approximation of betweenness centrality through sampling. Data Min. Knowl. Discov. 30, 438–475.

25.

Riondato

, and Upfal

2014. Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. ACM Trans. Knowl. Discov. Data, 8, 20.

26.

Rizk

, Lavenier

, and Chikhi

2013. DSK: k-mer counting with very low memory usage. Bioinformatics, 29, 652–653.

27.

Roy

R.S.

, Bhattacharya

, and Schliep

2014. Turtle: Identifying frequent k-mers with cache-efficient algorithms. Bioinformatics, 30, 1950–1957.

28.

Salmela

, Walve

, Rivals

, et al. 2016. Accurate self-correction of errors in long reads using de bruijn graphs. Bioinformatics, 33, 799–806.

29.

Servan-Schreiber

, Riondato

, and Zgraggen

2018. Prosecco: Progressive sequence mining with convergence guarantees, 417–426. 2018. IEEE International Conference on Data Mining (ICDM). IEEE, Singapore.

30.

Shalev-Shwartz

, and Ben-David

2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, New York, NY.

31.

Sims

G.E.

, Jun

S.-R.

, Wu

G.A.

, et al. 2009. Alignment-free genome comparison with feature frequency profiles (ffp) and optimal resolutions. Proc. Natl. Acad. Sci. USA. 106, 2677–2682.

32.

Sivadasan

, Srinivasan

, and Goyal

2016. Kmerlight: Fast and accurate k-mer abundance estimation. arXiv preprint arXiv: 1609.05626.

33.

Solomon

, and Kingsford

2016. Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34, 300.

34.

Vapnik

1998. Statistical Learning Theory. Wiley, New York.

35.

Vapnik

V.N.

, and Chervonenkis

A.Y.

1971. On the uniform convergence of relative frequencies of events to their probabilities. Theor. Probab. Appl. 16, 264–280.

36.

Wood

D. E.

, and Salzberg

S. L.

2014. Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46.

37.

Zerbino

D.R.

, and Birney

2008. Velvet: Algorithms for de novo short read assembly using de bruijn graphs. Genome Res. 18, 821–829.

38.

Zhang

, Pell

, Canino-Koning

, et al. 2014. These are not the k-mers you are looking for: Efficient online k-mer counting using a probabilistic data structure. PLoS One, 9, e101271.

39.

Zhang

, and Wang

2014. Rna-skim: A rapid method for rna-seq quantification at transcript level. Bioinformatics, 30, i283–i292.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

2.93 MB