Abstract
Estimating the abundances of all k-mers in a set of biological sequences is a fundamental and challenging problem with many applications in biological analysis. Although several methods have been designed for the exact or approximate solution of this problem, they all require to process the entire data set, which can be extremely expensive for high-throughput sequencing data sets. Although in some applications it is crucial to estimate all k-mers and their abundances, in other situations it may be sufficient to report only frequent k-mers, which appear with relatively high frequency in a data set. This is the case, for example, in the computation of k-mers' abundance-based distances among data sets of reads, commonly used in metagenomic analyses. In this study, we develop, analyze, and test a sampling-based approach, called Sampling Algorithm for K-mErs approxIMAtion (SAKEIMA), to approximate the frequent k-mers and their frequencies in a high-throughput sequencing data set while providing rigorous guarantees on the quality of the approximation. SAKEIMA employs an advanced sampling scheme and we show how the characterization of the Vapnik–Chervonenkis dimension, a core concept from statistical learning theory, of a properly defined set of functions leads to practical bounds on the sample size required for a rigorous approximation. Our experimental evaluation shows that SAKEIMA allows to rigorously approximate frequent k-mers by processing only a fraction of a data set and that the frequencies estimated by SAKEIMA lead to accurate estimates of k-mer-based distances between high-throughput sequencing data sets. Overall, SAKEIMA is an efficient and rigorous tool to estimate k-mers' abundances providing significant speedups in the analysis of large sequencing data sets.
Introduction
The analysis of substrings of length k, called k-mers, is ubiquitous in biological sequence analysis and is among the first steps of processing pipelines for a wide spectrum of applications, including de novo assembly (Pevzner et al., 2001; Zerbino and Birney, 2008), error correction (Kelley et al., 2010; Salmela et al., 2016), repeat detection (Li and Waterman, 2003), genome comparison (Sims et al., 2009), digital normalization (Brown et al., 2012), RNA-seq quantification (Patro et al., 2014; Zhang and Wang, 2014), metagenomic reads classification (Wood and Salzberg, 2014), and binning (Girotto et al., 2016), and fast search-by-sequence over large high-throughput sequencing repositories (Solomon and Kingsford, 2016). A fundamental task in k-mer analysis is to compute the frequency of all k-mers, with the goal to distinguish frequent k-mers from infrequent k-mers (Marçais and Kingsford, 2011; Melsted and Pritchard, 2011). For example, this task is relevant in the analysis of high-throughput sequencing data, since infrequent k-mers are often assumed to result from sequencing errors. For several applications, the computation of k-mer frequencies is among the most computationally demanding steps of the analysis.
Many algorithms have been proposed for computing the exact frequency of all k-mers, such as Tallymer (Kurtz et al., 2008), Jellyfish (Marçais and Kingsford, 2011), BFCounter (Melsted and Pritchard, 2011), DSK (Rizk et al., 2013), KAnalyze (Audano and Vannberg, 2014), Turtle (Roy et al., 2014), KMC 3 (Kokot et al., 2017), and Squeakr-exact (Pandey et al., 2017). These methods typically perform a linear scan of the sequences to analyze and use a combination of parallelism and efficient data structures (such as Bloom filters and Hash tables) to maintain membership and counting information associated with all k-mers.
Since the computation of exact k-mer frequencies is computationally demanding, in particular for large sequence analysis or for high-throughput sequence data sets, recent methods have focused on providing approximate solutions to the problem, improving the time and memory requirements. KmerStream (Melsted and Halldórsson, 2014), Kmerlight (Sivadasan et al., 2016), and ntCard (Mohamadi et al., 2017) proposed streaming approaches for the approximation of the k-mer frequencies histogram. KmerGenie (Chikhi and Medvedev, 2013) performs a linear scan of the input, counting a random subset (chosen before processing the data set) of all possible k-mers to approximate the abundance histogram, providing an exploratory tool to choose the value of k. khmer (Zhang et al., 2014) and the recently proposed Squeakr (Pandey et al., 2017) rely on probabilistic data structures to approximate the counts of individual k-mers. With the only exception of KmerGenie, all these methods process all the k-mer occurrences in the input data set; in addition, all the aforementioned approximate methods that report the counts of individual k-mers do not provide simultaneous estimates with rigorous guarantees for all the counts k-mers that are provided in output.
All the methods cited previously try to estimate the frequency of all k-mers or of all k-mers that appear at least few times (e.g., twice) in the data set. Although this is crucial in some applications (e.g., in genome assembly k-mers that occur exactly once often represent sequencing errors and it is, therefore, important to estimate the count of all observed k-mers), in other applications this is less justified. For example, in the comparison of high-throughput sequencing metagenomic data sets, abundance-based distances or dissimilarities [e.g., the Bray–Curtis (BC) dissimilarity] between k-mer counts of two data sets are often used (Benoit et al., 2016; Danovaro et al., 2017; Dickson et al., 2017) to assess the distance between the corresponding data sets. In contrast to presence-based distances (Ondov et al., 2016; e.g., Jaccard distance), abundance-based distances take into account the frequency of each k-mer, with frequent k-mers contributing more to the distance than k-mers that appear with low frequency, but still more than a handful of times, in the data set.
Thus, two natural questions are (1) whether the results obtained considering all k-mers can be estimated by considering the abundances of frequent k-mers only and (2) whether the abundances of frequent k-mers can be computed more efficiently than the counts of all k-mers. Recently, preliminary work (Hrytsenko et al., 2018) has shown that, for the cosine distance and k = 12, the answer to the first question is positive, and in Section 4 we show that this is indeed the case for larger values of k and other abundance-based distances as well as presence-based distances (e.g., the Jaccard distance). To the best of our knowledge, the second question is hitherto unexplored. In addition, considering only frequent k-mers allows to focus on the most reliable information in a metagenomic data set, since a high stochastic variability in low frequency k-mers is to be expected due to the sampling process inherent in sequencing.
A natural approach to reduce time and memory requirements for frequency estimation problems is to process only a portion of the data, for example, by sampling some parts of a data set. Sampling approaches are appealing because infrequent k-mers naturally tend to appear with lower probability in a sample, allowing to directly focus on frequent k-mers in subsequent steps. However, major challenges in sampling approaches are (1) to provide rigorous guarantees relating the results obtained by processing the sample and the results that would be obtained from the whole data set and (2) to provide effective bounds on the size of the sample required to achieve such guarantees. The application of sampling to k-mers is even more challenging than in other scenarios since, for values of k in the typical range of interest to applications (e.g., 20–60), even the most frequent k-mers have relatively low frequency in the data. To the best of our knowledge, no approach based on sampling a portion of the input data set has been proposed to approximate frequent k-mers and their frequencies while providing rigorous guarantees.
Our contribution
We study the problem of approximating frequent k-mers, that is, k-mers that appear with frequency more than a user-defined threshold θ in a high-throughput sequencing data set. In these regards, our contributions are fourfold. First, we define a rigorous definition of approximation, governed by an accuracy parameter ɛ. Second, we propose a new method, Sampling Algorithm for K-mErs approxIMAtion (SAKEIMA), to obtain an approximation to the set of frequent k-mers using sampling. SAKEIMA (see Fig. 1) is based on a sampling scheme that goes beyond naive sampling of k-mers and allows to estimate k-mers of relatively low frequency considering only a fraction of all k-mers occurrences in the data set. Third, we provide analytical bounds to the sample size needed to obtain rigorous guarantees on the accuracy of the estimated k-mer frequencies, with respect to the frequencies measured on the entire data set.

SAKEIMA computes a fast and rigorous approximation of the frequent k-mers in a high-throughput sequencing data set by sampling a fraction of all k-mer occurrences in a data set, providing a significant speedup for the computation of k-mers' abundance-based distances between data sets of reads (e.g., in metagenomics). SAKEIMA, Sampling Algorithm for K-mErs approxIMAtion.
Our bounds are based on the notion of Vapnik–Chervonenkis (VC) dimension, a fundamental concept from statistical learning theory (Shalev-Shwartz and Ben-David, 2014), which has been used to design efficient algorithms to identify frequent patterns in other scenarios (Riondato and Upfal, 2014; Riondato and Kornaropoulos, 2016; Servan-Schreiber et al., 2018). To our knowledge, ours is the first method that applies concepts from statistical learning to provide a rigorous approximation of the k-mer frequencies. Fourth, we use SAKEIMA to extract frequent k-mers from metagenomic data sets from the Human Microbiome Project (HMP) and to approximate abundance-based and presence-based distances among such data sets, showing that SAKEIMA allows to accurately estimate such distances by analyzing only a fraction of the entire data set, resulting in a significant speedup.
Our approach is orthogonal to previous studies; any exact or approximate algorithm can be applied to the sample extracted by SAKEIMA, which can, therefore, be used before applying previously proposed methods, thus reducing their computational requirements while providing rigorous guarantees on the results w.r.t. to the entire data set. Although we present our methodology in the case of finding frequent k-mers from a set of sequences representing a high-throughput sequencing data set of short reads, our results can be applied to data sets of long reads and to whole-genome sequences as well.
Let a data set
Frequent k-mers and approximations
We are interested in obtaining the set
for any
for any (A, fA) ∈C it holds that
for any (A, fA) ∈C it holds that
Definition 2 guarantees that every frequent k-mer of
We aim to provide an approximation to
Obtaining an ɛ-approximation from a random sample with absolute certainty is impossible, thus we focus on obtaining an ɛ-approximation with probability 1 – δ > 0, where δ∈(0, 1) is a confidence parameter, whose value is provided by the user. Intuitively, the set
Proof. We first prove that when
For an arbitrary k-mer A, given the definition of fP(A) we have that
Now define the event
We now prove that when
In addition, by using known results in statistical learning theory (Vapnik and Chervonenkis, 1971; Mitzenmacher and Upfal, 2017) relating the VC dimension (see Section 3 for its definition) of a family of functions to a newly derived bound on the family of functions
Although the bound of Proposition 2 significantly improves the simple bounds of Section 1, since the factor ln(2σk) has been reduced to 1, it still has an inverse quadratic dependency with respect to the accuracy parameter ɛ, which is problematic when the quantities to estimate are small. In these cases, one needs a small ɛ to produce a meaningful approximation (since ɛ < θ), and the inverse quadratic dependence of the sample size from ɛ often results in a sample size larger than the entire input, defeating the purpose of sampling. The case of k-mers is particularly challenging, since the sum
Sampling bags of positions and VC dimension bound
We propose a method to provide an efficiently computable approximation to
Let Iℓ = {(i1, j1), (i2, j2), …, (iℓ, jℓ)} be a bag of ℓ positions for k-mers in
Intuitively,
of
In our analysis, we use the VC dimension (Vapnik, 1998; Vapnik and Chervonenkis, 1971), a statistical learning concept that measures the expressivity of a family of binary functions. We define a range space Q as a pair Q = (X, RX), where X is a finite or infinite set and RX is a finite or infinite family of subsets of X. The members of RX are called ranges. Given D ⊂ X, the projection of RX on D is defined as projRX(D) = {r ∩ D: r∈RX}. We say that D is shattered by RX if
A finite bound on the VC dimension of a range space Q implies a bound on the number of random samples required to obtain a good approximation of its ranges, defined as follows.
The following result (Mitzenmacher and Upfal, 2017) relates ɛ and the probability that a random sample of size m is an ɛ-approximation for a range space of VC dimension at most v.
The universal constant c has been experimentally estimated to be at most 0.5 (Löffler and Phillips, 2009).
We now prove an upper bound to the VC dimension VC(Q) of the range space Q associated with the class of functions
We prove the following results on the VC dimension of the mentioned range space.
Proof. If VC(Q) ≥ v, then there must exist a set
Using the mentioned result, we prove the following.
Then, with probability at least 1 – δ:
for any k-mer
for any k-mer A with
for any k-mer
for any k-mer A with
for any k-mer A with
Proof. For a given k-mer A, consider the event
Consider a k-mer A with frequency
For the second part, consider a k-mer A with
The third result follows from
The last two results follow from
Note that from Proposition 5, the set
Then, the third, fourth, and fifth guarantees from Proposition 5 state that we can use the biased estimates
We now present our SAKEIMA that builds on Proposition 5 and efficiently samples a bag Pℓ of bags of ℓ positions from
SAKEIMA is described in Algorithm 1. Although SAKEIMA performs a linear scan of the input data set, it practically reduces the number of k-mers that need to be processed with the following strategy.
SAKEIMA performs a pass on the stream of k-mers appearing in
The biased estimate
Note that SAKEIMA does not sample m bags of exactly ℓ positions each, since the number of occurrences of each position in
As a simple corollary, the output
The technique we just described can be used to avoid the exact computation of
Improved lower and upper bounds to k-mer frequencies
Note that Proposition 5 guarantees that we can obtain upper and lower bounds to
However, Proposition 5 can be generalized to obtain tighter upper and lower bounds to the frequency of all k-mers. For given ℓ, ɛ, and δ, let m be as given in Proposition 5. Note that the total number of k-mers' positions in the sample Pℓ is mℓ. Let
Proof. Combining proposition 4 and Proposition 3 and by union bound on the
that is equivalent to
when
In our experiments, we use
Experimental Results
In this section, we present the results of our experimental evaluation for SAKEIMA. Section 4.1 describes the data sets, our implementation for SAKEIMA * , and the baseline for comparisons. In Section 4.2, we report the results for computing the approximation of the frequent k-mers using SAKEIMA. Section 4.3 reports the results of using our approximation to compute abundance-based and presence-based distances between metagenomic data sets.
Data sets and implementation
We considered six data sets from the HMP † , one of the largest publicly available collection of metagenomic data sets from high-throughput sequencing. In particular, we selected the three largest data sets of stool and the three largest of tongue dorsum (Table 1). These data sets constitute the most challenging instances, due to their size, and provide a test case with different degrees of similarities among data sets. We implemented SAKEIMA in C++ as a modification of Jellyfish (Marçais and Kingsford, 2011; the version we used is 2.2.10 ‡ ), a very popular and efficient algorithm for exact k-mer counting. Doing so, our algorithm enjoys the succinct counting data structure provided by Jellyfish publicly available implementation. We remark that our sampling-based approach can be used in combination with any other highly tuned method available for exact, approximate, and parallel k-mer counting. For this reason, we only compare SAKEIMA with the exact counting performed by Jellyfish, since they share the underlying characteristics, allowing us to evaluate the impact of SAKEIMA's sampling strategy.
Data Sets for Our Experimental Evaluation
Data Sets for Our Experimental Evaluation
For each data set
For running time and memory, we computed the average from 10 runs. When comparing Jellyfish and SAKEIMA using one worker, we show the CPU time, while when using multiple threads we show the overall running time. We did not include the time to compute
For the computation of the abundance-based distances from the k-mer counts of two data sets, we implemented in C++ a simple algorithm that loads the counts of one data set in main memory and then performs one pass on the counts of the other data set, producing the distances in output. We executed all our experiments on the same machine with 512 GB of RAM and 2.30 GHz Intel Xeon CPUs (with 64 cores in total), compiling both implementations with GCC 8. SAKEIMA can be used in combination with more efficient algorithms and implementations for the computation of these (and other) distances (Benoit et al., 2016), resulting in speedups analogous to the those we present hereunder. For all the experiments of SAKEIMA, given θ and a data set
We fixed k = 31, and we compared SAKEIMA with the exact counting of all k-mers (from Jellyfish) in terms of (1) running time, including, for both algorithms, the time required to write the output on disk and (2) memory requirement. We also assessed the accuracy of the output of SAKEIMA.
Figure 2 shows the average running times and peak memory as function of θ, using one worker. Note that for the exact counting algorithm, these metrics do not depend on θ, since it always counts all k-mers. SAKEIMA is always faster than the exact counting, with a difference that increases when θ increases and a speedup ∼2 even for θ = 2 · 10−8. The memory requirement of SAKEIMA reduces when θ increases, and for θ = 2 · 10−8 it is half of the memory required by the exact counting. This is due to SAKEIMA's sample size being much smaller than the data set size (Fig. 2d); therefore, a large portion of extremely low frequency k-mers are naturally left out from the random sample and do not need to be accounted for in the counting data structure, as confirmed by counting the number of distinct k-mers that are inserted in the counting data structure by the two algorithms (Fig. 2c). (The difference between the memory requirement and the number of distinct k-mers is given by Jellyfish's strategy to double the size of the counting data structure when it is full.)

Running time, memory requirements, and number of distinct k-mers counted, for SAKEIMA and exact counting as function of θ.
Figure 3 shows the average running times of SAKEIMA and Jellyfish as function of θ and the number of workers w for counting k-mer from data set SRS043663. The memory used by both approaches does not depend on w, therefore, it is the same as shown in Figure 2. We can see that increasing w reduces the running time of both approaches, and that the relative improvements provided by the sampling strategy of SAKEIMA are maintained. This shows that SAKEIMA is well suited to be combined with parallel approaches.

Running time for SAKEIMA and exact counting for data set SRS043663, as function of θ and the number of workers w.
In terms of quality of the approximation, the output of SAKEIMA satisfied the guarantees given by Proposition 5 for all runs of our experiments, therefore, with probability >1 – δ. Although SAKEIMA may incur in false negatives, its false negative ratio (i.e., the fraction of k-mers in

Quality of the approximation of
We evaluate the use of SAKEIMA to speed up the computation of commonly used k-mer-based ecological distances (Benoit et al., 2016) between data sets of next-generation sequencing reads. We present results for the BC distance; analogous results hold for other distances (Supplementary Appendix).
We first investigated how the distances change when those are computed considering only the frequent k-mers (w.r.t. a frequency threshold θ) instead that the full spectrum of k-mers appearing in the data. Therefore, given a pair of data sets
Note that when θ ≤ 10−10 then

Results for BC distances of metagenomic data sets.
We then compared, for different values of θ, the total running time required to compute the approximations of the frequent k-mers using SAKEIMA for all data sets in Table 1 and all distances among such data sets using SAKEIMA approximations with the running time required when the exact counting algorithm is used for the same tasks. SAKEIMA reduces the computing time by >75% (Fig. 5d). This result comes from both the efficiency of SAKEIMA and the fact that by focusing on the most frequent k-mers, we greatly reduce the number of distinct k-mers that need to be processed for computing the distances. Therefore, SAKEIMA can be used for a very fast comparison of metagenomic data sets while preserving the ability of distinguishing similar data sets from different data sets.
We presented SAKEIMA, a sampling-based algorithm, to approximate frequent k-mers and their frequencies with rigorous guarantees on the quality of the approximation. We show that SAKEIMA can be used to speed up the analysis of large high-throughput sequencing metagenomic data sets, in particular to compute abundance-based distances among such data sets. Interestingly SAKEIMA allows to compute accurate approximations also for presence-based distances (e.g., the Jaccard distance), even if for such distances other, potentially faster, tools (Ondov et al., 2016) are available. SAKEIMA can be combined with any highly optimized method that counts all k-mers in a set of strings, including recent parallel methods designed for comparative metagenomics (Benoit et al., 2016). Although we presented results for k-mers from data sets of short reads, SAKEIMA can also be used for the analysis of spaced seeds (Břinda et al., 2015), large data sets of long reads, and whole genome sequences.
Footnotes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
