A De Novo Robust Clustering Approach for Amplicon-Based Sequence Data

Abstract

When analyzing microbial communities, an active and computational challenge concerns the categorization of 16S rRNA gene sequences into operational taxonomic units (OTUs). Established clustering tools use a one pass algorithm to tackle high number of gene sequences and produce OTUs in reasonable time. However, all of the current tools are based on a crisp clustering approach, where a gene sequence is assigned to one cluster. The weak quality of the output compared with more complex clustering algorithms forces the user to postprocess the obtained OTUs. Providing a membership degree when assigning a gene sequence to an OTU will help the user during the postprocessing task. Moreover it is possible to use this membership degree to automatically evaluate the quality of the obtained OTUs. So the goal of this study is to propose a new clustering approach that takes into account uncertainty when producing OTUs, and improves both the quality and the presentation of the OTU results.

1. Introduction

Studying the structure of the communities in an ecosystem is central in environmental microbiology (Roux et al., 2011; Hugoni et al., 2013). The biosphere's diversity can be determined by amplifying and sequencing specific phylogenetic markers (e.g., 16S rRNA). From there, these amplicons need to be clustered in “species” named operational taxonomic units (OTUs) (Li et al., 2012; Chen et al., 2013; Mahé et al., 2014; Westcott and Schloss, 2015). As the volume of sequences has drastically increased in recent times, new clustering tools have emerged to treat the data in reasonable time. The currently used algorithms are, from the point of view of algorithmic complexity, the fastest available that do not produce random results. However, due to their simplicity, the reliability of the results is often discussed. These tools being essentially black boxes, their sensitivity to the sequence order, clustering threshold, and structure of the data make it that the users have no way of knowing whether better OTUs could have been obtained with different parameters or even whether they correctly represent the data. In these circumstances, there is no choice but to blindly trust them.

Distance-based greedy clustering algorithms such as those implemented in OTUclust (Albanese et al., 2015), VSEARCH (Rognes et al., 2016), CD-HIT (Li and Godzik, 2006), or USEARCH (Edgar, 2010) all share the same base algorithm as shown in Algorithm 1.

Algorithm 1. DBG (distance-based greedy) Clustering Principle

Although more sophisticated algorithms (Gath and Geva, 1989; Hariz et al., 2006; Antoine et al., 2012; Pérez-Suárez et al., 2013; Antoine et al., 2014) could produce better results quality wise, their runtime would render them unusable on millions of sequences. As the quality of the OTUs is important, we have to find a way to improve it without increasing the runtime. The different available implementations use a variety of heuristics to counterbalance the simplicity of the algorithm but, to the best of our knowledge, no approach has tried to add a measure of uncertainty to the process. This is why, to help increase the quality and trustworthiness of the clustering, we propose to add uncertainty to this simple algorithm through the use of fuzzy clustering.

2. Methods

2.1. Motivation

Distance-based greedy clustering algorithms, such as that in VSEARCH, produce a number of OTUs and assign each sequence to one of them. The OTU to which a sequence is said to belong to is usually the first OTU to be encountered that is sufficiently close, that is, within the specified threshold. This creates two problems:

A sequence can only belong to a single OTU

An OTU either includes or does not include a sequence

Having a sequence associated with a single OTU is expected as the ultimate output of the algorithm. For this reason, algorithms can stop after finding the first OTU that is close enough to a sequence, which speeds up the computation. However, not considering all the OTUs, a sequence could be assigned to increase the sensitivity to the order—a weakness of these algorithms—and reduce the quality of the clustering. Indeed, what if two different OTUs are close enough? Giving priority to the first generated OTU only creates a bias that no heuristic—such as sorting the sequences—could hope to overcome.

Moreover, by using strict thresholds, it is possible to have two nearly identical sequences such that one belongs to a particular OTU whereas the other does not. Due to this strictness, an OTU partitions the set of sequences into two sets inside of which sequences are considered the same regardless of their distance to the center of the OTU. This lack of distinction between sequences that are isolated and sequences on the border of OTUs hides information that could help understand the data.

Although these would not be problems were the clustering optimal, the need for fast algorithms gives rise to results that are not always trustworthy. The OTUs being presented as absolute, the end user has no choice, should consider them correct and cannot know whether the algorithm has encountered ambiguity. We believe that being less strict in the way the OTUs partition sequences would help produce better results from the end user's point of view.

2.2. Fuzzy clustering

To help increase the quality of the clustering and maximize the information that can be gathered from the data, we propose to add uncertainty to the clustering by means of fuzzy sets.

We define a membership function \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${f_C} \left( S \right)$$ \end{document} that, for an OTU C, associates a membership value with a sequence S. Usually, this value is either 0 or 1. Here, we propose to have \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${f_C} \left( S \right)$$ \end{document} take its value in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\left\{ { \frac { n } { { 10 } } { \rm { \; \; } } \vert { \rm { \; \; } } n = 0 \ldots 10 } \right\} $$ \end{document} . This value represents the degree of membership and, as such, 1 means that the sequence certainly belongs to the OTU, whereas 0 means that the sequence certainly does not belong to it. Other values represent uncertainty and are used to express that the sequence nearly belongs to the OTU. This membership value can easily be computed from the distance between the sequence and the center of the OTU using two thresholds t₁ and t₂ such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${t_1} \ge {t_2}$$ \end{document} . If the distance is less than the threshold t₁, the membership value is 1. If the distance is greater than t₂ the value is 0. If the distance is between t₁ and t₂, it increases gradually. Fuzzy OTUs are depicted in Figure 1 and the algorithm in Algorithm 2.

FIG. 1.

Representations of a Crisp (left) and a Fuzzy (right) OTUs. OTU, operational taxonomic unit.

Algorithm 2. Fuzzy DBG (distance-based greedy) Clustering

Using fuzzy OTUs allows us to discern the difference between sequences close to the OTU and sequences extremely far. Using the parameters t₁ and t₂, we can tune the “detection radius” around OTUs to gather information that would normally be discarded by the clustering algorithm.

2.3. Evaluating fuzzy operational taxonomic units

Having a nonbinary membership function produces OTUs that partition the sequences into multiple sets. If we consider only the sequences that belong (more or less) to an OTU, the repartition of their membership values provides information on the topology of the OTU. An ideal OTU would contain only sequences with a membership value of 1, meaning a group of sequences has been perfectly regrouped with a good threshold and no sequence lies ambiguously on the border. More realistically, a good OTU would contain many sequences with high membership values and little sequences with low values. A bad OTU with the majority of its sequences having low membership values could mean that the algorithm has chosen as a center a sequence on the border of a group or, even worse, between two distinct groups. Examples of such good and bad OTUs are given in Table 1.

Table 1.

Two Example Operational Taxonomic Units with the Number of Sequences That Belong to Them with Each Possible Membership Value

	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	1
OTU1	6	4	1	1	0	3	8	13	29	88
OTU2	70	41	30	41	34	19	11	6	5	16

We can quickly evaluate the quality of an OTU with this repartition. If we suppose that each sequence lowers the quality of the OTU depending on its membership value, we can use the following formula: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { \rm { Quality } } \left( { { \rm { OTU } } } \right) = 1 - { \rm { } } \sum \nolimits_ { i = 1 } ^9 { { \omega _i } } \times { \frac { { \rm { Number } } { \rm { \ of \ sequences \ with \ membership \ value } } \ i \times 0.1 } { { \rm { Number } } { \rm { \ of \ sequences \ in \ the \ OTU } } } } , \end{align*} \end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \omega _i}$$ \end{document} is the “cost” of having a sequence with membership value \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$i \times 0.1$$ \end{document} . In our previous examples, and with the values of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \omega _i}$$ \end{document} , in Table 2, we obtain a quality of, respectively, 0.71 and 0.26 for OTU1 and OTU2, showing OTU1 is better.

Table 2.

Example of Weight Values

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \omega _1}$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \omega _2}$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \omega _3}$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \omega _4}$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \omega _5}$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \omega _6}$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \omega _7}$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \omega _8}$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \omega _9}$$ \end{document}
1	0.9	0.8	0.7	0.6	0.5	0.4	0.3	0.2

A problem arises with singletons that always have perfect quality but these can be safely treated separately.

2.4. Choosing an operational taxonomic unit

A sequence can belong to multiple OTUs due to fuzzy membership. However, in the end, we want each sequence to be assigned to a single OTU. Hence, we have to choose one of the possible OTUs. We have two types of values left from the clustering process: membership and quality. The first type is based on the distance between the OTU and the sequence and the second type is used to recognize bad OTUs. Choosing the OTU with the best membership value is akin to running VSEARCH. Choosing the OTU with the best quality tends to create bigger OTUs that absorb distant sequences. To better compromise, we can use a linear combination of both values: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \alpha \times { \rm{quality}} + \beta \times { \rm{membership}}. \end{align*} \end{document}

Increasing the importance of the quality reduces the number of OTUs containing sequences. When α is low, the “best” OTUs quality wise absorb very close sequences that would have been attributed to other OTUs. When α is too high, the best OTUs start absorbing all the sequences around them, effectively acting like an increase of the distance threshold.

2.5. Identifying ambiguous sequences

Distance-based greedy algorithms are good at clustering objects that are easy to cluster. Groups of very similar sequences that are different from the rest of the data set are supposed to birth a new OTU, whereas isolated singletons should be identified to be either removed or treated separately. A problem arises when groups of sequences are close to each other but not enough to be the same OTU. In this case and supposing the algorithm ideally chooses the centers of the OTUs, sequences can lie just between these OTUs. In the current implementations, these ambiguous sequences that must be assigned are usually put in OTUs of their own, increasing the number of OTUs and reducing the overall quality of the clustering.

Using fuzzy clustering allows us to identify these ambiguous sequences such as those depicted in Figure 2. Using the previously mentioned choice strategy, they can be assigned to a good OTU even though they lie slightly outside of the distance threshold. However, their ambiguousness may be significant for the user. It is thus important to highlight their existence and the various fuzzy OTUs they could have alternatively be assigned to.

FIG. 2.

A case of ambiguous sequences.

3. Experimental Results

3.1. Relevant metrics

Five criteria are important when evaluating the efficiency of a clustering method in the setting of environmental microbiology. First of all, the computation (Time) should be as fast as possible and the memory (Memory) usage should be low. Comparing runtimes with those of VSEARCH, which gives satisfactory results, is enough. The quality of the clustering itself is evaluated using the number of OTUs (#OTUs), singletons (#singletons), and pairs (#pairs). The way to use these three values to study the population of microorganisms is outside the scope of this work but it is important to note that they should be minimized and that the proportion of singletons should be as small as possible. In addition, the average taxonomic distance (Distance) between sequences in a same OTU is used to represent the difference between the computed result and reality. The distance between two sequences in the taxonomy is defined as the sum of the lengths of the path from their nearest commonality. For example, if a sequence is classified as “bacteria;proteobacteria;betaproteobacteria” and the other is classified as “bacteria;proteobacteria;alphaproteobacteria,” their distance is 2 as each of them is at a distance 1 from their commonality “bacteria;proteobacteria.”

The comparison with VSEARCH is done using identical parameters when applicable.

3.2. Data and results

We tested our algorithm on a data set containing 100,000 sequences that can be found in the SILVA database and are thus already labeled for the distance computation. We used a threshold of 0.97 (97% similarity) for determining new clusters and a threshold of 0.95 for fuzzy membership. For the choice of each sequence's final cluster, we used different values of α between 0 and 1 with 0.25 increments. The results, presented in Table 3, are compared with those of VSEARCH using identical parameters when applicable.

Table 3.

Experimental Results

	Time	Memory	No. of OTUs	No. of singletons	No. of pairs	Distance
Fuzzy ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha = 0$$ \end{document} )	15:58	1,734,180	32,597	20,966	4618	0.52
Fuzzy ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha = 0.25$$ \end{document} )	16:39	1,734,496	32,543	20,510	4806	0.56
Fuzzy ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha = 0.5$$ \end{document} )	17:28	1,735,256	32,220	20,023	4830	0.62
Fuzzy ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha = 0.75$$ \end{document} )	16:08	1,735,056	31,567	19,347	4754	0.69
Fuzzy ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha = 1$$ \end{document} )	17:32	1,734,180	31,309	18,276	4826	1.03
VSEARCH	6:13	1,332,400	34,184	22,129	4803	0.48

OTU, operational taxonomic unit.

The program, data set, and corresponding taxonomy are available on (http://projets.isima.fr/sclust/Expe.html).

3.3. Analysis

We observe that the runtime, although between two to three times longer than VSEARCH's, is still reasonable. The memory usage is slightly higher because more values have to be stored for each sequence/cluster pair. An increase in the importance of the quality (α) results in a decrease in the total number of OTUs and singletons. The proportion of singletons is also reduced. The number of pairs is slightly increased whenever the quality is taken into account, but no particular pattern can be discerned. We interpret this as the existence of isolated sequences that initially form singletons but are close enough to be regrouped in small clusters. Unsurprisingly, increasing the importance of the quality increases the average taxonomic distance inside clusters. This increase is not linear and too much emphasis on the quality ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha > 0.75$$ \end{document} ) drastically increases the distance.

4. Discussion

We observe that the experimental results confirm that adding uncertainty to the clustering helps improve the quality of the output by reducing the number of singletons. Using fuzzy clusters, we are able to extend the clustering threshold to gather additional information on the OTUs' surroundings and use it to quickly assess their quality. This quality can be used together with the distance to choose an OTU for each sequence. The resulting output contains less singletons and misclassifications. Being able to choose the weight of both distance and quality allows for additional tuning.

As previously mentioned, the fuzziness also makes it possible to detect ambiguous sequences and clusters. In our opinion, this is where further work is required. An ambiguous sequence could be arbitrarily assigned to a nearby OTU, become the center of its own OTU, or even be considered as an error and deleted, but these operations imply such knowledge of the domain that interactions with the human user become necessary. However, on data sets containing millions of sequences, the number of alerts would render manual treatment impractical or even impossible. Automatizing this treatment would require being able to adapt to the type of data, domain, and preferences of the user. We suggest that machine learning techniques be introduced in the process to automatically learn how to handle these ambiguities.

Footnotes

Acknowledgments

This work was supported by the European Union's “Fonds Européen de Développement Régional (FEDER)” program and the Auvergne-Rhone-Alpes region.

Author Disclosure Statement

The authors declare that no competing financial interests exist.

References

Albanese

, Fontana

, De Filippo

, et al. 2015. MICCA: A complete and accurate software for taxonomic profiling of metagenomic data. Sci. Rep. 5, 9743.

Antoine

, Quost

, Masson

, et al. 2012. CECM: Constrained evidential c-means algorithm. Comput. Stat. Data Anal. 56, 894–914.

Antoine

, Quost

, Masson

, et al. 2014. CEVCLUS: Evidential clustering with instance-level constraints for relational data. Soft Comput. 18, 1321–1335.

Chen

, Zhang

C.K.

, Cheng

, et al. 2013. A comparison of methods for clustering 16s rRNA sequences into OTUs. PLoS One, 8, 1–10.

Edgar

R.C.

2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26, 2460–2461.

Gath

, and Geva

A.B.

1989. Unsupervised optimal fuzzy clustering. IEEE Trans. Pattern Anal. Machine Intell. 11, 773–780.

Hariz

S.B.

, Elouedi

, and Mellouli

2006. Clustering approach using belief function theory, 162–171. In International Conference on Artificial Intelligence: Methodology, Systems, and Applications. Springer, Berlin-Heidelberg.

Hugoni

, Taib

, Debroas

, et al. 2013. Structure of the rare archaeal biosphere and seasonal dynamics of active ecotypes in surface coastal waters. Proc. Natl Acad. Sci. U. S. A. 110, 6004–6009.

, Fu

, Niu

, et al. 2012. Ultrafast clustering algorithms for metagenomic sequence analysis. Brief. Bioinform. 13, 656–668.

10.

, and Godzik

2006. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22, 1658–1659.

11.

Mahé

, Rognes

, Quince

, et al. 2014. Swarm: Robust and fast clustering method for amplicon-based studies. PeerJ, 2, e593.

12.

Pérez-Suárez

, Martínez-Trinidad

J.F.

, Carrasco-Ochoa

J.A.

, et al. 2013. OClustR: A new graph-based algorithm for overlapping clustering. Neurocomputing, 121, 234–247.

13.

Rognes

, Flouri

, Nichols

, et al. 2016. VSEARCH: A versatile open source tool for metagenomics. PeerJ, 4, e2584.

14.

Roux

, Faubladier

, Mahul

, et al. 2011. Metavir: A web server dedicated to virome analysis. Bioinformatics, 27, 3074–3075.

15.

Westcott

S.L.

, and Schloss

P.D.

2015. De novo clustering methods outperform reference-based methods for assigning 16s rRNA gene sequences to operational taxonomic units. PeerJ, 3, e1487.