Y. Xia, J. Sun Bioinformatic and Statistical Analysis of Microbiome Data https://doi.org/10.1007/978-3-031-21391-5_6

6. Clustering Sequences into OTUs

Yinglin Xia¹ and Jun Sun ¹

(1)

Department of Medicine, University of Illinois Chicago, Chicago, IL, USA

Abstract

This chapter describes and illustrates taxonomic classification of the representative sequences and clustering of OTUs. First it introduces some preliminary procedures of clustering sequences into OTUs. Then it describes VSEARCH and q2-vsearch. Next three sections introduce and illustrate three approaches of clustering sequences into OTUs using q2-vsearch: closed-reference clustering, de novo clustering, and open-reference clustering.

Keywords

Clustering OTUs VSEARCH q2-vsearch Closed-reference clustering De novo clustering Open-reference clustering q2-cutadapt Quality-filter Q-score Q-score-joined USEARCH, UCLUST

Chapter 4 described and illustrated how to generate feature table and feature data (i.e., representative sequences). Chapter 5 described and illustrated how to assign taxonomy and build phylogenetic tree. In this chapter, we will describe and illustrate taxonomic classification of the representative sequences and clustering of OTUs. We first introduce some preliminary procedures of clustering sequences into OTUs (Sect. 6.1). Then we introduce VSEARCH and q2-vsearch (Sect. 6.2). In the next three sections, we introduce and illustrate cluster sequences into OTUs using q2-vsearch: closed-reference clustering (Sect. 6.3), De novo Clustering (Sect. 6.4), and open-reference clustering (Sect. 6.5), respectively. Finally, we provide a brief summary in Sect. 6.6.

6.1 Introduction to Clustering Sequences into OTUs

OTUs are used pragmatically as proxies for potential microbial species represented in a sample. The developers of QIIME 2 recommend working with Amplicon Sequence Variants (ASVs); thus QIIME 2 workflow by default does not include a typical OTU picking step. However, OTU picking step is the traditional approach to generate feature table or OTU table for downstream data analysis, and currently some bioinformatic centers still use this approach to generate data. Specifically, OTU picking step is an option in QIIME 2 and the q2-vsearch plugin is available for this analysis. Thus, here we still want to introduce the OTU picking technique.

To cluster sequences, several preliminary works need to be done: (1) merging paired-end reads, (2) removing non-biological sequences, (3) trimming all reads to the same length, (4) discarding low-quality reads, and (5) dereplicating the reads.

6.1.1 Merge Reads

Whether or not need to merge reads is depending on how the sequences will be denoised or clustered into ASVs or OTUs. The sequences need to be jointed when using Deblur or OTU clustering methods, which can be achieved via the QIIME 2 q2-vsearch plugin with the join-pairs method in Deblur. When the sequences were denoised using DADA2, then merging reads is not necessary because DADA2 performs read merging automatically after denoising each sequence.

6.1.2 Remove Non-biological Sequences

Any non-biological sequences such as primers, sequencing adapters, and PCR spacers should be removed before clustering. There are comprehensive methods for removing non-biological sequences from paired-end or single-end data in the q2-cutadapt plugin. We refer the interested readers to QIIME 2 documentation files for details.

As recall in Chap. 4, DADA2 can remove biological sequences when the denoising function is called to denoise sequences. When calling the denoise functions, we can specify the values for --p-trim parameter to remove base pairs from the 5′ end of the reads. In this case, ASVs were obtained through denoising sequences using DADA2 method, in which the non-biological sequences have been removed, so we do not need to perform this step again. If sub-OTUs/ASVs were obtained through Deblur, then we need to remove non-biological sequences because Deblur does not have this functionality yet.

6.1.3 Trim Reads Length

The raw reads need to be trimmed to the same length before OTU clustering. QIIME 2 recommends first denoising reads. Denoising reads involves a length trimming step, and then the length trimming step can optionally pass to the ASVs through a clustering algorithm. Thus, currently QIIME 2 does not have a function to trim reads to the same length directly.

6.1.4 Discard Low-Quality Reads

Low-quality reads will be discarded through quality filtering using the quality-filter plugin. Different types of quality filtering are available in QIIME 2, including the q-score method for single- or paired-end sequences (i.e., SampleData[PairedEndSequencesWithQuality | SequencesWithQuality]), q-score-joined for joined reads (i.e., SampleData[JoinedSequencesWithQuality]) after merging. We refer the readers to Chap. 4 (Sect. 4.4.3 Preliminary Works for Denoising with Deblur).

6.1.5 Dereplicate Sequences

All types of clustering first need to dereplicate the sequences. In QIIME 2, dereplicate-sequences can be performed via the q2-vsearch plugin.

6.2 Introduction to VSEARCH and q2-vsearch

Example 6.1: Miseq_SOP, Examples 5.1 and 5.2, Cont.

In this section, we’ll cover these three OTU picking methods via QIIME 2 using the example dataset we used in Chaps. 4 and 5. In this case, because ASVs were obtained through denoising sequences using DADA2 method, the reads were already merged, so a merging step can be omitted.

After quality filtering and denoising DNA sequences, to obtain datasets suitable for downstream statistical analyses, sequences are identified by assigning them to taxonomic groups or cluster them into OTUs. Typically, there are three ways to assign sequences to OTUs (Lawley and Tannock 2017; Whelan and Surette 2017; De Filippis et al. 2018): closed-reference clustering, de novo clustering, and open-reference clustering. QIIME 2 currently supports de novo, closed-reference, and open-reference clustering (Rideout et al. 2014).

VSEARCH (Rognes et al. 2016) is a versatile open source tool for processing and preparing metagenomics, genomics, and population genomics nucleotide sequence data. It was designed as an alternative to the USEARCH (Edgar 2010) based on a fast heuristic algorithm for searching nucleotide sequences. It performs optimal global sequence alignment of the query using full dynamic programming. Its functionalities include performing searching, clustering, chimera detection, and subsampling, paired-end reads merging and dereplication.

Currently two options are available in QIIME 2 for clustering of sequences or features into OTUs using vsearch: (1) using demultiplexed, quality-controlled sequence data (i.e., a SampleData[Sequences] artifact). Currently this option is performed in two steps. A single command is expected in the future release of QIIME 2. (2) Using dereplicated, quality-controlled data in feature table and feature representative sequences (i.e., the FeatureTable[Frequency] and FeatureData[Sequence]artifacts). These artifacts could be generated using a variety of analysis pipelines, such as qiime vsearch dereplicate-sequences, and qiime dada2 denoise or qiime deblur denoise commands. The second option is performed in one step. The FeatureTable[Frequency] (in this case, FeatureTableMiSeq_SOP.qza) and FeatureData[Sequence] (RepSeqsMiSeq_SOP.qza) artifacts have already been generated in Chap. 4. We can directly use them to cluster sequences into OTUs.

Traditionally, the 97% threshold was used for approximating to species (Stackebrandt and Goebel 1994; Schloss and Handelsman 2005; Seguritan and Rohwer 2001; Westcott and Schloss 2017). Currently more stringent cut-offs was suggested to avoid over-classification of the representative sequences because it could result in spurious OTUs. Given much larger datasets currently available, around 99% for full-length sequences and around 100% for the V4 hypervariable region are considered as optimal identity thresholds (Edgar 2018).

6.3 Closed-Reference Clustering

Closed-reference clustering (Caporaso et al. 2012; Navas-Molina et al. 2013) is a phylotype-based method, also called as phylotyping (Schloss and Westcott 2011) or taxonomy-dependent method (Sun et al. 2012).

6.3.1 Introduction

Closed reference clustering is to group those sequences that match the same reference sequence in a database with a certain similarity together. That is, this method bins sequences into groups within a well curated database of known sequences, first comparing each query sequence to an annotated reference taxonomy database via the sequence classification or searching methods (Liu et al. 2017a, b; Rodrigues et al. 2017), then grouping the sequences that are matched to the same reference sequence into the same OTU. The algorithm behind closed-reference clustering is first to cluster the sequences in the FeatureData[Sequence] artifact against a reference database, and then to collapse the features in the FeatureTable into new features that are clusters of the input features.

6.3.2 Implement Cluster-Features-Closed-Reference

Below, we do closed-reference clustering with the cluster-features-closed-reference method via the qiime vsearch plugin. The cluster-features-closed-reference method is a wrap of the VSEARCH function --usearch_global. To perform closed-reference clustering of a feature table (in this case, FeatureTableMiSeq_SOP.qza), we download a reference database Greengenes (16S rRNA) with the 13_8 (most recent version) from QIIME 2 website (https://docs.qiime2.org/2022.2/data-resources/). The Greengene database (gg_13_8_otus.tar.gz) includes the 99_otus.fasta file. We first import this fasta file as a FeatureData[Sequence] artifact representing the Greengenes 13_8 99% OTUs.

source activate qiime2-2022.2

mkdir QIIME2R-Bioinformatics

cd QIIME2R-Bioinformatics

qiime tools import \

--input-path gg_13_8_otus/rep_set/99_otus.fasta\

--output-path 99_otus.qza\

--type 'FeatureData[Sequence]'

In general, closed-reference OTU clustering prefers to be performed at a higher percent identity. Here, we perform clustering at 99% identity against the Greengenes 13_8 99% OTUs.

qiime vsearch cluster-features-closed-reference \

--i-table FeatureTableMiSeq_SOP.qza \

--i-sequences RepSeqsMiSeq_SOP.qza \

--i-reference-sequences 99_otus.qza \

--p-perc-identity 0.99 \

--o-clustered-table TableCR99.qza \

--o-clustered-sequences RepSeqsCR99.qza \

--o-unmatched-sequences UnmatchedCR99.qza

In above commands, the --i-reference-sequences flag is used to include reference database to cluster against with. This reference input file should be a .qza file containing a fasta file with the sequences to use as references, with QIIME 2 data type FeatureData[Sequence]. SILVA or GreenGenes for 16S rRNA gene sequences are most often used as the references input file, while other standard references such as UNITE for ITS data are also used. Still others prefer to curate their own databases.

After implementing closed-reference clustering, we obtain a FeatureTable[Frequency] artifact (TableCR99.qza) and a FeatureData[Sequence] artifact(RepSeqsCR99.qza). Note that The FeatureData[Sequence] artifact (in this case RepSeqsCR99.qza or UnmatchedCR99.qza) is not the sequences defining the features in the FeatureTable, but rather the collection of feature ids and their sequences that didn’t match the reference database at 99% identity.

6.3.3 Remarks on Closed-Reference Clustering

Closed-reference clustering as a phylotype-based method directly assigns sequences based on their distance (similarity) to phylotypes, i.e., reference sequences, whereas distance-based methods group sequences based on their distance (similarity) between sequences to OTUs.

The phylotype-based methods have several appealing features, including:

Easily linking a sequence to previously identified microbes, computational efficiency, and stable classification.
Have the strengths of speed, potential for trivial parallelization (Westcott and Schloss 2015).
Closed-reference clustering methods cluster sequence reads against a reference dataset; thus the OTUs obtained from this method can be used to do alpha- and beta-diversity estimations and directly compare OTUs across studies (Westcott and Schloss 2015; He et al. 2015).
Sequence reads from different marker gene regions can be clustered together if the reference dataset consists of full-length marker genes (He et al. 2015).
OTU clustering can be parallelized for large datasets (He et al. 2015), which is suitable for meta-analysis.

However, the phylotype-based methods also have critical challenges:

The success of assignment is highly contingent on sequencing platform and reference database (Tyler et al. 2014). Thus, when reference databases are incomplete because a large portion of taxa in a sample is unknown or has not yet been well defined, and hence not recorded in databases, then they cannot be assigned to an OTU. Thus, it is impossible to analyze novel sequences detected in an experiment via previously unidentified taxonomic lineages (Tyler et al. 2014; Schloss and Westcott 2011).
Due to largely being dependent on the completeness of the reference database, these clustering methods do not perform well if many novel organisms exist in the sequencing data (Schloss and Westcott 2011; Chen et al. 2016).
Especially, the fundamental problem of the closed-reference approach is that two query sequences matched to the same reference sequence at a higher same (e.g., 97%) similarity may only have a lower similarity (e.g., 94%) to each other (Westcott and Schloss 2015). This is the issue of adverse triplets, which is common in practice (Edgar 2018).

In summary, because closed-reference clustering methods are largely dependent on the completeness of the reference database, they are often employed to annotate sequences (Sun et al. 2012) rather than to detect novel sequences.

6.4 De Novo Clustering

De novo clustering is a distance-based method (Schloss and Westcott 2011), also called as taxonomy-independent (Sun et al. 2012), OTU-based (Zongzhi Liu et al. 2008; Chen et al. 2013), taxonomy-unsupervised (Sul et al. 2011), or de novo (Navas-Molina et al. 2013; Edgar 2010) clustering methods.

6.4.1 Introduction

De novo clustering clusters sequences into groups based on sequence identity or genetic distances alone. It first clusters all sequences into OTUs based on the pairwise sequence distances to compare each sequence against each other rather than to compare against a reference database (Forster et al. 2016), then group sequences into OTUs by implementing a clustering algorithm with a specified threshold.

The algorithm behind de novo clustering is first to cluster all sequences in the FeatureData[Sequence] artifact against one another (rather than against a reference database) based on the pairwise sequence distances, and then to collapse features in the FeatureTable into new features that are clusters of the input features, i.e., classify reads that have a similarity greater than a threshold (typically 97% or 99% identity) as the same OTU.

6.4.2 Implement Cluster-Features-De-Novo

De novo clustering of a feature table can be performed as follows. Here, we perform clustering at 99% identity by specifying 99% identity in --p-perc-identity parameter, which wraps the VSEARCH --cluster_size function, to create 99% OTUs. First, store the artifacts of FeatureTableMiSeq_SOP.qza and RepSeqsMiSeq_SOP.qza into the directory QIIME2R-Bioinformatics. Then type: cd QIIME2R-Bioinformatics in terminal after activating QIIME 2 to link the datasets to this folder. Finally, call the cluster-features-de-novo method via the qiime vsearch plugin to implement the de novo clustering.

qiime vsearch cluster-features-de-novo \

--i-table FeatureTableMiSeq_SOP.qza \

--i-sequences RepSeqsMiSeq_SOP.qza \

--p-perc-identity 0.99 \

--o-clustered-table TableDn99MiSeq_SOP.qza \

--o-clustered-sequences RepSeqsDn99MiSeq_SOP.qza

Above commands generate two artifacts: a FeatureTable[Frequency] (TableDn99MiSeq_SOP.qza) with the BIOMV210DirFmt format, and a FeatureData[Sequence](RepSeqsDn99MiSeq_SOP.qza) with the DNASequencesDirectoryFormat format. We review them through the QIIME2 viewer. The FeatureData[Sequence] artifact contains the centroid sequence defining each OTU cluster.

6.4.3 Remarks on De Novo Clustering

De novo clustering methods do overcome most limitations of phylotype-based methods and have several advantages. The de novo clustering approach:

Carries out the clustering step independently without references for a phylotype-database. Thus, these methods outperform phylotype-based reference methods for assigning 16S rRNA gene sequences to OTUs and have been preferably used across the field (Westcott and Schloss 2015).
Is optimal for samples that contain many bacteria that have no reference sequences in the public databases.
Particularly, it was demonstrated that de novo clustering methods significantly outperform the approaches of closed-reference clustering and open-reference clustering for picking OTUs (Schloss 2016; Jackson et al. 2016).

However, de novo clustering methods also have several weaknesses, such as:

It is computationally intensive (cost of hierarchical clustering), relatively slow, and larger memory required due to higher sequencing error rates in expanding sequencing throughput, the difficult choice of linkage method for clustering (Schloss and Westcott 2011; Westcott and Schloss 2015).
It tends to produce a very large number of OTUs.
The OTUs obtained from both de novo clustering and open-reference OTU clustering methods affect alpha-diversity analyses (e.g., rarefaction curves), beta-diversity analyses such as principal component analysis and distance-based ordination (e.g., principal coordinate analysis), and the identification of differentially represented OTUs by a hypothesis testing, such as ADONIS’s R value (He et al. 2015).
Especially, the de novo clustering methods have one fundamental problem: the clustering results (i.e., OTU assignments) are strongly influenced by or sensitive to the input order of the sequences (Mahé et al. 2014; He et al. 2015).

In summary, de novo clustering methods have been attracted more attention and have become the preferred option for researchers (Schloss 2010; Cai et al. 2017).

6.5 Open-Reference Clustering

Open-reference clustering is a hybrid of the closed-reference clustering and de novo clustering approaches (Navas-Molina et al. 2013; Rideout et al. 2014).

6.5.1 Introduction

Open-reference clustering combines the closed-reference and de novo methods and sequentially performs closed-reference clustering and de novo clustering, in which a closed-reference clustering method is first used to assign OTUs, and the unassigned sequences outputted by the closed-reference method are then grouped by a de novo clustering method (Westcott and Schloss 2017).

For example, in QIIME-uclust, the “pick_open_reference_otus.py” script implements the latest QIIME open reference OTU clustering (Rideout et al. 2014). The algorithm behind open-reference clustering is: it first performs closed-reference clustering against a reference database (e.g., Greengenes v.13.8, 97% OTU database), using clustering method UCLUST (Edgar 2010), which exploits USEARCH to search large sequence databases and to assign sequences to clusters. Then, it subsamples (default proportion of subsampling = 0.001) those reads that do not map in this first step and performs a de novo OTU clustering step. Next, remaining unmapped reads are subsequently closed-reference clustered against these de novo OTUs. Finally, it performs another step of de novo clustering on the remaining unmapped reads.

6.5.2 Implement Cluster-Features-Open-Reference

Similar to the closed-reference clustering, open-reference clustering (Rognes et al. 2016) can be performed using the cluster-features-open-reference method via the qiime vsearch plugin. Also similar to the closed-reference clustering, open-reference OTU clustering is generally performed at a higher percent identity.

qiime vsearch cluster-features-open-reference \.

--i-table FeatureTableMiSeq_SOP.qza \

--i-sequences RepSeqsMiSeq_SOP.qza \

--i-reference-sequences 99_otus.qza \

--p-perc-identity 0.99 \

--o-clustered-table TableOR99.qza \

--o-clustered-sequences RepSeqsOR99.qza \

--o-new-reference-sequences NewRefSeqsOR99.qza

Open-reference clustering generated a FeatureTable[Frequency] artifact TableOR99.qza, and two FeatureData[Sequence] artifacts: RepSeqsOR99.qza and NewRefSeqsOR99.qza. The first FeatureData[Sequence] artifact represents the clustered sequences, while the second artifact represents the new reference sequences, composed of the reference sequences used for input, and the sequences clustered as part of the internal de novo clustering step.

6.5.3 Remarks on Open-Reference Clustering

Open-reference clustering combines the closed-reference and de novo methods and sequentially performs closed-reference clustering and de novo clustering; thus, theoretically this method should reserve the strengths of both closed-reference and de novo clustering. For example, it was evaluated that open-reference clustering method is a much more effective compared to using de novo methods alone, and was recommended for assigning OTUs along implementing using uclust in QIIME (He et al. 2015).

However, open-reference clustering methods have the weaknesses:

It blends the strengths and weaknesses of the other methods and was reviewed to have potential problems when using these two methods together due to the different OTU definitions employed by commonly used closed-reference and de novo clustering implementations and associated with database quality and classification error (Westcott and Schloss 2015, 2017).
OTU clustering tends to exaggerate the number of unique organisms found within a sample (Edgar 2017). Especially open-reference OTU clustering consistently picks up more number of OTUs (QIIME) than the number of ASVs (DADA2)(Sierra et al. 2020).
Particularly a recent study (Prodan et al. 2020) showed that QIIME-uclust (used in QIIME 1) produced large number of spurious OTUs and inflated alpha-diversity measures, and suggested QIIME-uclust should be avoided in future studies.

In summary, the performance of open-reference clustering has not been consistently confirmed.

6.6 Summary

This chapter described clustering sequences into OTUs via QIIME 2. First, several preliminary steps of OTU clustering was described including merging reads, removing non-biological sequences, trimming reads length, discard low-quality reads, and dereplicating sequences. Then, the VSEARCH and q2-vsearch were introduced, which perform versatile bioinformatic functions including searching, clustering, chimera detection and subsampling, paired-end reads merging, and dereplication. Next, three approaches of OTU clustering methods including closed-reference clustering, de novo clustering, and open-reference clustering were described and illustrated. Their advantages and disadvantages were also discussed. In Chap. 7, we will introduce the original OTU methods in numerical taxonomy.

References

Cai, Yunpeng, Wei Zheng, Jin Yao, Yujie Yang, Volker Mai, Qi Mao, and Yijun Sun. 2017. ESPRIT-Forest: Parallel clustering of massive Amplicon Sequence data in subquadratic time. PLoS Computational Biology 13 (4): e1005518.CrossrefPubMedPubMedCentral
Caporaso, J. Gregory, Christian L. Lauber, William A. Walters, Donna Berg-Lyons, James Huntley, Noah Fierer, Sarah M. Owens, Jason Betley, Louise Fraser, Markus Bauer, Niall Gormley, Jack A. Gilbert, Geoff Smith, and Rob Knight. 2012. Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms. The ISME Journal 6 (8): 1621–1624. https://doi.org/10.1038/ismej.2012.8. https://www.ncbi.nlm.nih.gov/pubmed/22402401, https://www.ncbi.nlm.nih.gov/pmc/PMC3400413/.CrossrefPubMedPubMedCentral
Chen, Wei, Clarence K. Zhang, Yongmei Cheng, Shaowu Zhang, and Hongyu Zhao. 2013. A comparison of methods for clustering 16S rRNA sequences into OTUs. PLoS One 8 (8): e70837. https://doi.org/10.1371/journal.pone.0070837.CrossrefPubMedPubMedCentral
Chen, Shi-Yi, Feilong Deng, Ying Huang, Xianbo Jia, Yi-Ping Liu, and Song-Jia Lai. 2016. bioOTU: An improved method for simultaneous taxonomic assignments and operational taxonomic units clustering of 16s rRNA gene sequences. Journal of Computational Biology 23 (4): 229–238.CrossrefPubMed
De Filippis, F., E. Parente, T. Zotta, and D. Ercolini. 2018. A comparison of bioinformatic approaches for 16S rRNA gene profiling of food bacterial microbiota. International Journal of Food Microbiology 265: 9–17. https://doi.org/10.1016/j.ijfoodmicro.2017.10.028.CrossrefPubMed
Edgar, Robert C. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26 (19): 2460–2461. https://doi.org/10.1093/bioinformatics/btq461.CrossrefPubMed
———. 2017. Accuracy of microbial community diversity estimated by closed- and open-reference OTUs. PeerJ 5: e3889. https://doi.org/10.7717/peerj.3889.CrossrefPubMedPubMedCentral
———. 2018. Updating the 97% identity threshold for 16S ribosomal RNA OTUs. Bioinformatics 34 (14): 2371–2375. https://doi.org/10.1093/bioinformatics/bty113.CrossrefPubMed
Forster, Dominik, Micah Dunthorn, Thorsten Stoeck, and Frédéric Mahé. 2016. Comparison of three clustering approaches for detecting novel environmental microbial diversity. PeerJ 4: e1692.CrossrefPubMedPubMedCentral
He, Yan, J. Gregory Caporaso, Xiao-Tao Jiang, Hua-Fang Sheng, Susan M. Huse, Jai Ram Rideout, Robert C. Edgar, Evguenia Kopylova, William A. Walters, Rob Knight, and Hong-Wei Zhou. 2015. Stability of operational taxonomic units: An important but neglected property for analyzing microbial diversity. Microbiome 3: 20–20. https://doi.org/10.1186/s40168-015-0081-x. https://www.ncbi.nlm.nih.gov/pubmed/25995836, https://www.ncbi.nlm.nih.gov/pmc/PMC4438525/.CrossrefPubMedPubMedCentral
Jackson, Matthew A., Jordana T. Bell, Tim D. Spector, and Claire J. Steves. 2016. A heritability-based comparison of methods used to cluster 16S rRNA gene sequences into operational taxonomic units. PeerJ 4: e2341.CrossrefPubMedPubMedCentral
Lawley, Blair, and Gerald W. Tannock. 2017. Analysis of 16S rRNA gene amplicon sequences using the QIIME software package. In Oral Biology, 153–163. Springer.Crossref
Liu, Zongzhi, Todd Z. DeSantis, Gary L. Andersen, and Rob Knight. 2008. Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic Acids Research 36 (18): e120–e120. https://doi.org/10.1093/nar/gkn491. https://pubmed.ncbi.nlm.nih.gov/18723574, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2566877/.CrossrefPubMedPubMedCentral
Liu, Zhunga, Quan Pan, Jean Dezert, Jun-Wei Han, and You He. 2017a. Classifier fusion with contextual reliability evaluation. IEEE Transactions on Cybernetics 48 (5): 1605–1618.CrossrefPubMed
Liu, Zhun-Ga, Quan Pan, Jean Dezert, and Arnaud Martin. 2017b. Combination of classifiers with optimal weight based on evidential reasoning. IEEE Transactions on Fuzzy Systems 26 (3): 1217–1230.Crossref
Mahé, Frédéric, Torbjørn Rognes, Christopher Quince, Colomban de Vargas, and Micah Dunthorn. 2014. Swarm: Robust and fast clustering method for amplicon-based studies. PeerJ 2: e593. https://doi.org/10.7717/peerj.593.CrossrefPubMedPubMedCentral
Navas-Molina, José A., Juan M. Peralta-Sánchez, Antonio González, Paul J. McMurdie, Yoshiki Vázquez-Baeza, Xu Zhenjiang, Luke K. Ursell, Christian Lauber, Hongwei Zhou, Se Jin Song, James Huntley, Gail L. Ackermann, Donna Berg-Lyons, J. Susan Holmes, Gregory Caporaso, and Rob Knight. 2013. Advancing our understanding of the human microbiome using QIIME. Methods in Enzymology 531: 371–444. https://doi.org/10.1016/b978-0-12-407863-5.00019-8. https://www.ncbi.nlm.nih.gov/pubmed/24060131, https://www.ncbi.nlm.nih.gov/pmc/PMC4517945/.CrossrefPubMedPubMedCentral
Prodan, Andrei, Valentina Tremaroli, Harald Brolin, Aeilko H. Zwinderman, Max Nieuwdorp, and Evgeni Levin. 2020. Comparing bioinformatic pipelines for microbial 16S rRNA Amplicon Sequencing. PLoS One 15 (1): e0227434. https://doi.org/10.1371/journal.pone.0227434.CrossrefPubMedPubMedCentral
Rideout, Jai Ram, Yan He, Jose A. Navas-Molina, William A. Walters, Luke K. Ursell, Sean M. Gibbons, John Chase, Daniel McDonald, Antonio Gonzalez, Adam Robbins-Pianka, Jose C. Clemente, Jack A. Gilbert, Susan M. Huse, Hong-Wei Zhou, Rob Knight, and J. Gregory Caporaso. 2014. Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences. PeerJ 2: e545. https://doi.org/10.7717/peerj.545.CrossrefPubMedPubMedCentral
Rodrigues, Matias, F. João, Thomas S.B. Schmidt, Janko Tackmann, and Christian von Mering. 2017. MAPseq: Highly efficient k-mer search with confidence estimates, for rRNA sequence analysis. Bioinformatics 33 (23): 3808–3810. https://doi.org/10.1093/bioinformatics/btx517.Crossref
Rognes, Torbjørn, Tomáš Flouri, Ben Nichols, Christopher Quince, and Frédéric Mahé. 2016. VSEARCH: A versatile open source tool for metagenomics. PeerJ 4: e2584. https://doi.org/10.7717/peerj.2584.CrossrefPubMedPubMedCentral
Schloss, Patrick D. 2010. The effects of alignment quality, distance calculation method, sequence filtering, and region on the analysis of 16S rRNA gene-based studies. PLoS Computational Biology 6 (7): e1000844–e1000844. https://doi.org/10.1371/journal.pcbi.1000844. https://www.ncbi.nlm.nih.gov/pubmed/20628621, https://www.ncbi.nlm.nih.gov/pmc/PMC2900292/.CrossrefPubMedPubMedCentral
———. 2016. Application of a database-independent approach to assess the quality of operational taxonomic unit picking methods. Msystems 1 (2): e00027–e00016.CrossrefPubMedPubMedCentral
Schloss, Patrick D., and Jo Handelsman. 2005. Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Applied and Environmental Microbiology 71 (3): 1501–1506. https://doi.org/10.1128/aem.71.3.1501-1506.2005. https://aem.asm.org/content/aem/71/3/1501.full.pdf.CrossrefPubMedPubMedCentral
Schloss, P.D., and S.L. Westcott. 2011. Assessing and improving methods used in operational taxonomic unit-based approaches for 16S rRNA gene sequence analysis. Applied and Environmental Microbiology 77 (10): 3219–3226.CrossrefPubMedPubMedCentral
Seguritan, V., and F. Rohwer. 2001. FastGroup: A program to dereplicate libraries of 16S rDNA sequences. BMC Bioinformatics 2: 9–9. https://doi.org/10.1186/1471-2105-2-9. https://www.ncbi.nlm.nih.gov/pubmed/11707150, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC59723/.CrossrefPubMedPubMedCentral
Sierra, Maria A., Qianhao Li, Smruti Pushalkar, Bidisha Paul, Tito A. Sandoval, Angela R. Kamer, Patricia Corby, Yuqi Guo, Ryan Richard Ruff, and Alexander V. Alekseyenko. 2020. The influences of bioinformatics tools and reference databases in analyzing the human oral microbial community. Genes 11 (8): 878.CrossrefPubMedPubMedCentral
Stackebrandt, E., and B.M. Goebel. 1994. Taxonomic note: A Place for DNA-DNA reassociation and 16S rRNA sequence analysis in the present species definition in bacteriology. International Journal of Systematic and Evolutionary Microbiology 44 (4): 846–849. https://doi.org/10.1099/00207713-44-4-846. https://www.microbiologyresearch.org/content/journal/ijsem/10.1099/00207713-44-4-846.Crossref
Sul, Woo Jun, James R. Cole, C. Ederson da, Qiong Wang Jesus, Ryan J. Farris, Jordan A. Fish, and James M. Tiedje. 2011. Bacterial community comparisons by taxonomy-supervised analysis independent of sequence alignment and clustering. Proceedings of the National Academy of Sciences of the United States of America 108 (35): 14637–14642. https://doi.org/10.1073/pnas.1111435108. https://pubmed.ncbi.nlm.nih.gov/21873204, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3167511/.CrossrefPubMedPubMedCentral
Sun, Yijun, Yunpeng Cai, Susan M. Huse, Rob Knight, William G. Farmerie, Xiaoyu Wang, and Volker Mai. 2012. A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis. Briefings in Bioinformatics 13 (1): 107–121. https://doi.org/10.1093/bib/bbr009. https://www.ncbi.nlm.nih.gov/pubmed/21525143, https://www.ncbi.nlm.nih.gov/pmc/PMC3251834/.CrossrefPubMed
Tyler, Andrea D., Michelle I. Smith, and Mark S. Silverberg. 2014. Analyzing the human microbiome: A “how to” guide for physicians. The American Journal of Gastroenterology 109: 983. https://doi.org/10.1038/ajg.2014.73.CrossrefPubMed
Westcott, Sarah L., and Patrick D. Schloss. 2015. De Novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units. PeerJ 3: e1487. https://doi.org/10.7717/peerj.1487.CrossrefPubMedPubMedCentral
———. 2017. OptiClust, an improved method for assigning amplicon-based sequence data to operational taxonomic units. mSphere 2 (2): e00073–e00017. https://doi.org/10.1128/mSphereDirect.00073-17. https://www.ncbi.nlm.nih.gov/pubmed/28289728, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5343174/.CrossrefPubMedPubMedCentral
Whelan, Fiona J., and Michael G. Surette. 2017. A comprehensive evaluation of the sl1p pipeline for 16S rRNA gene sequencing analysis. Microbiome 5 (1): 100. https://doi.org/10.1186/s40168-017-0314-2.CrossrefPubMedPubMedCentral