Chapter 4 described and illustrated how to generate feature table and feature data (i.e., representative sequences). Chapter 5 described and illustrated how to assign taxonomy and build phylogenetic tree. In this chapter, we will describe and illustrate taxonomic classification of the representative sequences and clustering of OTUs. We first introduce some preliminary procedures of clustering sequences into OTUs (Sect. 6.1). Then we introduce VSEARCH and q2-vsearch (Sect. 6.2). In the next three sections, we introduce and illustrate cluster sequences into OTUs using q2-vsearch: closed-reference clustering (Sect. 6.3), De novo Clustering (Sect. 6.4), and open-reference clustering (Sect. 6.5), respectively. Finally, we provide a brief summary in Sect. 6.6.
6.1 Introduction to Clustering Sequences into OTUs
OTUs are used pragmatically as proxies for potential microbial species represented in a sample. The developers of QIIME 2 recommend working with Amplicon Sequence Variants (ASVs); thus QIIME 2 workflow by default does not include a typical OTU picking step. However, OTU picking step is the traditional approach to generate feature table or OTU table for downstream data analysis, and currently some bioinformatic centers still use this approach to generate data. Specifically, OTU picking step is an option in QIIME 2 and the q2-vsearch plugin is available for this analysis. Thus, here we still want to introduce the OTU picking technique.
To cluster sequences, several preliminary works need to be done: (1) merging paired-end reads, (2) removing non-biological sequences, (3) trimming all reads to the same length, (4) discarding low-quality reads, and (5) dereplicating the reads.
6.1.1 Merge Reads
Whether or not need to merge reads is depending on how the sequences will be denoised or clustered into ASVs or OTUs. The sequences need to be jointed when using Deblur or OTU clustering methods, which can be achieved via the QIIME 2 q2-vsearch plugin with the join-pairs method in Deblur. When the sequences were denoised using DADA2, then merging reads is not necessary because DADA2 performs read merging automatically after denoising each sequence.
6.1.2 Remove Non-biological Sequences
Any non-biological sequences such as primers, sequencing adapters, and PCR spacers should be removed before clustering. There are comprehensive methods for removing non-biological sequences from paired-end or single-end data in the q2-cutadapt plugin. We refer the interested readers to QIIME 2 documentation files for details.
As recall in Chap. 4, DADA2 can remove biological sequences when the denoising function is called to denoise sequences. When calling the denoise functions, we can specify the values for --p-trim parameter to remove base pairs from the 5′ end of the reads. In this case, ASVs were obtained through denoising sequences using DADA2 method, in which the non-biological sequences have been removed, so we do not need to perform this step again. If sub-OTUs/ASVs were obtained through Deblur, then we need to remove non-biological sequences because Deblur does not have this functionality yet.
6.1.3 Trim Reads Length
The raw reads need to be trimmed to the same length before OTU clustering. QIIME 2 recommends first denoising reads. Denoising reads involves a length trimming step, and then the length trimming step can optionally pass to the ASVs through a clustering algorithm. Thus, currently QIIME 2 does not have a function to trim reads to the same length directly.
6.1.4 Discard Low-Quality Reads
Low-quality reads will be discarded through quality filtering using the quality-filter plugin. Different types of quality filtering are available in QIIME 2, including the q-score method for single- or paired-end sequences (i.e., SampleData[PairedEndSequencesWithQuality | SequencesWithQuality]), q-score-joined for joined reads (i.e., SampleData[JoinedSequencesWithQuality]) after merging. We refer the readers to Chap. 4 (Sect. 4.4.3 Preliminary Works for Denoising with Deblur).
6.1.5 Dereplicate Sequences
All types of clustering first need to dereplicate the sequences. In QIIME 2, dereplicate-sequences can be performed via the q2-vsearch plugin.
6.2 Introduction to VSEARCH and q2-vsearch
In this section, we’ll cover these three OTU picking methods via QIIME 2 using the example dataset we used in Chaps. 4 and 5. In this case, because ASVs were obtained through denoising sequences using DADA2 method, the reads were already merged, so a merging step can be omitted.
After quality filtering and denoising DNA sequences, to obtain datasets suitable for downstream statistical analyses, sequences are identified by assigning them to taxonomic groups or cluster them into OTUs. Typically, there are three ways to assign sequences to OTUs (Lawley and Tannock 2017; Whelan and Surette 2017; De Filippis et al. 2018): closed-reference clustering, de novo clustering, and open-reference clustering. QIIME 2 currently supports de novo, closed-reference, and open-reference clustering (Rideout et al. 2014).
VSEARCH (Rognes et al. 2016) is a versatile open source tool for processing and preparing metagenomics, genomics, and population genomics nucleotide sequence data. It was designed as an alternative to the USEARCH (Edgar 2010) based on a fast heuristic algorithm for searching nucleotide sequences. It performs optimal global sequence alignment of the query using full dynamic programming. Its functionalities include performing searching, clustering, chimera detection, and subsampling, paired-end reads merging and dereplication.
Currently two options are available in QIIME 2 for clustering of sequences or features into OTUs using vsearch: (1) using demultiplexed, quality-controlled sequence data (i.e., a SampleData[Sequences] artifact). Currently this option is performed in two steps. A single command is expected in the future release of QIIME 2. (2) Using dereplicated, quality-controlled data in feature table and feature representative sequences (i.e., the FeatureTable[Frequency] and FeatureData[Sequence]artifacts). These artifacts could be generated using a variety of analysis pipelines, such as qiime vsearch dereplicate-sequences, and qiime dada2 denoise or qiime deblur denoise commands. The second option is performed in one step. The FeatureTable[Frequency] (in this case, FeatureTableMiSeq_SOP.qza) and FeatureData[Sequence] (RepSeqsMiSeq_SOP.qza) artifacts have already been generated in Chap. 4. We can directly use them to cluster sequences into OTUs.
Traditionally, the 97% threshold was used for approximating to species (Stackebrandt and Goebel 1994; Schloss and Handelsman 2005; Seguritan and Rohwer 2001; Westcott and Schloss 2017). Currently more stringent cut-offs was suggested to avoid over-classification of the representative sequences because it could result in spurious OTUs. Given much larger datasets currently available, around 99% for full-length sequences and around 100% for the V4 hypervariable region are considered as optimal identity thresholds (Edgar 2018).
6.3 Closed-Reference Clustering
Closed-reference clustering (Caporaso et al. 2012; Navas-Molina et al. 2013) is a phylotype-based method, also called as phylotyping (Schloss and Westcott 2011) or taxonomy-dependent method (Sun et al. 2012).
6.3.1 Introduction
Closed reference clustering is to group those sequences that match the same reference sequence in a database with a certain similarity together. That is, this method bins sequences into groups within a well curated database of known sequences, first comparing each query sequence to an annotated reference taxonomy database via the sequence classification or searching methods (Liu et al. 2017a, b; Rodrigues et al. 2017), then grouping the sequences that are matched to the same reference sequence into the same OTU. The algorithm behind closed-reference clustering is first to cluster the sequences in the FeatureData[Sequence] artifact against a reference database, and then to collapse the features in the FeatureTable into new features that are clusters of the input features.
6.3.2 Implement Cluster-Features-Closed-Reference

A color gradient file name reads, imported g g underscore 13 underscore 8 underscore otus slash r e p underscore set slash 99 underscore otus dot fast a as D N A sequences directory format to 99 underscore otus dot q z a.
In general, closed-reference OTU clustering prefers to be performed at a higher percent identity. Here, we perform clustering at 99% identity against the Greengenes 13_8 99% OTUs.

3 color gradient commands of saved feature table and data of frequency and sequence for 3 dot q z a files.
In above commands, the --i-reference-sequences flag is used to include reference database to cluster against with. This reference input file should be a .qza file containing a fasta file with the sequences to use as references, with QIIME 2 data type FeatureData[Sequence]. SILVA or GreenGenes for 16S rRNA gene sequences are most often used as the references input file, while other standard references such as UNITE for ITS data are also used. Still others prefer to curate their own databases.
After implementing closed-reference clustering, we obtain a FeatureTable[Frequency] artifact (TableCR99.qza) and a FeatureData[Sequence] artifact(RepSeqsCR99.qza). Note that The FeatureData[Sequence] artifact (in this case RepSeqsCR99.qza or UnmatchedCR99.qza) is not the sequences defining the features in the FeatureTable, but rather the collection of feature ids and their sequences that didn’t match the reference database at 99% identity.
6.3.3 Remarks on Closed-Reference Clustering
Closed-reference clustering as a phylotype-based method directly assigns sequences based on their distance (similarity) to phylotypes, i.e., reference sequences, whereas distance-based methods group sequences based on their distance (similarity) between sequences to OTUs.
Easily linking a sequence to previously identified microbes, computational efficiency, and stable classification.
Have the strengths of speed, potential for trivial parallelization (Westcott and Schloss 2015).
Closed-reference clustering methods cluster sequence reads against a reference dataset; thus the OTUs obtained from this method can be used to do alpha- and beta-diversity estimations and directly compare OTUs across studies (Westcott and Schloss 2015; He et al. 2015).
Sequence reads from different marker gene regions can be clustered together if the reference dataset consists of full-length marker genes (He et al. 2015).
OTU clustering can be parallelized for large datasets (He et al. 2015), which is suitable for meta-analysis.
The success of assignment is highly contingent on sequencing platform and reference database (Tyler et al. 2014). Thus, when reference databases are incomplete because a large portion of taxa in a sample is unknown or has not yet been well defined, and hence not recorded in databases, then they cannot be assigned to an OTU. Thus, it is impossible to analyze novel sequences detected in an experiment via previously unidentified taxonomic lineages (Tyler et al. 2014; Schloss and Westcott 2011).
Due to largely being dependent on the completeness of the reference database, these clustering methods do not perform well if many novel organisms exist in the sequencing data (Schloss and Westcott 2011; Chen et al. 2016).
Especially, the fundamental problem of the closed-reference approach is that two query sequences matched to the same reference sequence at a higher same (e.g., 97%) similarity may only have a lower similarity (e.g., 94%) to each other (Westcott and Schloss 2015). This is the issue of adverse triplets, which is common in practice (Edgar 2018).
In summary, because closed-reference clustering methods are largely dependent on the completeness of the reference database, they are often employed to annotate sequences (Sun et al. 2012) rather than to detect novel sequences.
6.4 De Novo Clustering
De novo clustering is a distance-based method (Schloss and Westcott 2011), also called as taxonomy-independent (Sun et al. 2012), OTU-based (Zongzhi Liu et al. 2008; Chen et al. 2013), taxonomy-unsupervised (Sul et al. 2011), or de novo (Navas-Molina et al. 2013; Edgar 2010) clustering methods.
6.4.1 Introduction
De novo clustering clusters sequences into groups based on sequence identity or genetic distances alone. It first clusters all sequences into OTUs based on the pairwise sequence distances to compare each sequence against each other rather than to compare against a reference database (Forster et al. 2016), then group sequences into OTUs by implementing a clustering algorithm with a specified threshold.
The algorithm behind de novo clustering is first to cluster all sequences in the FeatureData[Sequence] artifact against one another (rather than against a reference database) based on the pairwise sequence distances, and then to collapse features in the FeatureTable into new features that are clusters of the input features, i.e., classify reads that have a similarity greater than a threshold (typically 97% or 99% identity) as the same OTU.
6.4.2 Implement Cluster-Features-De-Novo
De novo clustering of a feature table can be performed as follows. Here, we perform clustering at 99% identity by specifying 99% identity in --p-perc-identity parameter, which wraps the VSEARCH --cluster_size function, to create 99% OTUs. First, store the artifacts of FeatureTableMiSeq_SOP.qza and RepSeqsMiSeq_SOP.qza into the directory QIIME2R-Bioinformatics. Then type: cd QIIME2R-Bioinformatics in terminal after activating QIIME 2 to link the datasets to this folder. Finally, call the cluster-features-de-novo method via the qiime vsearch plugin to implement the de novo clustering.

2 color gradient commands of saved feature table and data of frequency and sequence for 2 dot q z a files.
Above commands generate two artifacts: a FeatureTable[Frequency] (TableDn99MiSeq_SOP.qza) with the BIOMV210DirFmt format, and a FeatureData[Sequence](RepSeqsDn99MiSeq_SOP.qza) with the DNASequencesDirectoryFormat format. We review them through the QIIME2 viewer. The FeatureData[Sequence] artifact contains the centroid sequence defining each OTU cluster.
6.4.3 Remarks on De Novo Clustering
Carries out the clustering step independently without references for a phylotype-database. Thus, these methods outperform phylotype-based reference methods for assigning 16S rRNA gene sequences to OTUs and have been preferably used across the field (Westcott and Schloss 2015).
Is optimal for samples that contain many bacteria that have no reference sequences in the public databases.
Particularly, it was demonstrated that de novo clustering methods significantly outperform the approaches of closed-reference clustering and open-reference clustering for picking OTUs (Schloss 2016; Jackson et al. 2016).
It is computationally intensive (cost of hierarchical clustering), relatively slow, and larger memory required due to higher sequencing error rates in expanding sequencing throughput, the difficult choice of linkage method for clustering (Schloss and Westcott 2011; Westcott and Schloss 2015).
It tends to produce a very large number of OTUs.
The OTUs obtained from both de novo clustering and open-reference OTU clustering methods affect alpha-diversity analyses (e.g., rarefaction curves), beta-diversity analyses such as principal component analysis and distance-based ordination (e.g., principal coordinate analysis), and the identification of differentially represented OTUs by a hypothesis testing, such as ADONIS’s R value (He et al. 2015).
Especially, the de novo clustering methods have one fundamental problem: the clustering results (i.e., OTU assignments) are strongly influenced by or sensitive to the input order of the sequences (Mahé et al. 2014; He et al. 2015).
In summary, de novo clustering methods have been attracted more attention and have become the preferred option for researchers (Schloss 2010; Cai et al. 2017).
6.5 Open-Reference Clustering
Open-reference clustering is a hybrid of the closed-reference clustering and de novo clustering approaches (Navas-Molina et al. 2013; Rideout et al. 2014).
6.5.1 Introduction
Open-reference clustering combines the closed-reference and de novo methods and sequentially performs closed-reference clustering and de novo clustering, in which a closed-reference clustering method is first used to assign OTUs, and the unassigned sequences outputted by the closed-reference method are then grouped by a de novo clustering method (Westcott and Schloss 2017).
For example, in QIIME-uclust, the “pick_open_reference_otus.py” script implements the latest QIIME open reference OTU clustering (Rideout et al. 2014). The algorithm behind open-reference clustering is: it first performs closed-reference clustering against a reference database (e.g., Greengenes v.13.8, 97% OTU database), using clustering method UCLUST (Edgar 2010), which exploits USEARCH to search large sequence databases and to assign sequences to clusters. Then, it subsamples (default proportion of subsampling = 0.001) those reads that do not map in this first step and performs a de novo OTU clustering step. Next, remaining unmapped reads are subsequently closed-reference clustered against these de novo OTUs. Finally, it performs another step of de novo clustering on the remaining unmapped reads.
6.5.2 Implement Cluster-Features-Open-Reference
Similar to the closed-reference clustering, open-reference clustering (Rognes et al. 2016) can be performed using the cluster-features-open-reference method via the qiime vsearch plugin. Also similar to the closed-reference clustering, open-reference OTU clustering is generally performed at a higher percent identity.

3 color gradient commands of saved feature table and data of frequency and sequence for 3 dot q z a files.
Open-reference clustering generated a FeatureTable[Frequency] artifact TableOR99.qza, and two FeatureData[Sequence] artifacts: RepSeqsOR99.qza and NewRefSeqsOR99.qza. The first FeatureData[Sequence] artifact represents the clustered sequences, while the second artifact represents the new reference sequences, composed of the reference sequences used for input, and the sequences clustered as part of the internal de novo clustering step.
6.5.3 Remarks on Open-Reference Clustering
Open-reference clustering combines the closed-reference and de novo methods and sequentially performs closed-reference clustering and de novo clustering; thus, theoretically this method should reserve the strengths of both closed-reference and de novo clustering. For example, it was evaluated that open-reference clustering method is a much more effective compared to using de novo methods alone, and was recommended for assigning OTUs along implementing using uclust in QIIME (He et al. 2015).
It blends the strengths and weaknesses of the other methods and was reviewed to have potential problems when using these two methods together due to the different OTU definitions employed by commonly used closed-reference and de novo clustering implementations and associated with database quality and classification error (Westcott and Schloss 2015, 2017).
OTU clustering tends to exaggerate the number of unique organisms found within a sample (Edgar 2017). Especially open-reference OTU clustering consistently picks up more number of OTUs (QIIME) than the number of ASVs (DADA2)(Sierra et al. 2020).
Particularly a recent study (Prodan et al. 2020) showed that QIIME-uclust (used in QIIME 1) produced large number of spurious OTUs and inflated alpha-diversity measures, and suggested QIIME-uclust should be avoided in future studies.
In summary, the performance of open-reference clustering has not been consistently confirmed.
6.6 Summary
This chapter described clustering sequences into OTUs via QIIME 2. First, several preliminary steps of OTU clustering was described including merging reads, removing non-biological sequences, trimming reads length, discard low-quality reads, and dereplicating sequences. Then, the VSEARCH and q2-vsearch were introduced, which perform versatile bioinformatic functions including searching, clustering, chimera detection and subsampling, paired-end reads merging, and dereplication. Next, three approaches of OTU clustering methods including closed-reference clustering, de novo clustering, and open-reference clustering were described and illustrated. Their advantages and disadvantages were also discussed. In Chap. 7, we will introduce the original OTU methods in numerical taxonomy.