Y. Xia, J. Sun Bioinformatic and Statistical Analysis of Microbiome Data https://doi.org/10.1007/978-3-031-21391-5_8

8. Moving Beyond OTU Methods

Yinglin Xia¹ and Jun Sun ¹

(1)

Department of Medicine, University of Illinois Chicago, Chicago, IL, USA

Abstract

This chapter investigates the movement of moving beyond OTU methods and discusses the necessity and possibility of this movement. First, it describes clustering-based OTU methods and the purposes of using OTUs and definitions of species and species-level analysis in microbiome studies. Then, it introduces the OTU-based methods that move toward single-nucleotide resolution. Third, it describes moving beyond the OTU methods. Finally, it discusses the necessity and possibility of moving beyond OTU methods as well as the issues of sub-OTU methods, assumption of sequence similarity predicting the ecological similarity, and functional analysis and multi-omics integration.

Keywords

Clustering-based OTU methods Hierarchical clustering OTU methods Heuristic clustering OTU methods Taxonomy OTUs Sequencing error Species and species-level analysis Eukaryote species Prokaryote or bacterial species 16S rRNA method Physiological characteristics Single-nucleotide resolution-based OTU methods Distribution-based clustering (DBC) Swarm2 Entropy-based methods Oligotyping Denoising-based methods Pyrosequencing flowgrams Cluster-free filtering (CFF) DADA2 UNOISE2 UNOISE3 Deblur SeekDeep Sub-OTU methods Sequence similarity Ecological similarity Functional analysis Multi-omics integration

In Chap. 7 we introduced clustering-based OTU methods in numerical taxonomy. In this chapter we will discuss the movement of moving beyond OTU methods and investigate the necessity and possibility of this movement. We divide the presentation into four sections: We first describe clustering-based OTU methods and the purposes of using OTUs and definitions of species and species-level analysis in microbiome studies (Sect. 8.1). We then introduce the OTU-based methods that move toward single-nucleotide resolution (Sect. 8.2). Section 8.3 describes moving beyond the OTU methods and discusses the limitations of OTU methods. Section 8.4 describes the movement of moving beyond OTU methods and discusses the necessity and possibility of moving beyond OTU methods. Finally, we summarize the contents of this chapter (Sect. 8.5).

8.1 Clustering-Based OTU Methods in Microbiome Study

In microbiome studies, there exist two widely used approaches of sequence similarity: clustering-based OTU methods and phylotype-based (phylotyping) methods. Both approaches use similarity measure and group sequences into small units, which are rooted in clustering methods in the original numerical taxonomy.

8.1.1 Common Clustering-Based OTU Methods

Clustering-based OTU methods typically first perform a hierarchical cluster analysis and then assign the sequence reads into OTUs based on a distance threshold, i.e., cluster their similarity to other sequences in the community (Schloss and Westcott 2011; Huse et al. 2010; Schloss and Handelsman 2005; Schloss et al. 2009; Sun et al. 2009).

In contrast, phylotype-based methods classify sequences into taxonomic bins (i.e., phylotypes) based on their similarity to reference sequences (Schloss and Westcott 2011; Huse et al. 2008; Liu et al. 2008; Stackebrandt and Goebel 1994). For example, taxonomy-supervised analysis (Sul et al. 2011) belongs to the phylotyping methods, in which sequences are allocated into taxonomy-supervised “taxonomy bins” to provide 16S rRNA gene classification based on the existing bacterial taxonomy from ribosomal RNA databases, such as Ribosomal Database Project (RDP) (Cole et al. 2007), Greengenes (DeSantis et al. 2006a), and SILVA (Pruesse et al. 2007). Because the processing of taxonomy bins is rooted in polyphasic taxonomy (Colwell 1970), the taxonomy-supervised analysis is believed reflecting physiological, morphological, and genetic information (Sul et al. 2011). SortMeRNA (Kopylova et al. 2012) suited for closed-reference OTU clustering also belongs to the phylotyping methods that query sequences are searched against a reference database and then alignments are evaluated based on optimal regions of similarity between two sequences.

For the benefits and challenges of using phylotype-based methods, the interested reader is referred to the paper by Schloss and Westcott (2011) and references cited therein for more detailed description. For how to process taxonomic binning and overview of existing web tools for taxonomic assignment and phylotyping of metagenome sequence samples, the interested reader is referred to Dröge and McHardy (2012). For a comprehensive evaluation of the performance of open-source clustering software (namely, OTUCLUST, Swarm, SUMACLUST, and SortMeRNA) against UCLUST and USEARCH in QIIME, hierarchical clustering methods in mothur, and USEARCH’s most recent clustering algorithm, UPARSE, the interested reader is referred to Kopylova et al. (2016) and references cited therein. One example for comparison of various methods used to cluster 16S rRNA gene sequences into OTUs is from Jackson et al. (2016).

Within the framework of clustering-based OTU methods, various nonhierarchical clustering algorithms have been developed to improve the resolution of assigning 16S rRNA gene sequences to OTUs, including heuristic clustering, network-based clustering, and model-based clustering.

Although hierarchical clustering and heuristic clustering methods have been criticized due to either the computational complexity (hierarchical clustering) (Barriuso et al. 2011) or considerable loss in clustering quality (greedy clustering) (Sun et al. 2012; Chen et al. 2013a; Bonder et al. 2012), these two clustering methods are still used to assign OTUs in microbiome literature. In the following, we will introduce hierarchical clustering OTU methods (Sect. 8.1.2) and heuristic clustering OTU methods (Sect. 8.1.3), respectively.

8.1.2 Hierarchical Clustering OTU Methods

Single-linkage (also called nearest-neighbor) clustering, complete-linkage (also called furthest-neighbor) clustering, and average-linkage (also known as average-neighbor or UPGMA: unweighted pair-group method by using arithmetic averages) clustering are the three commonly used hierarchical clustering algorithms (Sokal and Rohlf 1995; Legendre and Legendre 1998). As we reviewed in Chap. 7, hierarchical clustering-based OTU methods are rooted in numerical taxonomy.

Single-linkage algorithm defines the distance between two clusters using the minimum distance between two sequences in each cluster. Complete-linkage algorithm defines the distance between two clusters using the maximum distance between two sequences in each cluster. Average-linkage algorithm defines the distance between two clusters using the average distance between each sequence in one cluster to every sequence in the other cluster. For more detailed description of these three clustering methods, the reader is referred to Chap. 7. Theses hierarchical clustering methods were widely adopted in the early development of microbiome software. For example, DOTUR (Distance-Based OTU and Richness) (2005) (Schloss and Handelsman 2005) is a clustering-based OTU method as the name suggested. It assigns sequences to OTUs by using either the furthest (i.e., complete-linkage), average (i.e., unweighted pair-group method by using arithmetic averages), or nearest (i.e., single-linkage) neighbor algorithm for each distance level (Schloss and Handelsman 2005). All these hierarchical clustering algorithms are commonly used in various disciplines (Sokal and Rohlf 1995; Legendre and Legendre 2012).

RDP (the Ribosomal Database Project) (2009) (Cole et al. 2009) is the taxonomy assignment alignment and analysis tools for quality-controlled bacterial and archaeal small-subunit rRNA research. RDP provides taxonomy-independent alignment, in which the trimmed reads are first aligned and then reads are clustered into OTUs at multiple pairwise distances using the complete-linkage clustering algorithm (Cole et al. 2009). In early release RDP has included a pairwise distance matrix suitable for the DOTUR molecular macroecology package (Cole et al. 2007). RDP pipeline was reviewed having better performance by using quality scores and performing a structural alignment but is only accurate at high sequence differences (Quince et al. 2009).

In 2009, in order to perform hypothesis tests and describe and compare communities using OTU-based methods, mothur (Schloss et al. 2009) integrated other software features into its implementation, including pyrosequencing pipeline (RDP) (Cole et al. 2009); NAST, SINA, and RDP aligners (Cole et al. 2009; DeSantis et al. 2006a, b; Pruesse et al. 2007); DNADIST (Felsenstein 1989); DOTUR (Schloss and Handelsman 2005); CD-HIT (Li and Godzik 2006); SONS (Schloss and Handelsman 2006b); LIBSHUFF (Schloss et al. 2004; Singleton et al. 2001); TreeClimber (Maddison and Slatkin 1991; Martin 2002; Schloss and Handelsman 2006a); and UniFrac (Lozupone and Knight 2005).

Although mothur includes many functions for performing hypothesis tests and describing and comparing communities, on the basis, mothur is the improved version of DOTUR: it uses the OTU-based methods of DOTUR (Schloss and Handelsman 2005) and CD-HIT (Li and Godzik 2006) for assigning sequences to OTUs, via implementing three de novo clustering algorithms, i.e., clustering sequences using the furthest, nearest, or UPGMA algorithms from DOTUR or using a nearest-neighbor-based approach from CD-HIT. Thus, mothur is one of representative hierarchical clustering methods for picking OTUs.

In 2009, ESPRIT (Sun et al. 2009) was proposed to use the k-mer (substrings of length k) distance to quickly compute pairwise distances of reads and classify sequence reads into OTUs at different dissimilarity levels via the complete-linkage algorithm. ESPRIT uses massively parallel pyrosequencing technology, i.e., via large collections of 16S rRNA pyrosequences for estimating species richness.

In 2010, SLP (a single-linkage pre-clustering) (Huse et al. 2010) was proposed to overcome the effect of sequencing errors and decrease the inflation of OTUs because the complete-linkage algorithm is sensitive to sequencing artifacts (Huse et al. 2010). SLP uses an average-linkage clustering to more accurately predict expected OTU richness in environmental samples.

In 2013, a modified version of mcClust (Cole et al. 2014; Fish et al. 2013) was proposed to allow the complete-linkage clustering to compute cluster and incorporate algorithmic changes for lowering the time complexity and speeding up clustering. HPC-CLUST (Matias Rodrigues and von Mering 2013) was proposed for large sets of nucleotide sequences via a distributed implementation of the complete- and average-linkage hierarchical clustering algorithms with high optimization.

In 2015, oclust (Franzén et al. 2015) was developed to improve OTU quality and decrease variance using long-read 16S rRNA gene amplicon sequencing by implementing the complete-linkage clustering.

8.1.3 Heuristic Clustering OTU Methods

In the early microbiome research, the commonly used hierarchical clustering methods have been adopted from numerical taxonomy (Sokal and Sneath 1963) and numerical ecology (Legendre and Legendre 1998). However, hierarchical clustering algorithms have the high computational complexity in time and space. Thus, to trade off between computational efficiency and accuracy, various heuristic methods have been proposed to reduce the computational complexity in sequence comparison by using greedy clustering strategy (e.g., CD-HIT (Huang et al. 2010; Li and Godzik 2006; Fu et al. 2012) and UCLUST (Edgar 2010)) instead of hierarchical clustering.

Heuristic clustering methods process input sequences one by one rather than compute pairwise distances of all sequences. Typically a greedy incremental clustering strategy and a seed are used to initiate its clustering (Chen et al. 2016). Because heuristic clustering methods just compare each sequence with the seed sequences, they are able to process massive sequence datasets. Thus, the heuristic clustering methods are considered as more attractive than clustering methods for picking OTUs in 16S rRNA datasets (Cai and Sun 2011; Wei et al. 2021) and have been applied by the human microbiome project (HMP) (Peterson et al. 2009).

The two best-known heuristic methods for improving the resolution of clustering OTUs are CD-HIT (Li and Godzik 2006; Huang et al. 2010; Fu et al. 2012) and USEARCH (Edgar 2010). In 2010, QIIME (pronounced “chime”: “quantitative insights into microbial ecology”), the software tool for downstream analyses, included the functions for choosing OTUs (Caporaso et al. 2010). Especially, QIIME has used the heuristic recursive algorithm UCLUST (Edgar 2010) as the default clustering method since UCLUST’s publication (corresponding to QIIME version 1.0.0) to cluster sequences into OTUs; i.e., it first identifies the centroid of the OTU based on its abundance and then includes all sequences that have sequence identity above a prefixed threshold with the centroid into this OTU. We summarize the heuristic clustering OUT methods in Table 8.1.

Table 8.1

Heuristic clustering methods for constructing OTUs

Heuristic clustering	References
CD-HIT Aims to improve better accuracy, scalability, flexibility, and speed for clustering and comparing biological sequences Sorts the sequence before clustering by the length of sequences	Huang et al. (2010) and Fu et al. (2012)
USEARCH Is an algorithm for sequence database search that seeks high-scoring global alignments Aims to implement greedy clustering and typically achieves good sensitivity for identity Sorts the sequence before clustering by sequence abundance UCLUST Is a heuristic clustering algorithm that employs USEARCH to assign sequences to clusters Aims to take the advantages of UBLAST and USEARCH algorithms enabling sensitive local and global search of large sequence databases at exceptionally high speeds to achieve high throughput Both QIIME and UPARSE are based on USEARCH	Edgar (2010)
GramClust Aims to use the inherent grammar distance metric of each pairwise sequence to cluster a set of sequences Determines partitioning for a set of biological sequences with higher statistical accuracy than both CD-HIT-EST and UCLUST	Russell et al. (2010)
ESPRIT-Tree Is a new version of ESPRIT with focusing on the quasilinear time and space complexity Aims to avoid using the seed sequences to represent clusters like the most existing clustering methods Initially constructs a pseudometric-based partition tree for a coarse representation of the entire sequences Then iteratively finds the closest pairs of sequences or clusters and merges them into a new cluster Is able to analyze very large 16S rRNA pyrosequences in quasilinear computational time with the effectiveness and accuracy comparable to other greedy heuristic clustering algorithms	Cai and Sun (2011)
DNACLUST Aims specifically to cluster highly similar DNA sequences for phylogenetic marker genes (e.g., 16S rRNA) Uses a greedy clustering strategy but via a novel sequence alignment and k-mer-based filtering algorithms to accelerate the clustering	DNAclust, Ghodsi et al. (2011)
DySC (Dynamic Seed-based Clustering) Aims to use a dynamic seeding strategy for greedy clustering 16S rRNA reads First uses the traditional greedy incremental clustering strategy to form the pending clusters Then converts the pending cluster into a fixed cluster when it reaches a threshold size and reselects a new fixed seed The new fixed seed is defined as the sequence that maximizes the sum of k-mers shared between the fixed read and other reads in one cluster	Zheng et al. (2012)
LST-HIT Is a DNA sequence clustering method based on CD-HIT Aims to speed up the computation due to the intrinsic difficulty in parallelization Uses a novel filtering technique based on the longest common subsequence (LST) to remove dissimilar sequence pairs before performing pairwise sequence alignment	Namiki et al. (2013)
DBC454 Is a taxonomy-independent (i.e., unsupervised clustering) method for fungal ITS1 (internal transcribed spacer 1) sequences of 454 reads using a density-based hierarchical clustering procedure Aims to focus on reproducibility and robustness	Pagni et al. (2013)
MSClust Is a multiseed-based heuristic clustering method Aims to use an adaptive strategy to generate multiseeds for one cluster Either assigns one query sequence to one cluster if the average distance between the sequence and seeds is smaller than the user-defined threshold Or marks the sequence as unassigned	Chen et al. (2013a)
UPARSE Is an improved version of USEARCH with adding the chimera detection for seed sequences Aims to highly accurately sequence microbial amplicon reads and to reduce amplification artifacts Uses a greedy algorithm to perform chimera filtering and OTU clustering simultaneously through quality filtering, trimming reads, optionally discarding singleton reads, and then clustering the remaining reads Uses the similar algorithms that used in AmpliconNoise (Quince et al. 2011), i.e., inferring errors in a sequence using parsimony The generated OTU’s sequences were reported with ≤1% incorrect bases; such accurate OTUs are more closer to the species in a community	Edgar (2013)
LSH Is a greedy clustering algorithm that uses the locality-sensitive hashing (LSH-based similarity function) to cluster similar sequences and make individual groups (OTUs) Aims to accelerate the pairwise sequence comparisons by using LSH algorithm and to improve the quality of sequence comparisons via incorporating a matching criterion The assigned OTUs can be utilized for computation of different species diversity/richness metrics (e.g., species richness estimation)	Rasheed et al. (2013)
SUMACLUST Is a de novo clustering method Aims to perform exact sequence alignment, comparison, and clustering rather than semiglobal alignments implemented in CD-HIT and USEARCH Incrementally constructs the clusters via comparing an abundance-ordered list of input sequences with the representative already-chosen sequences based on a greedy clustering strategy SUMACLUST has been integrated into QIIME 1.9.0	Mercier et al. (2013)
Swarm Is a de novo clustering method based on an unsupervised single-linkage clustering method Aims to address two fundamental problems suffered by greedy clustering methods: arbitrary global clustering thresholds and input-order dependency induced by centroid selection First uses a local threshold clustering to generate an initial set of OTUs (nearly identical amplicons) by iteratively agglomerating similar sequences Then uses internal structure of clusters and amplicon abundances to refine the clustering results into sub-OTUs Swarm has been integrated into QIIME 1.9.0	Swarm (Mahé et al. 2014, 2015b)
OTUCLUST Is a greedy clustering method (like SUMACLUST) Is a de novo clustering algorithm specifically designed for the inference of OTUs Aims to perform exact sequence alignment Compares an abundance-ordered list of input sequences with the representative already-chosen sequences Performs sequence de-duplication and chimera removal via UCHIME (Edgar et al. 2011) Is one component of software pipeline MICCA for the processing of amplicon metagenomic datasets that combines quality filtering, clustering of OTUs, taxonomy assignment, and phylogenetic tree inference	Albanese et al. (2015)
VSEARCH Is an alternative to the USEARCH tool based on a heuristic algorithm for searching nucleotide sequences Aims to perform optimal global sequence alignment of the query using full dynamic programming Performs searching, clustering, chimera detection (reference-based or de novo) and subsampling, paired-end reads merging, and dereplication (full length or prefix)	VSEARCH (Rognes et al. 2016)
OptiClust Is an OTU assignment algorithm that iteratively reassigns sequences to new OTUs to maximize the Matthews correlation coefficient (MCC) Aims to improve the quality of the OTU assignments Requires a distance matrix for constructing OTUs because it is a distance-based algorithm	Westcott and Schloss (2017)
ESPRIT-Forest Is an improved method of ESPRIT and ESPRIT-Tree Aims to cluster massive sequence data in a sub-quadratic time and space complexity while inheriting the same pipeline of ESPRIT and ESPRIT-Tree for preprocessing, hierarchical clustering, and statistical analysis First organizes sequences into a pseudometric-based partitioning tree for sublinear time searching of nearest neighbors Then uses a new multiple-pair merging criterion to construct clusters in parallel using multiple threads	Cai et al. (2017)
DBH Is a de Bruijn (DB) graph-based heuristic clustering method Aims to reduce the sensitivity of seeds to sequencing errors by just selecting one sequence as the seed for each cluster First forms temporary clusters using the traditional greedy clustering approach Then builds a DB graph for this cluster and generates a new seed to represent this cluster when the size of a temporary cluster reaches the predefined minimum sequence number	Wei and Zhang (2017)
Fuzzy Aims to improve the clustering quality via the clustering based on fuzzy sets Takes into account uncertainty when producing OTUs using the fuzzy OTU-picking algorithm	Bazin et al. (2019)
DMSC Is a dynamic multiseed clustering (DMSC) method for generating OTUs First uses greedy incremental strategy to generate a series of clusters based on the distance threshold When the sequence number in a cluster reaches the predefined minimum size, then the multicore sequence (MCS) is selected as the seeds of the cluster The average distance to MCS and the distance standard deviation in MCS will determine whether a new sequence is added to the cluster The MCS will be dynamically updated until no sequence is merged into the cluster	Wei and Zhang (2019)

8.1.4 Limitations of Clustering-Based OTU Methods

The early methodological developments of numerical taxonomy provided a rich source for modern ecology and microbiome research. For example, the concepts of OTU, similarity, and using similarity to define taxonomic rank and using distance as a measure of taxonomic similarity have been adopted in current ecology and microbiome research.

However, OTU methods in microbiome and numerical taxonomy have different assumptions, purposes, and approaches. Numerical taxonomy assumes that characters can be grouped based on their similarities. It employs commonly used clustering to perform classification of characters. Bioinformatic analysis of microbiome assumes that sequence similarity can predict taxonomic similarity. It has dual purpose: grouping sequences and estimating diversity. Its final goal is to detect the association between microbiome health and the development of the disease. Thus, microbiome research is not restricted to commonly used clustering and heuristic clustering.

Theoretically any methods that can be used to achieve the goals are appropriate. The newly developed methods such as denoising, refining OTU clustering, model-based methods, and filtering-free methods are all targeting the final goal of bioinformatics in microbiome research.

However, the main motivation of moving beyond the clustering-based OTU methods in microbiome research is that current application of OTU methods to microbiome data has several limitations. Here we describe and discuss the clustering-based OTU methods and especially their limitations that have motivated microbiome research to move beyond them.

8.1.4.1 Assumptions of OTU Methods in Numerical Taxonomy and Microbiome

OTU methods used in numerical taxonomy and microbiome have different aims. Adopting OTU methods into microbiome studies is somewhat inappropriate and thus is controversial.

Although at the beginning the concept of OTU is convenient for bioinformatic analysis of microbiome sequencing data, the use of the OTU methods in microbiome literature deviates from the considerations of the original authors. As reviewed in Chap. 7, the original authors of numerical taxonomy strictly separated taxonomic classifications from phylogenetic classifications and treated the latter as “phylogenetic speculation” and the former as “taxonomic procedure.” The original authors considered OTUs as operational not functional; the functionality remained to taxa; cluster analysis of similarity is merely on datasets at hand and requires that the characters are from larger and randomly chosen samples; taxonomic similarity does not represent the phylogenetic similarity. In contrast, by assigning sequences to a particular OTU, actually it assumes that the similarities among sequences are closely related to the phylogeny, and therefore the similarities of sequences can most likely be used to derive phylogenetic divergence or ecologically rank the similarity of taxa. However, the assumption that sequence similarity of 16S hypervariable regions is a good proxy for phylogenetic and therefore ecological similarity is problematic (Prosser et al. 2007; Tikhonov et al. 2015; Preheim et al. 2013) because sequence similarity does not imply ecological similarity. It was demonstrated that sequence similarity very poorly predicted ecological similarity (Tikhonov et al. 2015). The application of OTU methods in microbiome studies actually forces OTUs to play the functional role of taxa.

8.1.4.2 Challenges of Defining Species Using OTU Methods

OTU methods have been used in microbiome studies to classify taxa in classical Linnaean taxonomy at different taxonomic levels with similar DNA sequences.

Because 16S rRNA genes evolve at different rates in different organisms, determining taxonomic rank by sequence identity thresholds is approximate at the best (Woese 1987). Hierarchical clustering methods generally use a predefined clustering threshold to the hierarchical tree and then group sequences within the threshold into one OTU. For example, because a natural entity “species” cannot be identified as a group of strains that is genetically well separated from its phylogenetic neighbors, sequences are often clustered into OTUs as proxies for species, although sequence divergence is not evenly distributed in the 16S rRNA region. Thus, defining a species by a polyphasic approach is only a pragmatic approach.

Previously, the threshold value of 70% was used in the phylogenetic definition of a species (“approximately 70% or greater DNA-DNA relatedness and with 5°C or less”) (Wayne et al. 1987). This approach of applying DNA similarity to study bacterial species was well received and well proven and has been acknowledged since the 1970s (Stackebrandt and Goebel 1994; Johnson 1985; Steigerwalt et al. 1976). In 1994 Stackebrandt and Goebel (1994) proposed to use the “similarities of 97% and higher” sequence identity as the canonical clustering threshold when few 16S rRNA sequences were available. In practice the 3% dissimilarity is often chosen as the cutoff value to define bacteria species (Sogin et al. 2006; Sun et al. 2009; Schloss 2010; Huse et al. 2007) although different cutoff values of similarity have been studied such as 99.6–95.6% similarity (equating 1–13 bp) (Hathaway et al. 2017).

However, the cutoff value of similarities of 97% was based on the observation that species having 70% or greater DNA similarity usually have more than 97% sequence identity with these 3% or 45-nucleotide differences being concentrated mainly in certain hypervariable regions (Stackebrandt and Goebel 1994) given the primary structure of the 16S rRNA is highly conserved. Thus, the criterion of 97% similarities or 3% dissimilarities (Schloss and Handelsman 2005) is observed and experienced. It is not really able to define a natural entity “species.” Especially, it was shown that 97% threshold is not optimal for an approximation to species because it is too low. To define a species, at least 99% of OTUs should be used (Edgar 2018). In summary, this practical approach of using OTUs to define a species is some kinds of arbitrary, subjective, and therefore inaccurate.

8.1.4.3 Criteria of Defining the Levels of Taxonomy

Like the 3% dissimilarities of sequences that are used to define a species (Borneman and Triplett 1997; Stackebrandt and Goebel 1994; Hugenholtz et al. 1998; Konstantinidis et al. 2006; Goris et al. 2007), accordingly in practice the 95% sequence similarity (95% homology) or the dissimilarities of 5% by species are used to define a genus (Wolfgang Ludwig et al. 1998; Everett et al. 1999; Sait et al. 2002; Schloss and Handelsman 2005); 10% dissimilarities (90% homology) are used to define a family (Schloss et al. 2009); 20% dissimilarities (80% homology) are used to define a phylum (Schloss and Handelsman 2005); an interkingdom identity range of 70–85% is used to define a kingdom (Woese 1987; Borneman and Triplett 1997); and 45% dissimilarities (55% homology) are used to define a domain (Stackebrandt and Goebel 1994).

This practice of using an arbitrary fixed global clustering threshold to group amplicons into molecular OTUs has been criticized (Mahé et al. 2015b) because global clustering thresholds have rarely been justified and are not applicable to all taxa and marker lengths (Caron et al. 2009; Nebel et al. 2011; Dunthorn et al. 2012; Brown et al. 2015). Actually it is impossible to use an accurate distance-based threshold to define taxonomic levels (Schloss and Westcott 2011). Therefore, the criteria of defining the levels of taxonomy are arbitrary, subjective, and inaccurate.

8.1.4.4 Clustering Algorithms and the Accuracy of OTU Clustering Results

As reviewed in Chap. 7, the original OTU methods are based on cluster analysis. However, when the clustering-based OTU methods were adopted in microbiome studies, several limitations have been detected that are associated with the clustering algorithms.

Hierarchical Clustering Methods Versus Heuristic Clustering Methods

In practice it is difficult to choose which clustering algorithms to use even if we could use a constant threshold to define OTUs. Because, in general, different clustering algorithms have assessed having different clustering quality, such as by comparing hierarchical and heuristic clustering algorithms in OTU construction, hierarchical clustering algorithms are assessed as more accurate (Sun et al. 2010). However, heuristic clustering methods are able to process massive sequence datasets. Thus, heuristic clustering methods are more attractive than hierarchical clustering methods for picking OTUs in 16S rRNA datasets (Cai and Sun 2011; Wei et al. 2021) and have been applied by the human microbiome project (HMP) (Peterson et al. 2009). However, most heuristic clustering methods just randomly select one sequence as the seed to represent the cluster and use this seed fixedly which results in the assigned OTUs sensitive to/depending on the selected seeds. Actually no real consensus has achieved on how the random seed is chosen, suggesting that OTU assignments are not definitive (Schloss and Westcott 2011).

Complete-Linkage, Single-Linkage, or Average-Linkage Clustering

Among the commonly used hierarchical clustering algorithms, a unanimous conclusion on which hierarchical clustering algorithm is optimal has not been reached.

The complete-linkage was originally recommended in numerical ecology because it is often desirable in ecology when one wishes to delineate clusters with clear discontinuities (Legendre and Legendre 1998), while the single-linkage clustering is implemented (Wei et al. 2012) in BlastClust (BLASTLab 2004) and GeneRage (Enright and Ouzounis 2000). Other researchers demonstrated that the average-linkage algorithm produces more robust OTUs than other hierarchical and heuristic clustering algorithms (i.e., CD-HIT, UCLUST, ESPRIT, and BlastClust) (Schloss and Westcott 2011; Sun et al. 2012; Quince et al. 2009). The average-linkage algorithm was further evaluated as having higher clustering quality compared to the other common clustering algorithms and even compared to greedy heuristic clustering algorithms, including distance-based greedy clustering (DGC) and abundance-based greedy clustering (AGC) (Westcott and Schloss 2015; Ye 2011). In such analysis, an OTU is generally considered presenting a bacterial species (Kuczynski et al. 2012). Therefore, using the average-linkage algorithm for clustering 16S rRNA gene sequences into OTUs has been suggested (Huse et al. 2010; Schloss and Westcott 2011). The suggestion of using the average-linkage clustering in microbiome data is in line with the suggestion of the original authors of numerical taxonomy (see Chap. 7 for details).

Read Preprocessing, Input-Order Dependency, and Accuracy of Clustered OTUs

The accuracy of clustered OTUs is also determined by removal of low-quality reads and of chimera sequences. It was shown that strict quality filtering may lead to a loss of information, which particularly affects the least abundant species, whereas if reads are clustered with low quality, then spurious OTUs may be produced which may result in overestimating the complexity of the analyzed samples (Albanese et al. 2015). It was also shown that amplicon input order can result in controversial results of OTU assignments (Koeppel and Wu 2013; Mahé et al. 2014, 2015b).

Challenges of Evaluating Clustering Algorithms

Different clustering algorithms and associated clustering thresholds have significantly impacted on the performance of clustering OTUs (Franzén et al. 2015). Since actually optimal thresholds are all higher than 97% (Edgar 2018). Thus, algorithms cannot be meaningfully ranked by OTU quality at 97% of similarity. For example, when comparing clustering algorithms, single-linkage, complete-linkage, average-linkage, AGC (Ye 2011), and OptiClust (OC) (Westcott and Schloss 2017), it was shown that no algorithm is consistently better than any other (Edgar 2018). In summary, no real consensus has emerged on the clustering algorithm.

The controversial and challenges of clustering approaches are due to using different criteria to assess clustering quality or test accuracy of clustering, such as simulation, information theoretic-based normalized distance measurement (Chen et al. 2013b; Baldi et al. 2000; van Rijsbergen 1979), stability of OTU assignments (He et al. 2015), and Matthew’s correlation coefficient (MCC) (Schloss and Westcott 2011; Westcott and Schloss 2015), or combining the MCC and the F-score (Preheim et al. 2013; Baldi et al. 2000).

Different criteria often reach totally different conclusions. For example, many hierarchical and greedy clustering methods are unable to stably assign or produce OTUs. Thus, some researchers, such as He et al. (2015), thought that clustering all sequences into one OTU is completely useless for downstream analysis. By using MCC, Westcott and Schloss (2015) further demonstrated that the stability of OTUs actually suffered a lack of quality because the stability concept focused on the precision of the assignments and did not reflect the quality of the OTU assignments. Actually, this point was also raised in the numerical taxonomy.

In summary, the commonly used hierarchical and heuristic-based OTU methods suffered several fundamental problems, such as (1) the assumption of sequence similarity implying ecological similarity, (2) using arbitrary criteria to define the level of taxonomy (e.g., the dissimilarity of 3% OTUs is used to define species), (3) seed sensitivity, (4) input-order dependency, and (5) read preprocessing.

8.1.5 Purposes of Using OTUs in Microbiome Study

In microbiome study, the concept of OTUs was borrowed from numerical taxonomy. OTUs are used to serve the triple purpose: First is to define different taxonomic levels of taxonomy. Second is expected to reduce the impact of amplicon sequencing error on ecological diversity estimations and measures of community composition via grouping errors together with the error-free sequence (Eren et al. 2016). Third, the refined OTUs were repurposed to play the functionality of species.

8.1.5.1 Defining Taxonomy

Initially when the OTU concept was proposed to be used in microbiome sequencing, the purpose is for the taxonomic grouping, i.e., identifying bacterial population boundaries (Callahan et al. 2017). However, as we reviewed above and others such as Murat Eren et al. (2016) and Benjamin Callahan et al. (2017), the original purpose of using OTUs has not been achieved by the clustering OTU methods. For example, it is problematic to define OTUs using 97% sequence similarity threshold in the closed-reference approach because two sequences might be 97% similar to the same reference sequence, maybe less similarity (e.g., only 94%) shared by each other (de Queiroz and Good 1997; Westcott and Schloss 2015). Thus, the use of 97% OTUs to construct “species” has not been largely approved. In practice a 16S rRNA gene sequence similarity threshold range of 98.7–99% was recommended for testing the genomic uniqueness of a novel isolate(s) (Erko Stackebrandt and Ebers 2006). Similarly, it has not been largely approved to use the 95%, 90% of sequence similarity to define a genus and a family, respectively, and so on.

8.1.5.2 Reducing the Impact of Sequencing Error

The second purpose of using OTU concept and methods is to differentiate true diversity from sequencing errors (Callahan et al. 2017). This purpose is different from and orthogonal to the first one. When the OTU concept was initially proposed and used in microbiome study, OTUs were not expected to reduce the impact of amplicon sequencing error. The task of correcting errors has been repurposed to OTU (Rosen et al. 2012). In other words, this task is an expectation or was weighted as having the function of reducing sequence errors because the use of 97% sequence similarity threshold has successfully reduced the impact of erroneous OTUs on diversity estimations (Eren et al. 2016). After the second role was repurposed to OTUs, clustering reads into OTUs by sequence similarity became a standard approach to filter the noise (Tikhonov et al. 2015).

However, the second purpose has not be achieved either. The use of OTUs for differentiating true diversity from sequencing errors is only appropriate for higher levels of taxonomy (e.g., phylum). When OTU methods were used for probing finer-scale diversity in lower levels of taxonomy, they have intrinsically high false-positive and false-negative rates because using arbitrary cutoff values not only overestimate diversity when there exist errors larger than the OTU-defining cutoff but also cannot resolve real diversity at a scale finer (Rosen et al. 2012). It was also shown that clustering reads into OTUs greatly underestimates ecological richness (Tikhonov et al. 2015). In other words, the use of OTUs has not reduced the controversy on diversity estimations and measures of community composition.

Taken together, both purposes of using OTU-based methods have problems to be achieved and the two major challenges remained: one is to identify bacterial population boundaries and another is to differentiate true diversity from sequencing errors (Preheim et al. 2013).

The third purpose of using OTUs to play the functionality of species in microbiome research is more complicated. To better understand whether this purpose is achieved, we write Sect. 8.1.6 to provide a theoretical background to discuss definitions of species and species-level analysis.

8.1.6 Defining Species and Species-Level Analysis

Defining species is a core component of defining taxonomy, which is an important purpose that the 16S rRNA approach assumed to achieve. In Chap. 7, we reviewed three major species concepts that have been suggested to distinguish species in taxonomic studies: (1) genospecies (a group of organisms exchanging genes), (2) taxospecies (a phenetic cluster), and (3) nomenspecies (organisms bearing the same binomen). We also discussed the difference between a phenetic species and a biological species.

Historically the prokaryote taxonomy was reviewed (Rosselló-Mora and Amann 2001) having largely focused on the nomenclature of taxa (Heise and Starr 1968; Buchanan 1955; Cowan 1965; Sneath 2015; Trüper 1999) rather than the practical circumscription of the species concept applied to prokaryotes. Today’s prokaryotic species concept results from empirical improvements of what has been thought to be a unit (Rosselló-Mora and Amann 2001). However, the underlying idea of defining a species is not only to formulate a unique unit for analysis but also to determine the functionality/genomic (i.e., a genotypic/genomic cluster definition) and phenotypic properties (i.e., not only biochemical or physiological properties but also chemo-taxonomical markers, such as fatty acid profiles that are related to the species and its host).

The topics on eukaryotic and prokaryotic species concepts are very complicated and still debatable. Many books and review articles have contributed to these topics. We here have no intention of fully reviewing either eukaryote or prokaryotic species concept. The interested reader is referred to these books (Claridge et al. 1997b; Wheeler and Meier 2000) and articles (Rosselló-Mora and Amann 2001; Sokal and Crovello 1970; Mishler 2000; Hey 2006; Zink and McKitrick 1995; Balakrishnan 2005; Platnick 2000; Mishler 1999). Below we just summarize various definitions of species and provide a history background of species concepts in microbiome study.

8.1.6.1 Eukaryote Species Concepts

The species concept has been discussed among microbiologists and eukaryote taxonomists (Claridge et al. 1997a). The species concepts for prokaryotes and eukaryotes are different (May 1986, 1988); briefly review of species concept for eukaryotes will definitely help to delineate the species concept for prokaryotes.

Until the end of the 1990s, at least 22 species concepts have already been used (Mayden 1997), which reflects that it is unable to define a species or is difficult to find a unified species or lacks of consensus criterion to define a species. Hull (1997) stated that a concept is defined requiring to meet at least one of the three most common criteria or goals: (1) universality (or generality), (2) applicability, and (3) theoretical significance. Scientists define their species concepts to be as general as possible, while biologists have much difficulty to formulate a species concept that encompasses all organisms (Hull 1997). These 22 species concepts can be categorized into 1 of these 3 most common criteria. They are either more general or practical or theoretically significant species concepts. Rosselló-Mora and Amann (2001) reviewed the prokaryotic species concept and thought that the important requirements for a concept are based on whether the resulting classification scheme is stable, operational, and predictive. We describe some representative eukaryote species concepts in Table 8.2.

Table 8.2

Selective eukaryote species concepts

Biological species concept (BSC)

BSC defines a species as

“groups of actually or potentially interbreeding natural populations which are reproductively isolated from other such groups” (Mayr 1940, 1942) and (Mayr 1963) (p. 19)

In his later publications (Mayr 1969, 2004, 2015), Mayr dropped the much criticized phrase “potentially interbreeding” from the definition and defined a species as:

“groups of interbreeding natural populations that are reproductively (genetically) isolated from other such groups” (Mayr 2004)

Evolutionary species concept (ESC)

ESC defines a species as

“a lineage (an ancestral-descendant sequence of populations) evolving separately from others and with its own unitary evolutionary role and tendencies” (Simpson 1961)

“a single lineage of ancestor-descendant populations which maintains its identity from other such lineages and which has its own evolutionary tendencies and historical fate” (Wiley 1978)

“an entity composed of organisms which maintains its identity from other such entities through time and over space, and which has its own independent evolutionary fate and historical tendencies” (Mayden 1997; Wiley and Mayden 2000)

Phenetic or polythetic species concept (PhSC)

PhSC defines a species as

“the species level is that at which distinct phenetic clusters can be observed” (Sneath 1976) (p. 437)

Phylogenetic species concept (PSC)

PSC defines a species as

“the smallest biological entities that are diagnosable and/or monophyletic” (Mayden 1997)

“the smallest diagnosable monophyletic unit with a parenteral pattern of ancestry and descent” (Rosselló-Mora and Amann 2001)

8.1.6.2 Prokaryote or Bacterial Species Concepts

The concepts of prokaryote (or bacterial) species were adopted from the concepts of eukaryote species. A brief history may be helpful to understand prokaryote (or bacterial) species concepts. Although in the seventeenth and eighteenth centuries the “infusion animalcules” were already observed due to invention of microscopes (Xia and Sun 2022), however, initially no classification was attempted because it lacked of a useful fossil record and was difficult to identify diagnostic characteristics from these small organisms. The first attempt to systematically arrange microorganisms was made at the end of the eighteenth century by Otto Müller (Nekhaev et al. 2015). However, the most important steps in the development of microbiology and especially classification of microorganisms were taken with the advances of new techniques. We categorize the development of prokaryote or bacterial species concepts into the following four stages.

1.
The 1870s to Early 1950s: Phenotypic Classification (Phenotypic Definition of Species)

The ability to isolate organisms in pure cultures started in 1872 cultivating pure colonies of chromogenic bacteria by Joseph Schroeter (Logan 2009). Cultivating microorganisms facilitates bacteria classification based on the phenotypic description of these organisms. However, early bacterial species was often defined based on monothetic groups (a unique set of features). This is a phenotypic definition of species, which lies on a database with an accurate morphologic and phenotypic (e.g., biochemical) description of type strains or typical strains and comparing the isolate to be identified to the database using standard methods to determine these characteristics for the isolate (Clarridge 2004). The limitation is that these sets of phenotypic properties were subjectively selected (Goodfellow et al. 1997).

Among the more than 20 species concepts, Mayr’s BSC is the most widely known and most controversial concept (Rosselló-Mora and Amann 2001; Sokal 1973). BSC attempts to unify genetics, systematics, and evolutionary biology (Claridge et al. 1997a). However, it lacks of practicability and hence generally it was agreed that Mayr’s BSC should be abandoned (Rosselló-Mora and Amann 2001; Sokal 1973). Therefore, BSC cannot presently be applied to prokaryotes (Stackebrandt and Goebel 1994).

ESC defines a species as a lineage that is explicitly temporal, treating the units as lineages extended in time (Hull 1997). It is the most theoretically committed species concept and the only one being a primary concept due to its accommodating all types of species known (Mayden 1997; Hull 1997). The disadvantages of ESC (Rosselló-Mora and Amann 2001) are as follows: (1) It has no pragmatic significance for the prokaryotes in analyzing the current state of knowledge about this group of organisms. (2) It cannot recognize an evolutionary fate nor historical tendencies of the prokaryotes due to lacking a useful fossil record. Although ESC has been recommended to be used for animals (Cracraft 1997), however, it is not yet possible to be adopted for prokaryotes due to (1) rather incomplete knowledge of evolutionary tempo and mode of evolutionary changes in prokaryotes and (2) difficult predictions in accounting for the possibilities of horizontal gene transfers between distant groups (Rosselló-Mora and Amann 2001).

2.
The Late 1950s to Early 1960s: Phenetic Classification

The development of numerical taxonomy in the late 1950s in parallel to application of computer for multivariate analyses makes bacteria classification toward an objective approach (Sneath 1989). The numerical taxonomy began to establish when modern biochemical analytical techniques advanced to study the distributions of specific chemical constituents (i.e., amino acids, proteins, sugars, and lipids in bacteria) (Rosselló-Mora and Amann 2001; Logan 2009).

Phenetic or polythetic (Van Regenmortel 1997) species concept (PhSC) describes “the species level is that at which distinct phenetic clusters can be observed” (Sneath 1976) (p. 437). This is a pragmatic and similarity species concept. It is based on statistically covarying characteristics which are not necessarily universal among the members of the taxa (Sokal and Crovello 1970; Hull 1997). PhSC has some advantages, including the following: (1) It is theory neutral or theory-free and thus can avoid the controversial issue and difficulty (Hull 1997). (2) It is operational and very stable and a valuable pragmatic concept (Rosselló-Mora and Amann 2001). (3) It is universal, monistic, and applicable, which mostly meets the primary requirements for being a concept, and hence has the most valuable characteristics for scientists to consider it as one of the most persuasive concepts (Hull 1997). PhSC has been recommended for higher organisms as well as most similarly for applying to prokaryotic species (Sokal and Crovello 1970).

3.
The Late 1960s: Genomic Classification

The development of molecular biological techniques in the early 1960s, including the techniques of initially overall base compositions of DNAs, DNA-DNA hybridization (Brenner et al. 1969), and defining clusters of strains (Krieg 1988), leads to classify bacterial species by comparing their genomes. A DNA similarity concept was also developed, and DNA-DNA hybridization became a standard technique for classifying bacterial species (genotypic definition of species) (Wayne et al. 1996). However, DNA-DNA hybridization is a difficult technique and does not always correlate with other definitions of species (Fournier et al. 2003). Moreover, a DNA similarity group often depended on phenotypic characters such as use of nucleic acid homologies (Johnson 1973). Thus, classification of bacterial species using only DNA similarity (DNA-DNA similarities) is in practice useless. A bacterial species classification was recommended that must be provided with diagnostic phenotypic properties (Wayne et al. 1987).

4.
The Late 1970s to the Mid-1980s: 16S rRNA Classification

In the late 1970s, with the advent of cataloging ribosomal ribonucleic acids (rRNAs) (Stackebrandt et al. 1985), the rRNA and especially 16S rRNA sequences were shown to be a very useful molecular marker for phylogenetic analyses (Ludwig and Schleifer 1994).

Clustering-based OTU methods have been combined with the 16S rRNA sequencing. The 16S rRNA sequencing method has been widely used for the prokaryotic classification (Olsen et al. 1994), and the datasets generated from 16S rRNA sequencing studies have also been increasingly used to propose new bacterial species (Stackebrandt and Goebel 1994).

The determination of relationships between distantly related bacteria has been evidenced a remarkable breakthrough in the late 1970s by cataloging 16S ribosomal ribonucleic acids (rRNAs) (Stackebrandt et al. 1985) and DNA-RNA hybridization (De Ley and De Smedt 1975) and in the mid-1980s by the full sequence analysis of rRNA (Rosselló-Mora and Amann 2001). In the 1980s, determining phylogenetic relationships of bacteria and all life forms by comparing a stable part of the genetic code (commonly the 16S rRNA gene) became a new standard for identifying bacteria (Woese et al. 1985; Woese 1987).

The reason that the 16S rRNA gene is chosen as the gene to sequence is mainly because it has two important properties (Woese 1987; Clarridge 2004; Xia et al. 2018):

1.
It is highly conserved (Dubnau et al. 1965; Woese 1987). Given its conservation the 16S rRNA gene marks evolutionary distance and relatedness of organisms; thus the evolutionary rates in the 16S rRNA gene sequence can be estimated through comparing studies of nucleotide sequences (Kimura 1980; Pace 1997; Thorne et al. 1998; Harmsen 2004). Currently it has accepted that 16S rRNA sequence analysis can be used for prokaryotic classification to identify bacteria or to assign close relationships at the genus and species levels (Clarridge 2004).
2.
It is universal in bacteria; thus the relationships among all bacteria can be measured (Woese et al. 1985; Woese 1987). This is also an important property. It was reviewed that through the comparison of the 16S rRNA gene sequences, it not only can classify strains at multiple levels, including the species and subspecies level, but also differentiate organisms at the genus level across all major phyla of bacteria (Clarridge 2004).

Due to above two important properties, bacterial phylogeny (the genealogical trees among the prokaryotes) can be established by comparing the 16S rRNA gene sequences (Ludwig et al. 1998). PSC includes monophyletic species concept (MSC) and diagnostic species concept (DSC) (Hull 1997). Both are defined as phylogenetic (or genealogical) concepts with a minimal time dimension (Rosselló-Mora and Amann 2001).

When DNA was discovered, prokaryote classification was based solely on phenotypic characteristics. With the development of numerical taxonomy (Sneath and Sokal 1973), the individuals are treated as operational taxonomic units that are polythetic (they can be defined only in terms of statistically covarying characteristics), resulting in a more objective circumscription of prokaryotic units. The discovery of genetic information gave a new dimension to the species concept for microorganisms, which enable at least a first rough insight into phylogenetic relationships. Thus, the species concept for prokaryotes evolved into a mostly phenetic or polythetic. This means that species are defined by a combination of independent, covarying characters, each of which may occur also outside the given class, thus not being exclusive of the class (Van Regenmortel 1997).

Compared to DNA-RNA hybridization approach, the 16S rRNA gene sequence approach is much simpler and thus has replaced DNA-RNA hybridization and become the new gold standard to define a species (Fournier et al. 2003; Harmsen 2004). However, as we will describe in Sect. 8.1.6.3, using the 16S rRNA method to define a species still remains challenging. Before we move forward to definition of species in 16S rRNA method, we quote some microbiological definitions of species.

Microbiologist unofficially defines a species for pure culture (essential for the classification of new prokaryotic species) as:

“a microbial species is a concept represented by a group of strains, that contains freshly isolated strains, stock strains maintained in vitro for varying periods of time, and their variants (strains not identical with their parents in all characteristics), which have in common a set or pattern of correlating stable properties that separates the group from other groups of strains (Gordon 1978).”

Microbiologist unofficially also generally describes a prokaryote species to be included uncultured organisms which constitute the largest proportion of living prokaryotes as:

“a group of strains that show a high degree of overall similarity and differ considerably from related strain groups with respect to many independent characteristics (Colwell et al. 1995)” or
“a collection of strains showing a high degree of overall similarity, compared to other, related groups of strains (Colwell et al. 1995).”

8.1.6.3 16S rRNA Method and Definition of Species

Using the 16S rRNA similarities to define bacteria species remains challenging. Much early in the 1870s, Ferdinand Cohn investigated whether the similarity exits between bacteria and whether bacteria, like animals and plants, can be arranged in distinct taxa (Cohn 1972; Schlegel and Köhler 1999). Cohn considered the form genera as natural entities but considered species as largely artificial. Cohn emphasized that character of the classification system is artificial. The original authors of OTU method Sokal and Sneath (1963) also criticized their then current taxonomy made little increase in understanding the nature and evolution of the higher categories (Sokal and Sneath 1963) (p. 5).

When clustering-based OTU methods emerged into the 16S rRNA sequencing analysis, an important concept, the phylogenetic species concept (PSC), was developed. PSC compares the 16S rRNA similarities with DNA similarity (DNA-DNA similarities), assuming that nucleic acid (DNA) reassociation can determine whether taxa are phylogenetically homogeneous. PSC defines a species that would generally include strains with “approximately 70% or greater DNA-DNA relatedness and with 5°C or less ΔT_m” (Wayne et al. 1987). Here, T_m is the melting temperature or the thermal denaturation midpoint, and ΔT_m is the difference between the homoduplex DNA T_m and the heteroduplex DNA T_m. Thus, ΔT_m is a reflection of the thermal stability of the DNA duplexes.

However, as reviewed in Sect. 8.1.4.2, the rationale for using DNA reassociation as the gold standard to delineate species originates from the observations that a high degree of correlation existed between DNA similarity and chemotaxonomic, genomic, serological, and numerical phenetic similarity (Stackebrandt and Goebel 1994). The 16S rRNA sequencing method does not have sufficient power to correctly define bacterial species, which was reviewed in several articles including Ash et al. (1991); Amann et al. (1992); Fox et al. (1992); Martinez-Murcia et al. (1992); Stackebrandt and Goebel (1994); and Johnson et al. (2019).

Thus, it suggests that the 16S rRNA method cannot be used for defining bacteria species. We can summarize four arguments as follows:

1.
The 97% sequence similarity used by 16S rRNA method is based on the assumption that 70% DNA similarity defines a species. Organisms that have 70% or greater DNA similarity will not necessarily have at least 96% DNA sequence identity. The threshold value of 70% does not consider the possibility that the tempo and mode of changes differ in different prokaryotic strains (Stackebrandt and Goebel 1994). DNA hybridization is significantly higher resolution of power than that of sequence analysis (Amann et al. 1992). Thus, the 70% DNA similarity cannot be used to define a species. Moreover, species having 70% DNA similarity usually have more than 97% sequence identity (Stackebrandt and Goebel 1994). Thus, even if the 70% DNA similarity can be used to define a species, the 97% sequence identity cannot be used to define a species.
2.
The 16S rRNA has a highly conserved primary structure; there does not exist a linear correlation between the two phylogenetic parameters DNA-DNA similarity percent and 16S rRNA similarity for closely related organisms (Stackebrandt and Goebel 1994). Sequence similarities and DNA reassociation values obtained for the same strain pairs do not have a linear relationship (Erko Stackebrandt and Ebers 2006; Stackebrandt and Goebel 1994), and actually 16S rRNA sequence identity is not sufficient to guarantee species identity (Fox et al. 1992; Rosselló-Mora and Amann 2001; Martinez-Murcia et al. 1992).
3.
The 3% or 45-nucleotide differences are not evenly distributed over the primary structure of the molecule; instead they are highly likely to concentrate in the hypervariable regions. Thus, it is impossible to exactly describe the genealogy of the molecule because the false identities are simulated and the actual number of evolutionary events is masked (Grimont 1988; Stackebrandt and Goebel 1994).
4.
Technically, 16S variable regions (i.e., ranging from single variable regions, such as V4 or V6, to three variable regions, such as V1–V3 or V3–V5) cannot be targeted with short-read sequencing platforms (e.g., using Illumina sequencing platform to produce ≤300 bases of short sequences) to achieve the taxonomic resolution afforded by sequencing the entire (~1500 bp) gene (Johnson et al. 2019).
5.
Particularly, the examples described below (Rosselló-Mora and Amann 2001) suggest it is difficult to use the 16S rRNA similarities to define bacteria species, because (1) different species could have identical (Probst et al. 1998) or nearly identical 16S rRNA sequences (Fox et al. 1992; Martinez-Murcia et al. 1992); (2) within a single species, there exists a microheterogeneity of the 16S rRNA genes (Bennasar et al. 1996; Ibrahim et al. 1997); and (3) in exceptional cases, single organisms with two or more 16S rRNA genes could have relatively high sequence divergence (Mylvaganam and Dennis 1992; Nübel et al. 1996).

In summary, clustering sequence similarity of 16S rRNA can never be solely used to define bacterial species. On the one hand, it was recognized that the DNA hybridization (DNA reassociation) remains the optimal method for measuring the degree of relatedness between highly related organisms and hence cannot be replaced by 16S rRNA method for defining species (Stackebrandt and Goebel 1994; Rosselló-Mora and Amann 2001). On the other hand, it has been accepted that a prokaryotic species classification should be defined by integrating both phenotypic and genomic parameters, as well as should analyze and compare as many phenotypic and genomic parameters as possible (Wayne et al. 1987). This approach is known as “polyphasic taxonomy” (Vandamme et al. 1996). Based on this polyphasic approach, in 2001 a more pragmatic prokaryote species concept (called a phylo-phenetic species concept) has been described as “a monophyletic and genomically coherent cluster of individual organisms that show a high degree of overall similarity in many independent characteristics, and is diagnosable by a discriminative phenotypic property” (Rosselló-Mora and Amann 2001). Obviously, this is far beyond the clustering-based OTU method can achieve.

8.1.6.4 16S rRNA Method and Physiological Characteristics

Whether the 16s rRNA method can inference the physiological characteristics is debatable. On the one hand, the 16S rRNA approach has the practical advantage for species identification and has been used frequently in bacterial phylogenetics, species delineation, and microbiome studies; for example, it is recommended to include the ribosomal sequence for describing new prokaryotic species (Ludwig 1999; Chakravorty et al. 2007). Some researchers considered that comparative analysis of 16S rRNA is a very good method to find a first phylogenetic affiliation in both potentially novel and poorly classified organisms (Goodfellow et al. 1997). Thus, although the 16S rRNA sequence analysis is not sufficient for numerically delineating borders of the prokaryotic species, it can indicate the ancestry pattern of the taxon studied, as well as confirm the monophyletic nature in grouping the members (Rosselló-Mora and Amann 2001).

On the other hand, other researchers have questioned the validity of 16S rRNA as a marker for phylogenetic inferences and thought that it is very difficult to infer the physiological characteristics of an organism by clustering the rRNA sequence method (Rosselló-Mora and Amann 2001; Cohan 1994; Gupta 1998; Gribaldo et al. 1999). Thus, using arbitrary threshold of 97% similarity (or 3% dissimilarity) to define OTUs that could be interpreted as a proxy for bacterial species is computationally convenient, but at the expense of accurate ecological inference (Eren et al. 2016) because 3% OTUs are often phylogenetically mixed and inconsistent (Koeppel and Wu 2013; Nguyen et al. 2016; Eren et al. 2014).

Recently, Hassler et al. (2022) found that phylogenies of the 16S rRNA gene and its hypervariable regions lack concordance with core genome phylogenies, and they concluded that the 16S rRNA gene has a poor phylogenetic performance and has far-reaching consequences.

By comparing core gene phylogenies to phylogenies constructed using core gene concatenations to estimate the strength of signal for the 16S rRNA gene, its hypervariable regions, and all core genes at the intra- and inter-genus levels, Hassler et al. (2022) showed that at both intra- and inter-genus taxonomic levels, the 16S rRNA gene was recombinant and subject to horizontal gene transfer, suffering from intragenomic heterogeneity, accumulating recombination and an unreliable phylogenetic signal. The poor phylogenetic performance using the 16S rRNA gene not only results in incorrect species/strain delineation and phylogenetic inference but also has the potential to confound community diversity metrics if incorporating phylogenetic information in the analysis such as using Faith’s phylogenetic diversity and UniFrac (thus, their use to measure the diversity is not recommended) (Hassler et al. 2022). Additionally, taxonomic abundance and hence measures of microbiome diversity may also be inflated and confounded by the multiple copies within a genome and wide range in 16S rRNA gene among genomes (De la Cuesta-Zuluaga and Escobar 2016; Louca et al. 2018).

8.2 Moving Toward Single-Nucleotide Resolution-Based OTU Methods

In summary, using OTUs as the atomic unit of analysis has both benefits and controversies. This consequence is often due to unacknowledged factor that OTUs fail to serve the dual tasks well; and especially the connection between OTUs and species is largely unfounded (Stackebrandt and Ebers 2006). In order to overcome the limitations of clustering-based OTU methods in bioinformatic analysis of microbiome sequencing data, the reform movements of moving beyond OTUs and clustering-based OTU methods have begun in microbiome research. Two strategies for these reform movements that represent overlapping goals are moving toward single-nucleotide resolution and moving beyond the OTU-based methods with denoising-based methods. In this section, we describe the first strategy, moving toward single-nucleotide resolution-based OTU methods, and in Sect. 8.3, we investigate the second strategy: moving beyond the OTU-based methods with denoising-based methods.

8.2.1 Concept Shifting in Bioinformatic Analysis

In bioinformatic analysis of microbiome data, a concept shifting regarding analysis units began with a dissatisfactory of using similarity/dissimilarity threshold to define OTUs.

As we reviewed in Sects. 8.1.5 and 8.1.6, OTUs used in microbiome study have three purposes: (1) to define taxonomic levels, (2) to improve the accuracy of ecological diversity estimations and measures of community composition, and (3) to play the functionality of species. All these are done through defining certain particular taxonomic levels by borrowing ecological taxonomic concepts into the context of high-throughput marker-gene sequencing of microbial communities, such as to classify microbial species (Rosen et al. 2012); the 97% sequence identity (or 3% ribosomal) OTUs are used to define the level of “like species.” When we do this way, we actually assume that the greater similar sequences more likely represent phylogenetically similar organisms. Defining OTUs in this way is expected to simplify the complexity of the large datasets (Westcott and Schloss 2015), to facilitate taxonomy-independent analyses, such as alpha diversity, beta diversity, and taxonomic composition, and thus to effectively reduce the computational resources (Human Microbiome Project 2012; He et al. 2015).

However, it was assessed (Chen et al. 2013b) that using a constant threshold to define OTUs at a specific taxonomic level, e.g., using 3% dissimilarity to define species, is not ideal because clustering reads into OTUs vastly underestimates ecological richness (Tikhonov et al. 2015) and hence may end up with an incorrectly estimated number of OTUs/taxa.

Thus, some researchers began reconsidering the concept of OTUs and clustering-based OTU methods.

Some heuristic-based OTU methods have already tried to address the problem of how to choose an optimal distance threshold to define OTUs to represent different taxonomic levels. AGC (abundance-sorted greedy clustering) (Ye 2011) was developed for identifying and quantifying abundant species from pyrosequences of 16S rRNA by consensus alignment, aiming to avoid inflation of species diversity estimated from error-prone 16S rRNA pyrosequences. UPARSE (Edgar 2013) reported that the generated OTU’s sequences with ≤1% incorrect bases are more closer to the species in a community, and Swarm’s (Mahé et al. 2014) sub-OTUs solution is also an effort to refine the OTU clustering.

Both model-based and network-based clustering OTU methods have also been developed to address the problem of selecting optimal distance threshold to define OTUs (Wei et al. 2021).

Such model-based clustering OTU methods include CROP (Clustering 16S rRNA for OTU Prediction) (Hao et al. 2011), BEBaC (Cheng et al. 2012), and BC (Jääskinen et al. 2014). Such network-based clustering OTU methods include M-pick (Wang et al. 2013), (Wei and Zhang 2015), and DMclust (Wei et al. 2017).

However, within the framework of OTU methods, all these clustering methods including hierarchical, heuristic-based OTU methods and model-based and network-based OTU methods remain using the concept of OTUs; the difference between them is how to cluster sequences to OTUs and what cutoff values are used to refine OTUs as closer to species.

Until 2011, most researchers still believed that the solutions of OTU-based methods can be improved to quickly and accurately assign sequences to OTUs and hence the obtained taxonomic information from those OTUs can greatly improve OTU-based analyses. The OTU-based methods are considered being robust and can overcome many of the challenges encountered with phylotype-based methods (Schloss and Westcott 2011).

There are three motivations behind the development of new methods:

The first is to overcome the weakness of classical OTU approach based on an arbitrary sequence identity threshold because 97% similarity is not accurate and cannot approximate biology species.
The second is to overcome the amplicon sequencing errors. The classical OTU approach assumed that it could reduce problems caused by erroneous sequences. But it cannot solve the problem of erroneously clustering OTUs. It also reduces phylogenetic resolution because sequences below the identity threshold cannot be differentiated (Amir et al. 2017).
The third is to facilitate merging datasets generated by bioinformatic tools. Because the de novo OTUs were assessed having problem of merging OTUs and although this problem could be reduced in closed-reference and open-reference OTU picking (Rideout et al. 2014), it remains a challenge to integrate large datasets into a single OTU space (Amir et al. 2017).

The last two motivations are associated with downstream statistical analysis. To avoid these weaknesses, the new approaches focus on sub-OTU methods or algorithms to detect ecological differences by a single base pair and try to approve that the subunits are ecological and functional meaningfulness, while expecting sub-OTUs are feasible for analysis. The new direction of bioinformatic analysis of microbial community has been challenging the importance of the solution of OTUs and OTU-based methods. It aims to define atomic analysis units of microbiome data and to search for algorithms that can provide single-nucleotide resolution without relying on arbitrary percent similarity thresholds of OTUs (Eren et al. 2016) or refining the OTUs.

8.2.2 Single-Nucleotide Resolution Clustering-Based OTU Methods

Several computational methods or algorithms have been developed to improve the resolution of 16S data analysis beyond the threshold of 3% dissimilarity OTUs. We introduce the two most important algorithms below: distribution-based clustering (Sect. 8.2.2.1) and Swarm2 (Sect. 8.2.2.2).

8.2.2.1 Distribution-Based Clustering (DBC)

In 2013, Preheim et al. (2013) used DBC to refine the OTU for single-nucleotide resolution but without relying on sequence data alone to serve the dual purpose for the use of OTUs: identifying taxonomic groups and eliminating sequencing errors. Alternatively, the DBC method compares the distribution of sequences across samples. In other words, it uses ecological information (distribution of abundance across multiple biological samples) to supplement sequence information.

DBC algorithm is implemented under the framework of clustering OTU methods. It first uses Jukes-Cantor-corrected genetic distances or Jensen-Shannon divergence (JSD) as appropriate to measure the genetic distances among sequences and then implements either complete algorithm (all sequences are analyzed together in the analysis) or parallel algorithm (sequences are pre-clustered with a heuristic approach) or different algorithms (e.g., nearest-neighbor single-linkage clustering). DBC first uses the chi-squared test to determine whether two sequences have similar distributions across libraries. Then, based on ecological distribution to form OTUs, so 16S rRNA sequences that even differ by only 1 base but that are found in different samples are still put into different OTUs. Conversely, the sequences drawn from the same underlying distribution across samples are grouped together into the same OTU regardless of interoperon variation or sequence variation or random sequencing errors. The underlying assumption is that bacteria in different populations are often highly correlated in their abundance across different samples (Preheim et al. 2013). Thus, 16S rRNA sequences derived from the same population regardless of their variations or random sequencing error will have the same underlying distribution across sampled environments.

It was shown (Preheim et al. 2013) that DBC has the advantages. We summarize them below:

1.
DBC algorithm uses both genetic distance and the distribution of relative abundances of sequences across samples to identify the appropriate grouping for each taxonomic lineage and to detect many methodological errors. Thus, it may avoid the limitation that will necessarily either over-cluster or under-cluster sequences using genetic information alone.
2.
DBC method is more accurate and sensitive in identifying true input sequences, clustering sequencing, and methodological errors, in which the accuracy is in terms of the F-score and the Matthew’s correlation coefficient (MCC) (Baldi et al. 2000). The DBC method is more accurate (more correct OTUs, fewer spurious/incorrect OTUs) than other OTU methods including closed-reference (i.e., phylotyping), de novo clustering, and open-reference (i.e., a hybrid of phylotyping and de novo clustering) and UCLUST at grouping reads into OTUs. It is also sensitive enough to differentiate between OTUs that differ by a single base pair for showing evidence of differing ecological roles. In other words, DBC predicts the most accurate OTUs when sequences are distributed in an ecologically meaningful way across samples. It more accurately represents the input sequences based on the total number of OTUs and predicts fewer overall OTUs than the de novo and open-reference methods. Because detecting the differences between closely related organisms is crucial (Shapiro et al. 2012), thus, DBC can be chosen to distinguish the signal from the noise of sequencing errors and to form accurate OTU for identifying evolutionary and ecological mechanisms (Preheim et al. 2013).

However, DBC as an OTU clustering algorithm has two important challenges:

1.
DBC is unreliable to conduct cross-sample comparisons for low-count sequences (Tikhonov et al. 2015).
2.
DBC currently requires very long run time on very large datasets and prohibitively long even for moderately sized datasets. This is a severe limitation of DBC (Preheim et al. 2013) and has been criticized (Tikhonov et al. 2015).

In summary, DBC is a clustering OTU method or a refined OTU method. The difference from other OTU methods is that it uses ecology to model multiple samples to refine the OTU. As a refined OTU method, DBC is still within traditional framework of OTU methods.

8.2.2.2 Swarm2

The de novo amplicon clustering methods share two fundamental problems (reviewed in Sect. 6.4). One uses arbitrary fixed global clustering thresholds; another is input-order dependency induced by centroid selection, i.e., the clustering results are strongly influenced by the input order of amplicons. To solve these two fundamental problems, in 2014, Mahé et al. (2014) proposed an amplicon clustering algorithm Swarm to fine-scale OTUs without relying on arbitrary global clustering thresholds and input-order dependency.

Swarm is a highly scalable and high-resolution, input-order independent amplicon clustering. Swarm first uses a local threshold to cluster nearly identical amplicons iteratively and then uses clusters’ internal structure and amplicon abundances to refine its results. Swarm was defined as a fast and exact, agglomerative, unsupervised (de novo) single-linkage clustering implementing with a growth and a breaking two phases. Swarm has the advantages:

1.
Swarm was expected to reduce the influence of clustering parameters and to produce robust OTUs (Mahé et al. 2014).
2.
Swarm2 (Mahé et al. 2015b) directly integrated the clustering and breaking phases and hence improved the performances in terms of computation time and reducing under-grouping by grafting low abundant OTUs (e.g., singletons and doubletons) onto larger ones. The improvement is done through improving Swarm’s scalability to linear complexity and adding the fastidious option (Mahé et al. 2015b).

However, Swarm has the disadvantage: it cannot resolve at the single-base level (Hathaway et al. 2017).

Like distribution-based clustering (Sect. 8.2.2.1), Swarm and Swarm2 aim to provide single-nucleotide resolution (Mahé et al. 2014, 2015b). Other bioinformatic software that aim to single-nucleotide resolution include oligotyping (Section 8.3.1), denoising-based methods (Sect. 8.3.2) such as cluster-free filtering (Sect. 8.3.2.3), DADA2 (Sect. 8.3.2.4), UNOISE2 (Sect. 8.3.2.5), Deblur (Sect. 8.3.2.6), and SeekDeep (Sect. 8.3.2.7).

8.3 Moving Beyond the OTU Methods

We can use the following four characteristics to describe the beyond OTU methods: (1) to avoid the concept of OTUs, especially without relying on using 97% sequence identity to approximate species, (2) to avoid using clustering method, (3) to focus on single-nucleotide (a single base pair) resolution to provide more accurate estimates of diversity, and (4) to expect more clinical or functional relevancy. We use these four criteria especially the first three criteria to assess on whether or not the newly developed methods are moving beyond the traditional OTU methods.

8.3.1 Entropy-Based Methods: Oligotyping

In molecular biology, the term oligotyping mainly consists of dual DNA sequence analysis methods or functions of taxonomy and sequencing; that is, the approach of oligotyping is to use primary DNA sequencing to identify organism taxonomy (taxonomy), while improving the accuracy of DNA sequencing (sequencing).

In 2013, Eren et al. (2013) described a supervised computational method called oligotyping in analysis of 16S rRNA gene data and demonstrated that oligotyping is able to differentiate closely related microbial taxa. Oligotyping decomposes marker-gene amplicons with a different way from clustering OTUs. The basic theory underlying the oligotyping workflow is minimum entropy decomposition (MED) (Eren et al. 2014), the unsupervised oligotyping, which relies on Shannon entropy (Shannon 1948), a measure of information uncertainty (Jost 2006). MED is an algorithm for fine-scale resolution for Illumina amplicon data.

The oligotyping workflow is implemented via the following three procedures: First, MED identifies variable nucleotide positions among the entire sequencing data based on subtle nucleotide variation (Ramette and Buttigieg 2014). Second, MED only chooses those significantly variate positions to partition reads into oligotypes (“MED nodes”: representing homogeneous OTUs), i.e., sensitively partitioning the high-throughput marker-gene sequences (Eren et al. 2014). Third, MED produces a sample-by-OTU table for downstream analyses (Buttigieg and Ramette 2014).

It was shown (Eren et al. 2014, 2016) that the MED strategy and algorithm along with the underlying information theory-based decomposition process have the benefits:

1.
Could resolve closely related but distinct taxa that differ by as little as one nucleotide at the sequenced region.
2.
Could not only allow identifying nucleotide positions that likely carry phylogenetically important signal, and enable finer representing the microbial diversity in a wide range of ecosystems without relying on extensive computational heuristics and user supervision, but also improve the ecological signal for downstream analyses.
3.
Particularly, it was shown (Hathaway et al. 2017) that like SeekDeep (Sect. 8.3.2.7), MED was able to achieve 100% haplotype recovery using 454 and Ion Torrent pyrosequencing reads, as well as Illumina MiSeq reads on the Illumina platform.

However, it was also shown (Hathaway et al. 2017) that the MED algorithm appears to get trouble as a haplotype’s abundance increases.

8.3.2 Denoising-Based Methods

Typically many sequences generated by any sequencing platform contain at least one error (Amir et al. 2017). When microbiome researchers noticed that the sequences contain sequencing errors, the use of OTU methods has been changed (Gaspar 2018) to deal with sequencing errors.

Sequencing errors are extremely important confounding factors for detecting low-frequency genetic variants (Ma et al. 2019) and thus impact the underlying biology due to inaccurate taxon identification and inflated diversity statistics (Amir et al. 2017). Although these errors seldom affect statistical hypothesis testing of differences between two communities, clinically higher precision could be deviated by these errors (Amir et al. 2017). Removing or reducing the sequencing errors will reduce false-negative results and improve the accuracy of describing the microbial community. Thus, one direction of methodology development in bioinformatic analysis of microbiome data is to handle sequencing errors (Ma et al. 2019).

One way to deal with sequencing errors is to denoise them. Denoising-based or error-correction methods exploit the predictable structure of certain error types to attempt to reassign or eliminate noisy reads (Huse et al. 2010; Quince et al. 2011; Rosen et al. 2012). Most denoising algorithms aim to assign erroneous reads to their most likely source, to make the abundance estimates of true sequences more accurate (Tikhonov et al. 2015).

Several types of error are introduced by PCR amplification followed by sequencing (Edgar 2016b). Of which sequencing error, PCR single-base substitutions, and PCR chimeras are the three important sources of error (Quince et al. 2011): (1) point errors caused by substitution and gap errors due to incorrect base pairing and polymerase slippage, respectively (Turnbaugh et al. 2010), (2) PCR chimeras caused by extending an incomplete amplicon prime into a different biological template (Haas et al. 2011), and (3) spurious species due to contaminants from reagents and other sources (Edgar, 2013) as well as introduced when reads are assigned to incorrect samples due to cross talk (Carlsen et al. 2012).

Thus, to infer accurate biological template sequences from noisy reads, the denoising-based or error-correction methods typically take two steps (Edgar 2016b): first, denoising or correcting point errors to obtain an accurate set of amplicon sequences and then filtering the chimeric amplicons. Among them, DADA2, UNOISE3, and Deblur are the three most widely used denoising packages. We introduce them, respectively, as below.

8.3.2.1 Denoising-Based Methods Versus Clustering-Based Methods

In 2018, Almeida et al. (2018) compared the default classifiers of MAPseq (Matias Rodrigues et al. 2017), mothur, QIIME, and QIIME 2 using synthetic simulated datasets to evaluate their accuracy when paired with both different reference databases (i.e., Greengenes (McDonald et al. 2012), NCBI (Federhen 2012), RDP (Cole et al. 2014) and SILVA (Yilmaz et al. 2014)) and variable subregions (V1-V2, V3-V4, V4, and V4-V5) of the 16S rRNA gene. This study showed that (Almeida et al. 2018):

1.
QIIME 2 outperformed MAPseq, mothur, and QIIME in terms of overall recall and F-scores at both genus and family levels as well as with the lowest distance estimates between the observed and simulated (predicted) samples, but at the expense of CPU time and memory usage. In contrast, MAPseq showed the highest precision, with miscall rates consistently <2%.
2.
Compared to using Greengenes, generally a higher recall was yielded using the SILVA database.
3.
The performance of taxonomic assignment for each tool can be considerably influenced varied up to 40% depending on the 16S rRNA subregion targeted. Overall, the V1–V2 and V3–V4 subregions performed the best across most of the software tools. However, the V1–V2 subregions were not recommended to be used for classification of complex community samples because the V1–V2 primers did not match almost 70% of the sequences across the four reference databases.
4.
This study concluded that use of either QIIME 2 or MAPseq is optimal for 16S rRNA gene profiling, where the recall rate (sensitivity) was estimated as the percentage of sequences assigned to the expected taxa for each biome, while precision (specificity) was calculated as the fraction of sequences from these predicted taxa out of all those from the taxa observed. The distances were estimated with either the Bray-Curtis or Jaccard dissimilarity indices at the genus level. The F-score was calculated as $F-\mathrm{score}=2\times \frac{\mathrm{precision}\times \mathrm{recall}}{\mathrm{precision}+\mathrm{recall}}$ , where precision is defined as $\frac{\mathrm{True}\ \mathrm{positives}\ \left(\mathrm{TP}\right)}{\mathrm{True}\ \mathrm{positives}\ \left(\mathrm{TP}\right)+\mathrm{False}\ \mathrm{positives}\ \left(\mathrm{FP}\right)}$ and recall is defined as $\frac{\mathrm{True}\ \mathrm{positives}\ \left(\mathrm{TP}\right)}{\mathrm{True}\ \mathrm{positives}\ \left(\mathrm{TP}\right)+\mathrm{False}\ \mathrm{negatives}\ \left(\mathrm{FN}\right)}$ .

In the same year, Nearing et al. (2018) conducted a thorough comparison of three of the most widely used denoising packages (DADA2, UNOISE3, and Deblur) which use high-resolution ASVs and an open-reference 97% OTU clustering pipeline. In literature, USEARCH’s most recent clustering algorithm has been referenced as UNOISE or UNOISE in USEARCH (Hathaway et al. 2017), UNOISE3 or USEARCH’s unoise3 (Nearing et al. 2018), and USEARCH-UNOISE3 (Prodan et al. 2020). UNOISE3 is used to denote the denoising method that uses the “unoise3” command in USEARCH. The different studies used different names, but all referenced the method paper UNOISE2 (Edgar 2016b). Thus, in this book, we use these terms interchangeably to denote improved error correction for Illumina 16S and ITS amplicon sequencing from USEARCH. In this study (Nearing et al. 2018):

1.
The mock community analyses showed that these three denoising pipelines generate very different numbers of ASVs that significantly impact alpha diversity metrics although they produced similar microbial compositions based on relative abundance.
2.
The real analyses showed that the three packages were consistent in their per-sample compositions, resulting in only minor differences of the intra-sample distances based on weighted UniFrac and Bray-Curtis dissimilarity.
3.
Overall, open-reference OTU clustering approach consistently identified considerably more OTUs than the number of ASVs generated by the denoising pipelines in all datasets tested.
4.
It was shown that DADA2 tended to find more ASVs than UNOISE3 and Deblur (Deblur called the least amount of ASVs), suggesting that it could be better at finding rare organisms, but at the expense of possible false positives.
5.
It also showed that UNOISE3 is very faster than DADA2 and Deblur, with DADA2 being the lowest with 1200 times lower than UNOISE3.
6.
This study concluded that all pipelines result in similar general community structure; however, the number of ASVs/OTUs and resulting alpha diversity metrics varies considerably. Thus when attempting to identify rare organisms from possible background noise, we should consider that determining species richness within low-diverse samples could be problematic for the denoising pipelines (Nearing et al. 2018).

In 2020, Prodan (Prodan et al. 2020) compared six bioinformatic software for the analysis of amplicon sequence data: three for assigning OTUs (QIIME-uclust, mothur, and USEARCH-UPARSE) and three for generating ASVs (Callahan et al. 2017) or “sub-OTUs” (Amir et al. 2017) or “zero noise OTUs” (Edgar 2016b) (DADA2, Qiime2-Deblur, and USEARCH-UNOISE3). The overview of the strengths and weaknesses of different outputs from these six software was assessed in terms of the sensitivity, specificity, and degree of consensus. It was demonstrated that (Prodan et al. 2020):

1.
DADA2 offered the best sensitivity but at the expense of decreased specificity compared to USEARCH-UNOISE3 and Qiime2-Deblur.
2.
USEARCH-UNOISE3 had the best balance between resolution and specificity (Prodan et al. 2020).
3.
Compared to ASV-level software, OTU-level software USEARCH-UPARSE and mothur performed well, but with lower specificity, whereas QIIME-uclust produced large number of spurious OTUs as well as inflated alpha diversity measures.

8.3.2.2 Pyrosequencing Flowgrams

Some important denoising-based methods for pyrosequencing flowgrams (patterns of intensities in each read) include:

PyroNoise (Quince et al. 2009, 2011). The PyroNoise algorithm is the first amplicon sequencing error-correction method that was designed for clustering the flowgrams of 454 pyrosequencing reads using a distance measure to model sequencing noise. PyroNoise is used to accurately construct OTUs from 16S rRNA sequence data and assign pyrosequenced reads to known taxa (Quince et al. 2009). It may be more robust to noise. However, PyroNoise is computationally expensive and is difficult to implement for most users and thus has received limited application (Schloss et al. 2011).
DeNoiser (Reeder and Knight 2010) is another denoising pyrosequencing algorithm that was proposed for denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions. Like PyroNoise, DeNoiser was reviewed as computationally expensive and difficult for most users and has limited application (Schloss et al. 2011).
SLP (single-linkage pre-clustering) (Huse et al. 2010) was proposed to address the problem that the complete-linkage clustering significantly increases the number of predicted OTUs and inflates richness estimates. SLP implements its algorithm via four steps: First it orders the unique sequences by frequency. Second, from the most abundant to the least sequence, it assigns them to clusters through testing each subsequent sequence against the growing list of clusters using the single-linkage algorithm. Third, it adds a new sequence to the cluster and stops testing against subsequent clusters if the sequence has a pairwise distance less than 0.02 (2%) to any of the sequences already in the cluster. It was shown that SLP can more accurately predict expected OTUs if followed by an average-linkage clustering based on pairwise alignments (Huse et al. 2010) and is not computationally expensive (Schloss et al. 2011).
AmpliconNoise (Quince et al. 2011), a development of the PyroNoise algorithm, was proposed to remove noise from pyrosequenced amplicons to avoid inflation of diversity estimates of OTUs. It uses the error-model-based denoising algorithm to distinguish noise from true sequence diversity in data and uses expectation-maximization (EM) to inference the true sequences. It was shown that AmpliconNoise can separately remove 454 sequencing errors and PCR single-base errors to obtain accurate estimates of OTU number (Quince et al. 2011).
APDP (Amplicon Pyrosequencing Denoising Program) (Morgan et al. 2013) was proposed to process raw sequence datasets into a set of validated sequences with compatible formats for facilitating downstream analyses. APDP uses both between-samples abundance distribution of similar sequences and within-samples frequency and diversity of sequences to distinguish real sequences from errors generated by PCR and sequencing. It was shown that APDP can effectively remove errors from both deeply sequenced datasets comprising biological and technical replicates and efficiently denoise single-sample datasets and thus provides more conservative or accurate biological diversity estimates (Morgan et al. 2013).
Other 454 sequence programs for correcting pyrosequencing errors include Acacia (Bragg et al. 2012) and HECTOR (Wirawan et al. 2014).

Compared to denoising pyrosequencing flowgrams, Illumina denoisers have been developed more recently. Some Illumina methods can also be used with the 454 sequencing platform, such as cluster-free filtering (CFF) and SeekDeep. Below we will introduce CFF in Sect. 8.3.2.3, DADA2 in Sect. 8.3.2.4, UNOISE2 in Sect. 8.3.2.5, Deblur in Sect. 8.3.2.6, and SeekDeep in Sect. 8.3.2.7.

8.3.2.3 Cluster-Free Filtering (CFF)

In 2015, Tikhonov et al. (2015) proposed a cluster-free filtering (CFF) denoiser approach to achieve sub-OTU resolution with ecologically differing by as little as one nucleotide (nt) (99.2% similarity). The CFF denoiser was proposed in the contexts: on the one hand, to improve estimates of ecological diversity by reducing the impact of amplicon sequencing error, researchers have tried various denoising algorithms to reassign or eliminate noisy reads; however, the challenging still remains because no approach can fully address the denoiser issues. On the other hand, the alternative DBC approach is necessarily unreliable due to computational cost and severe limitation for low-count sequences in the cross-sample comparisons. Under this background, the focuses of the CFF denoiser are not to further try to improve the existing OTU clustering approaches and also not to attempt to identify rare species. Instead, the CFF denoiser combines error-model-based denoising and systematic cross-sample comparisons to resolve the fine (sub-OTU) structure of moderate-to-high-abundant taxonomy data (Tikhonov et al. 2015).

CFF employs cross-sample comparisons to achieve sub-OTU resolution. CFF was developed based on the observation that sequence similarity need not imply dynamical similarity and vice versa in time-series data. In contrast, the maximum correlation between the time traces (normalized counts versus observation day) of two sequences depends on their abundance measured by Poisson model. Thus CFF denoiser (Tikhonov et al. 2015) first defines the “dynamical similarity” of two traces as the Pearson correlation of their abundance, normalized by their maximum possible correlation. Then it models the abundance traces of these two sequences by an additive Poisson counting noise, setting an upper bound on the correlation coefficient for low-abundance sequences because Poisson sampling noise becomes non-negligible for low-abundance sequences. Finally it uses the Hamming distance metric to measure the pairwise sequence distances based on normalized counts.

Three important characteristics that characterize CFF as a method of moving beyond the traditional OTU methods are as follows: (1) aims to differ 16S sequences by one nucleotide or a single base pair, which is similar to oligotyping and distribution-based clustering approaches; (2) does not rely on clustering similar sequences together, which is similar to oligotyping; and (3) does not assume that sequence similarity implies ecological similarity. Instead CFF actually considers sequence similarity is a very poor predictor of ecological similarity (Tikhonov et al. 2015). Thus CFF uses highly conservative filtering criteria, assuming that sequence similarity contains only true biological sequences, that is, there are no false positives. Regarding this point, CFF is similar to Sokal and Sneath’s numerical taxonomy (Sokal and Sneath 1963).

As a hybrid of error-model-based denoising and distribution-based clustering, CFF does not require manual supervision, which makes it different from oligotyping. In the meanwhile, CFF models an entire community instead of an isolated OTU, which is similar to DBC method. However, CFF denoiser and DBC method have different goals: DBC is an OTU clustering algorithm, whereas CFF denoiser is to identify sub-OTU structure of moderate-to-high-abundance community members. As a denoiser, CFF focuses on the substitution errors (Tikhonov et al. 2015), rather than all the three main sources of errors that often occur in Illumina platform to generate 16S data: PCR substitutions, PCR chimeras, and substitution errors due to Illumina base call errors. The CFF denoiser method has some advantages including:

1.
CFF demonstrated “sequence similarity need not imply ecological similarity, and vice versa,” and the ecological relatedness can be independently assessed. This is different from the tag-sequencing data analysis and clustering-based OTU methods in microbiome study because typically these methods assume that sequence similarity of 16S hypervariable regions can be used as a proxy for phylogenetic, and therefore ecological, relatedness (Tikhonov et al. 2015). This idea of CFF is similar to the important idea of separating taxonomic procedure from phylogenetic speculation in the original OTU methods (Sokal and Sneath 1963).
2.
CFF can achieve a single-nucleotide (nt) sub-OTU resolution (Tikhonov et al. 2015).
3.
Particularly, CFF can achieve the same accuracy as DADA, which was considered as the best denoiser in 2015 (Tikhonov et al. 2015), but has less execution time than DADA. We can ignore the advantage of execution time since the longer time used in DADA largely due to its exact treatment of probabilities for processing low abundant sequences while CFF denoiser discards sequences of low abundance prior to denoising. In other words, the speed of CFF is at the cost of its limitation.

However, CFF also has limitations: (1) It is specifically designed to be run on large multi-sample datasets. (2) It is not a replacement for OTU clustering; it focuses on moderate-to-high-abundance sequences and hence discards low-abundance sequences. Thus, CFF is unsuitable for studying population-level alpha or beta diversity (Tikhonov et al. 2015), which probably is an important limitation of this method.

8.3.2.4 DADA2

DADA2 and q2-dada2 plugin were introduced in Sect. 4.2.1. In line with providing single-nucleotide resolution without using arbitrary taxonomic similarity thresholds to define molecular OTUs, Callahan et al. (2017) proposed to use exact “sequence variants” to replace OTUs in marker-gene data analysis and developed software package DADA2 (DADA: Divisive Amplicon Denoising Algorithm) for modeling and correcting Illumina-sequenced amplicon errors. The proposal of using ASVs instead of the traditional OTUs was due to the limitations of OTU methods, such as clustering sequence reads into OTUs often eliminates biological information present in the data, and OTUs are not species, and their construction is not necessitated by amplicon errors (Callahan et al. 2016a), and especially was motivated by a series of new computational methods (Eren et al. 2013, 2016; Tikhonov et al. 2015; Edgar 2016b; Amir et al. 2017).

DADA2 is a parametric model that relies on input read abundances and distances. It assumes that true reads are likely to be more abundant and less abundant reads are likely error-derived which may be only a few base differences away from a more abundant sequence (Prodan et al. 2020; Callahan et al. 2016a). DADA2 was developed based on the DADA (Rosen et al. 2012). DADA was built on AmpliconNoise’s error-modeling approach (Quince et al. 2011). The distinctive feature of DADA is its divisive hierarchical clustering algorithm, while previous methods, including AmpliconNoise and simple OTU clustering, use agglomerative approach.

The core denoising algorithm (Divisive Amplicon Denoising Algorithm) in the DADA2 is based on a Poisson model to estimate the errors in Illumina-sequenced amplicon reads. This error model estimates the probability for each amplicon read with each sequence is produced from sample sequence as a function of sequence composition and quality. The Poisson model assumes that the number of amplicon reads (the abundance) of each sequence is consistent with the error model and the p-value of the null hypothesis is calculated. The calculated p-values or quality scores (base Q-scores) are used as the division criteria for an iterative partitioning algorithm; sequencing reads will continue to be divided until all partitions are consistent with being produced from their central sequence.

Like other new methods (e.g., Deblur) for denoising sequences, the general goal of DADA2 is to provide single-nucleotide resolution and without using arbitrary taxonomic similarity thresholds to define molecular OTUs, which supposedly represent the true biological sequences present in the data. Thus, an exact “sequence variants” was proposed to replace OTUs in marker-gene data analysis. Both DADA and DADA2 use a model-based approach to correct amplicon errors without constructing OTUs. However, the applications of DADA and DADA2 are different: DADA was used to identify fine-scale variation in 454-sequenced amplicon data (Rosen et al. 2012), while DADA2 was used for modeling and correcting Illumina-sequenced amplicon errors (Rosen et al. 2012; Callahan et al. 2016a). As an extended and improved version of DADA, DADA2 is reference-free and applicable to any genetic locus. DADA2 generates a parametric error model that is trained on the entire sequencing run and then applies that model to correct and collapse the sequence errors into amplicon sequence variants (ASVs). This approach has its advantages because:

1.
DADA2 builds unique error models for each sequencing run (Nearing et al. 2018; Callahan et al. 2016b).
2.
DADA2 exactly infers sample sequences and resolves differences of as little as one nucleotide (Callahan et al. 2016b).
3.
DADA method was shown to outperform AmpliconNoise’s and other previous methods in both speed and accuracy (Rosen et al. 2012).
4.
DADA2 R package can implement the full amplicon workflow from filtering, dereplication, sample inference, and chimera identification to merging the paired-end reads (Callahan et al. 2016a). Inferring ASVs (also called RSVs: ribosomal sequence variants) was shown in amplicon bioinformatic workflow (Callahan et al. 2016a, b, 2017).
5.
Particularly, it was demonstrated that the DADA2 methods were more accurate than the four OTU methods: UPARSE (Edgar 2013), MED (minimum entropy decomposition) (Eren et al. 2015), mothur (average linkage) (Schloss et al. 2009), and QIIME (uclust) (Caporaso et al. 2010).

However, DADA2 approach also has its limitations:

1.
As reviewed above, it was shown (Nearing et al. 2018; Prodan et al. 2020) that DADA2 had the best sensitivity and resolution, but at the expense of producing more number of spurious ASVs compared to USEARCH-UNOISE3 and Qiime2-Deblur.
2.
It was also shown (Hathaway et al. 2017) that like MED and UNOISE, DADA2 could create larger false haplotypes especially for 454 technique. In DADA2, when the number of false haplotypes is minimized, the sensitivity is also lost, especially for the lower read depth input, in which the low-abundance one-off haplotypes are missed (Hathaway et al. 2017). For Ion Torrent and 454 pyrosequencing data, both UNOISE and DADA2 cannot achieve 100% haplotype recovery compared to 100% haplotype recovery in SeekDeep and MED (Hathaway et al. 2017).
3.
It was also shown (Tikhonov et al. 2015) that DADA has a long execution time largely due to its exact treatment of probabilities. This is critically important to process sequences with an abundance of just a few counts.

Overall, DADA2 has been recognized as so far the best denoising-based method for focusing on the highest possible biological resolution (e.g., differentiating closely related strains) (Prodan et al. 2020).

8.3.2.5 UNOISE2 and UNOISE3

To improve error correction in Illumina 16S and ITS amplicon sequencing, distinguishing the sequence errors caused by PCR and sequencing from true biological variation, Edgar (2016b) proposed UNOISE2, an updated version of the UNOISE algorithm (Edgar and Flyvbjerg 2015) for denoising (error-correcting) Illumina amplicon reads.

UNOISE2 uses a one-pass clustering strategy without using quality (Q) scores. The clustering strategy is different from DADA2’s model-based approach, which uses quality scores in an iterative divisive partitioning clustering strategy. In UNOISE2, both algorithms of UCHIME2 (Edgar 2016a) and DADA2 are incorporated to reduce the number of incorrect sequences. However, both UNOISE2 and DADA2 were reviewed as still generating some amount of false positives with DADA2 likely to increase the number of false-positive chimera predictions (Edgar 2016b). In UNOISE2, the obtained predicted biological sequences are called as ZOTUs (zero-radius OTUs) (Edgar 2016b, 2018). The ZOTU is similar to Callahan et al.’s ASV. In DADA2, the 97% similarity of OTUs is replaced by ASVs as the standard unit of marker-gene analysis and reporting; in UNOISE2, similarly the [97%] OTUs are replaced by ZOTUs (Edgar 2018).

The UNOISE3 denoising method first uses the “unoise3” command to rank sequences in decreasing order of abundance and then discards those sequences with counts less than the specified minimum abundance threshold (the default value is 8). By using 100% identity threshold, each distinct sequence defines a separate OTU. Both OTU-level clustering with UPARSE and ASV-level denoising with UNOISE3 are implemented via USEARCH, and actually the author of USEARCH (Edgar 2010) recommended that UPARSE and UNOISE3 should be performed together.

UNOISE2 and UNOISE3 have the advantages:

1.
UNOISE2 has comparable or better accuracy than DADA2 demonstrated by the author of UNOISE2 (Edgar 2016b).
2.
USEARCH-UNOISE3 (UNOISE2) had arguably the best overall performance compared to DADA2 and Qiime2-Deblur if combinedly considering its high sensitivity with excellent specificity (Prodan et al. 2020).
3.
Especially USEARCH-UNOISE3 and Qiime2-Deblur had the perfect specificity, producing no spurious OTUs/ASVs based on the mock sample sequencing data analysis (Prodan et al. 2020).

However, UNOISE also has limitations:

1.
Like DADA2, UNOISE was designed and tested on Illumina reads. It does not work well on Ion Torrent and 454 data and hence cannot achieve 100% haplotype recovery (Edgar 2021; Hathaway et al. 2017).
2.
UNOISE collapses one-off errors if the abundance ratio of two sequences achieves a certain threshold, recommends not using singlet sequences, and hence decreases haplotype recovery at lower read depths. This in part contributes to UNOISE’s speed and makes it to be the fastest algorithm compared to MED, DADA2, and SeekDeep (Hathaway et al. 2017). However, this also costs UNOISE to be unable to detect new haplotypes that differ by only one nucleotide (Hathaway et al. 2017).

8.3.2.6 Deblur

We illustrated Qiime2-Deblur in Chap. 4 (Sect. 4.4). Amir et al. (2017) developed a novel sub-OTU (sOTU) method called Deblur to fast and accurately identify exact amplicon sequences and to integrate large datasets. Like DADA2 and UNOISE2, Deblur aims to identify real ecological differences between taxa with a single base pair of amplicons differentiation. Deblur in concept is similar as DADA2 and UNOISE2, but uses a different algorithm to obtain single-nucleotide resolution. Deblur performs each sample independently and compares sequence-to-sequence Hamming distances within a sample to an upper-bound error profile by combining with a greedy algorithm (Amir et al. 2017).

The Deblur algorithm is implemented via three steps (Amir et al. 2017): First, Deblur sorts sequences by abundance. Second, Deblur subtracts the number of predicted error-derived reads from the counts of neighboring reads based on their read-to-read Hamming distance using an upper bound on the error probability based on the most to least abundant sequence. Finally, Deblur removes any sequence whose abundance drops to 0 after a subtraction. Thus, after applying Deblur, the invalid (i.e., noise) sequences are removed, and only reads likely to have been presented to the sequencer are retained.

Like DADA2 and UNOISE2, in Deblur the stable sOTUs were proposed to be used for replacing [97%] OTUs. Deblur has been wrapped into QIIME 2. Deblur has the advantages:

1.
Deblur is comparable or better than DADA2 and UNOISE2 in terms of performance characteristics and stability (i.e., obtaining the same sOTU across different samples) (Amir et al. 2017). For example, although all three methods can identify sOTUs with single-nucleotide differences which are close to the ground truth, Deblur has the greater stability than DADA2 and UNOISE2.
2.
Unlike DADA2 and UNOISE2, Deblur performs each sample independently without requiring operation on the full study and therefore can be easily parallelized to very large projects.
3.
Particularly, in one study that compared six bioinformatic software for the analysis of amplicon sequence data including QIIME-uclust, mothur, and USEARCH-UPARSE for generating OTUs and DADA2, Qiime2-Deblur, and USEARCH-UNOISE3 for generating ASVs, Qiime2-Deblur is one of the only two pipelines (another is USEARCH-UNOISE3) that showed perfect specificity on the mock sample sequencing data, producing no spurious OTUs/ASVs (Prodan et al. 2020).

Deblur also has the disadvantages, such as Deblur has poor performance compared to DADA2 and open-reference OTU clustering when it is used to detect low abundant taxa in the extreme dataset at 97% identity (Nearing et al. 2018). For comparing Deblur and DADA2, the reader is also referred to Sect. 4.4.6.

8.3.2.7 SeekDeep

SeekDeep (Hathaway et al. 2017) is an error-correction method for de novo (i.e., reference-free) analysis of amplicons. SeekDeep was developed in the contexts of comparing to three pipelines that aim for single-base resolution to determine the local PCR amplicon haplotypes (simply as haplotypes for brevity), MED (Eren et al. 2014), DADA2 (Callahan et al. 2016a), and UNOISE in USEARCH (Edgar 2016b), as well as two OTU-based clustering pipelines which cannot resolve at the single-base level: USEARCH (aka UCLUST/UPARSE) (Edgar 2016b) and Swarm (Edgar 2013). SeekDeep aims for single-base resolution to provide improved resolution toward species and strains. The central algorithm of SeekDeep is the qluster (for quality clustering) that improves the correction of PCR and sequencing errors through base quality values and k-mer frequencies and other multiple key ways. SeekDeep can be utilized with Ion Torrent, 454, and Illumina sequencing technologies. SeekDeep consists of four components (Mahé et al. 2014): (1) extractor for de-multiplexing and read filtering, (2) qluster for rapid and accurate clustering based on quality, (3) processClusters for replicate and population comparisons, and (4) popClusteringViewer for viewing and manipulating final results. The first three main components are central to generating clustering results, and the fourth is an additional component to aid in viewing and sharing the results.

It was shown (Hathaway et al. 2017) that SeekDeep has the advantages:

1.
SeekDeep is able to resolve sequences differing by only a single base.
2.
SeekDeep has greater consistency even at low frequencies.
3.
Particularly, SeekDeep has matched or outperformed the five pipelines including MED, DADA2, UNOISE in USEARCH, USEARCH, and Swarm in terms of haplotype recovery (especially one-off haplotypes), accuracy of predicting the abundance (accurate abundance estimates), and the number and abundances of false haplotypes.

However, Hathaway et al. (2017) also acknowledged that SeekDeep creates more false haplotypes than DADA2 and UNOISE although the false haplotypes have much lower abundance.

8.3.2.8 Remarks on Denoising-Based Methods

Denoising-based methods are widely used for identifying low-abundance or rare species against a noisy background, often with the aim of improving estimates of ecological diversity (Hathaway et al. 2017), such as Chao1 richness.

Using 97% identity may merge phenotypically different strains with distinct sequences into a single cluster (Tikhonov et al. 2015; Callahan et al. 2016a, b, c). In contrast, the denoising usually results in a set of predicted biological sequences or valid OTUs, which often have single-base resolution. Thus, compared to OTU methods, in most cases denoising algorithms provide the maximum possible biological resolution and hence superior to conventional OTU methods.

However, the denoising-based methods cannot fully address all the issues of bioinformatic analysis of microbiome data although these methods were proposed for identifying rare species and improving the accuracy of ecological diversity estimates, because (1) all error models are necessarily approximate, and no denoising algorithm can deal with errors that are beyond their model capabilities, and (2) the denoising-based methods are particularly problematic when they are used for calling low-abundance species. As reviewed above, DADA2 had the best sensitivity and resolution, but also produced more number of spurious ASVs, while UNOISE2 was fast, but was not recommended for using singlet sequences due to decreased haplotype recovery at lower read depths.

8.4 Discussion on Moving Beyond OTU Methods

The concept of OTU and OTU methods were originated from numerical taxonomy. OTU is an operational taxonomic unit or the organizational level of a unit character. It was created in numerical taxonomy to facilitate taxonomic classification against previous subjective approaches of taxonomy. The OTU methods in numerical taxonomy can be characterized with three concepts and one assumption. The three concepts are OTU, similarity, and clustering. The assumption is that taxonomic classification can be done through clustering the similarities of characters with large random sampling. It separates taxonomic classification from phylogenetic speculation. In numerical taxonomy, the OTUs are used for classification of characters via commonly used clustering with the assumption that characters can be grouped based on their similarities.

In early microbiome studies, the concept of OTU and OTU methods were adopted for convenience, and using the 97% sequence identity for approximating species is arbitrary and was based on researchers’ experiences but not substantial experiments. In microbiome, the use of OTUs serves the dual purpose: grouping sequences (the third purpose of playing species functionality is the extension of the first purpose) and estimating diversity. The final purpose of using OTUs is to detect the association between microbiome and health and the development of the disease. When the concept of OTU and OTU methods were adopted into microbiome study, not only the commonly used clustering methods but also various methods including heuristic clustering, model-based methods were developed to improve and refine OTUs, all with the assumption that sequence similarity can predict taxonomic similarity.

Under the framework of OTU methods, the dual purpose of using OTUs has been demonstrated failing to achieve. In the meanwhile, various new methods have been proposed in recent years with the goal to achieve single-nucleotide resolution of sequences and not necessarily relying on arbitrary sequence identity and clustering methods. Among different methods that were newly proposed, the denoising-based methods have been demonstrated their promising. However, either the failure of dual purpose of using OTU methods or the promising of new bioinformatic sequence analysis raises a number of interesting questions which deserve further discussion.

8.4.1 Necessity of Targeting Single-Base Resolution

First, whether it is necessary to target single-base resolution?

On the one hand, clinically and for downstream analysis of microbiome data, a consistent single-base resolution of analysis units is necessary. Clinically, consistently differentiating single-base differences of analysis units is particularly a necessity for studying eukaryotic intraspecies populations and for mutation detection. It is crucial for seeking to detect and quantify minority haplotypes that may be represented by a single-nucleotide polymorphism (SNP), a nucleotide difference in a single DNA building block. SNP is the most common type of genetic variation among people. SNPs have been used to detect strains (also refers to subspecies) of clinical relevance or to predict phenotypic characteristics when they are stably associated with other parts of the bacterial haplotype (Fitz-Gibbon et al. 2013). In the oncology and infectious disease fields, such studies are becoming increasingly common. In these studies, these sequences often only differ from the wild type by a single base when using marker regions to bacterial strains or monitoring for pathogen drug resistance mutations. Thus, it is crucial to accurately quantify these low-abundance and genetically similar strains (Miotto et al. 2015; Hathaway et al. 2017). Another example provided by Tikhonov et al. (2015) is malaria research, where the strains within an infected individual are often defined by the sequence of a single amplicon, and these sequences usually differ by only a single base, representative of a SNP within the larger parasite population.

Recently, Johnson et al. (2019) suggested that intragenomic variation between 16S gene copies must be necessarily accounted for in modern bioinformatic analysis. They also demonstrated that full-length 16S intragenomic copy variants can potentially provide taxonomic resolution of bacterial communities at species and strain level when they are appropriately treated, where the full 16S gene refers to the ~1500 bp 16S rRNA gene comprising nine variable regions (V1–V9) interspersed throughout the highly conserved 16S sequence. Hassler et al. (2022) found that alignment SNP count strongly predicted concordance for any given gene and the strongest concordance was displayed by the SNPs from non-ribosomal protein coding genes and the weakest concordance was shown by the SNPs from rRNA genes. Statistically, single-base resolution of 16S amplicon clustering improves accuracy and sensitivity of traditional OTUs and thus extracts maximal information for downstream analysis (Tikhonov et al. 2015).

On the other hand, the techniques for obtaining a consistent single-base resolution of atomic analysis units are still challenging. For example, as we reviewed, DADA2 could have the best sensitivity and resolution, but also is prone to inflate spurious ASVs, while UNOISE2 could be fast, but without recommending using singlet sequences because of decreased haplotype recovery at lower read depths. SeekDeep may create more false haplotypes than DADA2 and UNOISE2.

8.4.2 Possibility of Moving Beyond Traditional OTU Methods

Second, whether it is possible to move beyond the traditional OTU-based methods?

Regardless of the names either called “oligotypes,” “ASVs,” “ZOTUs,” “sOTU,” or others, the authors of all the newly developed methods explicitly or inexplicitly want to replace the traditional OTUs by their proposed sub-OTUs.

New Methods Outperform the Traditional OTU Methods

We review the possibilities of using sub-OTUs to replace OTUs from four perspectives below:

1.
The new available methods provide better resolution than OTU methods. The sub-OTU methods can obtain single-nucleotide resolution, which makes it possible to identify real ecological differences between taxa or organisms in communities whose amplicons differ by a very small number of nucleotides (i.e., a single base pair, a single nucleotide) (Hathaway et al. 2017; Eren et al. 2013; Callahan et al. 2016a; Amir et al. 2017). For example, the identified oligotypes by oligotyping offer a comparable or higher level of resolution than OTUs at 97%, while the traditional OTU clustering at 97% identity level does not have so high resolution.
2.
The new available methods are more accurate than OTU methods with demonstrating as good or better sensitivity and specificity than OTU methods. For example, oligotyping and ASV methods can separate different environments more efficiently and have increased power to discriminate better ecological distribution patterns that taxonomical classification and OTU clustering at 97% identity level cannot detect (Eren et al. 2013, 2014, 2015; Callahan et al. 2016a). ASVs even can capture all biological variation present in the data (Callahan et al. 2017). Swarm2 can distinguish higher-resolution clusters or different taxa, while traditional UCLUST cannot detect (Callahan et al. 2017). Deblur also has similar or better sensitivity and specificity while it substantially reduces computational demands relative to similar sOTU methods (Mahé et al. 2015b).
3.
Like PiCRUST (Amir et al. 2017), sub-OTUs are functional and biological or clinically meaningful. For example, the identified subpopulations of a single species by a sequence or the sub-OTU can provide functional information: insight into functional relatedness of community members (Langille et al. 2013). Sequences differing by as little as one nucleotide (99.2% similarity) within the 16S rRNA gene can be ecologically distinct and reveal clinical or epidemiological associations that would be missed by genus-level or species-level categorization of 16S rRNA data (Tikhonov et al. 2015).
4.
The sub-OTUs facilitate downstream analysis due to their precision, reusability across studies, and reproducibility in future datasets. The sub-OTU methods, such as oligotyping, cluster-free filtering, and DADA2, were shown not only comparable or better than traditional clustering analysis at explaining the structure of the bacteria dataset but also can avoid vast underestimates of ecological richness, improve accurate measurement of diversity and dissimilarity, provide new insight into factors shaping community assembly and the prevalence of strain exchange between communities and invasion/extinction dynamics of OTU subpopulations, and resolve distinct subpopulations with high dynamical similarity. Thus, due to its precision, the sub-OTU methods could potentially allow amplicon methods to probe strain-level variation (Eren et al. 2011, 2013; Callahan et al. 2016a). For traditional OTU methods, after filtering the sequences and removing the chimer, a sample-by-OTU feature table is generated to serve as the basis for further analysis. Most sub-OTU methods also can obtain a sample-by-sub-OTU feature table for downstream analysis. Probably, more importantly, sub-OTUs are reusable across studies and comprehensive and reproducible in future datasets because they are more precise and combine the benefits for subsequent analysis of closed-reference and de novo OTUs (Tikhonov et al. 2015). We can continue our discovery by exploring the present and future large datasets and through reuse of existing rich datasets (Callahan et al. 2017). Some sub-OTUs, such as DADA2, can easily process Illumina samples on a laptop, and a workflow from raw data to downstream analysis was shown in recent publications (Callahan et al. 2016a, 2017; Amir et al. 2017). In summary, all these advantages of sub-OTUs, especially more accurate abundance estimates of sub-OTUs, facilitate downstream statistical analyses using sophisticated parametric or nonparametric models (Callahan et al. 2016b).

Incorporating Sub-OTU Methods in Bioinformatic Tools

Based on above review, the sub-OTU methods are better than OTU methods. Most of them are stable and able to integrate and have good performance and open-source license. Using sub-OTU methods in marker-gene data analysis will be a trend in bioinformatic analysis of microbiome data or at least is an alternative method of OTU approach. The sub-OTUs are the good alternative to OTUs as the analysis units of microbiome data.

The usage of these new sub-OTU methods is facilitated by using them in datasets and bioinformatics pipelines. For example, since its introduction, Swarm v1 has been used in a variety of datasets (Callahan et al. 2016a; de Vargas et al. 2015; Filker et al. 2015; Lima-Mendez et al. 2015; Mahé et al. 2015a). QIIME already offers Swarmv1.2, to facilitate its usage, and Swarm v2 can be included in QIIME2 and in Galaxy (Oikonomou et al. 2015; Mahé et al. 2015b). QIIME 2 has already incorporated DADA2 by the use of exact “ASVs” rather than “OTUs.” Because of its stability and open-source license, Deblur is easy for commercial adoption and peer scrutiny and to be integrated (Goecks et al. 2010). The R package DADA2 was developed to show its usage in analysis of microbiome data (Amir et al. 2017).

8.4.3 Issues of Sub-OTU Methods

Third, whether there still exist issues in sub-OTU methods after moving beyond the use of OTUs and OTU-based methods?

Sub-OTU methods open potential new directions for a more accurate depiction of microbiome communities through marker-gene amplicons. However, the new methods also have the issues and come with new questions.

First, although the new methods can distinguish highly similar sequence variants, there is still no guarantee that the resulting sub-OTUs are necessarily phylogenetically and ecologically meaningful due to limitations of the selected 16S rRNA gene (Callahan et al. 2016a; Berry et al. 2017). Thus, the inference based on sub-OTUs (i.e., ASVs) does not solve all problems (Callahan et al. 2017). Second, not all new methods are suitable for studies of microbiome communities. For example, in the cluster-free filtering, the low-abundance sequences are discarded; thus, we cannot use it to study population-level alpha or beta diversity (Callahan et al. 2017). Third, to be more accessible, computational and ecological issues still need to be addressed for new methods. To move beyond the use of OTUs, the microbial ecologists and the developers of widely used software platforms still need further works (Tikhonov et al. 2015).

8.4.4 Prediction of Sequence Similarity to Ecological Similarity

Fourth, whether we need to avoid the assumption that the sequence similarity predicts the ecological similarity or phylogeny?

In literature, on the one hand, it was exhibited that 16S-based methods have limitations, and on the other hand, it was demonstrated that even a purely 16S-based study can exhibit ecologically significant distinctions to provide insight into functional relatedness of community members (Eren et al. 2016; Langille et al. 2013; Tikhonov et al. 2015). Thus, the assumption of the sequence similarity predicting the ecological similarity or phylogeny remains unanswered.

8.4.5 Functional Analysis and Multi-omics Integration

Fifth, whether functional analysis and multi-omics integration are necessary and important in microbiome study?

Discussion on the advantages and disadvantages of the OTU methods and the movement of moving beyond OTU methods highlights the importance of two topics in bioinformatic and statistical analysis of microbiome data: (1) What are the optimal analysis units of microbiome data? (2) What kind of science microbiome should be?

Regarding the first topic, compared to OTU methods, in general the methods that provide ASVs or single-nucleotide base resolution are more close to microbiome sequence data and serve the purposes of microbiome study. We have shown that targeting single-base resolution is necessary and the movement of moving beyond traditional OTU-methods is possible in bioinformatic analysis of microbiome data. However, as we reviewed above, the ASVs and single-nucleotide base resolution methods also have limitations, and not all OTU methods are inferior to all the ASV methods. Actually, all statistical models are used to approximate the real data but are not expected to get insight into all the complexities of reality; i.e., every single model will never represent the exact real data and rare achieve all the purposes of the analysis and the study supposed to be; the goal of model comparisons is to find the model that is sufficiently close to the reality because it is useful even if this model cannot describe exactly the reality. The challenge is how to choose the parsimonious bioinformatic models that provide remarkably useful approximations to the real sequencing data and extract maximal information for statistical analysis to obtain insight into clinical relevance. Here, Ockham’s razor [William of Ockham (c. 1287–1347)] or the principle of parsimony or law of parsimony is applicable. The principle of parsimony states that we should choose the simplest model while approximating to reality because “entities should not be multiplied beyond necessity” (Schaffer 2015).

Regarding the second topic: what kind of science is microbiome? These are the questions: How to define microbiome? What research theme in microbiome study? What are the unique data structure and characteristics? The interested reader is referred to Xia and Sun (2022).

Briefly, we define microbiome as an experimental science. As an experimental science, microbiome does not solely lie on direct observation and measurement of objects/samples and phenomena to identify association or correlation, but more importantly conducts experiments to find the functions and mechanism/causality that play in the human health and the development of disease. Like other experimental sciences, “inference from the particular to the general must be attended with some degree of uncertainty, but this is not the same as to admit that such inference cannot be absolutely rigorous, for the nature and degree of uncertainty may itself be capable of rigorous expression” [see Fisher (1935)’s Design of Experiments (p. 4)].

Actually, for microbiome study, the accurate coding of sequence data by a bioinformatic tool is just the first step. We need multiple approaches, especially including experiments and clinical trials to conform and validate the findings of bioinformatic analyses.

Microbiome study as a research field is not merely descriptive and observational. Describing diversity and visualizing the differences between the experimental groups are necessary. But the final goal of microbiome study is to find the causality between microbiome and the human health and the development of disease and to improve human health and prevent disease using the findings from microbiome research. The diversity concept was adopted from macroecology. The alpha diversity including richness and evenness are important in macroecology. But are they equally important in microbial ecology if considering their differences and complicities of microbial ecology? The answer is no.

Currently two approaches in microbiome research tend to be more important although they are more complicated: One is functional study, which is to find out the functionality of microbiota, identify the top microbes, and use them in experiment study to test their functions. Another is to integrate microbiome with other omics, such as metabolomics, metaproteomics, transcriptomics, genomics, and epigenomics to find the biomarkers that are associated with favorable and adverse outcomes. Finally the findings from microbiome research are expected to apply to human health and treat the human diseases.

Overall, we want to highlight the importance of functional, experimental studies with multiple omics’ biological design. Like other experimental sciences, the final purpose of microbiome study is to find the mechanism of microbiome and the links to biological and environmental factors and to establish causality of them. Thus, an experimental and functional study is more important than an operational method. However, any experimental and functional study needs statistical methods to confirm its effect.

8.5 Summary

In this chapter we first described how the commonly clustering-based OTU methods were adopted in microbiome study and how the heuristic clustering OTU methods have been developed to advance the assignment of OTUs in microbiome study. We also described the limitations of clustering-based OTU methods and the purposes for the use of OTUs in microbiome study and discussed defining species and species-level analysis. Next, we introduced the concept shifting in bioinformatic analysis and the two single-nucleotide resolution-based OTU methods: distribution-based clustering (DBC) and Swarm2, which provided the contexts of moving beyond the OTU methods. Then, we focused on two kinds of moving beyond the OTU methods: oligotyping, the entropy-based single-nucleotide resolution non-clustering OTU method, and denoising-based methods including pyrosequencing flowgrams and Illumina denoisers. In bioinformatic analysis of microbiome data, pyrosequencing techniques were early developed, while Illumina techniques represented the current development. Thus we focused on five newly developed Illumina denoisers: (1) cluster-free filtering (CFF), (2) DADA2, (3) UNOISE2 and UNOISE3, (4) Deblur, and (5) SeekDeep. We also introduced overall comparisons of denoising-based and clustering-based methods and commented on the denoising-based methods. In the final parts of this chapter, we discussed on the movement of moving beyond OTU methods, highlighting the necessity of targeting single-base resolution, discussing the possibility of moving beyond traditional OTU methods, the issues of sub-OTU methods, the assumption of sequence similarity predicting to the ecological similarity, as well as the functional analysis and multi-omics integration. Starting with Chap. 9 through Chap. 18, we will focus on investigating statistical analysis of microbiome data. In Chap. 9, we will introduce alpha diversity.

References

Albanese, Davide, Paolo Fontana, Carlotta De Filippo, Duccio Cavalieri, and Claudio Donati. 2015. MICCA: A complete and accurate software for taxonomic profiling of metagenomic data. Scientific Reports 5 (1): 9743. https://doi.org/10.1038/srep09743.CrossrefPubMedPubMedCentral
Almeida, Alexandre, Alex L. Mitchell, Aleksandra Tarkowska, and Robert D. Finn. 2018. Benchmarking taxonomic assignments based on 16S rRNA gene profiling of the microbiota from commonly sampled environments. GigaScience 7 (5): giy054.PubMedPubMedCentral
Amann, Rudolf I., Chuzhao Lin, Rebekah Key, Larry Montgomery, and David A. Stahl. 1992. Diversity among fibrobacter isolates: Towards a phylogenetic classification. Systematic and Applied Microbiology 15 (1): 23–31.
Amir, Amnon, Daniel McDonald, Jose A. Navas-Molina, Evguenia Kopylova, James T. Morton, Zhenjiang Zech Xu, Eric P. Kightley, Luke R. Thompson, Embriette R. Hyde, Antonio Gonzalez, and Rob Knight. 2017. Deblur rapidly resolves single-nucleotide community sequence patterns. mSystems 2 (2). https://doi.org/10.1128/mSystems.00191-16. https://msystems.asm.org/content/msys/2/2/e00191-16.full.pdf.
Ash, Carol, John A.E. Farrow, Matthias Dorsch, Erko Stackebrandt, and Matthew D. Collins. 1991. Comparative analysis of Bacillus anthracis, Bacillus cereus, and related species on the basis of reverse transcriptase sequencing of 16S rRNA. International Journal of Systematic and Evolutionary Microbiology 41 (3): 343–346.
Balakrishnan, Rohini. 2005. Species concepts, species boundaries and species identification: A view from the tropics. Systematic Biology 54 (4): 689–693.PubMed
Baldi, Pierre, Søren Brunak, Yves Chauvin, Claus A.F. Andersen, and Henrik Nielsen. 2000. Assessing the accuracy of prediction algorithms for classification: An overview. Bioinformatics 16 (5): 412–424. https://doi.org/10.1093/bioinformatics/16.5.412.CrossrefPubMed
Barriuso, Jorge, Jose R. Valverde, and Rafael P. Mellado. 2011. Estimation of bacterial diversity using next generation sequencing of 16S rDNA: A comparison of different workflows. BMC Bioinformatics 12 (1): 473. https://doi.org/10.1186/1471-2105-12-473.CrossrefPubMedPubMedCentral
Bazin, Alexandre, Didier Debroas, and Engelbert Mephu Nguifo. 2019. A de novo robust clustering approach for amplicon-based sequence data. Journal of Computational Biology 26 (6): 618–624.PubMed
Bennasar, Antonio, Ramon Rossello-Mora, Jorge Lalucat, and Edward R.B. Moore. 1996. 16S rRNA gene sequence analysis relative to genomovars of Pseudomonas stutzeri and proposal of Pseudomonas balearica sp. nov. International Journal of Systematic and Evolutionary Microbiology 46 (1): 200–205.
Berry, Michelle A., Jeffrey D. White, Timothy W. Davis, Sunit Jain, Thomas H. Johengen, Gregory J. Dick, Orlando Sarnelle, and Vincent J. Denef. 2017. Are oligotypes meaningful ecological and phylogenetic units? A case study of microcystis in freshwater lakes. Frontiers in Microbiology 8 (365). https://doi.org/10.3389/fmicb.2017.00365. https://www.frontiersin.org/article/10.3389/fmicb.2017.00365.
BLASTLab. 2004. Using BLASTClust to make non-redundant sequence sets. Last Modified Spring, 2004. https://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html. Accessed 1 Dec 2004.
Bonder, Marc J., Sanne Abeln, Egija Zaura, and Bernd W. Brandt. 2012. Comparing clustering and pre-processing in taxonomy analysis. Bioinformatics 28 (22): 2891–2897. https://doi.org/10.1093/bioinformatics/bts552.CrossrefPubMed
Borneman, J., and E.W. Triplett. 1997. Molecular microbial diversity in soils from eastern Amazonia: Evidence for unusual microorganisms and microbial population shifts associated with deforestation. Applied and Environmental Microbiology 63 (7): 2647–2653. https://doi.org/10.1128/aem.63.7.2647-2653.1997. https://pubmed.ncbi.nlm.nih.gov/9212415, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC168563/.CrossrefPubMedPubMedCentral
Bragg, Lauren, Glenn Stone, Michael Imelfort, Philip Hugenholtz, and Gene W. Tyson. 2012. Fast, accurate error-correction of amplicon pyrosequences using Acacia. Nature Methods 9 (5): 425–426. https://doi.org/10.1038/nmeth.1990.CrossrefPubMed
Brenner, Don J., George R. Fanning, Adrian V. Rake, and Karl E. Johnson. 1969. Batch procedure for thermal elution of DNA from hydroxyapatite. Analytical Biochemistry 28: 447–459.PubMed
Brown, Emily A., Frédéric J.J. Chain, Teresa J. Crease, Hugh J. MacIsaac, and Melania E. Cristescu. 2015. Divergence thresholds and divergent biodiversity estimates: Can metabarcoding reliably describe zooplankton communities? Ecology and Evolution 5 (11): 2234–2251. https://doi.org/10.1002/ece3.1485. https://www.ncbi.nlm.nih.gov/pubmed/26078859, https://www.ncbi.nlm.nih.gov/pmc/PMC4461424/.CrossrefPubMedPubMedCentral
Buchanan, R.E. 1955. Taxonomy. Annual Reviews in Microbiology 9 (1): 1–20.
Buttigieg, Pier Luigi, and Alban Ramette. 2014. A guide to statistical analysis in microbial ecology: A community-focused, living review of multivariate data analyses. FEMS Microbiology Ecology 90 (3): 543–550. https://doi.org/10.1111/1574-6941.12437.CrossrefPubMed
Cai, Yunpeng, and Yijun Sun. 2011. ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time. Nucleic Acids Research 39 (14): e95. https://doi.org/10.1093/nar/gkr349. https://www.ncbi.nlm.nih.gov/pubmed/21596775, https://www.ncbi.nlm.nih.gov/pmc/PMC3152367/.CrossrefPubMedPubMedCentral
Cai, Yunpeng, Wei Zheng, Jin Yao, Yujie Yang, Volker Mai, Qi Mao, and Yijun Sun. 2017. ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time. PLoS Computational Biology 13 (4): e1005518.PubMedPubMedCentral
Callahan, Benjamin J., Paul J. McMurdie, Michael J. Rosen, Andrew W. Han, Amy Jo A. Johnson, and Susan P. Holmes. 2016a. DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods 13 (7): 581–583. https://doi.org/10.1038/nmeth.3869. https://www.ncbi.nlm.nih.gov/pubmed/27214047, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4927377/.CrossrefPubMedPubMedCentral
Callahan, Ben J., Kris Sankaran, Julia A. Fukuyama, Paul J. McMurdie, and Susan P. Holmes. 2016b. Bioconductor workflow for microbiome data analysis: From raw reads to community analyses. F1000Research 5: 1492. https://doi.org/10.12688/f1000research.8986.2. https://www.ncbi.nlm.nih.gov/pubmed/27508062, https://www.ncbi.nlm.nih.gov/pmc/PMC4955027/.CrossrefPubMedPubMedCentral
Callahan, B.J., K. Sankaran, J.A. Fukuyama, P.J. McMurdie, and S.P. Holmes. 2016c. Bioconductor workflow for microbiome data analysis: From raw reads to community analyses [version 1; referees: 3 approved]. F1000Research 5 (1492). https://doi.org/10.12688/f1000research.8986.1. http://openr.es/7gi.
Callahan, Benjamin, Paul McMurdie, and Susan Holmes. 2017. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. The ISME Journal 11: 2639.PubMedPubMedCentral
Caporaso, J. Gregory, Justin Kuczynski, Jesse Stombaugh, Kyle Bittinger, Frederic D. Bushman, Elizabeth K. Costello, Noah Fierer, Antonio Gonzalez Peña, Julia K. Goodrich, Jeffrey I. Gordon, Gavin A. Huttley, Scott T. Kelley, Dan Knights, Jeremy E. Koenig, Ruth E. Ley, Catherine A. Lozupone, Daniel McDonald, Brian D. Muegge, Meg Pirrung, Jens Reeder, Joel R. Sevinsky, Peter J. Turnbaugh, William A. Walters, Jeremy Widmann, Tanya Yatsunenko, Jesse Zaneveld, and Rob Knight. 2010. QIIME allows analysis of high-throughput community sequencing data. Nature Methods 7 (5): 335–336. https://doi.org/10.1038/nmeth.f.303.CrossrefPubMedPubMedCentral
Carlsen, Tor, Anders Bjørnsgaard Aas, Daniel Lindner, Trude Vrålstad, Trond Schumacher, and Håvard Kauserud. 2012. Don’t make a mista(g)ke: Is tag switching an overlooked source of error in amplicon pyrosequencing studies? Fungal Ecology 5 (6): 747–749. https://doi.org/10.1016/j.funeco.2012.06.003. https://www.sciencedirect.com/science/article/pii/S1754504812000918.Crossref
Caron, David A., Peter D. Countway, Pratik Savai, Rebecca J. Gast, Astrid Schnetzer, Stefanie D. Moorthi, Mark R. Dennett, Dawn M. Moran, and Adriane C. Jones. 2009. Defining DNA-based operational taxonomic units for microbial-eukaryote ecology. Applied and Environmental Microbiology 75 (18): 5797–5808. https://doi.org/10.1128/aem.00298-09. https://www.ncbi.nlm.nih.gov/pubmed/19592529, https://www.ncbi.nlm.nih.gov/pmc/PMC2747860/.CrossrefPubMedPubMedCentral
Chakravorty, S., D. Helb, M. Burday, N. Connell, and D. Alland. 2007. A detailed analysis of 16S ribosomal RNA gene segments for the diagnosis of pathogenic bacteria. Journal of Microbiological Methods 69 (2): 330–339. https://doi.org/10.1016/j.mimet.2007.02.005.CrossrefPubMedPubMedCentral
Chen, Wei, Yongmei Cheng, Clarence Zhang, Shaowu Zhang, and Hongyu Zhao. 2013a. MSClust: A multi-seeds based clustering algorithm for microbiome profiling using 16S rRNA sequence. Journal of Microbiological Methods 94 (3): 347–355.PubMedPubMedCentral
Chen, Wei, Clarence K. Zhang, Yongmei Cheng, Shaowu Zhang, and Hongyu Zhao. 2013b. A comparison of methods for clustering 16S rRNA sequences into OTUs. PLoS One 8 (8): e70837. https://doi.org/10.1371/journal.pone.0070837.CrossrefPubMedPubMedCentral
Chen, Shi-Yi, Feilong Deng, Ying Huang, Xianbo Jia, Yi-Ping Liu, and Song-Jia Lai. 2016. bioOTU: An improved method for simultaneous taxonomic assignments and operational taxonomic units clustering of 16s rRNA gene sequences. Journal of Computational Biology 23 (4): 229–238.PubMed
Cheng, Lu, Alan W. Walker, and Jukka Corander. 2012. Bayesian estimation of bacterial community composition from 454 sequencing data. Nucleic Acids Research 40 (12): 5240–5249. https://doi.org/10.1093/nar/gks227.CrossrefPubMedPubMedCentral
Claridge, M.F., H.A. Dawah, and M.R. Wilson. 1997a. Practical approaches to species concepts for living organisms. In Species: The units of biodiversity, ed. M.F. Claridge, H.A. Dawah, and M.R. Wilson, 1–15. London: Chapman and Hall.
Claridge, Michael F., Hassan A. Dawah, and Michael R. Wilson. 1997b. Species: The units of biodiversity. London: Chapman and Hall.
Clarridge, Jill E., 3rd. 2004. Impact of 16S rRNA gene sequence analysis for identification of bacteria on clinical microbiology and infectious diseases. Clinical Microbiology Reviews 17 (4): 840–862. https://doi.org/10.1128/CMR.17.4.840-862.2004. https://pubmed.ncbi.nlm.nih.gov/15489351, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC523561/.CrossrefPubMedPubMedCentral
Cohan, Frederick M. 1994. Genetic exchange and evolutionary divergence in prokaryotes. Trends in Ecology & Evolution 9 (5): 175–180.
Cohn, Ferdinand. 1972. Untersuchungen Ueber Bacterien. Breslau: J.U. Kern.
Cole, J.R., B. Chai, R.J. Farris, Q. Wang, A.S. Kulam-Syed-Mohideen, D.M. McGarrell, A.M. Bandela, E. Cardenas, G.M. Garrity, and J.M. Tiedje. 2007. The ribosomal database project (RDP-II): Introducing myRDP space and quality controlled public data. Nucleic Acids Research 35 (Database issue): D169–D172. https://doi.org/10.1093/nar/gkl889.CrossrefPubMed
Cole, James R., E. Qiong Wang, J. Fish Cardenas, Benli Chai, Ryan J. Farris, A.S. Kulam-Syed-Mohideen, Donna M. McGarrell, T. Marsh, and George M. Garrity. 2009. The Ribosomal Database Project: Improved alignments and new tools for rRNA analysis. Nucleic Acids Research 37 (suppl_1): D141–D145.PubMed
Cole, James R., Qiong Wang, Jordan A. Fish, Benli Chai, Donna M. McGarrell, C. Yanni Sun, Titus Brown, Andrea Porras-Alfaro, Cheryl R. Kuske, and James M. Tiedje. 2014. Ribosomal Database Project: Data and tools for high throughput rRNA analysis. Nucleic Acids Research 42 (D1): D633–D642. https://doi.org/10.1093/nar/gkt1244.CrossrefPubMed
Colwell, R.R. 1970. Polyphasic taxonomy of the genus vibrio: numerical taxonomy of Vibrio cholerae, Vibrio parahaemolyticus, and related Vibrio species. Journal of Bacteriology 104 (1): 410–433. https://doi.org/10.1128/jb.104.1.410-433.1970. https://pubmed.ncbi.nlm.nih.gov/5473901, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC248227/.CrossrefPubMedPubMedCentral
Colwell, R.R., R.A. Clayton, B.A. Ortiz-Conde, D. Jacobs, and E. Russek-Cohen. 1995. The microbial species concept and biodiversity. Microbial diversity and ecosystem function: Proceedings of the IUBS/IUMS Workshop held at Egham, UK, 10–13 August 1993.
Cowan, S.T. 1965. Principles and practice of bacterial taxonomy – A forward look. Microbiology 39 (1): 143–153.
Cracraft, Joel. 1997. Species concepts in systematics and conservation biology-An ornithological viewpoint. In Species: The units of biodiversity, ed. M.F. Claridge, H.A. Dawah, and M.R. Wilson. London: Chapman and Hall.
De la Cuesta-Zuluaga, Jacobo, and Juan S. Escobar. 2016. Considerations for optimizing microbiome analysis using a marker gene. Frontiers in Nutrition 3: 26.PubMedPubMedCentral
De Ley, J., and J. De Smedt. 1975. Improvements of the membrane filter method for DNA: rRNA hybridization. Antonie Van Leeuwenhoek 41 (3): 287–307.PubMed
de Queiroz, Kevin, and David A. Good. 1997. Phenetic clustering in biology: A critique. The Quarterly Review of Biology 72 (1): 3–30.
de Vargas, Colomban, Stéphane Audic, Nicolas Henry, Johan Decelle, Frédéric Mahé, Ramiro Logares, Enrique Lara, Cédric Berney, Noan Le Bescot, Ian Probert, Margaux Carmichael, Julie Poulain, Sarah Romac, Sébastien Colin, Jean-Marc Aury, Lucie Bittner, Samuel Chaffron, Micah Dunthorn, Stefan Engelen, Olga Flegontova, Lionel Guidi, Aleš Horák, Olivier Jaillon, Gipsi Lima-Mendez, Julius Lukeš, Shruti Malviya, Raphael Morard, Matthieu Mulot, Eleonora Scalco, Raffaele Siano, Flora Vincent, Adriana Zingone, Céline Dimier, Marc Picheral, Sarah Searson, Stefanie Kandels-Lewis, Silvia G. Acinas, Peer Bork, Chris Bowler, Gabriel Gorsky, Nigel Grimsley, Pascal Hingamp, Daniele Iudicone, Fabrice Not, Hiroyuki Ogata, Stephane Pesant, Jeroen Raes, Michael E. Sieracki, Sabrina Speich, Lars Stemmann, Shinichi Sunagawa, Jean Weissenbach, Patrick Wincker, and Eric Karsenti. 2015. Eukaryotic plankton diversity in the sunlit ocean. Science 348 (6237). https://doi.org/10.1126/science.1261605. http://science.sciencemag.org/content/sci/348/6237/1261605.full.pdf.
DeSantis, Todd Z., Philip Hugenholtz, Neils Larsen, Mark Rojas, Eoin L. Brodie, Keith Keller, Thomas Huber, Daniel Dalevi, Hu Ping, and Gary L. Andersen. 2006a. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Applied and Environmental Microbiology 72 (7): 5069–5072.PubMedPubMedCentral
DeSantis, T.Z., K. Philip Hugenholtz, Eoin L. Keller, N. Larsen Brodie, Y.M. Piceno, R. Phan, and Gary L. Andersen. 2006b. NAST: A multiple sequence alignment server for comparative analysis of 16S rRNA genes. Nucleic Acids Research 34 (suppl_2): W394–W399.PubMedPubMedCentral
Dröge, Johannes, and Alice C. McHardy. 2012. Taxonomic binning of metagenome samples generated by next-generation sequencing technologies. Briefings in Bioinformatics 13 (6): 646–655. https://doi.org/10.1093/bib/bbs031.CrossrefPubMed
Dubnau, David, Issar Smith, Pierre Morell, and Julius Marmur. 1965. Gene conservation in Bacillus species. I. Conserved genetic and nucleic acid base sequence homologies. Proceedings of the National Academy of Sciences of the United States of America 54 (2): 491.PubMedPubMedCentral
Dunthorn, Micah, Julia Klier, John Bunge, and Thorsten Stoeck. 2012. Comparing the hyper-variable V4 and V9 regions of the small subunit rDNA for assessment of ciliate environmental diversity. Journal of Eukaryotic Microbiology 59 (2): 185–187. https://doi.org/10.1111/j.1550-7408.2011.00602.x. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1550-7408.2011.00602.x.CrossrefPubMed
Edgar, Robert C. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26 (19): 2460–2461. https://doi.org/10.1093/bioinformatics/btq461.CrossrefPubMed
———. 2013. UPARSE: Highly accurate OTU sequences from microbial amplicon reads. Nature Methods 10 (10): 996–998. https://doi.org/10.1038/nmeth.2604.CrossrefPubMed
———.. 2016a. UCHIME2: Improved chimera prediction for amplicon sequencing. bioRxiv. https://doi.org/10.1101/074252. https://www.biorxiv.org/content/biorxiv/early/2016/09/09/074252.full.pdf.
———. 2016b. UNOISE2: Improved error-correction for Illumina 16S and ITS amplicon sequencing. bioRxiv. https://doi.org/10.1101/081257. https://www.biorxiv.org/content/biorxiv/early/2016/10/15/081257.full.pdf.
———. 2018. Updating the 97% identity threshold for 16S ribosomal RNA OTUs. Bioinformatics 34 (14): 2371–2375. https://doi.org/10.1093/bioinformatics/bty113.CrossrefPubMed
Edgar, Robert. 2021. usearch v11. http://www.drive5.com/usearch/manual/faq_unoise_not_illumina.html. Accessed 2 Sept 2021.
Edgar, Robert C., and Henrik Flyvbjerg. 2015. Error filtering, pair assembly and error correction for next-generation sequencing reads. Bioinformatics 31 (21): 3476–3482. https://doi.org/10.1093/bioinformatics/btv401.CrossrefPubMed
Edgar, Robert C., Brian J. Haas, Jose C. Clemente, Christopher Quince, and Rob Knight. 2011. UCHIME improves sensitivity and speed of chimera detection. Bioinformatics (Oxford, England) 27 (16): 2194–2200. https://doi.org/10.1093/bioinformatics/btr381. https://www.ncbi.nlm.nih.gov/pubmed/21700674, https://www.ncbi.nlm.nih.gov/pmc/PMC3150044/.CrossrefPubMed
Enright, Anton J., and Christos A. Ouzounis. 2000. GeneRAGE: A robust algorithm for sequence clustering and domain detection. Bioinformatics 16 (5): 451–457. https://doi.org/10.1093/bioinformatics/16.5.451.CrossrefPubMed
Eren, A. Murat, Marcela Zozaya, Christopher M. Taylor, Scot E. Dowd, David H. Martin, and Michael J. Ferris. 2011. Exploring the diversity of Gardnerella vaginalis in the genitourinary tract microbiota of monogamous couples through subtle nucleotide variation. PLoS One 6 (10): e26732. https://doi.org/10.1371/journal.pone.0026732. https://www.ncbi.nlm.nih.gov/pubmed/22046340, https://www.ncbi.nlm.nih.gov/pmc/PMC3201972/.CrossrefPubMedPubMedCentral
Eren, A. Murat, Loïs Maignien, Woo Jun Sul, Leslie G. Murphy, Sharon L. Grim, Hilary G. Morrison, and Mitchell L. Sogin. 2013. Oligotyping: Differentiating between closely related microbial taxa using 16S rRNA gene data. Methods in Ecology and Evolution 4 (12): 1111–1119. https://doi.org/10.1111/2041-210x.12114. https://www.ncbi.nlm.nih.gov/pubmed/24358444, https://www.ncbi.nlm.nih.gov/pmc/PMC3864673/.CrossrefPubMedPubMedCentral
Eren, A. Murat, Hilary G. Morrison, Pamela J. Lescault, Julie Reveillaud, Joseph H. Vineis, and Mitchell L. Sogin. 2014. Minimum entropy decomposition: Unsupervised oligotyping for sensitive partitioning of high-throughput marker gene sequences. The ISME Journal 9 (4): 968–979. https://doi.org/10.1038/ismej.2014.195. https://www.ncbi.nlm.nih.gov/pubmed/25325381, https://www.ncbi.nlm.nih.gov/pmc/PMC4817710/.CrossrefPubMedCentral
Eren, A. Murat, Mitchell L. Sogin, Hilary G. Morrison, Joseph H. Vineis, Jenny C. Fisher, Ryan J. Newton, and Sandra L. McLellan. 2015. A single genus in the gut microbiome reflects host preference and specificity. The ISME Journal 9 (1): 90–100. https://doi.org/10.1038/ismej.2014.97. https://www.ncbi.nlm.nih.gov/pubmed/24936765, https://www.ncbi.nlm.nih.gov/pmc/PMC4274434/.CrossrefPubMed
Eren, A. Murat, Mitchell L. Sogin, and Loïs Maignien. 2016. Editorial: New insights into microbial ecology through subtle nucleotide variation. Frontiers in Microbiology 7 (1318). https://doi.org/10.3389/fmicb.2016.01318. https://www.frontiersin.org/article/10.3389/fmicb.2016.01318.
Everett, Karin D.E., Robin M. Bush, and Arthur A. Andersen. 1999. Emended description of the order Chlamydiales, proposal of Parachlamydiaceae fam. nov. and Simkaniaceae fam. nov., each containing one monotypic genus, revised taxonomy of the family Chlamydiaceae, including a new genus and five new species, and standards for the identification of organisms. International Journal of Systematic and Evolutionary Microbiology 49 (2): 415–440.
Federhen, Scott. 2012. The NCBI taxonomy database. Nucleic Acids Research 40 (D1): D136–D143.PubMed
Felsenstein, Joseph. 1989. PHYLIP—Phylogeny inference package. Cladistics 5: 164–166.
Filker, Sabine, Anna Gimmler, Micah Dunthorn, Frédéric Mahé, and Thorsten Stoeck. 2015. Deep sequencing uncovers protistan plankton diversity in the Portuguese Ria Formosa solar saltern ponds. Extremophiles 19 (2): 283–295. https://doi.org/10.1007/s00792-014-0713-2.CrossrefPubMed
Fish, Jordan, Benli Chai, Qiong Wang, Sun Yanni, C. Titus Brown, James Tiedje, and James Cole. 2013. FunGene: The functional gene pipeline and repository. Frontiers in Microbiology 4 (291). https://doi.org/10.3389/fmicb.2013.00291. https://www.frontiersin.org/article/10.3389/fmicb.2013.00291.
Fisher, R.A. 1935. Design of experiments. Edinburgh: Oliver & Boyd.
Fitz-Gibbon, Sorel, Shuta Tomida, Bor-Han Chiu, Lin Nguyen, Christine Du, Minghsun Liu, David Elashoff, Marie C. Erfe, Anya Loncaric, Jenny Kim, Robert L. Modlin, Jeff F. Miller, Erica Sodergren, Noah Craft, George M. Weinstock, and Huiying Li. 2013. Propionibacterium acnes strain populations in the human skin microbiome associated with acne. The Journal of Investigative Dermatology 133: 2152–2160.PubMedPubMedCentral
Fournier, Pierre-Edouard, J. Stephen Dumler, Gilbert Greub, Jianzhi Zhang, Wu Yimin, and Didier Raoult. 2003. Gene sequence-based criteria for identification of new rickettsia isolates and description of Rickettsia heilongjiangensis sp. nov. Journal of Clinical Microbiology 41 (12): 5456–5465. https://doi.org/10.1128/JCM.41.12.5456-5465.2003. https://pubmed.ncbi.nlm.nih.gov/14662925, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC308961/.CrossrefPubMedPubMedCentral
Fox, George E., Jeffrey D. Wisotzkey, and Peter Jurtshuk Jr. 1992. How close is close: 16S rRNA sequence identity may not be sufficient to guarantee species identity. International Journal of Systematic and Evolutionary Microbiology 42 (1): 166–170.
Franzén, O., J. Hu, X. Bao, S.H. Itzkowitz, I. Peter, and A. Bashir. 2015. Improved OTU-picking using long-read 16S rRNA gene amplicon sequencing and generic hierarchical clustering. Microbiome 3: 43. https://doi.org/10.1186/s40168-015-0105-6.CrossrefPubMedPubMedCentral
Fu, Limin, Beifang Niu, Zhengwei Zhu, Wu Sitao, and Weizhong Li. 2012. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics 28 (23): 3150–3152. https://doi.org/10.1093/bioinformatics/bts565.CrossrefPubMedPubMedCentral
Gaspar, John M. 2018. NGmerge: Merging paired-end reads via novel empirically-derived models of sequencing errors. BMC Bioinformatics 19 (1): 536. https://doi.org/10.1186/s12859-018-2579-2.CrossrefPubMedPubMedCentral
Ghodsi, Mohammadreza, Bo Liu, and Mihai Pop. 2011. DNACLUST: Accurate and efficient clustering of phylogenetic marker genes. BMC Bioinformatics 12 (1): 271. https://doi.org/10.1186/1471-2105-12-271.CrossrefPubMedPubMedCentral
Goecks, Jeremy, Anton Nekrutenko, James Taylor, and Team Galaxy. 2010. Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology 11 (8): R86–R86. https://doi.org/10.1186/gb-2010-11-8-r86. https://www.ncbi.nlm.nih.gov/pubmed/20738864, https://www.ncbi.nlm.nih.gov/pmc/PMC2945788/.CrossrefPubMedPubMedCentral
Goodfellow, M., G.P. Manfio, and J. Chun. 1997. Towards a practical species concept for cultivable bacteria. In Species: The units of biodiversity, ed. M.F. Claridge, H.A. Dawah, and M.R. Wilson, 25–59. London: Chapman & Hall.
Gordon, Ruth E. 1978. A species definition. International Journal of Systematic and Evolutionary Microbiology 28 (4): 605–607.
Goris, Johan, Konstantinos T. Konstantinidis, Joel A. Klappenbach, Tom Coenye, Peter Vandamme, and James M. Tiedje. 2007. DNA–DNA hybridization values and their relationship to whole-genome sequence similarities. International Journal of Systematic and Evolutionary Microbiology 57 (1): 81–91.PubMed
Gribaldo, S., V. Lumia, R. Creti, E. Conway de Macario, A. Sanangelantoni, and P. Cammarano. 1999. Discontinuous occurrence of the hsp70 (dnaK) gene among Archaea and sequence features of HSP70 suggest a novel outlook on phylogenies inferred from this protein. Journal of Bacteriology 181 (2): 434–443. https://doi.org/10.1128/JB.181.2.434-443.1999. https://pubmed.ncbi.nlm.nih.gov/9882656, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC93396/.CrossrefPubMedPubMedCentral
Grimont, P.A. 1988. Use of DNA reassociation in bacterial classification. Canadian Journal of Microbiology 34 (4): 541–546. https://doi.org/10.1139/m88-092.CrossrefPubMed
Gupta, Radhey S. 1998. What are archaebacteria: Life’s third domain or monoderm prokaryotes related to Gram-positive bacteria? A new proposal for the classification of prokaryotic organisms. Molecular Microbiology 29 (3): 695–707.PubMed
Haas, Brian J., Dirk Gevers, Ashlee M. Earl, Mike Feldgarden, Doyle V. Ward, Georgia Giannoukos, Dawn Ciulla, Diana Tabbaa, Sarah K. Highlander, and Erica Sodergren. 2011. Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Research 21 (3): 494–504.PubMedPubMedCentral
Hao, Xiaolin, Rui Jiang, and Ting Chen. 2011. Clustering 16S rRNA for OTU prediction: A method of unsupervised Bayesian clustering. Bioinformatics (Oxford, England) 27 (5): 611–618. https://doi.org/10.1093/bioinformatics/btq725. https://www.ncbi.nlm.nih.gov/pubmed/21233169, https://www.ncbi.nlm.nih.gov/pmc/PMC3042185/.CrossrefPubMed
Harmsen, Dag. 2004. 16S rDNA for diagnosing pathogens: A living tree. Asm News 70: 19–24.
Hassler, Hayley B., Brett Probert, Carson Moore, Elizabeth Lawson, Richard W. Jackson, Brook T. Russell, and Vincent P. Richards. 2022. Phylogenies of the 16S rRNA gene and its hypervariable regions lack concordance with core genome phylogenies. Microbiome 10 (1): 104. https://doi.org/10.1186/s40168-022-01295-y.CrossrefPubMedPubMedCentral
Hathaway, Nicholas J., Christian M. Parobek, Jonathan J. Juliano, and Jeffrey A. Bailey. 2017. SeekDeep: single-base resolution de novo clustering for amplicon deep sequencing. Nucleic Acids Research 46 (4): e21. https://doi.org/10.1093/nar/gkx1201.CrossrefPubMedCentral
He, Yan, J. Gregory Caporaso, Xiao-Tao Jiang, Hua-Fang Sheng, Susan M. Huse, Jai Ram Rideout, Robert C. Edgar, Evguenia Kopylova, William A. Walters, Rob Knight, and Hong-Wei Zhou. 2015. Stability of operational taxonomic units: An important but neglected property for analyzing microbial diversity. Microbiome 3: 20–20. https://doi.org/10.1186/s40168-015-0081-x. https://www.ncbi.nlm.nih.gov/pubmed/25995836, https://www.ncbi.nlm.nih.gov/pmc/PMC4438525/.CrossrefPubMedPubMedCentral
Heise, Helen, and Mortimer P. Starr. 1968. Nomenifers: Are they christened or classified? Systematic Biology 17 (4): 458–467.
Hey, Jody. 2006. On the failure of modern species concepts. Trends in Ecology & Evolution 21 (8): 447–450. https://doi.org/10.1016/j.tree.2006.05.011. https://www.sciencedirect.com/science/article/pii/S0169534706001649.Crossref
Huang, Ying, Beifang Niu, Ying Gao, Fu Limin, and Weizhong Li. 2010. CD-HIT Suite: A web server for clustering and comparing biological sequences. Bioinformatics 26 (5): 680–682. https://doi.org/10.1093/bioinformatics/btq003.CrossrefPubMedPubMedCentral
Hugenholtz, P., B.M. Goebel, and N.R. Pace. 1998. Impact of culture-independent studies on the emerging phylogenetic view of bacterial diversity. Journal of Bacteriology 180 (18): 4765–4774. https://doi.org/10.1128/JB.180.18.4765-4774.1998. https://pubmed.ncbi.nlm.nih.gov/9733676, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC107498/.CrossrefPubMedPubMedCentral
Hull, D.L. 1997. The ideal species concept-and why we can’t get it. In Species: The units of biodiversity, ed. M.F. Claridge, H.A. Dawah, and M.R. Wilson, 357–380. London: Chapman and Hall.
Human Microbiome Project, Consortium. 2012. A framework for human microbiome research. Nature 486 (7402): 215–221. https://doi.org/10.1038/nature11209. https://www.ncbi.nlm.nih.gov/pubmed/22699610, https://www.ncbi.nlm.nih.gov/pmc/PMC3377744/.Crossref
Huse, Susan M., Julie A. Huber, Hilary G. Morrison, Mitchell L. Sogin, and David Mark Welch. 2007. Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology 8 (7): R143–R143. https://doi.org/10.1186/gb-2007-8-7-r143. https://www.ncbi.nlm.nih.gov/pubmed/17659080, https://www.ncbi.nlm.nih.gov/pmc/PMC2323236/.CrossrefPubMedPubMedCentral
Huse, Susan M., Les Dethlefsen, Julie A. Huber, David Mark Welch, David A. Relman, and Mitchell L. Sogin. 2008. Exploring microbial diversity and taxonomy using SSU rRNA hypervariable tag sequencing. PLoS Genetics 4 (11): e1000255. https://doi.org/10.1371/journal.pgen.1000255.CrossrefPubMedPubMedCentral
Huse, S.M., D.M. Welch, H.G. Morrison, and M.L. Sogin. 2010. Ironing out the wrinkles in the rare biosphere through improved OTU clustering. Environmental Microbiology 12 (7): 1889–1898.PubMedPubMedCentral
Ibrahim, Ashraf, Peter Gerner-Smidt, and Werner Liesack. 1997. Phylogenetic relationship of the twenty-one DNA groups of the genus Acinetobacter as revealed by 16S ribosomal DNA sequence analysis. International Journal of Systematic and Evolutionary Microbiology 47 (3): 837–841.
Jääskinen, V., V. Parkkinen, L. Cheng, and J. Corander. 2014. Bayesian clustering of DNA sequences using Markov chains and a stochastic partition model. Statistical Applications in Genetics and Molecular Biology 13 (1): 105–121. https://doi.org/10.1515/sagmb-2013-0031.CrossrefPubMed
Jackson, Matthew A., Jordana T. Bell, Tim D. Spector, and Claire J. Steves. 2016. A heritability-based comparison of methods used to cluster 16S rRNA gene sequences into operational taxonomic units. PeerJ 4: e2341. https://doi.org/10.7717/peerj.2341.CrossrefPubMedPubMedCentral
Johnson, John L. 1973. Use of nucleic-acid homologies in the taxonomy of anaerobic bacteria. International Journal of Systematic and Evolutionary Microbiology 23 (4): 308–315.
———. 1985. 2 DNA reassociation and RNA hybridisation of bacterial nucleic acids. In Methods in microbiology, 33–74. San Diego: Elsevier.
Johnson, Jethro S., Daniel J. Spakowicz, Bo-Young Hong, Lauren M. Petersen, Patrick Demkowicz, Lei Chen, Shana R. Leopold, Blake M. Hanson, Hanako O. Agresta, Mark Gerstein, Erica Sodergren, and George M. Weinstock. 2019. Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis. Nature Communications 10 (1): 5029. https://doi.org/10.1038/s41467-019-13036-1.CrossrefPubMedPubMedCentral
Jost, Lou. 2006. Entropy and diversity. Oikos 113 (2): 363–375. https://doi.org/10.1111/j.2006.0030-1299.14714.x. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.2006.0030-1299.14714.x.Crossref
Kimura, Motoo. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16 (2): 111–120.PubMed
Koeppel, Alexander F., and Martin Wu. 2013. Surprisingly extensive mixed phylogenetic and ecological signals among bacterial Operational Taxonomic Units. Nucleic Acids Research 41 (10): 5175–5188. https://doi.org/10.1093/nar/gkt241. https://www.ncbi.nlm.nih.gov/pubmed/23571758, https://www.ncbi.nlm.nih.gov/pmc/PMC3664822/.CrossrefPubMedPubMedCentral
Konstantinidis, Konstantinos T., Alban Ramette, and James M. Tiedje. 2006. The bacterial species definition in the genomic era. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 361 (1475): 1929–1940. https://doi.org/10.1098/rstb.2006.1920. https://pubmed.ncbi.nlm.nih.gov/17062412, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1764935/.CrossrefPubMedPubMedCentral
Kopylova, Evguenia, Laurent Noé, and Hélène Touzet. 2012. SortMeRNA: Fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics 28 (24): 3211–3217. https://doi.org/10.1093/bioinformatics/bts611.CrossrefPubMed
Kopylova, Evguenia, Jose A. Navas-Molina, Céline Mercier, Xu Zhenjiang Zech, Frédéric Mahé, Yan He, Hong-Wei Zhou, J. Torbjørn Rognes, Gregory Caporaso, and Rob Knight. 2016. Open-source sequence clustering methods improve the state of the art. mSystems 1 (1): e00003–e00015.PubMedPubMedCentral
Krieg, Noel R. 1988. Bacterial classification: An overview. Canadian Journal of Microbiology 34 (4): 536–540.PubMed
Kuczynski, Justin, Jesse Stombaugh, William Anton Walters, J. Antonio González, Gregory Caporaso, and Rob Knight. 2012. Using QIIME to analyze 16S rRNA gene sequences from microbial communities. Current Protocols in Microbiology 1: Unit-1E.5. https://doi.org/10.1002/9780471729259.mc01e05s27. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4477843/.Crossref
Langille, Morgan G.I., J. Jesse Zaneveld, Gregory Caporaso, Daniel McDonald, Dan Knights, Joshua A. Reyes, Jose C. Clemente, Deron E. Burkepile, Rebecca L. Vega, Rob Knight Thurber, Robert G. Beiko, and Curtis Huttenhower. 2013. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nature Biotechnology 31 (9): 814–821. https://doi.org/10.1038/nbt.2676. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3819121/.CrossrefPubMedPubMedCentral
Legendre, P., and L. Legendre. 1998. Numerical ecology, amsterdam, second English edition: Developments in environmental modeling. Vol. 20. New York: Elsevier Science. 853pp.
Legendre, Pierre, and Louis Legendre. 2012. Numerical ecology. Amsterdam: Elsevier.
Li, Weizhong, and Adam Godzik. 2006. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22 (13): 1658–1659. https://doi.org/10.1093/bioinformatics/btl158.CrossrefPubMed
Lima-Mendez, Gipsi, Karoline Faust, Nicolas Henry, Johan Decelle, Sébastien Colin, Fabrizio Carcillo, Chaffron Samuel, J. Cesar Ignacio-Espinosa, Simon Roux, Flora Vincent, Lucie Bittner, Youssef Darzi, Jun Wang, Stéphane Audic, Léo Berline, Gianluca Bontempi, Ana M. Cabello, Laurent Coppola, Francisco M. Cornejo-Castillo, Francesco d’Ovidio, Luc De Meester, Isabel Ferrera, Marie-José Garet-Delmas, Lionel Guidi, Elena Lara, Stéphane Pesant, Marta Royo-Llonch, Guillem Salazar, Pablo Sánchez, Marta Sebastian, Caroline Souffreau, Céline Dimier, Marc Picheral, Sarah Searson, Stefanie Kandels-Lewis, Gabriel Gorsky, Fabrice Not, Hiroyuki Ogata, Sabrina Speich, Lars Stemmann, Jean Weissenbach, Patrick Wincker, Silvia G. Acinas, Shinichi Sunagawa, Peer Bork, Matthew B. Sullivan, Eric Karsenti, Chris Bowler, Colomban de Vargas, and Jeroen Raes. 2015. Determinants of community structure in the global plankton interactome. Science 348 (6237). https://doi.org/10.1126/science.1262073. http://science.sciencemag.org/content/sci/348/6237/1262073.full.pdf.
Liu, Zongzhi, Todd Z. DeSantis, Gary L. Andersen, and Rob Knight. 2008. Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic Acids Research 36 (18): e120–e120. https://doi.org/10.1093/nar/gkn491. https://pubmed.ncbi.nlm.nih.gov/18723574, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2566877/.CrossrefPubMedPubMedCentral
Logan, Niall A. 2009. Bacterial systematics. Hoboken: Wiley.
Louca, Stilianos, Michael Doebeli, and Laura Wegener Parfrey. 2018. Correcting for 16S rRNA gene copy numbers in microbiome surveys remains an unsolved problem. Microbiome 6 (1): 41. https://doi.org/10.1186/s40168-018-0420-9.CrossrefPubMedPubMedCentral
Lozupone, Catherine, and Rob Knight. 2005. UniFrac: A new phylogenetic method for comparing microbial communities. Applied and Environmental Microbiology 71 (12): 8228–8235.PubMedPubMedCentral
Ludwig, W. 1999. The role of rRNA as a phylogenetic marker in the context of genomics. USFCC Newsletter 29: 2–6.
Ludwig, W., and K.H. Schleifer. 1994. Bacterial phylogeny based on 16S and 23S rRNA sequence analysis. FEMS Microbiology Reviews 15 (2-3): 155–173.PubMed
Ludwig, Wolfgang, Oliver Strunk, Sabine Klugbauer, Norbert Klugbauer, Michael Weizenegger, Judith Neumaier, Marianne Bachleitner, and Karl Heinz Schleifer. 1998. Bacterial phylogeny based on comparative sequence analysis. Electrophoresis 19 (4): 554–568.PubMed
Ma, Xiaotu, Ying Shao, Liqing Tian, Diane A. Flasch, Heather L. Mulder, Michael N. Edmonson, Yu Liu, Xiang Chen, Scott Newman, Joy Nakitandwe, Yongjin Li, Benshang Li, Shuhong Shen, Zhaoming Wang, Sheila Shurtleff, Leslie L. Robison, Shawn Levy, John Easton, and Jinghui Zhang. 2019. Analysis of error profiles in deep next-generation sequencing data. Genome Biology 20 (1): 50–50. https://doi.org/10.1186/s13059-019-1659-6. https://pubmed.ncbi.nlm.nih.gov/30867008, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6417284/.CrossrefPubMedPubMedCentral
Maddison, Wayne P., and Montgomery Slatkin. 1991. Null models for the number of evolutionary steps in a character on a phylogenetic tree. Evolution 45 (5): 1184–1197.PubMed
Mahé, Frédéric, Torbjørn Rognes, Christopher Quince, Colomban de Vargas, and Micah Dunthorn. 2014. Swarm: Robust and fast clustering method for amplicon-based studies. PeerJ 2: e593. https://doi.org/10.7717/peerj.593.CrossrefPubMedPubMedCentral
Mahé, Frédéric, Jordan Mayor, John Bunge, Jingyun Chi, Tobias Siemensmeyer, Thorsten Stoeck, Benjamin Wahl, Tobias Paprotka, Sabine Filker, and Micah Dunthorn. 2015a. Comparing high-throughput platforms for sequencing the V4 region of SSU-rDNA in environmental microbial eukaryotic diversity surveys. Journal of Eukaryotic Microbiology 62 (3): 338–345. https://doi.org/10.1111/jeu.12187. https://onlinelibrary.wiley.com/doi/abs/10.1111/jeu.12187.CrossrefPubMed
Mahé, Frédéric, Torbjørn Rognes, Christopher Quince, Colomban de Vargas, and Micah Dunthorn. 2015b. Swarm v2: Highly-scalable and high-resolution amplicon clustering. PeerJ 3: e1420. https://doi.org/10.7717/peerj.1420. https://www.ncbi.nlm.nih.gov/pubmed/26713226, https://www.ncbi.nlm.nih.gov/pmc/PMC4690345/.CrossrefPubMedPubMedCentral
Martin, Andrew P. 2002. Phylogenetic approaches for describing and comparing the diversity of microbial communities. Applied and Environmental Microbiology 68 (8): 3673–3682.PubMedPubMedCentral
Martinez-Murcia, A.J., S. Benlloch, and M.D. Collins. 1992. Phylogenetic interrelationships of members of the genera Aeromonas and Plesiomonas as determined by 16S ribosomal DNA sequencing: Lack of congruence with results of DNA-DNA hybridizations. International Journal of Systematic and Evolutionary Microbiology 42 (3): 412–421.
Matias Rodrigues, João F., and Christian von Mering. 2013. HPC-CLUST: Distributed hierarchical clustering for large sets of nucleotide sequences. Bioinformatics 30 (2): 287–288. https://doi.org/10.1093/bioinformatics/btt657.CrossrefPubMedPubMedCentral
Matias Rodrigues, João F., Thomas S.B. Schmidt, Janko Tackmann, and Christian von Mering. 2017. MAPseq: Highly efficient k-mer search with confidence estimates, for rRNA sequence analysis. Bioinformatics 33 (23): 3808–3810. https://doi.org/10.1093/bioinformatics/btx517.CrossrefPubMedPubMedCentral
May, Robert M. 1986. Biological diversity: How many species are there? Nature 324 (6097): 514–515.PubMed
———. 1988. How many species are there on earth? Science 241 (4872): 1441–1449.PubMed
Mayden, R.L. 1997. A hierarchy of species concepts: The denouement in the saga of the species problem. In Species: The units of biodiversity, ed. M.F. Claridge, H.A. Dawah, and M.R. Wilson, 381–424. London: Chapman and Hall.
Mayr, Ernst. 1940. Speciation phenomena in birds. The American Naturalist 74 (752): 249–278. http://www.jstor.org.proxy.cc.uic.edu/stable/2457576.
———. 1942. Systematics and the origin of species. New York: Columbia University Press.
———. 1963. Animal species and evolution. Cambridge, MA: Harvard University Press.
———. 1969. The biological meaning of species. Biological Journal of the Linnean Society 1 (3): 311–320.
———. 2004. What makes biology unique? Considerations on the autonomy of a scientific discipline. Cambridge: Cambridge University Press.
———. 2015. Principles of systematic zoology. New Delhi: Scientific Publishers.
McDonald, Daniel, Morgan N. Price, Julia Goodrich, Eric P. Nawrocki, Todd Z. DeSantis, Alexander Probst, Gary L. Andersen, Rob Knight, and Philip Hugenholtz. 2012. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. The ISME Journal 6 (3): 610–618. https://doi.org/10.1038/ismej.2011.139. https://www.ncbi.nlm.nih.gov/pubmed/22134646, https://www.ncbi.nlm.nih.gov/pmc/PMC3280142/.CrossrefPubMed
Mercier, C., F. Boyer, A. Bonin, and E. Coissac. 2013. SUMATRA and SUMACLUST: Fast and exact comparison and clustering of sequences. Programs and Abstracts of the SeqBio 2013 Workshop. Abstract, (Citeseer), 27–29.
Miotto, Olivo, Roberto Amato, Elizabeth A Ashley, Bronwyn MacInnis, Jacob Almagro-Garcia, Chanaki Amaratunga, Pharath Lim, Daniel Mead, Samuel O Oyola, and Mehul Dhorda. 2015. “Genetic architecture of artemisinin-resistant Plasmodium falciparum.” Nature Genetics 47 (3): 226–234.
Mishler, Brent D. 1999. Getting rid of species. In Species: New interdisciplinary essays, 307. Cambridge, MA: MIT Press.
———. 2000. The phylogenetic species concept (sensu Mishler and Theriot): monophyly, apomorphy, and phylogenetic species concepts. In Species concepts and phylogenetic theory, a debate, vol. 44, 54. New York: Columbia University Press.
Morgan, Matthew J., Anthony A. Chariton, Diana M. Hartley, Leon N. Court, and Christopher M. Hardy. 2013. Improved inference of taxonomic richness from environmental DNA. PLoS One 8 (8): e71974. https://doi.org/10.1371/journal.pone.0071974. https://www.ncbi.nlm.nih.gov/pubmed/23991013, https://www.ncbi.nlm.nih.gov/pmc/PMC3753314/.CrossrefPubMedPubMedCentral
Mylvaganam, Shanthini, and Patrick P. Dennis. 1992. Sequence heterogeneity between the two genes encoding 16S rRNA from the halophilic archaebacterium Haloarcula marismortui. Genetics 130 (3): 399–410.PubMedPubMedCentral
Namiki, Youhei, Takashi Ishida, and Yutaka Akiyama. 2013. Acceleration of sequence clustering using longest common subsequence filtering. BMC Bioinformatics 14 (Suppl 8): S7. https://doi.org/10.1186/1471-2105-14-S8-S7. https://pubmed.ncbi.nlm.nih.gov/23815271, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3654901/.CrossrefPubMedPubMedCentral
Nearing, Jacob T., Gavin M. Douglas, André M. Comeau, and Morgan G.I. Langille. 2018. Denoising the Denoisers: An independent evaluation of microbiome sequence error-correction approaches. PeerJ 6: e5364. https://doi.org/10.7717/peerj.5364. https://www.ncbi.nlm.nih.gov/pubmed/30123705, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6087418/.CrossrefPubMedPubMedCentral
Nebel, Markus, Cornelia Pfabel, Alexandra Stock, Micah Dunthorn, and Thorsten Stoeck. 2011. Delimiting operational taxonomic units for assessing ciliate environmental diversity of small subunit rRNA gene sequences. Environmental Microbiology Reports 3: 154.PubMed
Nekhaev, Ivan O., T. Shiøtte, and Maxim V. Vinarski. 2015. Type materials of European freshwater molluscs described by Otto Friedrich Müller. Archiv für Molluskenkunde 144 (1): 51–64.
Nguyen, Nam-Phuong, Tandy Warnow, Mihai Pop, and Bryan White. 2016. A perspective on 16S rRNA operational taxonomic unit clustering using sequence similarity. npj Biofilms and Microbiomes 2 (1): 16004. https://doi.org/10.1038/npjbiofilms.2016.4.CrossrefPubMedPubMedCentral
Nübel, Uea, Bert Engelen, Andreas Felske, Jiri Snaidr, Alois Wieshuber, Rudolf I. Amann, Wolfgang Ludwig, and Horst Backhaus. 1996. Sequence heterogeneities of genes encoding 16S rRNAs in Paenibacillus polymyxa detected by temperature gradient gel electrophoresis. Journal of Bacteriology 178 (19): 5636–5643.PubMedPubMedCentral
Oikonomou, Andreas, Sabine Filker, Hans-Werner Breiner, and Thorsten Stoeck. 2015. Protistan diversity in a permanently stratified meromictic lake (Lake Alatsee, SW Germany). Environmental Microbiology 17 (6): 2144–2157. https://doi.org/10.1111/1462-2920.12666. https://onlinelibrary.wiley.com/doi/abs/10.1111/1462-2920.12666.CrossrefPubMed
Olsen, Gary J., Carl R. Woese, and Ross Overbeek. 1994. The winds of (evolutionary) change: Breathing new life into microbiology. Journal of Bacteriology 176 (1): 1–6.PubMedPubMedCentral
Pace, Norman R. 1997. A molecular view of microbial diversity and the biosphere. Science 276 (5313): 734–740.PubMed
Pagni, Marco, Hélène Niculita-Hirzel, Loïc Pellissier, Anne Dubuis, Ioannis Xenarios, Antoine Guisan, Ian R. Sanders, Jérôme Goudet, and Nicolas Guex. 2013. Density-based hierarchical clustering of pyro-sequences on a large scale – The case of fungal ITS1. Bioinformatics (Oxford, England) 29 (10): 1268–1274. https://doi.org/10.1093/bioinformatics/btt149. https://pubmed.ncbi.nlm.nih.gov/23539304, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3654712/.CrossrefPubMedPubMedCentral
Peterson, Jane, Susan Garges, Maria Giovanni, Lu Pamela McInnes, Jeffery A. Wang, Vivien Bonazzi Schloss, Jean E. McEwen, Kris A. Wetterstrand, and Carolyn Deal. 2009. The NIH human microbiome project. Genome Research 19 (12): 2317–2323.PubMedPubMedCentral
Platnick, Norman I. 2000. A defense of the phylogenetic species concept (sensu Wheeler and Platnick). In Species concepts and phylogenetic theory: A debate, 185–197. New York: Columbia University Press.
Preheim, Sarah P., Allison R. Perrotta, Antonio M. Martin-Platero, Anika Gupta, and Eric J. Alm. 2013. Distribution-based clustering: Using ecology to refine the operational taxonomic unit. Applied and Environmental Microbiology 79 (21): 6593–6603. https://doi.org/10.1128/aem.00342-13. https://www.ncbi.nlm.nih.gov/pubmed/23974136, https://www.ncbi.nlm.nih.gov/pmc/PMC3811501/.CrossrefPubMedPubMedCentral
Probst, Andreas J., Christian Hertel, Lothar Richter, Lars Wassill, Wolfgang Ludwig, and Walter P. Hammes. 1998. Staphylococcus condimenti sp. nov., from soy sauce mash, and Staphylococcus carnosus (Schleifer and Fischer 1982) subsp. utilis subsp. nov. International Journal of Systematic and Evolutionary Microbiology 48 (3): 651–658.
Prodan, Andrei, Valentina Tremaroli, Harald Brolin, Aeilko H. Zwinderman, Max Nieuwdorp, and Evgeni Levin. 2020. Comparing bioinformatic pipelines for microbial 16S rRNA amplicon sequencing. PLoS One 15 (1): e0227434. https://doi.org/10.1371/journal.pone.0227434.CrossrefPubMedPubMedCentral
Prosser, James I., Brendan Bohannan, Tom Curtis, Richard Ellis, Mary K. Firestone, Rob P. Freckleton, Jessica L. Green, Laura Green, Ken Killham, Jack Lennon, Andrew Osborn, Martin Solan, Christopher van der Gast, and J. Peter Young. 2007. Essay – The role of ecological theory in microbial ecology. Nature Reviews Microbiology 5: 384.PubMed
Pruesse, Elmar, Christian Quast, Katrin Knittel, Bernhard M. Fuchs, Wolfgang Ludwig, Jörg Peplies, and Frank Oliver Glöckner. 2007. SILVA: A comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Research 35 (21): 7188–7196. https://doi.org/10.1093/nar/gkm864.CrossrefPubMedPubMedCentral
Quince, Christopher, Anders Lanzén, Thomas P. Curtis, Russell J. Davenport, Neil Hall, Ian M. Head, L. Fiona Read, and William T. Sloan. 2009. Accurate determination of microbial diversity from 454 pyrosequencing data. Nature Methods 6: 639. https://doi.org/10.1038/nmeth.1361. https://www.nature.com/articles/nmeth.1361#supplementary-information.CrossrefPubMed
Quince, Christopher, Anders Lanzen, Russell J. Davenport, and Peter J. Turnbaugh. 2011. Removing noise from pyrosequenced amplicons. BMC Bioinformatics 12 (1): 38. https://doi.org/10.1186/1471-2105-12-38.CrossrefPubMedPubMedCentral
Ramette, Alban, and Pier Luigi Buttigieg. 2014. The R package otu2ot for implementing the entropy decomposition of nucleotide variation in sequence data. Frontiers in Microbiology 5 (601). https://doi.org/10.3389/fmicb.2014.00601. https://www.frontiersin.org/article/10.3389/fmicb.2014.00601.
Rasheed, Zeehasham, Huzefa Rangwala, and Daniel Barbará. 2013. 16S rRNA metagenome clustering and diversity estimation using locality sensitive hashing. BMC Systems Biology 7 (Suppl 4): S11. https://doi.org/10.1186/1752-0509-7-S4-S11. https://pubmed.ncbi.nlm.nih.gov/24565031, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3854655/.CrossrefPubMedPubMedCentral
Reeder, Jens, and Rob Knight. 2010. Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions. Nature Methods 7 (9): 668–669. https://doi.org/10.1038/nmeth0910-668b.CrossrefPubMedPubMedCentral
Rideout, Jai Ram, Yan He, Jose A. Navas-Molina, William A. Walters, Luke K. Ursell, Sean M. Gibbons, John Chase, Daniel McDonald, Antonio Gonzalez, Adam Robbins-Pianka, Jose C. Clemente, Jack A. Gilbert, Susan M. Huse, Hong-Wei Zhou, Rob Knight, and J. Gregory Caporaso. 2014. Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences. PeerJ 2: e545. https://doi.org/10.7717/peerj.545.CrossrefPubMedPubMedCentral
Rognes, T., T. Flouri, B. Nichols, C. Quince, and F. Mahé. 2016. VSEARCH: A versatile open source tool for metagenomics. PeerJ 4: e2584. https://doi.org/10.7717/peerj.2584.CrossrefPubMedPubMedCentral
Rosen, Michael J., Benjamin J. Callahan, Daniel S. Fisher, and Susan P. Holmes. 2012. Denoising PCR-amplified metagenome data. BMC Bioinformatics 13: 283–283. https://doi.org/10.1186/1471-2105-13-283. https://www.ncbi.nlm.nih.gov/pubmed/23113967, https://www.ncbi.nlm.nih.gov/pmc/PMC3563472/.CrossrefPubMedPubMedCentral
Rosselló-Mora, Ramon, and Rudolf Amann. 2001. The species concept for prokaryotes. FEMS Microbiology Reviews 25 (1): 39–67.PubMed
Russell, David J., Samuel F. Way, Andrew K. Benson, and Khalid Sayood. 2010. A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences. BMC Bioinformatics 11 (1): 601. https://doi.org/10.1186/1471-2105-11-601.CrossrefPubMedPubMedCentral
Sait, Michelle, Philip Hugenholtz, and Peter H. Janssen. 2002. Cultivation of globally distributed soil bacteria from phylogenetic lineages previously only detected in cultivation-independent surveys. Environmental Microbiology 4 (11): 654–666. https://doi.org/10.1046/j.1462-2920.2002.00352.x. https://sfamjournals.onlinelibrary.wiley.com/doi/abs/10.1046/j.1462-2920.2002.00352.x.CrossrefPubMed
Schaffer, Jonathan. 2015. What not to multiply without necessity. Australasian Journal of Philosophy 93 (4): 644–664. https://doi.org/10.1080/00048402.2014.992447. S2CID 16923735.Crossref
Schlegel, H.G., and W. Köhler. 1999. Bacteriology paved the way to cell biology: A historical account. In Biology of the prokaryotes, ed. J.W. Lengeler, G. Drews, and H.G. Schlegel. Stuttgart: Thieme.
Schloss, Patrick D. 2010. The effects of alignment quality, distance calculation method, sequence filtering, and region on the analysis of 16S rRNA gene-based studies. PLoS Computational Biology 6 (7): e1000844. https://doi.org/10.1371/journal.pcbi.1000844. https://www.ncbi.nlm.nih.gov/pubmed/20628621, https://www.ncbi.nlm.nih.gov/pmc/PMC2900292/.CrossrefPubMedPubMedCentral
Schloss, Patrick D., and Jo Handelsman. 2005. Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Applied and Environmental Microbiology 71 (3): 1501–1506. https://doi.org/10.1128/aem.71.3.1501-1506.2005. https://aem.asm.org/content/aem/71/3/1501.full.pdf.CrossrefPubMedPubMedCentral
———. 2006a. Introducing TreeClimber, a test to compare microbial community structures. Applied and Environmental Microbiology 72 (4): 2379–2384.PubMedPubMedCentral
———. 2006b. Introducing SONS, a tool for operational taxonomic unit-based comparisons of microbial community memberships and structures. Applied and Environmental Microbiology 72 (10): 6773–6779. https://doi.org/10.1128/AEM.00474-06. https://pubmed.ncbi.nlm.nih.gov/17021230, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1610290/.CrossrefPubMedPubMedCentral
Schloss, Patrick D., and Sarah L. Westcott. 2011. Assessing and improving methods used in operational taxonomic unit-based approaches for 16S rRNA gene sequence analysis. Applied and Environmental Microbiology 77 (10): 3219–3226. https://doi.org/10.1128/aem.02810-10. https://www.ncbi.nlm.nih.gov/pubmed/21421784, https://www.ncbi.nlm.nih.gov/pmc/PMC3126452/.CrossrefPubMedPubMedCentral
Schloss, Patrick D., Bret R. Larget, and Jo Handelsman. 2004. Integration of microbial ecology and statistics: A test to compare gene libraries. Applied and Environmental Microbiology 70 (9): 5485–5492.PubMedPubMedCentral
Schloss, Patrick D., Sarah L. Westcott, Thomas Ryabin, Justine R. Hall, Martin Hartmann, Emily B. Hollister, Ryan A. Lesniewski, Brian B. Oakley, Donovan H. Parks, Courtney J. Robinson, Jason W. Sahl, Blaz Stres, Gerhard G. Thallinger, David J. Van Horn, and Carolyn F. Weber. 2009. Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Applied and Environmental Microbiology 75 (23): 7537–7541. https://doi.org/10.1128/aem.01541-09. https://aem.asm.org/content/aem/75/23/7537.full.pdf.CrossrefPubMedPubMedCentral
Schloss, Patrick D., Dirk Gevers, and Sarah L. Westcott. 2011. Reducing the effects of PCR amplification and sequencing artifacts on 16S rRNA-based studies. PLoS One 6 (12): e27310. https://doi.org/10.1371/journal.pone.0027310.CrossrefPubMedPubMedCentral
Shannon, C.E. 1948. A mathematical theory of communication. The Bell System Technical Journal 27 (3): 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.Crossref
Shapiro, B. Jesse, Jonathan Friedman, Otto X. Cordero, Sarah P. Preheim, Sonia C. Timberlake, Gitta Szabó, Martin F. Polz, and Eric J. Alm. 2012. Population genomics of early events in the ecological differentiation of bacteria. Science (New York, N.Y.) 336 (6077): 48–51. https://doi.org/10.1126/science.1218198. https://pubmed.ncbi.nlm.nih.gov/22491847, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3337212/.CrossrefPubMed
Simpson, George Gaylord. 1961. Principles of animal taxonomy. New York: Columbia University Press.
Singleton, David R., Michelle A. Furlong, Stephen L. Rathbun, and William B. Whitman. 2001. Quantitative comparisons of 16S rRNA gene sequence libraries from environmental samples. Applied and Environmental Microbiology 67 (9): 4374–4376.PubMedPubMedCentral
Sneath, P.H.A. 1976. Phenetic taxonomy at the species level and above. Taxon 25 (4): 437–450.
———. 1989. Numerical taxonomy. In Bergey’s manual of systematic bacteriology, ed. S.T. Williams, M.E. Sharpe, and J.G. Holt, 2303–2305. Baltimore: Williams & Wilkins.
Sneath, Peter H.A. 2015. Bacterial nomenclature. In Bergey’s manual of systematics of archaea and bacteria, 1–8. Hoboken: Wiley.
Sneath, P.H.A., and R.R. Sokal. 1973. Numerical taxonomy: The principles and practice of numerical classification. San Francisco: W. H. Freeman and Company.
Sogin, Mitchell L., Hilary G. Morrison, Julie A. Huber, David Mark Welch, Susan M. Huse, Phillip R. Neal, Jesus M. Arrieta, and Gerhard J. Herndl. 2006. Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proceedings of the National Academy of Sciences of the United States of America 103 (32): 12115–12120. https://doi.org/10.1073/pnas.0605127103. https://www.ncbi.nlm.nih.gov/pubmed/16880384, https://www.ncbi.nlm.nih.gov/pmc/PMC1524930/.CrossrefPubMedPubMedCentral
Sokal, Robert R. 1973. The species problem reconsidered. Systematic Zoology 22 (4): 360–374. https://doi.org/10.2307/2412944. http://www.jstor.org.proxy.cc.uic.edu/stable/2412944.Crossref
Sokal, Robert R., and Theodore J. Crovello. 1970. The biological species concept: A critical evaluation. The American Naturalist 104 (936): 127–153.
Sokal, Robert R., and F. James Rohlf. 1995. Biometry. New York: W. H. Freeman and Company.
Sokal, R.R., and P.H.A. Sneath. 1963. Principles of numerical taxonomy. San Francisco: WH Freeman.
Stackebrandt, Erko, and J. Ebers. 2006. Taxonomic parameters revisited: Tarnished gold standards. Microbiology Today 8: 6–9.
Stackebrandt, E., and B.M. Goebel. 1994. Taxonomic note: A place for DNA-DNA reassociation and 16S rRNA sequence analysis in the present species definition in bacteriology. International Journal of Systematic and Evolutionary Microbiology 44 (4): 846–849. https://doi.org/10.1099/00207713-44-4-846. https://www.microbiologyresearch.org/content/journal/ijsem/10.1099/00207713-44-4-846.Crossref
Stackebrandt, E., W. Ludwig, and G.E. Fox. 1985. 16 S ribosomal RNA oligonucleotide cataloguing. In Methods in microbiology, 75–107. San Diego: Elsevier.
Steigerwalt, Arnold G., G. Richard Fanning, Mary Alyce Fife-Asbury, and Don J. Brenner. 1976. DNA relatedness among species of Enterobacter and Serratia. Canadian Journal of Microbiology 22 (2): 121–137.PubMed
Sul, Woo Jun, James R. Cole, C. Ederson da, Qiong Wang Jesus, Ryan J. Farris, Jordan A. Fish, and James M. Tiedje. 2011. Bacterial community comparisons by taxonomy-supervised analysis independent of sequence alignment and clustering. Proceedings of the National Academy of Sciences of the United States of America 108 (35): 14637–14642. https://doi.org/10.1073/pnas.1111435108. https://pubmed.ncbi.nlm.nih.gov/21873204, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3167511/.CrossrefPubMedPubMedCentral
Sun, Yijun, Yunpeng Cai, Li Liu, Yu Fahong, Michael L. Farrell, William McKendree, and William Farmerie. 2009. ESPRIT: Estimating species richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Research 37 (10): e76. https://doi.org/10.1093/nar/gkp285. https://pubmed.ncbi.nlm.nih.gov/19417062, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2691849/.CrossrefPubMedPubMedCentral
Sun, Yijun, Yunpeng Cai, Volker Mai, William Farmerie, Yu Fahong, Jian Li, and Steve Goodison. 2010. Advanced computational algorithms for microbial community analysis using massive 16S rRNA sequence data. Nucleic Acids Research 38 (22): e205. https://doi.org/10.1093/nar/gkq872. https://www.ncbi.nlm.nih.gov/pubmed/20929878, https://www.ncbi.nlm.nih.gov/pmc/PMC3001099/.CrossrefPubMedPubMedCentral
Sun, Yijun, Yunpeng Cai, Susan M. Huse, Rob Knight, William G. Farmerie, Xiaoyu Wang, and Volker Mai. 2012. A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis. Briefings in Bioinformatics 13 (1): 107–121. https://doi.org/10.1093/bib/bbr009. https://www.ncbi.nlm.nih.gov/pubmed/21525143, https://www.ncbi.nlm.nih.gov/pmc/PMC3251834/.CrossrefPubMed
Thorne, Jeffrey L., Hirohisa Kishino, and Ian S. Painter. 1998. Estimating the rate of evolution of the rate of molecular evolution. Molecular Biology and Evolution 15 (12): 1647–1657.PubMed
Tikhonov, M., R.W. Leach, and N.S. Wingreen. 2015. Interpreting 16S metagenomic data without clustering to achieve sub-OTU resolution. The ISME Journal 9 (1): 68–80.PubMed
Trüper, Hans G. 1999. How to name a prokaryote? Etymological considerations, proposals and practical advice in prokaryote nomenclature. FEMS Microbiology Reviews 23 (2): 231–249.
Turnbaugh, Peter J., Christopher Quince, Jeremiah J. Faith, Alice C. McHardy, Tanya Yatsunenko, Faheem Niazi, Jason Affourtit, Michael Egholm, Bernard Henrissat, Rob Knight, and Jeffrey I. Gordon. 2010. Organismal, genetic, and transcriptional variation in the deeply sequenced gut microbiomes of identical twins. Proceedings of the National Academy of Sciences 107 (16): 7503–7508. https://doi.org/10.1073/pnas.1002355107. https://www.pnas.org/content/pnas/107/16/7503.full.pdf.Crossref
Van Regenmortel, M.H.V. 1997. Viral species. In Species: The units of biodiversity, ed. M.F. Claridge, H.A. Dawah, and M.R. Wilson, 17–24. London: Chapman & Hall.
van Rijsbergen, C.V. 1979. Information retrieval. 2nd ed. Boston: Butterworth.
Vandamme, Peter, Bruno Pot, P. De Monique Gillis, Karel Kersters Vos, and Jean Swings. 1996. Polyphasic taxonomy, a consensus approach to bacterial systematics. Microbiological Reviews 60 (2): 407–438.PubMedPubMedCentral
Wang, Xiaoyu, Jin Yao, Yijun Sun, and Volker Mai. 2013. M-pick, a modularity-based method for OTU picking of 16S rRNA sequences. BMC Bioinformatics 14 (1): 43. https://doi.org/10.1186/1471-2105-14-43.CrossrefPubMedPubMedCentral
Wayne, L.G., D.J. Brenner, R.R. Colwell, P.A.D. Grimont, O. Kandler, M.I. Krichevsky, L.H. Moore, W.E.C. Moore, RGEea Murray, and E.S.M.P. Stackebrandt. 1987. Report of the ad hoc committee on reconciliation of approaches to bacterial systematics. International Journal of Systematic and Evolutionary Microbiology 37 (4): 463–464.
Wayne, L.G., R.C. Good, E.C. Bottger, R. Butler, M. Dorsch, T. Ezaki, W. Gross, V. Jonas, J. Kilburn, and P. Kirschner. 1996. Semantide-and chemotaxonomy-based analyses of some problematic phenotypic clusters of slowly growing mycobacteria, a cooperative study of the International Working Group on Mycobacterial Taxonomy. International Journal of Systematic and Evolutionary Microbiology 46 (1): 280–297.
Wei, Ze-Gang, and Shao-Wu Zhang. 2015. MtHc: A motif-based hierarchical method for clustering massive 16S rRNA sequences into OTUs. Molecular BioSystems 11 (7): 1907–1913. https://doi.org/10.1039/C5MB00089K.CrossrefPubMed
———. 2017. DBH: A de Bruijn graph-based heuristic method for clustering large-scale 16S rRNA sequences into OTUs. Journal of Theoretical Biology 425: 80–87. https://doi.org/10.1016/j.jtbi.2017.04.019. https://www.sciencedirect.com/science/article/pii/S0022519317301832.CrossrefPubMed
———. 2019. DMSC: A dynamic multi-seeds method for clustering 16S rRNA sequences into OTUs. Frontiers in Microbiology 10 (428). https://doi.org/10.3389/fmicb.2019.00428. https://www.frontiersin.org/article/10.3389/fmicb.2019.00428.
Wei, Dan, Qingshan Jiang, Yanjie Wei, and Shengrui Wang. 2012. A novel hierarchical clustering algorithm for gene sequences. BMC Bioinformatics 13: 174–174. https://doi.org/10.1186/1471-2105-13-174. https://pubmed.ncbi.nlm.nih.gov/22823405, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3443659/.CrossrefPubMedPubMedCentral
Wei, Z.G., S.W. Zhang, and Y.Z. Zhang. 2017. DMclust, a density-based modularity method for accurate OTU picking of 16S rRNA sequences. Molecular Informatics 36 (12). https://doi.org/10.1002/minf.201600059.
Wei, Ze-Gang, Xiao-Dan Zhang, Ming Cao, Fei Liu, Yu Qian, and Shao-Wu Zhang. 2021. Comparison of methods for picking the operational taxonomic units from amplicon sequences. Frontiers in Microbiology 12 (474). https://doi.org/10.3389/fmicb.2021.644012. https://www.frontiersin.org/article/10.3389/fmicb.2021.644012.
Westcott, Sarah L., and Patrick D. Schloss. 2015. De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units. PeerJ 3: e1487. https://doi.org/10.7717/peerj.1487.CrossrefPubMedPubMedCentral
———. 2017. OptiClust, an improved method for assigning amplicon-based sequence data to operational taxonomic units. mSphere 2 (2). https://doi.org/10.1128/mSphereDirect.00073-17. https://msphere.asm.org/content/msph/2/2/e00073-17.full.pdf.
Wheeler, Quentin D., and Rudolf Meier. 2000. Species concepts and phylogenetic theory: A debate. New York: Columbia University Press.
Wiley, Edward O. 1978. The evolutionary species concept reconsidered. Systematic Zoology 27 (1): 17–26.
Wiley, E.O., and R.L. Mayden. 2000. The evolutionary species concept. In Species concepts and phylogenetic theory: A debate, ed. Q.D. Wheeler and R. Meier, 70–89. New York: Columbia University Press.
Wirawan, Adrianto, Robert S. Harris, Yongchao Liu, Bertil Schmidt, and Jan Schröder. 2014. HECTOR: A parallel multistage homopolymer spectrum based error corrector for 454 sequencing data. BMC Bioinformatics 15 (1): 131. https://doi.org/10.1186/1471-2105-15-131.CrossrefPubMedPubMedCentral
Woese, Carl R. 1987. Bacterial evolution. Microbiological Reviews 51 (2): 221–271.PubMedPubMedCentral
Woese, C.R., E. Stackebrandt, T.J. Macke, and G.E. Fox. 1985. A phylogenetic definition of the major eubacterial taxa. Systematic and Applied Microbiology 6 (2): 143–151.PubMed
Xia, Y., and J. Sun. 2022. An integrated analysis of microbiomes and metabolomics. Washington, DC: American Chemical Society. https://doi.org/10.1021/acsinfocus.7e5003. ISBN 9780841299542.Crossref
Xia, Yinglin, Jun Sun, and Ding-Geng Chen. 2018. Bioinformatic analysis of microbiome data. In Statistical analysis of microbiome data with R, 1–27. Singapore: Springer.
Ye, Yuzhen. 2011. Identification and quantification of abundant species from pyrosequences of 16S rRNA by consensus alignment. In Proceedings IEEE international conference on bioinformatics and biomedicine, 2010, 153–157. https://doi.org/10.1109/bibm.2010.5706555. https://www.ncbi.nlm.nih.gov/pubmed/22102981, https://www.ncbi.nlm.nih.gov/pmc/PMC3217275/.
Yilmaz, Pelin, Laura Wegener Parfrey, Pablo Yarza, Jan Gerken, Elmar Pruesse, Christian Quast, Timmy Schweer, Jörg Peplies, Wolfgang Ludwig, and Frank Oliver Glöckner. 2014. The SILVA and “all-species living tree project (LTP)” taxonomic frameworks. Nucleic Acids Research 42 (D1): D643–D648.PubMed
Zheng, Zejun, Stefan Kramer, and Bertil Schmidt. 2012. DySC: Software for greedy clustering of 16S rRNA reads. Bioinformatics 28 (16): 2182–2183. https://doi.org/10.1093/bioinformatics/bts355.CrossrefPubMed
Zink, Robert M., and Mary C. McKitrick. 1995. The debate over species concepts and its implications for ornithology. The Auk 112 (3): 701–719.