Y. Xia, J. Sun Bioinformatic and Statistical Analysis of Microbiome Data https://doi.org/10.1007/978-3-031-21391-5_5

5. Assigning Taxonomy, Building Phylogenetic Tree

Yinglin Xia¹ and Jun Sun ¹

(1)

Department of Medicine, University of Illinois Chicago, Chicago, IL, USA

Abstract

Keywords

Reference databases q2-feature-classifier Taxonomic classification RDP Greengenes SILVA UNITE NCBI EzBioCloud GTDB (Genome Taxonomy Database) HITdb nifHdada2 pr2database MAFFT FastTree

Taxonomic classification of the representative sequences and clustering of OTUs (Operational Taxonomic Units) are two core steps in bioinformatic analysis of microbiome data. In Chap. 4, we described and illustrated how to generate feature table and feature data (i.e., representative sequences), which is one crucial component for microbiome study because it provides the most wanted data for downstream analysis. In Chap. 5, we first describe and illustrate other core bioinformatic analyses: assign taxonomy (Sect. 5.1) and build phylogenetic tree (Sect. 5.2). Then we briefly summarize the materials covered in this chapter (Sect. 5.3).

5.1 Assigning Taxonomy

Taxonomic assignment is a crucial step in bioinformatic analysis of microbiome data, while reference databases are essential component in the analysis of microbiomes because they are used to transform sequences into readable taxonomy (e.g., bacterial) names.

5.1.1 Bioinformatics Tools and Reference Databases

Various bioinformatics tools are available for analysis of 16S rRNA gene amplicon sequencing data (Plummer et al. 2015; Nilakanta et al. 2014). Among these software, the most widely used are QIIME (Caporaso et al. 2010) and its extension of QIIME 2 (Bolyen et al. 2019), and DADA2 (Callahan 2021). In Chap. 4, we used DADA2 via QIIME 2 to generate feature frequency table and its representative sequence data table which contains the denoised sequences.

Taxonomic classification is performed after the sequences pass the filtering process, which is typically searched against a known reference taxonomy classifier at a pre-determined threshold. Most classifiers including the Ribosomal Database Project (RDP) (Cole et al. 2005), Greengenes (DeSantis et al. 2006), SILVA 16S rRNA gene database (Yilmaz et al. 2014), the UNITE database (Kõljalg et al. 2013), the National Center for Biotechnology Information (NCBI) (Federhen 2011), and EzBioCloud (Yoon et al. 2017) are publicly available. When performing taxonomic classification, we should use the frequently updated databases to avoid mapping the sequences to obsolete taxonomy names.

Currently the most often used reference databases for 16S taxonomy assignment are Silva (Quast et al. 2013), RDP (Cole et al. 2005; Maidak et al. 2000), NCBI (Federhen 2011; Geer et al. 2010), and Greengenes (McDonald et al. 2012).

SILVA (from Latin silva, forest) (Pruesse et al. 2007) is a comprehensive online resource providing quality controlled databases of aligned rRNA sequences data for all three domains of life (Bacteria, Archaea, and Eukarya) (Pruesse et al. 2012). The SILVA database is based on phylogenies for small subunit rRNAs (16S and 18S), and manually curates its taxonomic rank assignment (Yilmaz et al. 2014). Other pipelines, such as QIIME 2, DADA2, and mothur (Schloss et al. 2009), use the SILVA 16S rRNA gene reference database (Plummer et al. 2015).

Like SILVA, RDP database (Cole et al. 2005) also provides the aligned and annotated rRNA gene sequences for all three domains of life (Bacteria, Archaea, and Eukarya) and a phylogenetically consistent taxonomic framework for these data. The RDP database obtains bacterial rRNA sequences from the International Nucleotide Sequence Databases (INSD: GenBank/EMBL/DDBJ) on a monthly basis (Nakamura et al. 2012). As a Bayesian classifier, the RDP Classifier can rapidly and accurately classify bacterial 16S rRNA sequences into the new higher-order with the majority of classifications (98%) having high estimated confidence (≥95%) and high accuracy (98%) (Wang et al. 2007). RDP provides taxonomic assignments from domain to genus (Wang et al. 2007), and thus these collected 16S sequences in RDP database have not all been assigned to species level taxonomic names.

The NCBI Taxonomy project began in 1991. NCBI taxonomy database (Federhen 2011; Geer et al. 2010) is the standard nomenclature and classification repository for the International Nucleotide Sequence Database Collaboration (INSDC), comprising the GenBank (Benson et al. 2013), the European Nucleotide Archive (ENA) (EMBL) (Leinonen et al. 2011), and the DNA DataBank of Japan (DDBJ) (Mashima et al. 2017) databases (Federhen 2011). It contains all organism names and taxonomic lineages for each of the sequences associated with submissions to the NCBI global database and is manually curated to maintain a phylogenetic taxonomy for the source organisms represented in the sequence databases (Federhen 2011).

Greengenes (McDonald et al. 2012; DeSantis et al. 2006) is a chimera-check 16S rRNA gene database that has Bacteria and Archaea sequences. It provides chimera screening, standard alignment, and taxonomic classification using multiple published taxonomies. Most of the sequences in Greengenes are retrieved from NCBI database (DeSantis et al. 2006). Greengenes taxonomic classification has been improved by explicit ranks for ecological and evolutionary analyses of bacteria and archaea (McDonald et al. 2012): (1) In Greengenes a “taxonomy to tree” approach has been used for transferring group names from an existing taxonomy to a tree topology, and applied to the Greengenes, NCBI, and cyanoDB (Cyanobacteria only) taxonomies to a de novo tree sequences (McDonald et al. 2012). Reference phylogenies provide the crucial information for a taxonomic framework in interpreting marker gene and metagenomic surveys, and help to reveal novel species remarkably. (2) Explicit rank information provided by the NCBI taxonomy has been incorporated to group names for better user orientation and classification consistency and hence significantly improved the classification of the sequences in the merged taxonomy (McDonald et al. 2012). In summary, Greengenes is a dedicated full-length 16S rRNA gene database providing a curated taxonomy based on de novo tree inference. The database is used for closed-reference OTU clustering. Reads are clustered against this reference database. It has been wrapped in QIIME 1 and QIIME 2.

However, as reviewed in Chap. 4, the OTU method clustering sequences with a fixed 97% similarity threshold might avoid fine-scale variation among sequences (Rosen et al. 2012). OTUs are not species, and hence OTU method often eliminates biological information of the data, and the construction of OTUs is not necessitated by amplicon errors (Callahan et al. 2016). Thus, DADA2 uses an alternative error-modeling approach for denoising and clustering amplicons. This may be the reason that the source of GreenGenes database will no longer be maintained in DADA2 because it is deprecated (see Table 5.2). For clustering sequences into OTUs, the reader is referred to Chap. 6.

There are a variety of databases available; QIIME 2 and DADA2 have formatted and maintained most often used taxonomic reference databases.

5.1.2 QIIME 2-Formatted and Maintained Taxonomic Reference Databases

We summarize the QIIME 2-formatted and maintained taxonomic reference databases into Table 5.1 (Bolyen et al. 2019). When the reader uses these databases in Table 5.1, please check the updating from QIIME 2 and cite QIIME 2 and the original databases.

Table 5.1

QIIME 2-formatted and maintained taxonomic reference databases

Category	Database name	Description
Taxonomy classifiers for use with q2-feature-classifier	Naive Bayes classifiers (Bokulich et al. 2018, 2021) trained on: Silva 138 99% OTUs full-length sequences (MD5: b8609f23e9b17bd4a1321a8971303310) (Quast et al. 2012; Yilmaz et al. 2013) Silva 138 99% OTUs from 515F/806R region of sequences (MD5: e05afad0fe87542704be96ff483824d4) (Quast et al. 2012; Yilmaz et al. 2013) Greengenes 13_8 99% OTUs full-length sequences (MD5: 6bbc9b3f2f9b51d663063a7979dd95f1) (McDonald et al. 2012) Greengenes 13_8 99% OTUs from 515F/806R region of sequences (MD5: 9e82e8969303b3a86ac941ceafeeac86) (McDonald et al. 2012)	Pre-trained classifiers can be used with q2-feature-classifier However, QIIME 2 warns that currently using pre-trained classifiers presents a security risk and this security risk will be addressed in a future version of q2-feature-classifier Taxonomic classifiers have best performance when they are trained based on the specific sample preparation and sequencing parameters of the study (e.g., the primers that were used for amplification and the length of your sequence reads) Therefore QIIME 2 recommends in general the instructions in “Training feature classifiers with q2-feature-classifier” should be followed when the users train their own taxonomic classifiers QIIME 2 notes that these classifiers were trained using scikit-learn 0.24.1, and therefore can only be used with scikit-learn 0.24.1 Using the pretrained-classifiers that were published with the release of QIIME 2 if the errors related to scikit-learn version mismatches are observed
Weighted Taxonomic Classifiers	Weighted pre-trained classifiers (Kaehler et al. 2019): Weighted Silva 138 99% OTUs full-length sequences (MD5: 48965bb0a9e63c411452a460d92cfc04) Weighted Greengenes 13_8 99% OTUs full-length sequences (MD5: 2baf87fce174c5f6c22a4c4086b1f1fe) Weighted Greengenes 13_8 99% OTUs from 515F/806R region of sequences (MD5: 8fb808c4af1c7526a2bdfaafa764e21f)	Trained with weights to take into account the fact that not all species are equally likely to be observed Provide superior classification precision if the V4 sample comes from any of the 14 QIIME 2 tested habitat types They might still help even if the sample doesn’t come from one of those habitats Training with weights specific to the habitat should help even more Weights for a range of habitats are available from https://github.com/BenKaehler/readytowear
Marker gene reference databases	Greengenes (16S rRNA) (DeSantis et al. 2006; McDonald et al. 2012): 13_8 (most recent) 13_5 12_10 February 4th, 2011 Silva (16S/18S rRNA) (Bokulich et al. 2021): Silva 138 SSURef NR99 full-length sequences (MD5: de8886bb2c059b1e8752255d271f3010) (Quast et al. 2012; Yilmaz et al. 2013) Silva 138 SSURef NR99 full-length taxonomy (MD5: f12d5b78bf4b1519721fe52803581c3d) (Quast et al. 2012; Yilmaz et al. 2013) Silva 138 SSURef NR99 515F/806R region sequences (MD5: a914837bc3f8964b156a9653e2420d22) Silva 138 SSURef NR99 515F/806R region taxonomy (MD5: e2c40ae4c60cbf75e24312bb24652f2c) (Quast et al. 2012; Yilmaz et al. 2013) UNITE (fungal ITS) (Põlme et al. 2020): All releases are available for download at https://unite.ut.ee/repository.php	Formatted for use with QIIME 1 and QIIME 2 Need to import them into artifacts if using these databases with QIIME 2 Silva (16S/18S rRNA): QIIME is compatible SILVA releases (up to release 132) The pre-formatted SILVA 138 release reference sequence and taxonomy files provided here by QIIME were processed using RESCRIPt (https://github.com/bokulich-lab/RESCRIPt) and q2-feature-classifier (https://github.com/qiime2/q2-feature-classifier/) UNITE (fungal ITS): Find more information about UNITE at https://unite.ut.ee/
SEPP reference databases	SEPP references (SEPP-Refs project): Silva 128 SEPP reference database (MD5: 7879792a6f42c5325531de9866f5c4de) Greengenes 13_8 SEPP reference database (MD5: 9ed215415b52c362e25cb0a8a46e1076)	These databases: Are intended for use with q2-fragment-insertion Are constructed directly from the SEPP-Refs project

5.1.3 DADA2-Formatted and Maintained Taxonomic Reference Databases

DADA2 formatted 16S rRNA gene sequences for both bacteria and archaea (Alishum 2019). DADA2 collated and formatted two combined bacterial and archaeal 16S rRNA gene sequence databases (RefSeq+RDP and Genome Taxonomy Database (GTDB)) and used various sources for assigning taxonomy. DADA2 categorizes the 16S databases into Maintained and Contributed. DADA2 maintains reference fastas for the three most common 16S databases: Silva, RDP, and GreenGenes. It also maintains the General Fasta releases of the UNITE project for ITS taxonomic assignment. DADA2 also makes formatted versions of other databases available as “contributed.”

DADA2 created the dada2-compatible training fastas from the Silva NR99 and taxonomy files, the RDP trainset 16 and release 11.5 database, and the GreenGenes 13.8 OTUs clustered at 97%.

We summarize the DADA2-formatted and maintained taxonomic reference databases into Table 5.2 (Callahan 2021). When the reader uses these databases in Table 5.2, please check the updating from DADA2 and cite DADA2 and the original databases.

Table 5.2

DADA2-formatted and maintained taxonomic reference databases

Category	Database name	Description
Maintained databases	Silva (16S/18S rRNA) (Bokulich et al. 2021): A list of database names. 1. Silva version 138.1 - updated March 10, 2021. 2. Silva version 132. 3. Silva version 128. 4. Silva version 123.	Like Silva version 138, the DADA2-formatted reference fastas are optimized for classification of Bacteria and Archaea, and are not suitable for classifying Eukaryotes
	RDP (Cole et al. 2005): A list of database names. 1. R D P trainset 18. 2. R D P trainset 16. 3. R D P trainset 14.
	UNITE (fungal ITS) (Põlme et al. 2020): A database name, U N I T E open parenthesis use the general fasta releases close parenthesis.
	Greengenes (16S rRNA) (DeSantis et al. 2006; McDonald et al. 2012; Callahan 2016): A database name, Green genes version 13.8.	DADA2 will no longer maintain the source GreenGenes database because it is deprecated
Contributed databases	RefSeq + RDP (NCBI RefSeq 16S rRNA database supplemented by RDP) (Alishum 2019): A list of database names. 1. Reference files formatted for assign Taxonomy. 2. Reference files formatted for assign species.	DADA2 compiled this database on May 14, 2018, from predominantly the NCBI RefSeq 16S rRNA database (https://www.ncbi.nlm.nih.gov/refseq/targetedloci/16S_process/) and was supplemented with extra sequences from the RDP database (https://rdp.cme.msu.edu/misc/resources.jsp) This database contains 14676 bacterial and 660 archaea full 16S rRNA gene sequences
	GTDB (Genome Taxonomy Database) (Alishum 2019): A database name, G T D B Version 202 colon Genome Taxonomy Database. Version 86 for assign taxonomy and assign species.	DADA2 downloaded this database from (http://gtdb.ecogenomic.org/downloads) on November 20, 2018 DADA2 formatted GTDB this reference sequence set which contains 20486 bacteria and 1073 archaea full 16S rRNA gene sequences
	Human InTestinal 16S rRNA (Diener 2016; Ritari et al. 2015): A database name, Hit D B version 1.	HITdb is a reference taxonomy for Human Intestinal 16S rRNA genes as described in Ritari et al. (2015) HITdb v1.00 for Dada2 is converted version to be used with dada2 HITdb is specific for intestinal samples; thus it might lead to arbitrarily wrong results for non-intestinal samples
	RDP fungi LSU (Czaplicki 2017): A database name, R D P fungi L S U trainset 11.	RDP LSU taxonomic training data was formatted for DADA2 (trainingset 11)
	SILVA v128 and v132 dada2 formatted 18s “train sets” (Morien and Parfrey 2018): A database name, Silva Eukaryotic 18 S, v 132 and v 128.	These are species-level taxonomy classification training sets for the assignTaxonomy () function in DADA2 The v132 and v128 training sets include every Eukaryotic organism from SILVA’s v132 and v128 databases, respectively, clustered at 99% similarity It also includes corrected species labels for the Blastocystis clade, and includes 37 Entamoeba sequences sourced from GenBank not present in the original v128 db The v128 training set is modified specifically to allow for better species-level assignments for those two clades in mammalian gut microbiome studies
	nifHdada2: v1.1.0 (Moynihan 2020): A database name, n i f H A R B, version 1.	This is the new reference sequences added to database
	pr2database(https://github.com/pr2database/pr2database/releases): A database name, P R2 version 4.7.2 plus.	The provided latest PR2 version 4.14.0 is a single SSU database that contains sequences for: 18S rRNA from nuclear and nucleomorph, 16S rRNA from plastid, apicoplast, chromatophore, Mitochondrion, as well as 16S rRNA from a small selection of bacteria DADA2 note: PR2 has different taxLevels than the DADA2 default. When assigning taxonomy against PR2, use the following: assignTaxonomy(..., taxLevels = c("Kingdom","Supergroup","Division","Class","Order","Family","Genus","Species")). There are many contributors and references for this database (https://github.com/pr2database/pr2database/releases)

5.1.4 Introduction to q2-Feature-Classifier

We may choose to train our classifiers using a suitable method, such as using q2-feature-classifier protocol that is available in QIIME 2 (Bokulich et al. 2018). The q2-feature-classifier is a QIIME 2 plugin for taxonomy classification of marker-gene sequences. It contains several novel machine-learning and alignment-based methods including a scikit-learn naïve Bayes machine-learning classifier, and alignment-based taxonomy consensus methods based on VSEARCH, and BLAST+ for classification of bacterial 16S rRNA and fungal ITS (internal transcribed spacer) marker-gene amplicon sequence data, which were evaluated as match or outperform the species-level accuracy of other commonly used methods designed for classification of marker gene sequences (Bokulich et al. 2018).

The q2-sample-classifier plugin employs scikit-learn (Pedregosa et al. 2011) for supervised learning (SL) to classify sequence and feature selection algorithms. The classify-sklearn method is a pre-fitted sklearn-based taxonomy classifier for implementing scikit-learn machine learning algorithms, while maintaining an easy-to-use interface tightly integrated with the Python language with several distinctive features including (1) it is distributed under the BSD (Berkeley Source Distribution) license; thus it has low restriction and requirement for using the distribution of many free and open source software; (2) it incorporates compiled code for efficiency; (3) it depends only on numpy and scipy to facilitate easy distribution; and (4) it focuses on imperative programming (Pedregosa et al. 2011).

The current improvement of methods for sequencing is capable to differentiate single nucleotide base: Amplicon Sequence Variants (ASVs), or sub-OTUs, which is 100% OTU. More researchers now in microbiome field including the developers of QIIME 2 recommend working with ASVs or sub-OTUs to assign taxonomy to the sequence variants, especially in 16S/18S/ITS amplicon sequencing. Thus, QIIME 2 workflow by default does not include a typical OTU picking step. Here, we follow this default direction; go directly into taxonomy assignment after using DADA2 to quality filter dataset. We will take the denoised sequences (RepSeqsMiSeq_SOP.qza) from Chap. 4 after taking denoising step, and assign taxonomy to each sequence (phylum➔ class➔…genus➔species). A trained classifier is required for this step. We can either use a reference set to train a naïve Bayes classifier and save as a QIIME 2 artifact for latter re-use, which avoids re-training the classifier between runs and saves overall running time or download a pretrained classifier.

We can use the qiime feature-classifier fit-classifier-naïve-bayes command to train a naïve Bayes classifier. If we want to use a pretrained classifier, there is a pre-trained naïve Bayes classifier artifact available in QIIME 2. This classifier was trained against Greengenes (13_8 reversion) trimmed to contain only the V4 hypervariable region and pre-clustered at 99% sequence identity (McDonald et al. 2012). We can check the QIIME 2 website (https://docs.qiime2.org/) to look for other available pre-trained artifacts.

5.1.5 Assign Taxonomy Using the q2-Feature-Classifier

Example 5.1: RepSeqsMiSeq_SOP.qza

We assign taxonomy based on the denoised sequence “RepSeqsMiSeq_SOP.qza” artifact using the q2-feature-classifier.

Once we obtain an appropriate classifier artifact, we can use the qiime feature-classifier command to generate the taxonomic classification results. Here, we use a pretrained classifier from GreenGenes database with 99% OTUs. We download this classifier gg-13-8-99-515-806-nb-classifier.qza from the QIIME 2 website (https://docs.qiime2.org/2022.2/data-resources/). To compare the denoised sequences (RepSeqsMiSeq_SOP.qza) to the GreenGenes reference database to assign taxonomy based on pairwise identity of rRNA sequences, place this classifier gg-13-8-99-515-806-nb-classifier.qza from the download in the working directory.

We need to take a few steps to assign taxonomy to the sequences, as shown below.

Step 1: Import reference data files as Qiime 2 Zipped Artifacts (.qza).

In this case, the downloaded gg-13-8-99-515-806-nb-classifier.qza is already Qiime 2 zipped artifacts (.qza), so we can skip importing it as an artifact (.qza). Otherwise if the downloaded data are zipped (.gz) files or other text files, then we need qiime tools import command to import it as an artifact (.qza).

Step 2: Assign taxonomy using QIIME 2 feature-classifier plugin.

Please note that the scikit-learn version used to generate the reference artifact should match the current version of scikit-learn installed. Otherwise the qiime feature-classifier will not work and a plugin error message from feature-classifier will be generated. We specify the classify-sklearn method in QIIME 2 feature-classifier plugin to assign taxonomy to the representative sequences RepSeqsMiSeq_SOP.qza and save the classified taxonomy files as artifacts and name as TaxonomyMiSeq_SOP.qza.

source activate qiime2-2022.2

mkdir QIIME2R-Bioinformatics

cd QIIME2R-Bioinformatics

qiime feature-classifier classify-sklearn \

--i-classifier gg-13-8-99-515-806-nb-classifier.qza\

--i-reads RepSeqsMiSeq_SOP.qza \

--o-classification TaxonomyMiSeq_SOP.qza

Step 3: Generate a visualization of the taxonomy artifact.

qiime metadata tabulate \

--m-input-file TaxonomyMiSeq_SOP.qza \

--o-visualization TaxonomyMiSeq_SOP.qzv

Now we can review the visualization of the classified sequences in the QIIME2 viewer.

Step 4: Visualize taxonomic classifications.

Since we now have three datasets: feature table, taxonomy, and sample metadata, we can use qiime taxa barplot command to create a bar plot to explore the distribution of taxonomy for each sample. Figure 5.1 can be reproduced using the following QIIME 2 commands.

# Figure 5.1:

qiime taxa barplot \

--i-table FeatureTableMiSeq_SOP.qza \

--i-taxonomy TaxonomyMiSeq_SOP.qza \

--m-metadata-file SampleMetadataMiSeq_SOP.tsv \

--o-visualization TaxaBarPlotsMiSeq_SOP.qzv

Fig. 5.1
Taxonomic profiles for the mouse gut samples at phylum level of Bacteroidetes. The bar plot was generated by first choosing Taxonomic Level(L2:phylum) and then by sorting samples by the taxonomic abundance(k_Bacteria;p_Bacteroidetes)

We can review the visualization of the taxa bar plot in the QIIME2 viewer or use the command qiime tools view to review TaxaBarPlotsMiSeq_SOP.qzv. The generated bars can be aggregated at the desired taxonomic level, and the abundance can be sorted by a specific taxonomic group. By providing sample metadata file, we can also sort the abundance by metadata groupings. We can also interactively change color schemes, and save plots and legends in vector graphic format.

Step 5: Create a BIOM table with taxonomy annotations(optional).

The .biom format files are often used in microbiome studies. The .biom files consist of two kinds of information: one is feature table [frequency]; another is taxonomy. Thus, we first export the data as a .biom file using qiime tools export command as below.

qiime tools export \

--input-path FeatureTableMiSeq_SOP.qza \

--output-path ExportedFeatureTableMiSeq_SOP

Then we export taxonomy information as below.

qiime tools export \

--input-path TaxonomyMiSeq_SOP.qza \

--output-path ExportedFeatureTableMiSeq_SOP

5.1.6 Remarks on Taxonomic Classification

A number of bioinformatics tools and reference databases are available for analysis of 16S rRNA amplicon microbiome data. It was shown that taxonomy assignment often obtains different results (Balvočiūtė and Huson 2017) and especially at genus level (Sierra et al. 2020) when using different reference databases. However, currently there is no general criterion to guide for choosing appropriate bioinformatics tools and reference databases for analysis of microbiome data; and especially there are no defined criteria for data curation and validation of annotations (Sierra et al. 2020). Thus, the annotated results may be inaccurate and irreproducible, making it difficult to compare data across studies.

5.2 Building Phylogenetic Tree

5.2.1 Introduction to Phylogenetic Tree

Microbiome data are encoded as a phylogenetic tree, which relates all the microbial species, containing the evolution information of the species. Thus, a phylogenetic tree is useful for incorporating biological structure (Xia 2020). Thus, one central method in computational biology is to infer evolutionary relationships or phylogenies from families of related DNA or protein sequences (Price et al. 2010). The FastTree method developed by Price et al. (2009) is to compute large minimum evolution trees with profiles instead of a distance matrix (Price et al. 2009).

Tree construction is optional; however, a phylogenetic tree has two primary applications:

1.
Phylogenetic tree measures are used in computing phylogenetically based alpha diversity metrics such as unweighted uniFrac (Lozupone and Knight 2005), weighted uniFrac (Lozupone et al. 2007), Faith’s Phylogenetic Diversity (PD) (Faith 1992), or generalized UniFrac distance (Chen et al. 2012). For example, QIIME supports several phylogenetic diversity metrics. In QIIME 2 we can calculate alpha diversities and output core metrics, which include Faith PD, unweighted uniFrac, and weighted uniFrac distance measures. To generate these metrics, except for providing the FeatureTable[Frequency] artifact, a rooted phylogenetic tree that relates the features to one another is needed.
2.
Phylogenetic tree-based association analyses also need to provide the information of phylogenetic tree. For example, a phylogenetic tree data are utilized in the general framework for association analysis of taxa (Tang et al. 2017), a predictive method based on a generalized mixed-models framework (Xiao et al. 2018) and a phylogenetic tree-based microbiome association test (Kim et al. 2019). For building phylogenetic tree in R, the reader is referred to Chap. 2 (Sect. 2.4).

5.2.2 Build a Phylogenetic Tree Using the Alignment and Phylogeny Commands

A phylogenetic tree is built in QIIME 2 via four steps: (1) multiple sequence alignment, (2) masking, (3) tree building, and (4) rooting.

Example 5.2: RepSeqsMiSeq_SOP.qza, Example 5.1 cont.

We can build a phylogenetic tree based on the denoised sequences “RepSeqsMiSeq_SOP.qza” artifact using align-to-tree-mafft-fasttree pipeline from the q2-phylogeny plugin in the four-step process as below.

Step 1: Conduct a multiple sequence alignment using MAFFT.

MAFFT is a multiple sequence alignment program based on fast Fourier transform in evolutionary analyses of biological sequences (Katoh and Standley 2013; Katoh et al. 2002). MAFFT includes various alignment strategies as its options: progressive methods, iterative refinement methods, and structural alignment methods for RNAs. MAFFT is a similarity-based multiple sequence alignment (MSA) method, while taking evolutionary information into account because evolutionary information is useful even for similarity-based methods (Katoh and Standley 2013). QIIME wraps MAFFT’s multiple sequence alignment in the qiime alignment mafft command.

source activate qiime2-2022.2

cd QIIME2R-Bioinformatics

qiime alignment mafft \

--i-sequences RepSeqsMiSeq_SOP.qza \

--o-alignment AlignedRepSeqsMiSeq_SOP.qza

Above MAFFT commands aligned the denoised sequences in the FeatureData[Sequence] (in this case, RepSeqsMiSeq_SOP.qza) and created a FeatureData[AlignedSequence] QIIME 2 artifact (we named as AlignedRepSeqsMiSeq_SOP.qza).

Step 2: Mask the alignment.

Highly variable positions could add noise to a resulting phylogenetic tree. The purpose of masking (i.e., filtering) the alignment is to remove these highly variable positions (highly gapped columns) from an alignment so that the sequences contain enough conservation to provide meaningful information. Below, we mask the uninformative positions via qiime alignment mask command.

QIIME 2 uses 40% (the default) minimum conservation as meaningful information to reproduce the mask presented in Lane (1991) via the parameter --p-min-conservation. Providing a value of 0.4 (the default), only the column that contains at least one character that is present in at least 40% of the sequences will be retained. Another default parameter used here is --p-max-gap-frequency, which is value of 1, retaining all columns regardless of gap character frequency. If a value of 0 is chosen, then retain only those columns without gap characters.

qiime alignment mask \

--i-alignment AlignedRepSeqsMiSeq_SOP.qza \

--o-masked-alignment MaskedAlignedRepSeqsMiSeq_SOP.qza

Step 3: Create the tree using the Fasttree program.

FastTree (Price et al. 2009, 2010) is a bioinformatic tool for inferring phylogenies for alignments. So far two versions of FastTree have been released.

Utilizing the “minimum-evolution” principle, FastTree 1 (Price et al. 2009) tries to find a topology that minimizes the amount of evolution, or the sum of the branch lengths. While via using a heuristic variant of neighbor joining method (Saitou and Nei 1987; Studier and Keppler 1988), FastTree 1 quickly finds a starting tree and using nearest-neighbor interchanges (NNIs) refines the topology (Price et al. 2010). FastTree 2 has improved its topological accuracy (the proportion of the splits in the true trees that are recovered) and 100–1,000 times faster compared to FastTree 1 and outperforms other methods including PhyML 3’s approach with default settings (NNI search) (Guindon et al. 2009, 2010), standard implementation of maximum-likelihood NNIs, minimum-evolution and parsimony methods, although not as accurate as the maximum-likelihood (ML) methods that use subtree-pruning-regrafting (SPR) moves (Price et al. 2010). The topological accuracy and outperformances are achieved by FastTree 2 mostly because FastTree 2 (1) adds minimum-evolution SPRs, (2) adds maximum likelihood NNIs, (3) uses heuristics to restrict the search for better trees, and (4) estimates a rate of evolution for each site (Price et al. 2010).

QIIME builds a phylogenetic tree based on FastTree (Price et al. 2010) via the qiime phylogeney fasttree command.

qiime phylogeny fasttree \

--i-alignment MaskedAlignedRepSeqsMiSeq_SOP.qza \

--o-tree UnrootedTreeMiSeq_SOP.qza

Step 4: Root the tree using the longest root.

By processing the FastTree method, an unrooted phylogenetic tree is generated from the masked alignment. However, some downstream analyses require a rooted tree; thus we use the longest branch to root the tree at the midpoint of the two leaves that are the furthest from one another (called “midrooting”), producing the rooted tree artifact file “RootedTreeMiSeq_SOP.qza” that can be used as input to generate phylogenetic-diversity measures.

qiime phylogeny midpoint-root \

--i-tree UnrootedTreeMiSeq_SOP.qza \

--o-rooted-tree RootedTreeMiSeq_SOP.qza

5.2.3 Remarks on the Taxonomic and Phylogenetic Trees

Taxonomy and phylogeny are two concepts involved in the classification of organisms. Taxonomy stems from ancient Greek taxis, meaning “arrangement,” and nomia, meaning “method.” Taxonomy is a field of classification, identification, and naming of biological organisms based on their shared characteristics of similarities and dissimilarities (Xia et al. 2018).

Classification of organisms was first introduced by the Swedish botanist Carl Linnaeus (1707–1778) (known as the father of taxonomy). He developed a system for categorization of organisms, known as Linnaean taxonomy and binomial nomenclature for categorizing and naming organisms.

Linnaeus and others ranked all living organisms into seven biological groups or levels of classification in the taxonomic hierarchy: kingdom, phylum, class, order, family, genus, and species. There are no domains in their classifications. The classification of domain was first proposed by Woese et al. in 1977 (Woese and Fox 1977; Woese et al. 1990). They added a level called “domain” above the level of kingdom. The three domains of life are Archaea, Bacteria, and Eukarya, and the five major kingdoms are monera, protista, fungi, plantae, and animalia. Thus, we can classify all living organisms into eight major hierarchical levels, from domain (the most general) to species (the most specific): domain, kingdom, phylum, class, order, family, genus, and species.

A phylogenetic tree (also called phylogeny or evolutionary tree) (Felsenstein 2004) is a branching diagram or a tree showing the evolutionary relationship of a species or a group of species with a common ancestor based upon similarities and differences in their physical or genetic characteristics.

Various ways have been developed to graphically represent the phylogenetic trees (Letunic and Bork 2006). Both taxonomy and phylogeny are important for classification of organisms. Phylogeny is important in building taxonomy. Researchers have attempted to synthesize phylogeny and taxonomy into a comprehensive tree of life (Hinchliff et al. 2015). However, taxonomy and phylogeny as well as taxonomic tree and phylogenetic tree are different. The key difference between these two pairs of concept lies in the fact that taxonomy/taxonomic tree involves naming and classifying organisms while phylogeny/phylogenetic tree involves the evolution of the species or groups of species. Taxonomy focuses on naming and classifying organisms, and hence does not reveal anything about the shared evolutionary history of organisms. In contrast, phylogeny focuses on evolutionary relationships of organisms and hence reveals the shared evolutionary history.

Additionally, as shown in above illustrating examples, their reconstructions are also different. For amplicon sequencing, the taxonomic tree is typically reconstructed when the phylogeny is not available but taxonomic annotations are available. The taxonomic tree is reconstructed from lineages extracted from regularly updated databases such as from NCBI (Federhen 2011; Geer et al. 2010) and represents the alignment from domain to species rank; as discovery of new species continues, assignment of new taxa in the taxonomic hierarchy will never end. Thus, taxonomic trees are highly polyatomic. In contrast, the phylogenetic tree is reconstructed based on the sequence divergence of taxa (of the marker-gene) (Price et al. 2010) and it encodes the common evolutionary history of the taxa. In other words, phylogenetic trees are reconstructed usually based on morphological or genetic homology to reveal the evolutionary relationships of species via comparison of anatomical traits and to reveal the ancestral genes (identify descent from an ancestral gene) via analysis of genetic differences of species. Thus, unlike taxonomic trees, phylogenetic trees are hypothetic (Dubois et al. 2021; Felsenstein 2004).

Phylogenetic classification has two main advantages over the Linnaean classification. First, phylogenetic classification reveals the evolutionary history of the organism: the important underlying biological processes that are responsible for the diversity of organisms. Second, phylogenetic classification does not attempt to “rank” organisms and hence avoids the misleading of considering different groupings with the same rank are equivalent and comparable. Actually, comparing to phylogenetic tree, taxonomic tree ignores the granular differences of the taxa belonging to the same rank. However, the advantage of phylogenies is that they do not capture similarities between taxa in terms of abundance profiles (Bichat et al. 2020).

Human microbiome is very complicated with existing genetic and evolutionary relationships among species. To understand the complexity of the human microbiome, it is important to recognize the genetic and evolutionary relationships between species. Microbiome data are encoded as the taxonomic and phylogenetic trees. Both taxonomic and phylogenetic trees play important roles in microbiome studies. As two unique features in the microbiome data, these two trees are usually used for different measures and are required by different strategies of statistical analysis.

Integrating the information of the taxonomic and phylogenetic trees into statistical analysis will increase statistical power in statistical hypothesis testing. For example, it has been believed that reference phylogenies can prove the crucial information into a taxonomic framework for interpretation of marker gene and metagenomic surveys to speed revealing novel species (McDonald et al. 2012), and leveraging the phylogenetic tree of the taxa can increase statistical power while controlling the False Discovery Rate (FDR) (Sankaran and Holmes 2014; Xiao et al. 2017).

However, the premise that the phylogenetic (or taxonomic) tree is the relevant hierarchical structure to incorporate in differential studies has been questioned in recent study; and it was showed that incorporating phylogenetic information in microbiome differential abundance analyses has no effect on detection power and FDR control (Bichat et al. 2020). Instead in this study (Bichat et al. 2020) a correlation-tree was proposed and advocated for use. A correlation-tree is a clustering tree built based on the abundance profiles of taxa across samples, in which taxa with highly correlated abundances are very close in the tree. The correlation tree is built involving three logical steps: (1) computing the pairwise correlation matrix using the Spearman correlation and excluding samples where both taxa are absent; (2) using the transformation to change the correlation matrix into a dissimilarity matrix; and (3) creating the correlation tree using hierarchical clustering with Ward linkage on this dissimilarity matrix. Branch lengths correspond to the dissimilarity cost of merging two subtrees.

The correlation tree was considered being better than the phylogenetic tree for a proxy of biological functions and increasing the detection power while with better FDR control (Bichat et al. 2020). However, it needs for evaluation by other studies and/or deserves to be further discussed and assessed whether above arguments are true and whether or not the correlation tree is more important than the phylogenetic tree in differential abundance analysis.

As reviewed the history of numerical taxonomy, we learn that early in 1950s the “taxonomic importance” was criticized by Cain (Cain 1958). Cain was not opposed to phylogeny per se, instead thought that we ought to incorporate phylogenetic information into classification when it is available and reliable. However, he recognized it was very difficult to do and in some cases classification should be purely phenotypic (Vernon 1988).

5.3 Summary

This chapter covered two important topics in bioinformatic analysis of microbiome data: assigning taxonomy and building phylogenetic tree. For assigning taxonomy, first, various bioinformatics tools and reference databases were reviewed, and then specifically both QIIME 2 and DADA2 formatted and maintained taxonomic reference databases were summarized in tables. Next, the q2-feature-classifier was introduced and how to assign taxonomy using the q2-feature-classifier was illustrated, followed by a brief remark on taxonomic classification. For building phylogenetic tree, phylogenetic tree was first introduced and then how to build a phylogenetic tree using the alignment and phylogeny commands was illustrated. Finally, comprehensive remarks on the taxonomic and phylogenetic trees were provided. Chapter 6 will introduce how to cluster sequences into OTUs.

References

Alishum, Ali. 2019. DADA2 formatted 16S rRNA gene sequences for both bacteria & archaea (Version 1) [Data set]. Zenodo. Accessed August 12. https://doi.org/10.5281/zenodo.2541239.
Balvočiūtė, Monika, and Daniel H. Huson. 2017. SILVA, RDP, greengenes, NCBI and OTT – How do these taxonomies compare? BMC Genomics 18 (2): 114. https://doi.org/10.1186/s12864-017-3501-4.CrossrefPubMedPubMedCentral
Benson, D.A., M. Cavanaugh, K. Clark, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, and E.W. Sayers. 2013. GenBank. Nucleic Acids Research 41 (Database issue): D36–D42. https://doi.org/10.1093/nar/gks1195.CrossrefPubMed
Bichat, Antoine, Jonathan Plassais, Christophe Ambroise, and Mahendra Mariadassou. 2020. Incorporating phylogenetic information in microbiome differential abundance studies has no effect on detection power and FDR control. Frontiers in Microbiology 11 (649). https://doi.org/10.3389/fmicb.2020.00649. https://www.frontiersin.org/article/10.3389/fmicb.2020.00649.
Bokulich, Nicholas A., Benjamin D. Kaehler, Jai Ram Rideout, Matthew Dillon, Evan Bolyen, Rob Knight, Gavin A. Huttley, and J. Gregory Caporaso. 2018. Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin. Microbiome 6 (1): 90. https://doi.org/10.1186/s40168-018-0470-z.CrossrefPubMedPubMedCentral
Bokulich, Nicholas, Mike Robeson, Matthew Dillon, Michal Ziemski, Ben Kaehler, and Devon O’Rourke. 2021. bokulich-lab/RESCRIPt: 2021.8.0.dev0 (2021.8.0.dev0). Zenodo. Accessed 12 Aug. 2021
Bokulish, Nicolas, Matthew Dillon, Evan Bolyen, Benjamin Kaehler, Gavin Huttley, and J. Caporaso. 2018. q2-sample-classifier: Machine-learning tools for microbiome classification and regression. Journal of Open Source Software 3: 934. https://doi.org/10.21105/joss.00934.Crossref
Bolyen, Evan, Jai Ram Rideout, Matthew R. Dillon, Nicholas A. Bokulich, Christian C. Abnet, Gabriel A. Al-Ghalith, Harriet Alexander, Eric J. Alm, Manimozhiyan Arumugam, Francesco Asnicar, Yang Bai, Jordan E. Bisanz, Kyle Bittinger, Asker Brejnrod, Colin J. Brislawn, C. Titus Brown, Benjamin J. Callahan, Andrés Mauricio Caraballo-Rodríguez, John Chase, Emily K. Cope, Ricardo Da Silva, Christian Diener, Pieter C. Dorrestein, Gavin M. Douglas, Daniel M. Durall, Claire Duvallet, Christian F. Edwardson, Madeleine Ernst, Mehrbod Estaki, Jennifer Fouquier, Julia M. Gauglitz, Sean M. Gibbons, Deanna L. Gibson, Antonio Gonzalez, Kestrel Gorlick, Jiarong Guo, Benjamin Hillmann, Susan Holmes, Hannes Holste, Curtis Huttenhower, Gavin A. Huttley, Stefan Janssen, Alan K. Jarmusch, Lingjing Jiang, Benjamin D. Kaehler, Kyo Bin Kang, Christopher R. Keefe, Paul Keim, Scott T. Kelley, Dan Knights, Irina Koester, Tomasz Kosciolek, Jorden Kreps, Morgan G.I. Langille, Joslynn Lee, Ruth Ley, Yong-Xin Liu, Erikka Loftfield, Catherine Lozupone, Massoud Maher, Clarisse Marotz, Bryan D. Martin, Daniel McDonald, Lauren J. McIver, Alexey V. Melnik, Jessica L. Metcalf, Sydney C. Morgan, Jamie T. Morton, Ahmad Turan Naimey, Jose A. Navas-Molina, Louis Felix Nothias, Stephanie B. Orchanian, Talima Pearson, Samuel L. Peoples, Daniel Petras, Mary Lai Preuss, Elmar Pruesse, Lasse Buur Rasmussen, Adam Rivers, Michael S. Robeson 2nd, Patrick Rosenthal, Nicola Segata, Michael Shaffer, Arron Shiffer, Rashmi Sinha, Se Jin Song, John R. Spear, Austin D. Swafford, Luke R. Thompson, Pedro J. Torres, Pauline Trinh, Anupriya Tripathi, Peter J. Turnbaugh, Sabah Ul-Hasan, Justin J.J. van der Hooft, Fernando Vargas, Yoshiki Vázquez-Baeza, Emily Vogtmann, Max von Hippel, William Walters, Yunhu Wan, Mingxun Wang, Jonathan Warren, Kyle C. Weber, Charles H.D. Williamson, Amy D. Willis, Zhenjiang Zech Xu, Jesse R. Zaneveld, Yilong Zhang, Qiyun Zhu, Rob Knight, and J. Gregory Caporaso. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology 37 (8): 852–857. https://doi.org/10.1038/s41587-019-0209-9. https://pubmed.ncbi.nlm.nih.gov/31341288. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7015180/.CrossrefPubMedPubMedCentral
Cain, A.J. 1958. Chromosomes and their taxonomic importance. Proceedings of the Linnean Society of London 169: 125–128.
Callahan, Benjamin. 2016. The RDP and GreenGenes taxonomic training sets formatted for DADA2 [Data set]. Zenodo. Accessed 13 Aug.
———. 2021. DADA2 pipeline tutorial (1.16). https://benjjneb.github.io/dada2/tutorial.html. Accessed 25 Jan 2021.
Callahan, B.J., P.J. McMurdie, M.J. Rosen, A.W. Han, A.J. Johnson, and S.P. Holmes. 2016. DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods 13 (7): 581–583.CrossrefPubMedPubMedCentral
Caporaso, J. Gregory, Justin Kuczynski, Jesse Stombaugh, Kyle Bittinger, Frederic D. Bushman, Elizabeth K. Costello, Noah Fierer, Antonio Gonzalez Peña, Julia K. Goodrich, Jeffrey I. Gordon, Gavin A. Huttley, Scott T. Kelley, Dan Knights, Jeremy E. Koenig, Ruth E. Ley, Catherine A. Lozupone, Daniel McDonald, Brian D. Muegge, Meg Pirrung, Jens Reeder, Joel R. Sevinsky, Peter J. Turnbaugh, William A. Walters, Jeremy Widmann, Tanya Yatsunenko, Jesse Zaneveld, and Rob Knight. 2010. QIIME allows analysis of high-throughput community sequencing data. Nature Methods 7: 335. https://doi.org/10.1038/nmeth.f.303. https://www.nature.com/articles/nmeth.f.303#supplementary-information.CrossrefPubMedPubMedCentral
Chen, Jun, Kyle Bittinger, Emily S. Charlson, Christian Hoffmann, James Lewis, Gary D. Wu, Ronald G. Collman, Frederic D. Bushman, and Hongzhe Li. 2012. Associating microbiome composition with environmental covariates using generalized UniFrac distances. Bioinformatics (Oxford, England) 28 (16): 2106–2113. https://doi.org/10.1093/bioinformatics/bts342. https://pubmed.ncbi.nlm.nih.gov/22711789. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3413390/.CrossrefPubMed
Cole, J.R., B. Chai, R.J. Farris, Q. Wang, S.A. Kulam, D.M. McGarrell, G.M. Garrity, and J.M. Tiedje. 2005. The Ribosomal Database Project (RDP-II): Sequences and tools for high-throughput rRNA analysis. Nucleic Acids Research 33 (Database issue): D294–D296. https://doi.org/10.1093/nar/gki038. https://www.ncbi.nlm.nih.gov/pubmed/15608200. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC539992/.CrossrefPubMed
Czaplicki, Lauren. 2017. RDP LSU taxonomic training data formatted for DADA2 (trainingset 11) [Data set]. Zenodo. Accessed 13 Aug.
DeSantis, T.Z., P. Hugenholtz, N. Larsen, M. Rojas, E.L. Brodie, K. Keller, T. Huber, D. Dalevi, P. Hu, and G.L. Andersen. 2006. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Applied and Environmental Microbiology 72 (7): 5069–5072. https://doi.org/10.1128/AEM.03006-05. https://journals.asm.org/doi/abs/10.1128/AEM.03006-05 %X A 16S rRNA gene database (http://greengenes.lbl.gov) addresses limitations of public repositories by providing chimera screening, standard alignment, and taxonomic classification using multiple published taxonomies. It was found that there is incongruent taxonomic nomenclature among curators even at the phylum level. Putative chimeras were identified in 3% of environmental sequences and in 0.2% of records derived from isolates. Environmental sequences were classified into 100 phylum-level lineages in the Archaea and Bacteria.CrossrefPubMedPubMedCentral
Diener, Christian 2016. HITdb v1.00 for Dada2 [Data set]. Zenodo. Accessed 13 Aug.
Dubois, Alain, Annemarie Ohler, and R alexander Pyron. 2021. New concepts and methods for phylogenetic taxonomy and nomenclature in zoology, exemplified by a new ranked cladonomy of recent amphibians (Lissamphibia). Megataxa 5 (1): 1–738.Crossref
Faith, Daniel P. 1992. Conservation evaluation and phylogenetic diversity. Biological Conservation 61 (1): 1–10. https://doi.org/10.1016/0006-3207(92)91201-3. http://www.sciencedirect.com/science/article/pii/0006320792912013.Crossref
Federhen, Scott. 2011. The NCBI taxonomy database. Nucleic Acids Research 40 (D1): D136–D143. https://doi.org/10.1093/nar/gkr1178.CrossrefPubMedPubMedCentral
Felsenstein, Joseph. 2004. Inferring phylogenies. Sunderland: Sinauer Associates, Inc.
Geer, L.Y., A. Marchler-Bauer, R.C. Geer, L. Han, J. He, S. He, C. Liu, W. Shi, and S.H. Bryant. 2010. The NCBI BioSystems database. Nucleic Acids Research 38 (Database issue): D492–D496. https://doi.org/10.1093/nar/gkp858.CrossrefPubMed
Guindon, Stéphane, Frédéric Delsuc, Jean-François Dufayard, and Olivier Gascuel. 2009. Estimating maximum likelihood phylogenies with PhyML. In Bioinformatics for DNA sequence analysis, 113–137. Springer.Crossref
Guindon, Stéphane, Jean-François Dufayard, Vincent Lefort, Maria Anisimova, Wim Hordijk, and Olivier Gascuel. 2010. New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of PhyML 3.0. Systematic Biology 59 (3): 307–321. https://doi.org/10.1093/sysbio/syq010.CrossrefPubMed
Hinchliff, Cody E., Stephen A. Smith, James F. Allman, J. Gordon Burleigh, Ruchi Chaudhary, Lyndon M. Coghill, Keith A. Crandall, Jiabin Deng, Bryan T. Drew, Romina Gazis, Karl Gude, David S. Hibbett, Laura A. Katz, H. Dail Laughinghouse, Emily Jane McTavish, Peter E. Midford, Christopher L. Owen, Richard H. Ree, Jonathan A. Rees, Douglas E. Soltis, Tiffani Williams, and Karen A. Cranston. 2015. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proceedings of the National Academy of Sciences 112 (41): 12764–12769. https://doi.org/10.1073/pnas.1423041112. https://www.pnas.org/content/pnas/112/41/12764.full.pdf.Crossref
Kaehler, Benjamin D., Nicholas A. Bokulich, Daniel McDonald, J. Rob Knight, Gregory Caporaso, and Gavin A. Huttley. 2019. Species abundance information improves sequence taxonomy classification accuracy. Nature Communications 10 (1): 4643. https://doi.org/10.1038/s41467-019-12669-6.CrossrefPubMedPubMedCentral
Katoh, Kazutaka, and Daron M. Standley. 2013. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Molecular Biology and Evolution 30 (4): 772–780. https://doi.org/10.1093/molbev/mst010.CrossrefPubMedPubMedCentral
Katoh, Kazutaka, Kazuharu Misawa, Kei-ichi Kuma, and Takashi Miyata. 2002. MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30 (14): 3059–3066. https://doi.org/10.1093/nar/gkf436.CrossrefPubMedPubMedCentral
Kim, Kang Jin, Jaehyun Park, Sang-Chul Park, and Sungho Won. 2019. Phylogenetic tree-based microbiome association test. Bioinformatics. https://doi.org/10.1093/bioinformatics/btz686.
Kõljalg, Urmas, R. Henrik Nilsson, Kessy Abarenkov, Leho Tedersoo, Andy F.S. Taylor, Mohammad Bahram, Scott T. Bates, Thomas D. Bruns, Johan Bengtsson-Palme, Tony M. Callaghan, Brian Douglas, Tiia Drenkhan, Ursula Eberhardt, Margarita Dueñas, Tine Grebenc, Gareth W. Griffith, Martin Hartmann, Paul M. Kirk, Petr Kohout, Ellen Larsson, Björn D. Lindahl, Robert Lücking, María P. Martín, P. Brandon Matheny, Nhu H. Nguyen, Tuula Niskanen, Jane Oja, Kabir G. Peay, Ursula Peintner, Marko Peterson, Kadri Põldmaa, Lauri Saag, Irja Saar, Arthur Schüßler, James A. Scott, Carolina Senés, Matthew E. Smith, D. Ave Suija, M. Lee Taylor, Teresa Telleria, Michael Weiss, and Karl-Henrik Larsson. 2013. Towards a unified paradigm for sequence-based identification of fungi. Molecular Ecology 22 (21): 5271–5277. https://doi.org/10.1111/mec.12481. https://onlinelibrary.wiley.com/doi/abs/10.1111/mec.12481.CrossrefPubMed
Lane, D.J. 1991. 16S/23S rRNA sequencing. In Nucleic acid techniques in bacterial systematics, 115–175. New York: Wiley.
Leinonen, Rasko, Ruth Akhtar, Ewan Birney, Lawrence Bower, Ana Cerdeno-Tárraga, Ying Cheng, Iain Cleland, Nadeem Faruque, Neil Goodgame, Richard Gibson, Gemma Hoad, Mikyung Jang, Nima Pakseresht, Sheila Plaister, Rajesh Radhakrishnan, Kethi Reddy, Siamak Sobhany, Petra Ten Hoopen, Robert Vaughan, Vadim Zalunin, and Guy Cochrane. 2011. The European Nucleotide Archive. Nucleic Acids Research 39 (Database issue): D28–D31. https://doi.org/10.1093/nar/gkq967. https://pubmed.ncbi.nlm.nih.gov/20972220. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3013801/.CrossrefPubMed
Letunic, Ivica, and Peer Bork. 2006. Interactive Tree Of Life (iTOL): An online tool for phylogenetic tree display and annotation. Bioinformatics 23 (1): 127–128. https://doi.org/10.1093/bioinformatics/btl529.CrossrefPubMed
Lozupone, Catherine, and Rob Knight. 2005. UniFrac: A new phylogenetic method for comparing microbial communities. Applied and Environmental Microbiology 71 (12): 8228–8235. https://doi.org/10.1128/AEM.71.12.8228-8235.2005. https://www.ncbi.nlm.nih.gov/pubmed/16332807. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1317376/.CrossrefPubMedPubMedCentral
Lozupone, Catherine A., Micah Hamady, Scott T. Kelley, and Rob Knight. 2007. Quantitative and qualitative beta diversity measures lead to different insights into factors that structure microbial communities. Applied and Environmental Microbiology 73 (5): 1576–1585. https://doi.org/10.1128/AEM.01996-06. https://www.ncbi.nlm.nih.gov/pubmed/17220268. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1828774/.CrossrefPubMedPubMedCentral
Maidak, Bonnie L., James R. Cole, Timothy G. Lilburn, Charles T. Parker Jr, Paul R. Saxman, Jason M. Stredwick, George M. Garrity, Bing Li, Gary J. Olsen, Sakti Pramanik, Thomas M. Schmidt, and James M. Tiedje. 2000. The RDP (Ribosomal Database Project) continues. Nucleic Acids Research 28 (1): 173–174. https://doi.org/10.1093/nar/28.1.173.CrossrefPubMedPubMedCentral
Mashima, Jun, Yuichi Kodama, Takatomo Fujisawa, Toshiaki Katayama, Yoshihiro Okuda, Eli Kaminuma, Osamu Ogasawara, Kousaku Okubo, Yasukazu Nakamura, and Toshihisa Takagi. 2017. DNA Data Bank of Japan. Nucleic Acids Research 45 (D1): D25–D31. https://doi.org/10.1093/nar/gkw1001. https://pubmed.ncbi.nlm.nih.gov/27924010. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5210514/.CrossrefPubMed
McDonald, Daniel, Morgan N. Price, Julia Goodrich, Eric P. Nawrocki, Todd Z. DeSantis, Alexander Probst, Gary L. Andersen, Rob Knight, and Philip Hugenholtz. 2012. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. The ISME Journal 6 (3): 610–618. https://doi.org/10.1038/ismej.2011.139.CrossrefPubMed
Morien, Evan, and Laura W. Parfrey. 2018. SILVA v128 and v132 dada2 formatted 18s ‘train sets’ (1.0) [Data set]. Zenodo. Accessed 13 Aug.
Moynihan, M.A. 2020. nifHdada2: v1.1.0 (v1.1.0). Zenodo. Accessed 13 Aug.
Nakamura, Yasukazu, Guy Cochrane, Ilene Karsch-Mizrachi, and on behalf of the International Nucleotide Sequence Database Collaboration. 2012. The International Nucleotide Sequence Database Collaboration. Nucleic Acids Research 41 (D1): D21–D24. https://doi.org/10.1093/nar/gks1084.CrossrefPubMedPubMedCentral
Nilakanta, Haema, Kimberly L. Drews, Suzanne Firrell, Mary A. Foulkes, and Kathleen A. Jablonski. 2014. A review of software for analyzing molecular sequences. BMC Research Notes 7: 830–830. https://doi.org/10.1186/1756-0500-7-830. https://pubmed.ncbi.nlm.nih.gov/25421430. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4258797/.CrossrefPubMedPubMedCentral
Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12: 2825–2830.
Plummer, E., J. Twin, D.M. Bulach, S.M. Garland, and S.N. Tabrizi. 2015. A comparison of three bioinformatics pipelines for the analysis of preterm gut microbiota using 16S rRNA gene sequencing data. Journal of Proteomics and Bioinformatics 8: 283–291. https://doi.org/10.4172/jpb.1000381.Crossref
Põlme, Sergei, Kessy Abarenkov, Rolf Henrik Nilsson, Björn Lindahl, Karina Clemmensen, Håvard Kauserud, Nhu Nguyen, Rasmus Kjøller, Scott Bates, Petr Baldrian, Tobias Frøslev, Kristjan Adojaan, Alfredo Vizzini, Ave Suija, Donald Pfister, Hans-Otto Baral, Helle Järv, Hugo Madrid, and Jenni Nordén. 2020. FungalTraits: A user-friendly traits database of fungi and fungus-like stramenopiles. Fungal Diversity 105: 1–16. https://doi.org/10.1007/s13225-020-00466-2.Crossref
Price, Morgan N., Paramvir S. Dehal, and Adam P. Arkin. 2009. FastTree: Computing large minimum evolution trees with profiles instead of a distance matrix. Molecular Biology and Evolution 26 (7): 1641–1650. https://doi.org/10.1093/molbev/msp077. https://pubmed.ncbi.nlm.nih.gov/19377059; https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2693737/.CrossrefPubMedPubMedCentral
———. 2010. FastTree 2 – Approximately maximum-likelihood trees for large alignments. PloS One 5 (3): –e9490. https://doi.org/10.1371/journal.pone.0009490. https://www.ncbi.nlm.nih.gov/pubmed/20224823. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2835736/.
Pruesse, Elmar, Christian Quast, Katrin Knittel, Bernhard M. Fuchs, Wolfgang Ludwig, Jörg Peplies, and Frank Oliver Glöckner. 2007. SILVA: A comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Research 35 (21): 7188–7196. https://doi.org/10.1093/nar/gkm864.CrossrefPubMedPubMedCentral
Pruesse, Elmar, Jörg Peplies, and Frank Oliver Glöckner. 2012. SINA: Accurate high-throughput multiple sequence alignment of ribosomal RNA genes. Bioinformatics 28 (14): 1823–1829. https://doi.org/10.1093/bioinformatics/bts252.CrossrefPubMedPubMedCentral
Quast, Christian, Elmar Pruesse, Pelin Yilmaz, Jan Gerken, Timmy Schweer, Pablo Yarza, Jörg Peplies, and Frank Oliver Glöckner. 2012. The SILVA ribosomal RNA gene database project: Improved data processing and web-based tools. Nucleic Acids Research 41 (D1): D590–D596. https://doi.org/10.1093/nar/gks1219.CrossrefPubMedPubMedCentral
———. 2013. The SILVA ribosomal RNA gene database project: Improved data processing and web-based tools. Nucleic Acids Research 41 (Database issue): D590–D596. https://doi.org/10.1093/nar/gks1219. https://pubmed.ncbi.nlm.nih.gov/23193283. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531112/.CrossrefPubMed
Ritari, Jarmo, Jarkko Salojärvi, Leo Lahti, and Willem M. de Vos. 2015. Improved taxonomic assignment of human intestinal 16S rRNA sequences by a dedicated reference database. BMC Genomics 16 (1): 1056. https://doi.org/10.1186/s12864-015-2265-y.CrossrefPubMedPubMedCentral
Rosen, Michael J., Benjamin J. Callahan, Daniel S. Fisher, and Susan P. Holmes. 2012. Denoising PCR-amplified metagenome data. BMC Bioinformatics 13: 283–283. https://doi.org/10.1186/1471-2105-13-283. https://www.ncbi.nlm.nih.gov/pubmed/23113967. https://www.ncbi.nlm.nih.gov/pmc/PMC3563472/.CrossrefPubMedPubMedCentral
Saitou, N., and M. Nei. 1987. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4 (4): 406–425. https://doi.org/10.1093/oxfordjournals.molbev.a040454.PubMed
Sankaran, Kris, and Susan Holmes. 2014. structSSI: Simultaneous and selective inference for grouped or hierarchically structured data. Journal of Statistical Software 1 (13). https://doi.org/10.18637/jss.v059.i13. https://www.jstatsoft.org/v059/i13.
Schloss, Patrick D., Sarah L. Westcott, Thomas Ryabin, Justine R. Hall, Martin Hartmann, Emily B. Hollister, Ryan A. Lesniewski, Brian B. Oakley, Donovan H. Parks, Courtney J. Robinson, Jason W. Sahl, Blaz Stres, Gerhard G. Thallinger, David J. Van Horn, and Carolyn F. Weber. 2009. Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Applied and Environmental Microbiology 75 (23): 7537–7541. https://doi.org/10.1128/AEM.01541-09. https://journals.asm.org/doi/abs/10.1128/AEM.01541-09.CrossrefPubMedPubMedCentral
Sierra, Maria A., Qianhao Li, Smruti Pushalkar, Bidisha Paul, Tito A. Sandoval, Angela R. Kamer, Patricia Corby, Yuqi Guo, Ryan Richard Ruff, and Alexander V. Alekseyenko. 2020. The influences of bioinformatics tools and reference databases in analyzing the human oral microbial community. Genes 11 (8): 878.CrossrefPubMedPubMedCentral
Studier, J.A., and K.J. Keppler. 1988. A note on the neighbor-joining algorithm of Saitou and Nei. Molecular Biology and Evolution 5 (6): 729–731. https://doi.org/10.1093/oxfordjournals.molbev.a040527.CrossrefPubMed
Tang, Zheng-Zheng, Guanhua Chen, Alexander V. Alekseyenko, and Hongzhe Li. 2017. A general framework for association analysis of microbial communities on a taxonomic tree. Bioinformatics (Oxford, England) 33 (9): 1278–1285. https://doi.org/10.1093/bioinformatics/btw804. https://www.ncbi.nlm.nih.gov/pubmed/28003264. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5408811/.CrossrefPubMed
Vernon, Keith. 1988. The founding of numerical taxonomy. The British Journal for the History of Science 21 (2): 143–159.Crossref
Wang, Qiong, George M. Garrity, James M. Tiedje, and James R. Cole. 2007. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology 73 (16): 5261–5267. https://doi.org/10.1128/AEM.00062-07. https://pubmed.ncbi.nlm.nih.gov/17586664. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1950982/.CrossrefPubMedPubMedCentral
Woese, Carl R., and George E. Fox. 1977. Phylogenetic structure of the prokaryotic domain: The primary kingdoms. Proceedings of the National Academy of Sciences 74 (11): 5088–5090. https://doi.org/10.1073/pnas.74.11.5088. https://www.pnas.org/content/pnas/74/11/5088.full.pdf.Crossref
Woese, C.R., O. Kandler, and M.L. Wheelis. 1990. Towards a natural system of organisms: Proposal for the domains Archaea, Bacteria, and Eucarya. Proceedings of the National Academy of Sciences of the United States of America 87 (12): 4576–4579. https://doi.org/10.1073/pnas.87.12.4576. https://pubmed.ncbi.nlm.nih.gov/2112744. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC54159/.CrossrefPubMedPubMedCentral
Xia, Y. 2020. Correlation and association analyses in microbiome study integrating multiomics in health and disease. Progress in Molecular Biology and Translational Science 171: 309–491. https://doi.org/10.1016/bs.pmbts.2020.04.003.CrossrefPubMed
Xia, Yinglin, Jun Sun, and Ding-Geng Chen. 2018. Bioinformatic analysis of microbiome data. In Statistical Analysis of Microbiome Data with R, 1–27. Singapore: Springer.Crossref
Xiao, Jian, Hongyuan Cao, and Jun Chen. 2017. False discovery rate control incorporating phylogenetic tree increases detection power in microbiome-wide multiple testing. Bioinformatics 33 (18): 2873–2881. https://doi.org/10.1093/bioinformatics/btx311.CrossrefPubMed
Xiao, Jian, Li Chen, Stephen Johnson, Yue Yu, Xianyang Zhang, and Jun Chen. 2018. Predictive modeling of microbiome data using a phylogeny-regularized generalized linear mixed model. Frontiers in Microbiology 9 (1391). https://doi.org/10.3389/fmicb.2018.01391. https://www.frontiersin.org/article/10.3389/fmicb.2018.01391.
Yilmaz, Pelin, Laura Wegener Parfrey, Pablo Yarza, Jan Gerken, Elmar Pruesse, Christian Quast, Timmy Schweer, Jörg Peplies, Wolfgang Ludwig, and Frank Oliver Glöckner. 2013. The SILVA and “all-species Living Tree Project (LTP)” taxonomic frameworks. Nucleic Acids Research 42 (D1): D643–D648. https://doi.org/10.1093/nar/gkt1209.CrossrefPubMedPubMedCentral
———. 2014. The SILVA and “all-species Living Tree Project (LTP)” taxonomic frameworks. Nucleic Acids Research 42 (Database issue): D643–D648. https://doi.org/10.1093/nar/gkt1209. https://www.ncbi.nlm.nih.gov/pubmed/24293649. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965112/.CrossrefPubMed
Yoon, Seok-Hwan, Sung-Min Ha, Soonjae Kwon, Jeongmin Lim, Yeseul Kim, Hyungseok Seo, and Jongsik Chun. 2017. Introducing EzBioCloud: A taxonomically united database of 16S rRNA gene sequences and whole-genome assemblies. International Journal of Systematic and Evolutionary Microbiology 67 (5): 1613–1617. https://doi.org/10.1099/ijsem.0.001755. https://www.ncbi.nlm.nih.gov/pubmed/28005526. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5563544/.CrossrefPubMedPubMedCentral