© Springer Nature Switzerland AG 2023
Y. Xia, J. Sun Bioinformatic and Statistical Analysis of Microbiome Data https://doi.org/10.1007/978-3-031-21391-5_9

9. Alpha Diversity

Yinglin Xia1   and Jun Sun 1
(1)
Department of Medicine, University of Illinois Chicago, Chicago, IL, USA
 

Abstract

Population diversity is one important characteristic of a microbiome community, which is highly related to its environment. The change of diversity may indicate that an environment of human body has undergone a change (e.g., from a healthy to a disease condition), or the environment is disrupted by some factors (e.g., antibiotic use or a change of immune response). Thus, one task of microbiome community analysis is to analyze diversity of the gut microbiota and find host and/or environment factors that can modulate the taxonomic composition. Generally alpha, beta, and gamma are three kinds of diversity in microbiome community. Among them, alpha diversity and beta diversity are two most commonly used diversity measures in microbiome research. This chapter focuses on alpha diversity analysis. First, it introduces abundance-based alpha diversity metrics and phylogenetic metrics. Then, it explores alpha diversity and abundance by some common plots. Next, it describes statistical hypothesis testing of alpha diversity using R and QIIME 2.

Keywords
Alpha diversity Chao 1 Abundance-based coverage estimator (ACE) Shannon diversity Simpson diversity Inverse Simpson diversity Simpson evenness Pielou’s evenness Phylogenetic diversity Phylogenetic entropy Phylogenetic quadratic entropy Kruskal-Wallis test Alpha-phylogenetic method

One important characteristic of a microbiome community is its population diversity. The community diversity is highly related to its environment, so if a change of diversity has been detected, then it may either indicate that an environment in the human body has undergone a change (e.g., from a healthy to a disease condition) or indicate that the environment is disrupted by some factors (e.g., antibiotic use or a change of immune response). Thus, one task of microbiome community analysis is to find host and/or environment factors that can modulate the taxonomic composition (classified groups of closely related microbiota) and diversity (distribution of microbiota within or between communities) of the gut microbiota.

In microbiome community, generally there are three kinds of diversity including alpha, beta, and gamma diversities (Whittaker 1972). The alpha diversity and beta diversity are two common concepts that are used to measure the diversity of microbiome. In Chap. 9, we focus on alpha diversity analysis. Chapters 10 and 11 will introduce beta diversity analysis. This chapter is organized as follows. Section 9.1 introduces abundance-based alpha diversity metrics. Section 9.2 introduces phylogenetic metrics. Section 9.3 explores alpha diversity and abundance by some common plots. Section 9.4 describes statistical hypothesis testing of alpha diversity. Section 9.5 introduces alpha diversity analysis in QIIME 2. Finally, we briefly summarize this chapter in Sect. 9.6.

9.1 Abundance-Based Alpha Diversity Metrics

Although previously in the literature species diversity and species richness are sometimes used interchangeably, now it tends to use “species diversity” in the form of a species diversity index, that is, as an expression or index of some relation between number of species and number of individuals, and use “species richness” to refer to the number of species (in a given area or in a given sample) (Spellerberg and Fedor 2003).

Alpha diversity estimates the diversity within a single community (within-sample), which reflects both the number of species present (species richness) and the distribution of the number of organisms per species (species evenness), i.e., their relative abundance (evenness or equitability) and their taxonomic distribution. Several non-phylogenetic and phylogenetic metrics are available to measure alpha diversity. The non-phylogenetic metrics are based solely on abundance information, weight relatively rare, and mid-abundant and abundant species, while the phylogenetic metrics further utilize phylogenetic tree information.

9.1.1 Chao 1 Richness and Abundance-Based Coverage Estimator (ACE)

Species richness estimators estimate the total number of species present in a sample or community. Chao1 (Chao 1984) and ACE (abundance-based coverage estimator) (Chao and Lee 1992) were developed specifically for estimating richness of species from samples and for the general class-estimation problem, respectively (Colwell and Coddington 1994). In this section, we focus on Chao1 richness, and in Sect. 9.1.2, we introduce ACE.

9.1.1.1 The Measures of Chao 1 Richness

Two nonparametric methods of species richness for presence/absence data were developed by Anne Chao that were called “Chao 1” and “Chao 2” in the literature (Colwell and Coddington 1994). Chao1 index is a simple estimator to estimate the true number of species in the sample (or classes in a population) (Chao 1984); thus it is an abundance-based estimator of species richness.

Chao 1 index is based upon the number of rare species (or rare classes, i.e., taxa or OTUs) found in a sample (Chao 1984). The formula is given as below:
$$ {S}_{Chao1}={S}_{obs}+\frac{n_1^2}{2{n}_2}, $$
(9.1)
where SChao1 is the estimated number of species, Sobs is the observed number of species in a sample in total, and n1 and n2 are the number of singletons and doubletons, respectively. In other words, n1 is the number of observed species that are represented by only a single individual in that sample, and n2 is the number of observed species represented by exactly two individuals in that sample.
Chao 1 estimator is based on the distribution of individuals among species and the data on singletons and doubletons. Colwell and Coddington (1994) thought that the same approach can be applied to the distribution of species among samples based on only presence-absence data. They call this estimator “Chao 2”:
$$ {S}_{Chao2}={S}_{obs}+\frac{L^2}{2M}, $$
(9.2)
where L is the number of species that occur in only one sample (“unique” species) and M is the number of species that occur in exactly two samples. In concept, Chao 1 and Chao 2 are very similar. However, Colwell and Coddington’s approach, i.e., interpretation of Chao estimator from the distribution of individuals among species to the distribution of species among samples and from the data on singletons and doubletons to the data on only presence-absence, provides an alternative interpretation of Chao estimator. For example, n1 could be the number of species (taxa, OTUs) represented only once in the samples (unique species, taxa, OTUs) or a single read in that community, while n2 could be the number of species (taxa, OTUs) represented only twice in the samples. The confidence intervals are constructed by slightly modifying the percentile method based on bootstrap distributions (Efron 1981, 1982).

Note that Chao index only uses the singletons and doubletons to estimate the number of missing species (Chao 1984) (see formula 9.1), so this index gives more weight to the low-abundance species. Therefore, Chao index is particularly useful for datasets skewed toward the low-abundance species (Hughes et al. 2001), as is likely to be the case with microbiome data.

However, from above formula, notice that if the number of singleton taxa n1 is getting larger, that is, a sample contains many singletons; in such case, it is likely that the sample exists more undetected species; then the Chao 1 index will estimate greater species richness than it would for a sample without rare species.

In 1987 (Chao 1987) Chao derived a closed-form solution for the variance of SChao1 as below:
$$ Va\mathrm{r}\left({S}_{Chao1}\right)={n}_2\left(\frac{m^4}{4}+{m}^2+\frac{m^2}{2}\right), $$
(9.3)
where $$ m=\frac{n_1}{n_2}. $$ This formula estimates the precision of Chao1 from multiple samples, replacing the more complex variance estimation technique presented in Chao (1984) (Colwell and Coddington 1994). The chao1 confidence interval for richness was developed in 2004 (Colwell et al. 2004).

9.1.1.2 The Measures of ACE

ACE (Chao and Lee 1992; Chao and Yang 1993) is a nonparametric method for estimating the number of species (classes) using sample coverage, which is defined as the sum of the probabilities of the observed species (classes):
$$ {S}_{\mathrm{ACE}}={S}_{\mathrm{abund}}+\frac{S_{\mathrm{rare}}}{C_{\mathrm{ACE}}}+\frac{n_1}{C_{\mathrm{ACE}}}{\gamma}_{\mathrm{ACE}}^2, $$
(9.4)
where Sabund and Srare are the number of abundant and rare species (taxa, OTUs), respectively. CACE is the sample abundance coverage estimator, n1 is the number (frequency) of singletons, and $$ {\gamma}_{ACE}^2 $$ is the estimated coefficient of variation for rare species (taxa, OTUs), which is defined as
$$ {\gamma}_{\mathrm{ACE}}^2=\max \left[\frac{S_{\mathrm{rare}}}{C_{\mathrm{ACE}}}\frac{\sum_{i=1}^{10}i\left(i-1\right){n}_1}{N_{\mathrm{rare}}\left({N}_{\mathrm{rare}}-1\right)}-1,0\right]. $$
(9.5)

Chao and Yang (1993) discussed a theoretical optimal stopping rule based on the minimization of the testing cost and expected penalty due to the unremoved bugs. The optimal detection value is chosen as 10. The number of bugs detected more than 10 times is then added to the resulting estimate.

ACE method (see Eq. 9.4) divides the number of observed species into the frequencies abundant species (those with more than ten individuals in the sample) and the frequencies rare species (those with fewer than ten individuals) (Kim et al. 2017). The ACE method has some properties (Chao and Lee 1992; Chao and Yang 1993; Hughes et al. 2001; Sornplang and Piyadeatsoontorn 2016; Kim et al. 2017): (1) it only considers the presence or absence information of abundant species because abundant species would be discovered anyway; (2) it does not require the exact frequencies for the abundant species; (3) it does require the exact frequencies for the rare species because the estimation of the number of missing species is based entirely on these rare species.

9.1.1.3 Calculating Chao 1 Richness and ACE Using Ampvis2 Package

In this section, we use both ampvis2 and microbiome packages to illustrate the alpha diversity calculations based on two datasets. The first example is breast cancer QTRT1 mouse gut data (Zhang et al. 2020). The second example is mouse gut microbiome we generated from raw sequencing using QIIME 2 in Chap. 4 and has been used for bioinformatic analysis in Chaps. 5 and 6.

Example 9.1: Mouse Gut Microbiome Regulated by QTRT1 Gene in the Development of Breast Cancer

Sun and her team in 2020 conducted a murine microbiome experiment to explore whether and how the enzyme queuosine (Q)-tRNA ribosyltransferase catalytic subunit 1 (QTRT1) affects tumorigenesis and the microbiome. Queuosine is a micronutrient from diet and microbiome. Transfer RNA (tRNA) Q-modifications occur specifically in four cellular tRNAs at the wobble anticodon position. Because the microbiome product queuine is required for its installation, so tRNA Q-modification in human cells depends on the gut microbiome. This study aims to (1) examine microbiome-dependent Q-tRNA modification and its impact on breast tumor growth and gene expression and (2) identify the mechanisms of tRNA Q-modification-dependent cellular phenotypes in vitro and in vivo. Single clones of complete QTRT1-knockout (KO) in human breast cancer MCF7 cells were generated. The Q-modification levels of the MCF7 wild-type (WT) and QTRT1-KO cells were measured using the standard APB gel method. In this study, 20 KO mice (n = 10 for each of before and post treatments) and 20 WT mice (n = 10 for each of before and post treatments) were investigated. DNA was extracted from fecal samples, and 16S rRNA gene V3–V4 variable regions were barcoded and sequenced using MiSeq according to the Illumina protocol. After DNA sequencing, all possible raw paired-end reads were evaluated and merged using the PEAR (Zhang et al. 2014). Quality-filtered sequences were performed using QIIME pipeline (Caporaso et al. 2010) at a 97% similarity threshold utilizing the USEARCH algorithm (Edgar 2010) and the reference database silva_132_16S.97 (Glöckner et al. 2017). The microbiome data for downstream analysis include a total of 40 samples across 20 mice, resulting in 635 unique OTUs/taxa. The OTUs ranged from 39,813 to 65,762 for the individual sample, with a mean of 55,502.7 and median of 56, 093 and total reads of 2,220,108.

Throughout this book, we will call this dataset as “Breast cancer QTRT1 mouse gut microbiome” for simplification.

First, we install the package ampvis2 (current version 2.7.17, March 2022) and load the data into R. To install the ampvis2 package, enter the following commands in R or RStudio.
install.packages("remotes")
remotes::install_github("MadsAlbertsen/ampvis2")
To fast install the ampvis2 package, we can use multicore processors by setting the Ncpus argument as below:
install.packages("remotes")
remotes::install_github("MadsAlbertsen/ampvis2", Ncpus = 6)

Because today most computers’ CPU can run eight processes simultaneously, thus unless you know your computer has a CPU with more (logical) cores than 8, setting it to 6 (Ncpus = 6) is a good starting point.

Windows users should also install Rtools4 on Windows. Starting with R 4.0.0 (released April 2020), R for Windows uses a toolchain bundle called rtools4 (https://​mirrors.​dotsrc.​org/​cran/​bin/​windows/​Rtools/​rtools40.​html). Rtools4 is based on msys2, which makes easier to build and maintain R itself as well as the system libraries needed by R packages on Windows.

Here, we use breast cancer QTRT1 mouse gut data. The ampvis2 package can import data directly from the commonly used amplicon pipelines: QIIME 2, mothur, and UPARSE. The amp_import_biom() function is used to import the data in BIOM format generated from QIIME 2 and mothur, while the amp_import_uparse() function is used to import the data from the UPARSE pipeline. Then the data is loaded into a single ampvis2 object by the amp_load() function, which results in a list of data frames. The data that we use here are already in the format of data frames. Below we load otu_table and metadata into R:

> setwd("~/Documents/QIIME2R/Ch9_AlphaDiversity")
> otu_tab <- read.csv("otu_table_L7_MCF7_amp.csv", check.names = FALSE)
> meta_tab <- read.csv("metadata.csv", check.names = FALSE)
The function amp_load() is very useful; it checks and combines the data into a single ampvis2 object, making it easier to manipulate, filter, and subset all elements of the data at once for analysis. Please note that ampvis2 expects OTU ID’s to be the first column named OTU/ASV in “otutable.” Now we combine the otu_table and meta table into a single ampvis2 object “ds”:
> library(ampvis2)
> ds <- amp_load(otutable = otu_tab,
+ metadata = meta_tab)
> ds
ampvis2 object with 3 elements.
Summary of OTU table:
Samples OTUs Total#Reads Min#Reads Max#Reads Median#Reads
40 635 2220108 39813 65762 56093
Avg#Reads
55502.7
Assigned taxonomy:
Kingdom Phylum Class Order Family Genus Species
635(100%) 635(100%) 635(100%) 635(100%) 635(100%) 635(100%) 635(100%)

Metadata variables: 4.

SampleID, Group, Time, Group4.

We can explore the individual data frames in the list with View() function.
> View(ds$metadata)
> View(otu_tab)
> View(meta_tab)
The ampvis2 package uses the amp_alphadiv() function to perform alpha diversity analysis, which calculates alpha diversity indices for each sample using the vegan function diversity(), where the read abundances are first rarefied using rrarefy() function by the size of the rarefy argument and combined with the metadata. The syntax is as below:
amp_alphadiv(data, measure = NULL, richness = FALSE, rarefy = NULL)
The function has four arguments: data, measure, richness, and rarefy and return a dataframe, where the argument:
  • data is the required argument, listing as loaded with the amp_load() function.

  • measure is the listing vector that listed one or more alpha diversities to be calculated including “observed” for Observed taxa (OTUs), “shannon” for Shannon diversity, “simpson” for Simpson diversity, and “invsimpson” for inverse Simpson diversity.

  • richness is used to calculate sample richness estimates (Chao 1 and ACE) which is calculated by the estimate() function. The default is FALSE, without calculating richness.

  • rarefy rarefies species richness to the specified value before calculating alpha diversity and/or richness. By default(NULL), samples will be not rarefied before calculation.

It has been shown that using rarefying of counts is an inappropriate approach for detection of differentially abundant species (McMurdie and Holmes 2014). Here we do not rarefy the sample counts. By default, if no measure(s) are chosen, all diversity indices will be returned. We can also provide a list of vector to calculate more than one diversity, such as measure = c(“shannon”, “simpson”) to calculate Shannon and Simpson alpha diversity indices simultaneously. However, here, we illustrate the function amp_alphadiv() to calculate each diversity one by one. The following R commands will calculate all the alpha diversity indices including Chao 1 and ACE:
> alpha_amp <- amp_alphadiv(ds,
+ measure =,
+ richness = TRUE,
+ rarefy = NULL)
> alpha_amp
SampleID Group Time Group4 Reads ObservedOTUs Shannon Simpson invSimpson Chao1 ACE
Sun040.BH5 Sun040.BH5 KO Before KO_BEFORE 39813 110 2.385 0.8270 5.779 141.0 156.1
Sun039.BH4 Sun039.BH4 KO Before KO_BEFORE 42056 112 2.378 0.8336 6.009 160.0 170.5
Sun027.BF2 Sun027.BF2 WT Before WT_BEFORE 42729 101 2.276 0.7784 4.512 153.8 162.9
……
We can explore the results in the data frame using View() and head() functions.
> View(alpha_amp)
> head(alpha_amp)

9.1.1.4 Calculating Chao 1 Richness and ACE Using Microbiome Package

In Chap. 2 (Sect. 2.​3.​2), we introduced the microbiome package (current version 1.14.0, March 2022). If this package has not been installed in your computer, open R and enter the following R commands to install:
library(BiocManager)
#source("https://bioconductor.org/install")
#useDevel()
BiocManager::install(version='devel')
BiocManager::install("microbiome")
Alternatively, we can install the development version (potentially unstable) by entering the following R commands in R:
# Alternative install the development version
# Load the devtools package
library(devtools)
# Install the package
install_github("microbiome/microbiome")
Example 9.2: Mouse Gut Microbiome Data

In order to show the workflow from bioinformatic data generation to downstream statistical analysis, here we use the same mouse gut microbiome data generated by DADA2 within QIIME 2. Thus we use qiime2R package to export the .qza and .tsv files into R. Entering the following R commands to install this package, then call the function qza_to_phyloseq() to create a phyloseq object. To do so, we also need the phyloseq package available:

> install.packages("remotes")
> remotes::install_github("jbisanz/qiime2R")
> library(qiime2R)
> library(phyloseq)
> physeq<-qza_to_phyloseq(features="FeatureTableMiSeq_SOP.qza", taxonomy = "TaxonomyMiSeq_SOP.qza", tree = "RootedTreeMiSeq_SOP.qza", metadata="SampleMetadataMiSeq_SOP.tsv")
> physeq
phyloseq-class experiment-level object
otu_table() OTU Table: [ 392 taxa and 360 samples ]
sample_data() Sample Data: [ 360 samples by 11 sample variables ]
tax_table() Taxonomy Table: [ 392 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 392 tips and 389 internal nodes ]
Two ways can be used to calculate Chao 1 richness using the microbiome package: one is to call the function alpha() and specify the index = “chao1”, and another way is to directly call the function richness(), which also calculates the observed taxa:
> library(microbiome)
> library(knitr)
The following R commands call the function alpha() and specify the index = “chao1” to calculate Chao 1 richness:
> chao1 <- alpha(physeq, index = "chao1")
> head(chao1,3)
chao1
F3D0 111
F3D1 102
F3D11 89
The following R commands directly call the function richness() to calculate Chao 1 richness and the observed taxa:
> rich <- richness(physeq)
> head(rich,3)
observed chao1
F3D0 111 111
F3D1 102 102
F3D11 89 89
However, current version (version 1.14.0) of microbiome package does not provide the capability to estimate the measures of ACE, which is available via estimate_richness() function in the phyloseq package:
> library(phyloseq)
> richness_physeq <- estimate_richness(physeq, split = TRUE, measures = c("Observed", "Chao1", "ACE"))
> head(richness_physeq,3)
Observed Chao1 se.chao1 ACE se.ACE
F3D0 111 111 0 111 4.727
F3D1 102 102 0 102 4.284
F3D11 89 89 0 89 4.290

9.1.2 Shannon Diversity

Shannon and Simpson diversity indices are most often used to measure the diversity of communities. Both indices take both abundance and relative abundances (evenness) of the species present into account. Thus they provide more information about the community composition than simple species richness (i.e., the number of species present).

9.1.2.1 The Measures of Shannon Diversity

The measure of Shannon diversity index (Shannon 1948), an index of both species richness and evenness, is the most commonly used and specifically the most popular in the ecological and microbiome literatures. Shannon’s index (labeled as H) was originally developed by Claude Shannon in 1948 to quantify the entropy (uncertainty or information content) in strings of text. The uncertainty is measured by the Shannon function H, which is based on communication theory and measures how to predict the next letter in a message or communication (a common question in communication) (Shannon 1948). Shannon’s index (H) is given as below:
$$ H=-\sum \limits_{i=1}^S{p}_i\ln {p}_i, $$
(9.6)
where pi is the proportion (or relative abundance) of individuals of species i in the community and S is the total number of species in the community (richness), so $$ \sum \limits_{i=1}^S{p}_i=1. $$ Thus, Shannon’s diversity index is to first calculate the proportion of species i relative to the total number of species (pi) and then multiplied by the natural logarithm of this proportion (lnpi). The resulting product is summed across species and multiplied by −1. In summary, Shannon’s index measures the uncertainty: How difficult would it be to predict correctly the species of the next individual collected? Put another way, it measures the amount of uncertainty, so that the larger the value of H, the greater the uncertainty.
It is worth mentioning that Shannon’s index has some properties:
  • Strictly speaking, the Shannon measure of information content should be used only on random samples drawn from a large community in which the total number of species is known (Pielou 1966).

  • Theoretically the value H could be very large because it increases as the number of species in the community increases and as the distribution of individuals among the species becomes even (Lemos et al. 2011; Magurran 2013). But in practice for biological communities, H does not seem to be larger than 5.0 (Washington 1984).

  • Shannon’s index “gives more importance” to less common categories (e.g., rare species in the case of microbiome) and places a greater weight on species richness than species evenness (Schloss and Handelsman 2006; Schloss et al. 2009).

  • Because Shannon’s index values increase as the number of sample size increases, thus normalization is crucial and suggested to avoid biases when it is compared between samples (Lemos et al. 2011).

Note that Shannon index is also known as the Shannon-Wiener index, the Shannon-Weaver index, and the Shannon entropy (Spellerberg and Fedor 2003). Based on Spellerberg and Fedor (2003), the uses of the “Shannon-Wiener” index and the “Shannon-Weaver” index are due to confusion. Actually, Shannon’s mathematical theory of communication including the first form of the present Shannon expression (index) was published in the 1948s Bell System Technical Journal (Shannon 1948). In 1949, Shannon published this information jointly with Warren Weaver, a mathematician, in the book The Mathematical Theory of Communication (Shannon et al. 1949), which is very similar to the original published papers, and the entropy “H” in 1948 was taken account of by Shannon in the first part of this 1949 book. In his paper Shannon appreciated the impact of the mathematician Norbert Wiener’s basic philosophy and theory on his communication theory and cited several of Wiener’s publications. Thus, Spellerberg and Fedor (2003) suggested that the “mislabeling” and the confusion of the Shannon index “H” (as referred to by Krebs 1999) as the Shannon and Weaver index are partly due to the joint authorship of Shannon and Weaver’s book, while given the fact that in the late 1940s, Shannon had built on the work of Wiener, it seems preferable to refer to “H” (the species diversity index) as the “Shannon index” or the “Shannon and Wiener index” (Spellerberg and Fedor 2003).

9.1.2.2 Calculating Shannon Diversity Using ampvis2 Package

Example 9.3: Breast Cancer QTRT1 Mouse Gut Microbiome, Example 9.1 Cont.

The following R commands calculate Shannon index for QTRT1 dataset using the amp_alphadiv() in ampvis2 package:

> shannon_amp <- amp_alphadiv(ds,
+ measure = "shannon",
+ richness = FALSE,
+ rarefy = NULL)
> View(shannon_amp)
> head(shannon_amp,3)
SampleID Group Time Group4 Reads Shannon
1 Sun040.BH5 KO Before KO_BEFORE 39813 2.385
2 Sun039.BH4 KO Before KO_BEFORE 42056 2.378
3 Sun027.BF2 WT Before WT_BEFORE 42729 2.276

9.1.2.3 Calculating Shannon Diversity Using Microbiome Package

Example 9.4: Mouse Gut Microbiome Data, Example 9.2 Cont.

To calculate Shannon diversity, we call the function alpha() in microbiome package and specify the index = “shannon”:

> shannon <- alpha(physeq, index = "shannon")
> head(shannon,3)
diversity_shannon
F3D0 3.885
F3D1 3.992
F3D11 2.893

9.1.3 Simpson Diversity

Simpson’s diversity indices refer to four closely related indices: (1) Simpson’s index (D), (2) Simpson’s index of diversity 1 – D, (3) Simpson’s reciprocal index 1 / D, and (4) Simpson’s index of evenness.

9.1.3.1 The Measures of Simpson Diversity

Simpson in 1949 (Simpson 1949) proposed a new nonparametric measure which combines species richness and evenness. The new concept of diversity states that diversity is inversely related to “the probability that two individuals chosen at random and independently from the population will be found to belong to the same group” (Simpson 1949), i.e., will belong to the same species. Although in ecological literature the new concept of diversity is synonymous with diversity (Hurlbert 1971), it actually is about species heterogeneity (Good 1953).

Simpson’s Index (D0)

Simpson’s index (D0) measures the probability that two individuals randomly selected from a sample will belong to the same species. This is Simpson’s original index, which is given as

$$ {D}_0=\sum {\left(n/N\right)}^2, $$
(9.7)
where n is the total number of individuals (organisms) of a particular species and N is the total number of individuals (organisms) of all species. Another version of the formula for calculating Simpson’s original index (D0) is given as
$$ {D}_0=\frac{\sum n\left(n-1\right)}{N\left(N-1\right)}. $$
(9.8)

This version of the formula will give consistent and acceptable results. In literature, Simpson’s original index in (9.7) is also written as $$ {D}_O=\sum \limits_{i=1}^S{p}_i^2, $$ where S is the total number of species in the community and pi is the proportion of individuals (or relative abundance) of species i in the community. Simpson’s index gives more weight to the more abundant species in a sample. Additional rare species to a sample does not change the value of D0 too much.

The value of D0 ranges between 0 and 1, with 0 representing infinite diversity (i.e., the lower value of D0 the more diversity) and 1 suggesting no diversity (i.e., the bigger value of D0, the lower diversity). Obviously Simpson’s original index cannot provide an intuitive and logical interpretation of diversity.

Simpson’s Diversity Index (1-D0)

To provide a straightforward interpretation, D0 is often subtracted from 1 which defines Simpson’s diversity index as below:

$$ D=1-\frac{\sum n\left(n-1\right)}{N\left(N-1\right)}. $$
(9.9)

In the literature, Simpson’s diversity index is also written as $$ D=1-\sum \limits_{i=1}^S{p}_i^2. $$ The values of Simpson’s diversity index also range from 0 to 1, but now with lower values (close to 0) indicating lower sample diversity, while greater values (close to 1) indicating greater sample diversity. Thus, Simpson’s diversity index measures the probability that two individuals randomly selected from a sample will belong to different species. In contrary to Shannon’s index, Simpson’s diversity index “gives more importance” to more common species.

Simpson’s Reciprocal Index (1/D0)

Another way to avoid nonintuitive nature of Simpson’s index is to take the reciprocal of the index, which defines inverse Simpson’s index (1/D0). The inverse Simpson index is given as below:

$$ {D}_I=\frac{1}{\sum \limits_{i=1}^S{p}_i^2}, $$
(9.10)
which is the reciprocal of Simpson’s original index (1/DO).
Simpson’s Index of Evenness

As we stated above, heterogeneity contains both species richness and evenness; thus, researchers naturally try to separately measure the evenness component from richness. The null hypothesis of evenness is all species in a hypothetical community are equally common. However, most communities contain a few dominant species and many species that are relatively uncommon. Evenness measures attempt to quantify this unequal representation against the null hypothesis. As the independent measures of species richness, many different measures of evenness (or equitability) have been proposed in the ecology and microbiome literatures. Simpson’s index of evenness is one of them. Simpson’s index of evenness is obtained from reciprocal of Simpson’s original index $$ {D}_O=\sum \limits_{i=1}^S{p}_i^2, $$ which is defined as

$$ E=\frac{1}{S\sum \limits_{i=1}^S{p}_i^2}, $$
(9.11)
where S is the number of species in the sample. This index also ranges from 0 to 1 and is relatively unaffected by the rare species in the sample.
Simpson index also has some properties:
  • Simpson’s diversity index measures the species dominance and reflects the probability of two individuals that belong to the same species being randomly chosen. It varies from 0 to 1 and the index increases as the diversity decreases (Simpson 1949).

  • Simpson diversity puts a higher weight on species evenness more than species richness in its measurement (Schloss and Handelsman 2006; Schloss et al. 2009).

  • Same to Shannon diversity, normalization is considered as crucial and important to avoid biases when comparing Simpson diversity between different samples (Lemos et al. 2011).

9.1.3.2 Calculating Simpson Diversity and Inverse Simpson Diversity Using Ampvis2 Package

Example 9.5: Breast Cancer QTRT1 Mouse Gut Microbiome, Example 9.1 Cont.

The following R commands calculate Simpson diversity for QTRT1 dataset using the amp_alphadiv() in ampvis2 package:

> simpson_amp <- amp_alphadiv(ds,
+ measure = "simpson",
+ richness = FALSE,
+ rarefy = NULL)
> View(simpson_amp)
> head(simpson_amp,3)
SampleID Group Time Group4 Reads Simpson
1 Sun040.BH5 KO Before KO_BEFORE 39813 0.8270
2 Sun039.BH4 KO Before KO_BEFORE 42056 0.8336
3 Sun027.BF2 WT Before WT_BEFORE 42729 0.7784
The following R commands calculate inverse Simpson index for QTRT1 dataset using the amp_alphadiv() in ampvis2 package:
> invsimpson_amp <- amp_alphadiv(ds,
+ measure = "invsimpson",
+ richness = FALSE,
+ rarefy = NULL)
> View(invsimpson_amp)
> head(invsimpson_amp,3)
SampleID Group Time Group4 Reads invSimpson
1 Sun040.BH5 KO Before KO_BEFORE 39813 5.779
2 Sun039.BH4 KO Before KO_BEFORE 42056 6.009
3 Sun027.BF2 WT Before WT_BEFORE 42729 4.512

9.1.3.3 Calculating Simpson Diversity, Inverse Simpson Diversity, and Simpson Evenness Using Microbiome Package

Example 9.6: Mouse Gut Microbiome Data, Example 9.2 Cont.

To calculate Simpson diversity, we call the function alpha() and specify the index = “simpson” in the microbiome package:

> simpson <- alpha(physeq, index = "simpson")
> head(simpson,3)
evenness_simpson dominance_simpson
F3D0 0.2548 0.03535
F3D1 0.3343 0.02933
F3D11 0.1113 0.10098
To calculate inverse Simpson diversity, we call the function alpha() and specify the index = “inverse_simpson” in the microbiome package:
> inv_simpson <- alpha(physeq, index = "inverse_simpson")
> head(inv_simpson,3)
diversity_inverse_simpson
F3D0 28.285
F3D1 34.098
F3D11 9.903
To calculate Simpson evenness, we call the function evenness() and specify the index = “simpson” in the microbiome package:
> even_simpson <- evenness(physeq, "simpson")
> head(even_simpson,3)
simpson
F3D0 0.2548
F3D1 0.3343
F3D11 0.1113

9.1.4 Pielou’s Evenness

Pielou’s evenness (Pielou 1966) is the most common measure of evenness. Here, we describe and illustrate it below.

9.1.4.1 The Measures of Pielou’s Evenness

Pielou’s evenness measures diversity along with species richness. The formula of Pielou’s evenness index is given as below:
$$ J=\frac{H}{\log (S)}, $$
(9.12)
where H is Shannon’s diversity index and S is the total number of species observed in a sample. The value of Pielou’s evenness ranges from 0 (no evenness) to 1 (complete evenness). Many diversity indices, such as Simpson’s diversity and Shannon’s diversity, incorporate evenness. However, it has been shown that the diversity indices which concentrate totally on evenness are fraught with problems, including dependence on species counts (McCune and Grace 2002). Pielou’s index has a particular problem: it is a ratio of a relatively stable index, H, and one that is strongly dependent on sample size, S.

9.1.4.2 Calculating Pielou Evenness Using Microbiome Package

Example 9.7: Mouse Gut Microbiome Data, Example 9.2, Cont.

To calculate Pielou evenness, we call the function alpha() and specify the index = “pielou” in microbiome package:

> pielou <- alpha(physeq, index = "pielou")
> head(pielou,3)
evenness_pielou
F3D0 0.8249
F3D1 0.8631
F3D11 0.6444
The following R commands calculate all available evenness measures in the microbiome package:
> even <- evenness(physeq, "all")
> head(even,3)
camargo pielou simpson evar bulla
F3D0 0.9118 0.8249 0.2548 0.3774 0.5021
F3D1 0.8991 0.8631 0.3343 0.3939 0.5597
F3D11 0.8549 0.6444 0.1113 0.2462 0.2859
The following R commands calculate Pielou evenness using evenness() in the microbiome package:
> even_pielou <- evenness(physeq, "pielou")
> head(even_pielou,3)
pielou
F3D0 0.8249
F3D1 0.8631
F3D11 0.6444

9.2 Phylogenetic Alpha Diversity Metrics

Phylogenetic metrics not only use abundance information but also further utilize phylogenetic tree information. Several phylogenetic metrics are available to measure alpha diversity. In this subsection, we introduce three commonly used phylogenetic metrics: (1) phylogenetic diversity, (2) phylogenetic entropy, and (3) phylogenetic quadratic entropy.

9.2.1 Phylogenetic Diversity

Faith’s phylogenetic diversity (PD) (Faith 1992) is a measure of biodiversity, based on phylogeny (the tree of life), that is, incorporating phylogenetic difference between species. It can be considered as “a phylogenetic generalization of species richness” (Chao et al. 2016). Faith thought that the branch lengths on the tree are informative because they count the relative number of new features arising along that part of the tree. Thus, he defined and calculated the PD of a set of species as “equal to the sum of the lengths of all those branches that are members of the corresponding minimum spanning path” (Faith 1992), where “branch” is a segment of a cladogram, and the minimum spanning path is the minimum distance between the two nodes. The formula of PD is given as below:

$$ \mathrm{PD}={\sum}_{branchesb}\mathrm{\ell}(b), $$
(9.13)
where ℓ(b) is the length of a branch b. As discussed in Faith (1992), PD is a “feature diversity” and so PD counts features. Faith’s PD has been reviewed as a good measure of phylogenetic diversity. However, PD is only based on phylogeny. It is important to incorporate abundance information into phylogenetic diversity measures for conservation (Chao et al. 2016). We will illustrate how to calculate Faith’s PD using QIIME 2 in Sect. 9.5.3.

9.2.2 Phylogenetic Entropy

Phylogenetic entropy (PE) (Allen et al. 2009) generalizes or extends the Shannon index (Shannon entropy) to incorporate phylogenetic distances among species. Thus, it is a diversity index (measure) that accounts for both phylogeny and species abundances, including species richness, evenness, and distinctness. The formula of PE is given as below:

$$ {H}_p=-{\sum \limits_{branches}}_{b ofT}\mathrm{\ell}(b)p(b)\ln p(b), $$
(9.14)
where T is a rooted phylogenetic tree for the community, ℓ(b) is the length of a branch b of T, and p(b) is the proportion of individuals in the (present-day) community who are represented by leaves descending from b.

Given defining PE this way, the Shannon index $$ H=-\sum \limits_{i=1}^S{p}_i\ln {p}_i $$ becomes a special case of the PD if all species are equally distinct or, equivalently, if T has uniform branch lengths (Allen et al. 2009).

Shannon entropy does not obey the replication principle (or “doubling property”), which states that the total diversity of pooling two equally diverse and equally large groups with no shared species should be two times the diversity of a single group (Chao et al. 2010, 2016). PE (as a phylogenetic generalization of Shannon entropy) was reviewed having the same interpretational problem as Shannon index (Chao et al. 2010, 2016).

9.2.3 Phylogenetic Quadratic Entropy (PQE)

Phylogenetic quadratic entropy (PQE) (Rao 1982; Warwick and Clarke 1995), the most established quadratic diversity, measures the average taxonomic or phylogenetic distance between individual organisms. PQE that was introduced by Rao (1982) is a generalization of the Gini-Simpson index (Chao et al. 2016). It was independently rediscovered by Warwick and Clarke (1995) under the name “taxonomic diversity” (Allen et al. 2009; Warwick and Clarke 1995). The formula is given as below:

$$ Q={\sum}_{i&lt;j}{d}_{ij}{p}_i{p}_j, $$
(9.15)
where dij is the taxonomic or phylogenetic distance between species si and sj; dij can be equated with the branch length of the shortest path between the corresponding leaves of a phylogenetic tree. pi and pj are the proportion of individuals in species si and sj, respectively.

Although Rao’s quadratic entropy Q was the first diversity measure that accounts for both phylogeny and species abundances, however, like the Gini-Simpson index, its parent measure, it does not obey the replication principle. Thus it has the same interpretational problem as Gini-Simpson measure (Chao et al. 2010, 2016).

9.3 Exploring Alpha Diversity and Abundance

Alpha diversities and microbial abundances can be plotted using ampvis2 and microbiome packages. Comparing and visualizing groups based on the differences or similarities is also important. Various plots can be used to achieve different goals. We choose some useful plots to illustrate the capabilities of these two packages.

9.3.1 Heatmap

In this subsection, we illustrate how to generate heatmap using ampvis2 package.

Example 9.8: Breast Cancer QTRT1 Mouse Gut Microbiome, Example 9.1 Cont.

The ampvis2 package generates plots using the ggplot2 package. The heatmap is generated by the amp_heatmap() function. This function by default aggregates data to phylum level and shows the top 10 phyla, ordered by mean read abundance across all samples. Before generating heatmap, let’s review a short summary of the breast cancer QTRT1 mouse gut data:

> ds
ampvis2 object with 3 elements.
Summary of OTU table:
Samples OTUs Total#Reads Min#Reads Max#Reads
40 635 2220108 39813 65762
Median#Reads Avg#Reads
56093 55502.7
Assigned taxonomy:
Kingdom Phylum Class Order Family Genus Species
635(100%) 635(100%) 635(100%) 635(100%) 635(100%) 635(100%) 635(100%)
Metadata variables: 4
SampleID, Group, Time, Group4
First we use the amp_heatmap() function to generate the heatmap for all four groups which by default aggregates to phylum level and shows the top 10 phyla, ordered by mean read abundance across all samples (Fig. 9.1):
> # Figure 9.1
> amp_heatmap(ds, group_by = "Group4")

A heat map with different color gradients lists down the mean abundance value of Q T R T 1 gene data in the top 10 phyla at the phylum level. Firmicute ranks top in the heat map with values 93.5, 35.4, 80.9, and 43.1.

Fig. 9.1

Heatmap of QTRT1 data aggregating to phylum level with the top 10 phyla, ordered by mean read abundance across all samples

In below arguments, we select genus as the aggregated level, showing top 10 genera, adding family level to get additional taxonomic information, with the abundance values in plots. The samples are grouped by lockout before and post, as wild type before and post 4 groups (Fig. 9.2):

A heat map with different color gradients lists down the mean abundance value of Q T R T 1 gene data in the top 10 genera at the genus level. Muribaculaceae ranks top in the heat map with values 0, 49, 10, and 49.1.

Fig. 9.2

Heatmap of QTRT1 data aggregating to genus level with the top 10 genera, ordered by mean read abundance across all samples

> # Figure 9.2
> amp_heatmap(ds,
+ group_by = "Group4",
+ tax_aggregate = "Genus",
+ tax_add = "Family",
+ tax_show = 10,
+ color_vector = c("white", "red"),
+ plot_colorscale = "sqrt",
+ plot_values = TRUE) +
+ theme(axis.text.x = element_text(angle = 45, size=10, vjust = 1),
+ axis.text.y = element_text(size=8),
+ legend.position="right")
Before generating the heatmap, we can filter and subset the data and then perform the function to obtain a specific heatmap to meet our study design. The following R commands subset the post treatment data for the analysis using the amp_subset_samples() function:
> ds_sub <- amp_subset_samples(ds, Time%in% c("Post"), minreads = 1000)
20 samples and 249 OTUs have been filtered
Before: 40 samples and 635 OTUs
After: 20 samples and 386 OTUs
Figure 9.3 is generated using the post samples only and minimum reads equal to 1000:
> # Figure 9.3
> amp_heatmap(ds_sub,
+ group_by = "Group4",
+ tax_aggregate = "Genus",
+ tax_add = "Family",
+ tax_show = 10,
+ color_vector = c("white", "red"),
+ plot_colorscale = "sqrt",
+ plot_values = TRUE) +
+ theme(axis.text.x = element_text(angle = 45, size=10, vjust = 1),
+ axis.text.y = element_text(size=8),
+ legend.position="right")

A heat map with different color gradients lists the mean abundance value of Q T R T 1 gene data in the top 10 genera at the genus level post-treatment. Muribaculaceae ranks top in the heat map with values 49 and 49.1.

Fig. 9.3

Heatmap of QTRT1 data for post treatments aggregating to genus level with the top 10 genus, ordered by mean read abundance across all samples

We can subset the data based on two or more variables using “&” to separate the conditions or simply use the function more than once. The “!” (logical NOT operator) indicates “except” and is useful to remove outliers. The minreads = cut-off values argument removes any sample(s) with total amount of reads below the chosen threshold.

We can also subset the data based on the taxonomy using the amp_subset_taxa() function. For example, the following R commands subset the data to the taxa of D_3__Aminicenantales and D_3__Blastocatellales with a vector specifying their names separated by a comma:

> ds_Aminicenantales_Blastocatellales <- amp_subset_taxa(ds, tax_vector=c("D_3__Aminicenantales", " D_3__Blastocatellales"))
634 OTUs have been filtered
Before: 635 OTUs
After: 1 OTUs

9.3.2 Boxplot

In this subsection, we illustrate how to generate boxplot using ampvis2 and microbiome packages, respectively.

9.3.2.1 Generating Boxplot Using Ampvis2 Package

We use the amp_boxplot() function to generate boxplots for the four groups, showing top 5 genera and adding family information. The plot again is ordered by mean read abundance across all samples (Fig. 9.4):
> # Figure 9.4
> amp_boxplot(ds,
+ group_by = "Group4",
+ tax_show = 5,
+ tax_add = "Family")

A boxplot of 5 different taxa versus real abundance in percentage. It displays data for the W T post and before and the K O post and before. W T post and K O post values are higher in Muribaculaeceae.

Fig. 9.4

Boxplot of QTRT1 data for showing top 5 taxa with family information added

9.3.2.2 Generating Boxplot Using Microbiome Package

Example 9.9: Mouse Gut Microbiome Data, Example 9.2 Cont.

Next, we show steps to visualize the differences and/or similarities between groups using the microbiome package. By doing the plots, we need the useful R package ggpubr (current version 0.4.0, March 2022). The following R commands call the microbiome, ggpubr, and other two R packages:

> library(microbiome)
> library(ggpubr)
> library(knitr)
> library(dplyr)
Let’s first prune the taxa and then calculate the Shannon diversity:
> physeq1 <- prune_taxa(taxa_sums(physeq) > 0, physeq)
The Shannon diversity is re-calculated as below.
> shannon1 <- alpha(physeq1, index = "shannon")
> head(shannon1,3)
diversity_shannon
F3D0 3.885
F3D1 3.992
F3D11 2.893
We pool out the metadata from the physeq1 object:
> physeq1_meta <- meta(physeq1)
> head(physeq1_meta)
And then add the diversity table to metadata:
> physeq1_meta$Shannon <- shannon1$diversity_shannon
> head(physeq1_meta,3)
BarcodeSequence ForwardPrimerSequence ReversePrimerSequence
F3D0 <NA> <NA> <NA>
F3D1 <NA> <NA> <NA>
F3D11 <NA> <NA> <NA>
ForwardRead ReverseRead
F3D0 F3D0_S188_L001_R1_001.fastq.gz F3D0_S188_L001_R2_001.fastq.gz
F3D1 F3D1_S189_L001_R1_001.fastq.gz F3D1_S189_L001_R2_001.fastq.gz
F3D11 F3D11_S198_L001_R1_001.fastq.gz F3D11_S198_L001_R2_001.fastq.gz
Group Sex Time DayID DPW Description Shannon
F3D0 F3D0 Female Early D000 0 QIIME2RAnalysisSet 3.885
F3D1 F3D1 Female Early D001 1 QIIME2RAnalysisSet 3.992
F3D11 F3D11 Female Early D011 11 QIIME2RAnalysisSet 2.893
Now, we can generate the boxplots using the ggboxplot() function from the ggpubr package. The following call function generates boxplots with jittered points. The outline colors are specified for changing by groups (in this case, Time). We customize color palette for two groups as blue and red and specify adding jitter points and changing the shape by groups (Fig. 9.5):
> # Figure 9.5
> p <- ggboxplot(physeq1_meta, x = "Time", y = "Shannon",
+ color = "Time", palette =c("#00AFBB", "#FC4E07"),
+ add = "jitter", shape = "Time")
> p

A boxplot of Shannon diversity of the gut microbiome measures the values at early and late time points as depicted by blue and red color boxplots. The values at the late time point are higher than at the early time point.

Fig. 9.5

Boxplot of gut microbiome data for Shannon diversity by early and late time points in mice

The following R commands specify the comparisons of groups and add P-values for comparing groups (Fig. 9.6). Please note that here comparisons of Shannon diversity that are conducted between early and late time points may not be appropriate and interest of the reader’s study. We just use this dataset for illustrating the commands:
> # Figure 9.6
> my_comparisons <- list(c("Early", "Late"))
> p + stat_compare_means(comparisons = my_comparisons)+ # Add pairwise comparisons p-value
+ stat_compare_means(label.y = 5.5) # Add global p-value

A boxplot represents Shannon diversity of the gut microbiome by p-values obtained at early and late time points of the Wilcoxon rank sum test. The p-value difference between the time points is 0.013 units.

Fig. 9.6

Boxplot of mouse gut microbiome data for Shannon diversity by early and late time points with p-value generated by Wilcoxon rank sum test

Here the pairwise comparison P-value is the same as the global P-value because there are only two groups (early and late) in this comparison.

9.3.3 Violin Plot

Example 9.10: Mouse Gut Microbiome Data, Example 9.2 Cont.

For violin plot, we continue to use the same phyloseq object “physeq1” as in boxplot and compare differences in Shannon diversity between early and late time points.

Now we can use the ggviolin() function in the ggpubr package to create a violin plot (Fig. 9.7):
> # Figure 9.7
> p1 <- ggviolin(physeq1_meta, x = "Time", y = "Shannon",
+ add = "boxplot", fill = "Time", palette = c("#a6cee3", "#b2df8a"))
> print(p1)

A violin plot of Shannon diversity of the gut microbiome measures the values at early and late time points as depicted by blue and green violin structures. The values at the late time point are broader than at the early time point.

Fig. 9.7

Violin plot of mouse gut microbiome data for Shannon diversity by early and late time points

Here we only have two levels of groups; in the case of more than two groups, a pairwise comparison can be conducted. For the purpose of illustration, we show how to get the variable and create a list of pairwise comparisons:
> time <- levels(physeq1_meta$Time)
> comp_time<- combn(seq_along(time), 2, simplify = FALSE, FUN = function(i)time[i])
> print(comp_time)
[[1]]
[1] "Early" "Late"
The following R commands add the statistics (P-values) to the violin plot generated by pairwise comparisons using a nonparametric test (in this case, Wilcoxon rank sum test) (Fig. 9.8):
> # Figure 9.8
> p1 <- p1 + stat_compare_means(comparisons = comp_time) + # Add pairwise comparisons p-value
+ stat_compare_means(label.y = 5.5) # Add global p-value
> print(p1)

A violin plot of Shannon diversity of the gut microbiome measures the values at early and late time points as depicted by blue and green violin structures. The p-value difference between the time points is 0.013 units as measured by Wilcoxon sum test.

Fig. 9.8

Violin plot of mouse gut microbiome data for Shannon diversity by early and late time points with p-value generated by Wilcoxon rank sum test

9.4 Statistical Hypothesis Testing of Alpha Diversity

Example 9.11: Breast Cancer QTRT1 Mouse Gut Microbiome, Example 9.1 Cont.

Depending on the number of groups and the distribution of the alpha diversity measures, we can conduct a statistical hypothesis testing of alpha diversity using a two-sample Welch’s t-test, a Wilcoxon rank sum test, a one-way ANOVA, or a Kruskal-Wallis test (Xia et al. 2018). In above boxplots and violin plots, the P-values generated are from comparisons of Shannon diversity between early and late time points using Wilcoxon rank sum test. Here, we use the nonparametric Kruskal-Wallis test to compare groups (in the case of two groups, it equals to Wilcoxon rank sum test).

9.4.1 Summarize the Diversity Measures

The following commands calculate Shannon diversity using the amp_alphadiv() function in ampvis2 package based on the “ds” object created by this package previously:
> # Shannon index
> shannon_amp <- amp_alphadiv(ds,
+ measure = "shannon",
+ richness = FALSE,
+ rarefy = NULL
+ )
We use the Summarize() in the FSA package (current version 0.9.3, March 2022) to summarize the Shannon diversity measures to obtain the basic statistics per group including the number of samples, mean, median, SD, minimum and maximum values, and Q1 and Q3:
> library(FSA)
> Summarize(Shannon ~ Group4, data = shannon_amp)
Group4 n mean sd min Q1 median Q3 max
1 KO_BEFORE 10 2.480 0.0982 2.369 2.390 2.462 2.563 2.620
2 KO_POST 10 2.076 0.2680 1.570 1.995 2.090 2.155 2.462
3 WT_BEFORE 10 2.373 0.1747 1.917 2.401 2.412 2.446 2.561
4 WT_POST 10 2.080 0.1922 1.809 1.891 2.142 2.235 2.327

9.4.2 Plot Histogram of the Diversity Distributions

We use the histogram() in the lattice package (current version 0.20.45, March 2022) to obtain a distribution plot for each group. We specify the layout of individual plots in panel of two columns and two rows (Fig. 9.9):
> # Figure 9.9
> library(lattice)
> histogram(~ Shannon|Group4, data=shannon_amp,layout=c(2,2))

A histogram of the percent of total versus Shannon has 4 parts, W T before, W T post, K O before, and K O post. It displays that the highest values occur before the W T and K O parts.

Fig. 9.9

Histograms of Shannon diversity in mouse gut microbiome data

9.4.3 Kruskal-Wallis Test

We perform a Kruskal-Wallis test of Shannon diversity by calling the kruskal.test() function:
> kruskal.test(Shannon ~ Group4, data = shannon_amp)
Kruskal-Wallis rank sum test
data: Shannon by Group4
Kruskal-Wallis chi-squared = 20, df = 3, p-value = 0.0002
As the P-value is less than the significance level 0.05, we can conclude that there are significant differences between the groups. The Kruskal-Wallis chi-squared value of 20 is greater than the chi-square statistic of 7.815 that is generated by the following qchisq() function with 3 degrees of freedom. So, the significance between groups were further confirmed:
> qchisq(0.950, 3)
[1] 7.815

9.4.4 Perform Multiple Comparisons

From the output of the Kruskal-Wallis test, we know that there is a significant difference between groups. To find out which pairs of groups are different, we use the following R commands to conduct multiple comparisons using the Tukey method for adjusting P-values:
> shannon_amp$Group4 <- as.factor(shannon_amp$Group4)
> library(DescTools)
> adj_tukey = NemenyiTest(x = shannon_amp$Shannon,
+ g = shannon_amp$Group4,
+ dist="tukey")
> adj_tukey

Nemenyi’s test of multiple comparisons for independent samples (tukey).

mean.rank.diff pval
KO_POST-KO_BEFORE -17.4 0.0048 **
WT_BEFORE-KO_BEFORE -4.6 0.8153
WT_POST-KO_BEFORE -19.2 0.0014 **
WT_BEFORE-KO_POST 12.8 0.0683 .
WT_POST-KO_POST -1.8 0.9860
WT_POST-WT_BEFORE -14.6 0.0269 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The following R commands conduct multiple comparisons using the Benjamini and Hochberg method for adjusting P-values:
> library(FSA)
> adj_bh = dunnTest(shannon_amp$Shannon ~ shannon_amp$Group4,data=shannon_amp, method="bh")
> adj_bh
Dunn (1964) Kruskal-Wallis multiple comparison
P-values adjusted with the Benjamini-Hochberg method.
Comparison Z P.unadj P.adj
1 KO_BEFORE - KO_POST 3.3282 0.0008742 0.002623
2 KO_BEFORE - WT_BEFORE 0.8799 0.3789374 0.454725
3 KO_POST - WT_BEFORE -2.4483 0.0143534 0.021530
4 KO_BEFORE - WT_POST 3.6724 0.0002402 0.001441
5 KO_POST - WT_POST 0.3443 0.7306271 0.730627
6 WT_BEFORE - WT_POST 2.7926 0.0052289 0.010458
It’s also possible to use the function pairwise.wilcox.test() to calculate pairwise comparisons between group levels and correct for multiple testing as below:
> pairwise.wilcox.test(shannon_amp$Shannon, shannon_amp$Group4,
+ p.adjust.method = "BH")
Pairwise comparisons using Wilcoxon rank sum exact test
data: shannon_amp$Shannon and shannon_amp$Group4
KO_BEFORE KO_POST WT_BEFORE
KO_POST 0.003 - -
WT_BEFORE 0.423 0.053 -
WT_POST 0.00006 0.853 0.002
P value adjustment method: BH

9.5 Alpha Diversity Analysis in QIIME 2

QIIME 2’s diversity analyses are available through the q2-diversity plugin, which supports calculating alpha and beta diversity metrics, performing related statistical tests, and generating interactive visualizations. Both alpha and beta diversity measures, as well as phylogenetic and non-phylogenetic diversity measures, can be generated with a single command qiime diversity core-metrics-phylogenetic command. However, to show QIIME 2’s capabilities, in this section we will first use the core-metrics-phylogenetic command to calculate alpha and beta diversity metrics together, and then we will illustrate how to calculate each alpha diversity using individual alpha command. In Chap. 10, we will illustrate how to calculate each beta diversity using beta command. For diversity analysis, a sequence/feature/OTU table and a phylogenetic tree are required. And a sampling depth for random subsampling needs to be specified.

9.5.1 Calculate Alpha Diversity Using Core-Metrics-Phylogenetic Method

Here, we use a single command to generate all the phylogenetic and non-phylogenetic diversity measures available in QIIME 2. The alpha diversity measures calculated by the qiime diversity core-metrics-phylogenetic method include Shannon’s diversity, Pielou’s evenness, Faith’s phylogenetic diversity, and observed OTUs. The beta diversity measures calculated by this command include weighted/unweighted UniFrac (Lozupone and Knight 2005; Lozupone et al. 2007), Bray-Curtis, and Jaccard indices.

Because most diversity metrics are sensitive to different sampling depths across different samples, in order to compute samples with uneven sequencing depth, the core-metrics-phylogenetic method will randomly subsample or rarefy the counts from each sample to a user-specified value (i.e., sampling depth) when QIIME 2 computes these alpha and beta diversity metrics. Thus, it is crucial to provide the even sampling (i.e., rarefaction) depth to the parameter --p-sampling-depth. The algorithm works this way: The value provided to --p-sampling-depth will set up a total count for each sample in the resulting table when subsampling the counts in each sample without replacement. For example, suppose we specify 1000 for this parameter; then if the total count for any sample(s) are smaller than 1000, those samples will be dropped from the diversity analysis. The recommendation of choosing this value is to review the information presented in the Feature table.qzv QIIME 2 artifact and in particular the Interactive Sample Detail tab in that visualization. Typically, choose a value that is as high as possible (so that we can retain more sequences per sample) while excluding as few samples as possible.

Example 9.12: Mouse Gut Microbiome Data, Example 9.2 Cont.

Qiime tools view FeatureTableMiSeq_SOP.qzv.

Above command activates qiime 2 view. In the Interactive Sample Detail tab, we can see that sample M1D141 has 1153 sequences; then there are 10 samples from M3D8 to M3D149; the sequences decrease from 815, 694, … to the fewest sequence 0. Here we will choose 1153 based on the number of sequences in the sample M1D141. The value of 1153 is considerably or relatively higher than the number of sequences in the samples that have fewer sequences (from 694 to 0). Thus, choosing this value we can retain more sequences per sample while excluding few samples:

source activate qiime2-2022.2
cd QIIME2R-Bioinformatics
qiime diversity core-metrics-phylogenetic \
--i-phylogeny RootedTreeMiSeq_SOP.qza \
--i-table FeatureTableMiSeq_SOP.qza \
--p-sampling-depth 1153 \
--m-metadata-file SampleMetadataMiSeq_SOP.tsv \
--output-dir CoreMetricsResults

A set of 17 lines of data written in the green text represents the q z a format files saving locations.

Above command generates alpha and beta diversity measures at 1153 sequences per sample. It also generates PCoA plots automatically. QIIME artifact files generated by this command can be opened by changing .qza to .gz and then double clicking to unzip. The measures are .tsv files, for example, the Shannon diversity for the first few samples like below:
shannon
F3D0 5.597038025432271
F3D1 5.7812812011984835
F3D11 4.175566318851661

9.5.2 Calculate Alpha Diversity Using Alpha Method

In QIIME 2, a user-specified alpha diversity metric for all samples in a feature table can be calculated using the alpha and alpha-phylogenetic methods for abundance-based (i.e., non-phylogenetic) alpha diversity metrics and phylogenetic alpha diversity metrics, respectively. Here, we continue to illustrate most frequently used abundance-based alpha diversity metrics using the mouse gut microbiome data in Example 9.12.

9.5.2.1 Shannon Index

The following commands calculate Shannon’s index which calculates richness and diversity using a natural logarithm and accounts for both abundance and evenness of the taxa present in samples:
source activate qiime2-2022.2
QIIME2R-Bioinformatics
qiime diversity alpha \
--i-table FeatureTableMiSeq_SOP.qza \
--p-metric shannon\
--o-alpha-diversity ShannonVector.qza

A text reads, saved sample data, open bracket, Alpha diversity, closed bracket, to, S h a n n o n V e c t o r dot q z a.

The input parameter --i-table is used to specify the feature table containing the samples for which the alpha diversity metric will be calculated. The parameter --p-metric is used to specify the alpha diversity metric to be calculated. The output parameter --o-alpha-diversity is used to specify the output file, which you can give a meaningful name.

9.5.2.2 Chao1 Index and Chao1 Confidence Interval

The following commands calculate Chao 1 index which estimates diversity from abundant data and estimates number of rare taxa missed from under-sampling:

qiime diversity alpha \
--i-table FeatureTableMiSeq_SOP.qza \
--p-metric chao1\
--o-alpha-diversity Chao1Vector.qza

A text reads, saved sample data, open bracket, Alpha diversity, closed bracket, to, C h a o 1 V e c t o r dot q z a.

The following commands calculate chao1 confidence interval for richness estimator:
qiime diversity alpha \
--i-table FeatureTableMiSeq_SOP.qza \
--p-metric chao1_ci \
--o-alpha-diversity Chao1CIVector.qza

A text reads, saved sample data, open bracket, Alpha diversity, closed bracket, to, C h a o 1 C I V e c t o r dot q z a.

9.5.2.3 Observed Features

The following commands calculate observed features (here, OTUs):

qiime diversity alpha \
--i-table FeatureTableMiSeq_SOP.qza \
--p-metric observed_features\
--o-alpha-diversity ObservedFeaturesVector.qza

A text reads, saved sample data, open bracket, Alpha diversity, closed bracket, to, O b s e r v e d f e a t u r e s v e c t o r dot q z a.

9.5.2.4 Simpson Index and Simpson’s Evenness

The following commands calculate Simpson’s index to measure the relative abundance of the different species making up the sample richness:
qiime diversity alpha \
--i-table FeatureTableMiSeq_SOP.qza \
--p-metric simpson \
--o-alpha-diversity SimpsonVector.qza

A text reads, saved sample data, open bracket, Alpha diversity, closed bracket, to, S i m p s o n v e c t o r dot q z a.

The following commands calculate Simpson’s evenness, the diversity that account for the number of organisms and number of species:
qiime diversity alpha \.
--i-table FeatureTableMiSeq_SOP.qza \
--p-metric simpson_e \
--o-alpha-diversity SimpsonEVector.qza

A text reads, saved sample data, open bracket, Alpha diversity, closed bracket, to, S i m p s o n E v e c t o r dot q z a.

9.5.2.5 Pielou’s Evenness

The following commands calculate Pielou’s evenness to measure the relative evenness of species richness:
qiime diversity alpha \
--i-table FeatureTableMiSeq_SOP.qza \
--p-metric pielou_e \
--o-alpha-diversity PielouEVector.qza

A text reads, saved sample data, open bracket, Alpha diversity, closed bracket, to, P i e l o u E v e c t o r dot q z a.

9.5.3 Calculate Alpha Diversity Using Alpha-Phylogenetic Method

QIIME 2 implements the phylogenetic alpha diversity metrics using the alpha-phylogenetic method.

The following commands calculate Faith’s phylogenetic diversity:
qiime diversity alpha-phylogenetic \
--i-table FeatureTableMiSeq_SOP.qza \
--i-phylogeny RootedTreeMiSeq_SOP.qza \
--p-metric faith_pd \
--o-alpha-diversity FaithPDVector.qza

A text reads, saved sample data, open bracket, Alpha diversity, closed bracket, to, F a i t h P D V e c t o r dot q z a.

The input parameter --i-phylogeny is used to specify the phylogenetic tree containing the tip identifiers that correspond to the feature identifiers in the table and is only used for the alpha-phylogenetic command for calculating phylogenetic diversity metrics.

9.5.4 Test for Differences of Alpha Diversity Between Groups

The following commands conduct significant testing of Shannon’s diversity for all categorical variables in sample metadata:
qiime diversity alpha-group-significance \
--i-alpha-diversity CoreMetricsResults/shannon_vector.qza \
--m-metadata-file SampleMetadataMiSeq_SOP.tsv \
--o-visualization ShannonGroupSignificance.qzv

A text reads, saved visualization to, S h a n n o n G r o u p S i g n i f i c a n c e dot q z v.

The alpha-group-significance command generates boxplots of the alpha diversity values and runs all-group and pairwise Kruskal-Wallis tests, a nonparametric ANOVA to compare significant differences among all groups. We can activate qiime2.view to check the significant testing results. Three things are automatically done for each categorical data in metadata columns including alpha diversity boxplots, Kruskal-Wallis (all groups), and Kruskal-Wallis (pairwise). For example, for time variable, the results show that there are total 349 samples (208 for early and 141 for late times). Both all-group and pairwise Kruskal-Wallis show that later time has more Shannon diversity than early time with P-value of 0.003227 and Q-value of 0.003227. However, there is no difference of Shannon diversity between male and female mice with P-value of 0.826126 and Q-value of 0.826126. In this case, the results from all-group and pairwise Kruskal-Wallis are the same, and P-value is also equal to Q-value because the time variable only has two points. We can download the boxplots as SVG format or download raw data as TSV format. The Kruskal-Wallis testing results can be downloaded as CSV files.

The following commands conduct significant testing of Pielou’s evenness for all categorical variables in sample metadata:

qiime diversity alpha-group-significance \
--i-alpha-diversity CoreMetricsResults/evenness_vector.qza \
--m-metadata-file SampleMetadataMiSeq_SOP.tsv \
--o-visualization EvennessGroupSignificance.qzv

A text reads, saved visualization to, E v e n n e s s G r o u p S i g n i f i c a n c e dot q z v.

The following commands conduct significant testing of Faith’s phylogenetic diversity for all categorical variables in sample metadata:

qiime diversity alpha-group-significance \
--i-alpha-diversity CoreMetricsResults/faith_pd_vector.qza \
--m-metadata-file SampleMetadataMiSeq_SOP.tsv \
--o-visualization FaithPDGroupSignificance.qzv

A text reads, saved visualization to, F a i t h P D G r o u p S i g n i f i c a n c e dot q z v.

The following commands conduct significant testing of observed OTUs for all categorical variables in sample metadata:

qiime diversity alpha-group-significance \
--i-alpha-diversity
CoreMetricsResults/observed_features_vector.qza \
--m-metadata-file SampleMetadataMiSeq_SOP.tsv \
--o-visualization ObservedFeaturesGroupSignificance.qzv

A text reads, saved visualization to, O b s e r v e d F e a t u r e s G r o u p S i g n i f i c a n c e dot q z v.

9.5.5 Alpha Rarefaction in QIIME 2

An alpha rarefaction analysis explores alpha diversity as a function of sampling depth, which is used to determine if an environment has been sequenced to a sufficient depth.

Alpha rarefaction is conducted by randomly subsampling the data at a series of sequence depths and plotting the alpha diversity metrics calculated from the random subsamples as a function of the sequencing depth. If a given sample has a plateau on the rarefaction curve, then it provides evidence that the sample has been sequenced to a sufficient depth to capture the majority of taxa.

In QIIME 2, we can use the qiime diversity alpha-rarefaction visualizer to conduct alpha rarefaction analysis. This visualizer will generate rarefaction curves using one or more alpha diversity metrics (by default based on Shannon diversity and observed features, i.e., OTUs measures) at multiple sampling depths, in steps between 1 (optionally controlled with --p-min-depth) and the value provided as --p-max-depth. In the case, if the phylogenetic tree is provided using the --i-phylogeny parameter, then this visualizer will also generate phylogenetic diversity-based curves.

Below we run the qiime diversity alpha-rarefaction visualizer to generate alpha rarefaction curves showing taxon accumulation as a function of sequence depth. Figure 9.10 can be reproduced using the following QIIME 2 commands:

2 line graphs plot Shannon and the number of samples versus sequencing depth. Graph 1 displays 2 lines that increase gradually and become a plateau. Graph 2 exhibits 2 stable lines, one at the top and the other at the bottom.

Fig. 9.10

The alpha rarefaction curves based on Shannon diversity with the results averaged by times

# Figure 9.10
qiime diversity alpha-rarefaction \
--i-table FeatureTableMiSeq_SOP.qza \
--i-phylogeny RootedTreeMiSeq_SOP.qza \
--p-max-depth 1153 \
--m-metadata-file SampleMetadataMiSeq_SOP.tsv \
--o-visualization AlphaRarefaction.qzv

A text reads, saved visualization to, A l p h a R a r e f a c t i o n dot q z v.

The above command generates two plots. The alpha rarefaction plot at top is primarily used to determine if the richness of the samples has been fully observed or sequenced. If the slope of lines in the plot approaches zero or “level out” at some sampling depth along the x-axis, then it suggests collecting additional sequences (taxa) beyond that sampling depth likely to be either rare microorganism or the result of the sequencing error. In other words, the slope of lines does not approach zero; then it indicates either too few sequences were collected and thus the richness of the samples hasn’t been fully observed yet or the data remains having a lot of sequencing error; thus it is mistaken to use this data for diversity analysis.

The bottom plot is generated when sample metadata has grouping groups. It illustrates the number of samples that remain in each group when the feature table is rarefied to each sampling depth. The lines show the number of samples remains in each group when the feature table is rarefied to each sampling depth.

The essential function of the bottom plot is used to check whether the data presented in the top plot is reliable. Please note that we cannot specify a sampling depth that is larger than the total frequency that was obtained for a sample because it is impossible to calculate the diversity measure for this sample at the sampling depth.

Two patterns were noticed from this alpha rarefaction curve: One is that both early and late time points appear to plateau. The Shannon diversity measure does generally continue to increase as a function of the sequencing depth; however, the accumulation increases slowly, suggesting that we have sufficient sample sequence depth to have captured the majority of taxa present in the sample. Another pattern we can see from this plotting is that the late samples have a significantly higher Shannon diversity than the early time. This result is consistent with the Kruskal-Wallis test in Sect. 9.5.4.

9.6 Summary

In this chapter, we first introduced some most common used abundance-based alpha diversity measures and their calculations, including Chao 1 richness and abundance-based coverage estimator (ACE), Shannon diversity, Simpson diversity indices, Pielou’s evenness, and also three phylogenetic-based diversity measures: phylogenetic diversity (PD), phylogenetic entropy (PE), and phylogenetic quadratic entropy (PQE). We then illustrated three exploration tools for alpha diversity analysis: heatmap, boxplot, and violin plot. Next, we illustrated how to summarize and visualize alpha diversity measures and conduct statistical hypothesis testing of alpha diversity using Kruskal-Wallis test as well as perform multiple comparisons to adjust for P-values. Finally, we focused on alpha diversity analysis in QIIME 2 including how to (1) calculate alpha diversity using core-metrics-phylogenetic method, (2) calculate abundance-based alpha diversity using alpha method, (3) calculate Faith’s PD using alpha-phylogenetic method, (4) conduct statistical hypothesis testing of alpha diversity using Kruskal-Wallis test, and (5) perform multiple comparisons to adjust for P-values. We also illustrate how to generate alpha rarefaction curve in QIIME 2. In Chaps. 10 and 11, we will focus on investigating beta diversity analysis. In Chap. 10, we will introduce beta diversity metrics and their calculation using R and QIIME 2 as well as explore beta diversity using ordination methods. In Chap. 11, we will introduce statistical hypothesis testing of beta diversity.