© Springer Nature Switzerland AG 2023
Y. Xia, J. Sun Bioinformatic and Statistical Analysis of Microbiome Data https://doi.org/10.1007/978-3-031-21391-5_15

15. Linear Mixed-Effects Models for Longitudinal Microbiome Data

Yinglin Xia1   and Jun Sun 1
(1)
Department of Medicine, University of Illinois Chicago, Chicago, IL, USA
 

Abstract

Longitudinal microbiome data analysis can be categorized into two approaches: univariate longitudinal analysis and multivariate longitudinal analysis. Univariate analysis analyzes the change of one taxon or alpha diversity over time. Multivariate analysis directly analyzes the change of multiple taxa simultaneously or distance/dissimilarity (beta diversities) over time. This chapter introduces using the classical univariate linear mixed-effects models (LMMs) to analyze microbiome data. First, it describes linear mixed-effects models (LMMs), and then it introduces implementation of LMMs to identify the significant taxa using the nlme package, and model the diversity indices using the lme4 and LmerTest packages, respectively. Next, it introduces how to fit LMMs using QIIME 2 and provides some remarks on longitudinal microbiome studies based on LMMs.

Keywords
Linear mixed-effects models (LMMs) Fixed effect Random effect Hypothesis tests nlme package lme4 package LmerTest package Maximum likelihood (ML) AIC BIC Restricted maximum likelihood (REML) Generalized linear model (GLM) Autoregressive of order 1 [AR(1)] KRmodcomp() pbkrtest qiime longitudinal linear-mixed-effects Volatility analysis

We can categorize longitudinal microbiome data analysis into univariate and multivariate approaches. Univariate analysis analyzes one taxon per time or first summarizes microbiome abundances into alpha diversities and then analyzes these diversities one by one. Multivariate analysis either analyzes multiple taxa simultaneously or directly tests distance/dissimilarity (beta diversities). This chapter focuses on introducing how to use the classical univariate linear mixed-effects models (LMMs) to analyze microbiome data. First, we describe linear mixed-effects models (LMMs) (Sect. 15.1), then we introduce how to implement LMMs to identify the significant taxa using the nlme package (Sect. 15.2) and model the diversity indices using the lme4 and LmerTest packages (Sect. 15.3), respectively. In Sect. 15.4, we introduce how to fit LMMs using QIIME 2. In Sect. 15.5, we provide some remarks on longitudinal microbiome studies based on LMMs. Finally, we summarize the topics of this chapter (Sect. 15.6).

15.1 Introduction to Linear Mixed-Effects Models (LMMs)

In this section, we introduce the advantages of LMMs, the development of fixed and random effects model and definition of LMMs, as well as how to conduct statistical hypothesis testing and fit LMMs.

15.1.1 Advantages and Disadvantages of LMMs

The distinctive characteristic of longitudinal study is that the subjects are measured repeatedly during the study, permitting directly assess the changes in response variable over time (Fitzmaurice et al. 2004; Diggle et al. 2002). Thus, a longitudinal study can capture between-individual differences (heterogeneity among individuals) and within-subject dynamics. Linear mixed-effects models (LMMs, aka multi-level modeling) are an important class of statistical models incorporating fixed and random effects. LMMs as a statistical method can be used to analyze correlated or non-independent data. Such data are often collected in the settings of longitudinal/repeated measurements in which the same statistical units were repeatedly measured or clustered observations where the clusters of related statistical units were measured. Such data also arise from a multilevel/hierarchical structure.

Because the circumstances of the measurements often cannot be fully controlled, considerable variation among individuals in the number and timing of observations may exist, hence resulting in unbalanced datasets. Typically, such unbalanced datasets are challenge to be analyzed using a general multivariate model with unrestricted covariance structure (Laird and Ware 1982). Moreover, missing values often arise in longitudinal/repeated measurements. Traditional statistical methods such as repeated measure analysis of variance are difficult to deal with missing values.

Mixed effects models often have advantages in dealing with unbalanced datasets and missing values.

LMMs integrate two (hierarchical) levels of observations of longitudinal data, i.e., within-group (subject) and between-group (subject) in a single model (Harville 1977; Arnau et al. 2010) and allows a variety of variance/covariance structures or correlation patterns to be explicitly modeled, which provides the opportunity to accurately capture the individual profile information over time. Typically the fixed effects are used to model the systematic mean patterns (i.e., treatment conditions), while random effects are used to model two types of random components: the correlation patterns between repeated measures within subjects or heterogeneities between subjects or both (Lee et al. 2006). That is, random effects estimate the variance of the response variable within and among these groups to reduce the probability of false positives (Type I error rates) and false negatives (Type II error rates) (Crawley 2012; Harrison et al. 2018).

In summary, the desirable features of LMMs are that they are not required for balance in the data and allow explicitly modeling and analysis of between- and within-individual variation (Laird and Ware 1982). Thus, LMMs have the capability of modeling variance and covariance (random effects) which makes this method more preferred to the classical linear model (Xia and Sun 2021). However, LMMs have the major limitation compared to the general multivariate model: a special form of the covariance structure should be assumed (Laird and Ware 1982).

15.1.2 Fixed and Random Effects

The core feature of LMMs is that they incorporate fixed and random effects and hence can analyze the fixed and random effects simultaneously. A fixed effect refers to a parameter that does not vary, whereas a random effect represents a parameter that is itself a random variable. In other words, the distribution of these parameters (or “random effects”) vary over individuals (Laird and Ware 1982).

LMMs have a long history of development. In 1918 Ronald Fisher introduced the random effects models to study the correlations of trait values between relatives such as the correlation of parent and child (Fisher 1918; Laird and Ware 1982). Since the 1930s, researchers have emphasized taking account of random effects as accurately as possible when estimating experimental treatment effects (fixed effects) (Thompson 2008). In the 1950s, Charles Roy Henderson provided best linear unbiased estimates (BLUE) of fixed effects and best linear unbiased predictions (BLUP) of random effects (Robinson 1991; Henderson et al. 1959; McLean et al. 1991). Subsequently, Goldberger (1962), Henderson (Henderson 1963, 1973, 1984; Henderson et al. 1959), and Harville (1976a, b) developed mixed model procedures. The mixed model equations (Henderson 1950; Henderson et al. 1959) introduced by Charles Roy Henderson offers the base for a methodology that provides flexibility of fitting models with various fixed and random elements with the possible assumption of correlation among random effects (McLean et al. 1991; Thompson 2008).

15.1.3 Definition of LMMs

In matrix form a linear mixed model can be specified as:

$$ y= X\beta + Zu+\varepsilon, \kern0.36em u\sim {N}_q\left(0,G\right),\kern0.48em \varepsilon \sim {N}_n\left(0,R\right), $$
(15.1)
where y is a n × 1 column vector, the response (outcome) variable; X is a n × p design matrix of fixed-effects parameters (i.e., for the p predictor variables); β is a p × 1 column vector of the fixed-effects parameters representing regression coefficients (the βs); Z is the n × q design matrix for the q random effects; u is a q × 1 vector of q random effects (the random complement to the fixed β); and ε is a n × 1 column vector of the residuals, which is part of y that is not explained by the model,  + Zu. u and ε are independent and R = σ2I.

G is the variance-covariance matrix of the random effects. Given the fixed effects are directly estimated, the random effects are just deviations around the value in β, which is the mean. Thus it the variance that is left to estimate. Various structures of the variance-covariance matrix of the random effects can be assumed and specified. If only a random intercept is specified, then G is just a 1 × 1 matrix, which models the variance of the random intercept. If a random intercept and a random slope are assumed, then we can specify:

$$ G=\left[\begin{array}{cc}{\sigma}_{\mathrm{int}}^2& {\sigma}_{\operatorname{int},\mathrm{slope}}^2\\ {}{\sigma}_{\operatorname{int},\mathrm{slope}}^2& {\sigma}_{\mathrm{slope}}^2\end{array}\right] $$. If we assume that the random effects are independent, then the true structure is $$ G=\left[\begin{array}{cc}{\sigma}_{\mathrm{int}}^2& 0\\ {}0& {\sigma}_{\mathrm{slope}}^2\end{array}\right] $$. In Chap. 16 (Sect. 16.​2), we will introduce generalized linear mixed models (GLMMs).

15.1.4 Statistical Hypothesis Tests

In the statistical model building for experimental or observational data, LMMs can be used to serve two purposes (Bates 2005): (1) to characterize the dependence of a response (outcome variable) on covariate(s), conditioning on the treatment status and the time under treatment, and (2) to characterize the “unexplained” variation in the response.

A mixed-effects model (or more simply a mixed model) incorporates both fixed-effects terms and random-effects terms, which use a different way to model repeatable covariates and non-repeatable covariates (Bates 2005; Bates et al. 2015). The fixed-effects terms are used for a repeatable covariate to characterize the change in the response between different levels such as typically the change of the response over time under treatment or the difference of a response between treatment and the control groups. In contrast, the random-effects terms are used to characterize the variation induced in the response by the different levels of the non-repeatable covariate (a grouping factor).

Hypothesis Testing of the Fixed-Effects
To test a hypothesis on the fixed effects β, we can use likelihood-ratio test (LRT). The LRT can be constructed by specifying a nested (smaller) model with the same error structure as model (15.1).
$$ y={X}_0{\beta}_0+ Zu+\varepsilon . $$
(15.2)

The LRT statistic for the test of the hypothesis is given:

$$ {\displaystyle \begin{array}{l}{H}_0=\beta \in {\Theta}_{\beta_0},\\ {}{H}_a=\beta \in {\Theta}_{\beta },\end{array}} $$
where $$ {\Theta}_{\beta_0} $$ is a subspace of the parameter space Θβ of the fixed effects β. The log-likelihood ratio test statistic for the null hypothesisis is given by:
$$ {\lambda}_{\mathrm{LRT}}=2\left( ll-{ll}_0\right), $$
where ll and ll0 are the log-likelihoods of the models in Eqs. (15.1) and (15.2), respectively. Under the hull hypothesis, λLRT follows asymptotically a χ2 distribution. LRT is often used, but it tends to produce “anticonservative” P-values and sometimes quite badly so (Pinheiro and Bates 2000) (p. 88). Thus, an F test (although overall it does not exactly follow an F distribution) of the null hypothesis about the fixed effects β: H0 :  = 0, is considered in the literature with L is a contrast matrix of q =  rank (L) > 1. The test statistic for this null hypothesis is given:
$$ F=\frac{{\left(L\hat{\beta}\right)}^T{\left(L\hat{C}{L}^T\right)}^{-1}\left(L\hat{\beta}\right)}{q}, $$
(15.3)

We will describe more details about the LRT and other tests in Chap. 16 (Sect. 16.5.4 for the LRT).

15.1.5 How to Fit LMMs

To fit a LMM, we typically should start with a full model, i.e., including all independent variables of interest and some interaction terms of main factors. Then we can evaluate parameters and compare sub-models. In literature, some protocols and procedures have been proposed and discussed to avoid common and particularly statistical problems related to mixed-effects modeling and multi-model inference in ecology (Zuur et al. 2010; Zuur and Ieno 2016; Harrison et al. 2018). Here, we describe some main points based on our experience using LMMs and the discussion in the literature.
  • First of all, exactly determining whether a variable is a random or fixed is controversial.

Specifying a particular variable as fixed or random effect or even both simultaneously largely depends on experimental design and context. Generally fixed effects refer to all factors whose levels are experimentally determined or whose interest is in the specific effects of each level, such as treatments, time, other covariates, as well as treatment and time (or covariates) interactions. In contrast, random effects refer to all factors that qualify as sampling from a population or whose interest is in the variation among the population rather than the specific effects of each level. Random effects typically represent some grouping variables in both subject and unit levels (Breslow and Clayton 1993), such as individuals in repeated measurements, field trials, plots, blocks, batches.

Random effect models have several desirable properties, but it is more challenging to use them. Harrison et al. (2018) very well described four considerations when fitting random effects. We summarized them here:
  1. 1.

    Fitting random effects requires at least five “levels” (groups) for a random intercept term to achieve robust estimates of variance (Gelman and Hill 2007; Harrison 2015). The mixed model may not be able to estimate the among-group variance accurately for less than 5 levels (Harrison et al. (2018) because if the random effect models have less than 5 levels, then the variance estimate will either collapse to zero, making the model become an ordinary GLM (Gelman and Hill 2007) or be non-zero which is incorrect if the small number of groups that were sampled are not representative of true distribution of means (Harrison 2015). Thus, in practice, as a rule of thumb, the factors with fewer than 5 levels are considered as fixed effects, while the factors with numerous levels are considered as random effects in order to increase accurately estimating variance.

     
  2. 2.

    Models and especially the random slope models will be unstable if sample sizes across groups are highly unbalanced (Grueber et al. 2011).

     
  3. 3.

    It is difficult to determine the significance or importance of variance among groups.

     
  4. 4.

    Mis-specification of random effects or incorrectly parameterizing the random effects in the model could results in the model estimates being unreliable as ignoring the need for random effects altogether (Harrison et al. 2018).

     
  • Second, start with a full model to include all potential independent variables in the fixed model using maximum likelihood (ML).

Depending on experimental design and hypothesis, include as many as possible predictors and their interactions in the fixed-effect model. A linear regression (ordinary least squares) model can be fitted to compare with the fixed effect only model. The fixed-effect model is fitted under the statistical framework of ML to find the parameters of a model that maximizes the probability of the observed data (the likelihood). It is suggested to use information criteria (e.g., AIC and/or BIC, see Chap. 16 for details) to compare the model fit and perform variable selection. Because LRT is unreliable for small to moderate sample sizes, it is not recommended for testing fixed effects in GLMMs (Pinheiro and Bates 2000). To test fixed effects or compare any models with different fixed effects using LRT, the model is required to be fitted with ML too. Based on the significant fixed effects using ML estimation, an optimal fixed model structure will be determined, including the important main effects and their interactions. For more details on model selection in GLMMs, please see Chaps. 16 and 17.
  • Third, fit the random model using restricted maximum likelihood (REML) to optimize the random structure.

The goal of this step is to find the optimal variance structure in terms of heterogeneity. Thus, it is required to fit the model with REML to estimate the random-effect parameters (i.e., standard deviations) over the averaged values of the fixed-effect parameters. Because REML assumes that the fixed effects are unknown, thus generally REML estimation is less biased compared to corresponding ML estimation (Veroniki et al. 2016; Zuur et al. 2009). An identical generalized linear model (GLM) with REML can be fitted to serve as a reference for the fitted LMMs. Based on the significant random effects using REML estimation, an optimal random (variance) model structure will be determined. That is, the random intercept, random slopes and crossed or nested effect can be specified as the random effects. An error structure can also be chosen.
  • Finally, refit the optimal fixed and random models using REML, and validate and interpret the modeling results.

After updating (refitting) optimal model with REML, we can check the differences in the main effects and effects in interaction terms caused by the introduction of random effects. A diagnosing model fit procedure can be conducted to validate the fitted models. The reader is referred to Chap. 17 for details. We then can interpret the results based on the prior hypotheses.

15.2 Identifying the Significant Taxa Using the nlme Package

In this section, we introduce LMMs in microbiome research and illustrate how to fit LMMs to identify significant microbial taxa.

15.2.1 Introduction to LMMs in Microbiome Research

The microbiome is inherently dynamic, driven by interactions with the host and the environment, and varies over time. Longitudinal microbiome data analysis can enhance our understanding of short- and long-term trends of microbiome by intervention, such as diet, and the development and persistence of chronic diseases caused by microbiome. Thus, longitudinal microbiome data analysis provides rich information on the profile of microbiome with host and environment interactions (Xia et al. 2018a).

In the literatures of longitudinal statistical analysis of physical, biological, psychological, social, and behavioral sciences, the linear mixed-effects models (LMMs) (Laird and Ware 1982) are often used. The classical LMM method was also one of the first models that were used to analyze longitudinal microbiome data (e.g., Rosa et al. 2014; Kostic et al. 2015; DiGiulio et al. 2015; Wang et al. 2015) and currently is still used in microbiome and multi-omics literature (Lloyd-Price et al. 2019). The reasons that LMMs have been chosen to analyze longitudinal microbiome data mainly because (1) LMMs equip with fixed- and random-effects components, which provides a standardized and flexible approach to model both fixed- and random-effects. (2) LMMs can jointly model measures over all the time points to account for time-dependent correlations in longitudinal microbiome study designs. Specifically because (3) LMM method was an established methodology to remove the effects of fixed- and random-effect confounding variables (Ernest et al. 2012; Xia 2020) in the areas of microarray, genome-wide association studies (GWAS), and metabolomics (Fabregat-Traver et al. 2014; Zhao et al. 2019), which motivates its application to microbiome data.

15.2.2 Longitudinal Microbiome Data Structure

Microbiome data (i.e., OTUs/ASVs table) generated by each sample through either the 16S rRNA gene sequencing or the shotgun metagenomics sequencing consist of read counts of taxa at certain taxonomic levels, such as kingdom, phylum, class, order, family, genus, and species. In longitudinal microbiome study, each subject are measured at multiple time points (i.e., samples). Assume that there are n subjects, and subject i is measured at ni time points tij, j = 1, …, ni and i = 1, …, n. Denote Cijm as read counts for subject i measured at time j for taxon m, m = 1, …, m, and denote the total sequence read counts for subject i at time j as Tij, which is referred to as library size or depths of coverage; then a longitudinal microbiome data structure can be summarized in Table 15.1.

In Table 15.1, Xi denotes clinical or environmental variables for each subject. Under this data structure and notations, we can formulate a longitudinal microbiome model to identify the associations between the microbiome counts and the host trait of interest (i.e., covariate variables) and characterize the time trends of microbiome abundance both within subjects and between subjects.
Table 15.1

Longitudinal microbiome data structure

Subject-ID

Taxon 1

Taxon 2

Taxon m

Total reads

Covariate

Time

Subject 1

C111

C112

C11m

T11

X1

t11

Subject 1

C121

C122

C12m

T12

X1

t12

Subject 1

C131

C132

C13m

T13

X1

t13

Subject 2

C211

C212

C21m

T21

X2

t21

Subject 2

C221

C222

C22m

T22

X2

t22

Subject 2

C231

C232

C23m

T23

X2

t23

Subject n

Cn11

Cn12

Cn1m

Tn1

Xn

Tn1

To analyze the associations between the microbiome and the host trait of interest (i.e., covariate variables), two approaches can be applied via LMMs: (1) fit LMMs using the read counts as the outcome and (2) fit LMMs using the diversity index as the outcome. The LMMs can be fitted using the nlme (current version 3.1-157, March 2022) and lme4 (current version 1.1-29, April 2022) packages. Below we illustrate their uses through our dataset of breast cancer QTRT1 mouse gut microbiome.

15.2.3 Fit LMMs Using the Read Counts as the Outcome

When applying LMM to fit microbiome and multi-omics abundance data, raw microbiome abundances should be transformed or normalized by an appropriate method (Xia 2020): (1) arcsine square root transformation (Kostic et al. 2015; Rosa et al. 2014), (2) log-transformation (with pseudocount 1 for zero values) (Lloyd-Price et al. 2019), and (3) log-transformation with no pseudocount for expression ratios (non-finite values removed) (Lloyd-Price et al. 2019).

It is arguable whether or not these transformation/normalization methods are appropriate to each outcomes of interest. Especially adding a pseudocount 1 for zero values may even not make sense because it forces the microbiomes from “nothing” (absence) to “being”(presence).

Example 15.1: Breast Cancer QTRT1 Mouse Gut Microbiome, Example 9.1, Cont.

In this dataset, the mouse fecal samples were collected at two time points: pretreatment and posttreatment. Below we use LMMs to identify the significant change species over time.

> setwd("~/Documents/QIIME2R/Ch15_LMMs")
  • Step 1: Load taxa abundance data and meta data.

> otu_tab=read.csv("otu_table_L7_MCF7.csv",row.names=1,check.names=FALSE)
> meta_tab <- read.csv("metadata_QtRNA.csv",row.names=1,check.names = FALSE)
  • Step 2: Check taxa and meta data.

> # Check the numbers of taxa and samples
> nrow(otu_tab)#taxa
[1] 59
> ncol(otu_tab)#samples
[1] 40
> # Check the numbers of samples and variables
> nrow(meta_tab)#samples
[1] 40
> ncol(meta_tab)#variables
[1] 6
> head(otu_tab,3)
> head(meta_tab,3)
For the modeling of LMMs, the taxa need to be in the columns. We use the t() function to transform the otu_tab.
> otu_tab_t<-t(otu_tab)
> head(otu_tab_t,3)
> dim(otu_tab_t)
[1] 40 59
  • Step 3: Create analysis dataset-outcomes and covariates.

# Create outcome matrix or data frame
> colnames(otu_tab_t)
> colnames(meta_tab)
> class(otu_tab_t)
[1] "matrix" "array"
> class(meta_tab)
[1] "data.frame"
> outcome =otu_tab_t
> # Match sample IDs in outcome matrix or data frame and meta_tab data frame
> outcome = outcome[rownames(meta_tab), ]
> class(outcome)
[1] "matrix" "array"
> class(meta_tab)
[1] "data.frame"
> head(outcome,3)
> yy = as.matrix(outcome)
> yy = ifelse(is.na(yy), 0, yy) #Exclude missing outcome variables if any
> dim(yy)
[1] 40 59

There are 59 taxa in otu_tab. To effectively model the outcomes, we first reduce the number of taxa in the analysis.

In the literature, there are no unique procedures to filter abundance taxa data before modeling. In general, the unclassified taxa were removed, then either read counts or relative abundance data were used. Some researchers renormalized the data after the unclassified taxa were removed, while others directly used the data but adjusted the total read counts during the modeling. Some researchers consider low read counts contribute less to outcome results (although this argument underestimates the functional analysis or the role of rare taxa) and hence removed the low read counts based on a cutoff value. Because too few zeros and too many zeros in the abundance data may cause the models be inaccurate, most models used the proportion of zeros to filter abundance taxa data. Some even used double functions (sum() and quantile() functions) (Xia et al. 2018b) to filter the data. Actually, such arbitrary data preprocessing is an unapproved practice in microbiome research. We should recognize that it is a limitation of current microbiome research.

Here, we use sum() and quantile() functions to double filter the taxa that results in 17 taxa to be included in the analysis.

> # Filter abundance taxa data
> # Here we use double filters
> filter1 <- apply(yy, 2, function(x){sum(x>0)>0.4*length(x)})
> filter2 <- apply(yy, 2, function(x){quantile(x,0.9)>1})
> filter12 <-yy[,filter1&filter2]
> cat('after filter:','samples','taxa',dim(filter12),'\n')
after filter: samples taxa 40 17
> #cat(colnames(filter12),'\n')
> colnames(filter12)
[1] "D_6__Anaerotruncus sp. G3(2012)"
[2] "D_6__Clostridiales bacterium CIEAF 012"
[3] "D_6__Clostridiales bacterium CIEAF 015"
[4] "D_6__Clostridiales bacterium CIEAF 020"
[5] "D_6__Clostridiales bacterium VE202-09"
[6] "D_6__Clostridium sp. ASF502"
[7] "D_6__Enterorhabdus muris"
[8] "D_6__Eubacterium sp. 14-2"
[9] "D_6__gut metagenome"
[10] "D_6__Lachnospiraceae bacterium A4"
[11] "D_6__Lachnospiraceae bacterium COE1"
[12] "D_6__mouse gut metagenome"
[13] "D_6__Mucispirillum schaedleri ASF457"
[14] "D_6__Parabacteroides goldsteinii CL02T12C30"
[15] "D_6__Ruminiclostridium sp. KB18"
[16] "D_6__Staphylococcus sp. UAsDu23"
[17] "D_6__Streptococcus danieliae"

If the meta data are incomplete, we can use the complete.cases() function to subset the data and apply complete cases.

Now we can define outcome matrix or data frame and determine the final taxa in the analysis.
> # Match the number of taxa to the filter12 dataset in the analysis
> yy = filter12[rownames(meta_tab), ]
> dim(yy)
[1] 40 17
> # Create a matrix to hold the taxa
> taxa_all <- colnames(filter12)
> taxa_all
The following commands are used to specify covariates for the model.
> N = meta_tab[, "Total.Read"] # total reads
> mean(N); sd(N)
[1] 55619
[1] 7524
> mean(log(N)); sd(log(N))
[1] 10.92
[1] 0.1417
> subject = meta_tab[, "MouseID"]
> group = meta_tab[, "Group"]
> time = meta_tab[, "Time"]
  • Step 4: Run the lme() function in the nlme package to implement LMMs.

The lme () function is available in the R package nlme, which was developed to fit linear mixed models of the form described in Pinheiro and Bates (2000) (Pinheiro and Bates 2000, 2006).

We can assume independence of correlation matrix by setting R matrix as an identity matrix, which is the most simplified structure. However, various correlation matrices can be specified (Pinheiro and Bates 2000, 2006) such as autoregressive of order 1 [AR(1)] or continuous-time AR(1). The lme () function allows for multiple and correlated group-specific (random) effects (the argument random) and various types of within-group correlation structures (the argument correlation) described by corStruct in the nlme package.

Below we fit LMMs with main host factors: group and time as well as their interaction as fixed effects and random intercept only (Group*Time, random = ~ 1|Subject). Other LMMs also can be fitted such as with main host factors: group, time, and their interaction as fixed effects and random intercept only and specifying correlation matrix (Group*Time, random = ~ 1|Subject, correlation = corAR1()).

First we need to create list() and matrix to hold the objects of modeling results, which could be used as further processing the results and model comparisons. Then we need to call library(nlme) and run the lme() function iteratively to fit multiple taxa (outcome variables).

> yy <- yy[, colSums(is.na(yy)) == 0]
> # Create list() and the objects to hold the modeling results
> mod= list()
> rest1 = rest2 = rest3 = matrix(NA, ncol(yy), 1)
> library(nlme)
> for (j in 1:ncol(yy)){
+ tryCatch({
+ #filter12= na.omit(filter12)
+ y = yy[, j]
+ yt = asin(sqrt(y/N))
+ df = data.frame(y=y, Group=group, Time=time, Subject=subject)
+ mod= lme(yt ~ Group*Time, random = ~ 1|Subject,data=df)
+ rest_tab = summary(mod)$tTable
+ rest1[j, ] = summary(mod)$tTable[2, 5]
+ rest2[j, ] = summary(mod)$tTable[3, 5]
+ rest3[j, ] = summary(mod)$tTable[4, 5]
+ }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
+ }
We can check the last outcome results using the object rest_tab and summary(mod).
> rest_tab
Value Std.Error DF t-value p-value
(Intercept) 0.0140953 0.002828 36 4.98473 1.574e-05
Group -0.0003452 0.003999 36 -0.08631 9.317e-01
TimePost 0.0043009 0.003999 36 1.07551 2.893e-01
Group:TimePost -0.0012754 0.005655 36 -0.22551 8.229e-01
> summary(mod)
  • Step 5: Adjust P-values.

> all_lmm <- as.data.frame(cbind(rest1, rest2,rest3))
> # Adjust p-values
> Group_lmm_adj <-round(p.adjust(rest1,'fdr'),4)
> TimePost_lmm_adj <-round(p.adjust(rest2,'fdr'),4)
> Group.TimePost_lmm_adj <-round(p.adjust(rest3,'fdr'),4)
> all_lmm_adj <- as.data.frame(cbind(taxa_all,Group_lmm_adj,TimePost_lmm_adj,Group.TimePost_lmm_adj))
  • Step 6: Write table to contain the results (Table 15.2).

Table 15.2

Modeling results from linear mixed effects model based on QtRNA data

 

taxa_all

Group_Imm_adj

TimePost_Imm_adj

Group.TimePost_Imm_adj

1

D_6__Anaerotruncus sp. G3(2012)

0.5954

2E-04

0.8432

2

D_6__Clostridiales bacterium CIEAF 012

0.9589

0.6811

0.4371

3

D_6__Clostridiales bacterium CIEAF 015

0.4575

2E-04

0.3968

4

D_6__Clostridiales bacterium CIEAF 020

0.3038

0.1708

0.1857

5

D_6__Clostridiales bacterium VE202-09

0.9589

0.2281

0.4862

6

D_6__Clostridium sp. ASF502

0.8211

0.3766

0.5562

7

D_6__Enterorhabdus muris

0.1482

0.8121

0.3488

8

D_6__Eubacterium sp. 14-2

0.0654

6E-04

0.1857

9

D_6__gut metagenome

0.8211

0.8808

0.5562

10

D_6__Lachnospiraceae bacterium A4

0.8309

0.6208

0.4494

11

D_6__Lachnospiraceae bacterium COE1

0.1329

0.4969

0.0277

12

D_6__mouse gut metagenome

0.4186

3E-04

0.8432

13

D_6__Mucispirillum schaedleri ASF457

0.0644

0.7033

0.1857

14

D_6__Parabacteroides goldsteinii CL02T12C30

0.353

0.0026

0.9993

15

D_6__Ruminiclostridium sp. KB18

0.8211

0.0026

0.1857

16

D_6__Staphylococcus sp. UAsDu23

0.1482

6E-04

0.3249

17

D_6__Streptococcus danieliae

0.9589

0.4471

0.8743

> # Table 15.2
> # Write results table
> # Make the table
> library(xtable)
> table <- xtable(all_lmm_adj,caption = "Table of significant taxa",lable="sig_taxa_table")
> print.xtable(table,type="html",file = "LMMs_Table_Model_QtRNA.html")
> write.csv(all_lmm_adj ,file = paste("LMMs_Table_Model_QtRNA.csv",sep = ""))

15.3 Modeling the Diversity Indices Using the lme4 and LmerTest Packages

In this section, we introduce the lme4 and lmerTest packages and illustrate how to fit LMMs to analyze the diversity indices.

15.3.1 Introduction to the lme4 and lmerTest Packages

The lme4 package (Bates 2005; Bates et al. 2015) and its lmer() function were developed by Bates and his colleagues to provide more flexible fitting of linear mixed models and provide extensions to generalized linear mixed models as well. Compared to the lme () function, the lmer() function fits a larger range of models and is more reliable and faster, but the model specification has been changed slightly. For the details on how to use the lmer() function to fit linear mixed models, and the differences between lmer() and lme(), the interested reader is referred to the cited references for description of the methods (Bates 2005; Bates et al. 2015).

The R package lmerTest (Kuznetsova et al. 2017) (latest version 3.1-3, October 2020) was developed to facilitate the functionality of the “lmerMod” class of the lme4 package, including:
  • Obtain the P-values for the F-and t-tests for objects returned by the lmer() through overloading the anova() and summary() functions of fixed effects.

  • Implement the Satterthwaite’s method for approximating degrees of freedom for the t-and F-tests.

  • Implement the construction of Type I-III ANOVA tables.

  • Obtain the summary and the anova table using the Kenward-Roger approximation for denominator degrees of freedom by using the KRmodcomp() function in the pbkrtest package.

  • Perform backward elimination (a step-down model building method) of both random and fixed non-significant effects.

  • Calculate population means and perform multiple comparison tests as well as plot facilities.

The lme4 package provides F-statistics and the t-statistics in anova() and summary() functions, respectively; however, the P-values for the corresponding F- and t-tests are not provided. The lmerTest package implements and wraps Satterthwaite’s method (Giesbrecht and Burns 1985; Hrong-Tai Fai and Cornelius 1996) into anova() and summary() functions for the object returned by lmer(). Satterthwaite’s method has been implemented in SAS software (SAS Institute Inc. 1978, 2013). The SAS users should be familiar with Satterthwaite’s method. As an alternative, the Kenward-Roger approximation method is also available, which is integrated through the KRmodcomp() function of the pbkrtest package (Halekoh and Højsgaard 2014, 2021). For details, the interested reader is referred to the paper of lmerTest package (Kuznetsova et al. 2017).

15.3.2 Fit LMMs Using the Diversity Index as the Outcome

Example 15.2: Breast Cancer QTRT1 Mouse Gut Microbiome, Example 15.1 Cont.

In this subsection, we use the same dataset from Example 15.1 to assess the association between microbiome diversity and group status (Genotype) by setting the Chao 1 richness index as the outcome and the group and time as the fixed-effect terms.

We fit the LMMs using the R package lme4 (Bates et al. 2015) and perform the corresponding statistical test using the R package lmerTest (Kuznetsova et al. 2017).

For illustration, we select the Chao 1 richness index as the response variable and the group, time, and their interaction as the fixed-effect terms. We also consider group and group by time interaction as random terms. We use the following five steps to perform LMMs and illustrate the functionality of lme4 and LmerTest packages.

The first two steps have done in Chap. 9 (Example 9.​1), we duplicate them here for the reader’s convenience.
  • Step 1: Load taxa abundance data and meta data.

> setwd("~/Documents/QIIME2R/Ch15_LMMs")
> otu_tab <- read.csv("otu_table_L7_MCF7_amp.csv", check.names = FALSE)
> meta_tab <- read.csv("metadata_QtRNA_amp.csv", check.names = FALSE)
  • Step 2: Calculate alpha-diversity indices.

Here we choose Chao 1 richness index as response variable. Similarly, other diversity indices can be applied. Below we use the ampvis2 package to calculate Chao 1 richness index. It can be calculated using other R packages such as vegan and microbiome packages.

> # Combine otu-table and meta-table using the amp_load() function
> library(ggplot2)
> library(ampvis2)
> ds <- amp_load(otutable = otu_tab, metadata = meta_tab)
We can calculate all the alpha-diversity indices including Chao 1 and ACE by specifying richness=TRUE that are available in the ampvis2 package.
> options(width=110,digits=4)
> alpha_amp <- amp_alphadiv(ds,
+ measure =,
+ richness = TRUE,
+ rarefy = NULL
+ )
> head(alpha_amp,3)
SampleID Sample MouseID Genotype Group Time Group4 Total_Read Reads ObservedOTUs Shannon Simpson invSimpson Chao1 ACE
Sun040.BH5 Sun040.BH5 40 BH5 KO 1 Before KO_BEFORE 39824 39813 110 2.385 0.8270 5.779 141.0 156.1
Sun039.BH4 Sun039.BH4 39 BH4 KO 1 Before KO_BEFORE 42068 42056 112 2.378 0.8336 6.009 160.0 170.5
Sun027.BF2 Sun027.BF2 27 BF2 WT 0 Before WT_BEFORE 42738 42729 101 2.276 0.7784 4.512 153.8 162.9
  • Step 3: Run LMMs using the ANOVA method for objects returned by lmer() function.

> install.packages("lmerTest")
> library(lme4)
> library("lmerTest")
> chao1 <- lmer(Chao1 ~ Group*Time + (1 |Group), data = alpha_amp)

The lmerTest package provides Type I, II, and III ANOVA tables as defined in the SAS software (SAS Institute Inc. 1978). The Type II and III tables do not depend on the order in which the effects are entered in the model, whereas the Type I ANOVA table is order dependent which performs the sequential decomposition of the contributions of the fixed-effects. The Type I table is the one produced by the ANOVA method of the lme4 package (Bates et al. 2015; Kuznetsova et al. 2017).

For balanced cases, these three ANOVA tables give same results. By default the lmerTest package provides the Type III ANOVA table, while lme4 provides the sequential (Type I) ANOVA table.
> anova(chao1)
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
Group 405 405 1 36 0.56 0.458
Time 592 592 1 36 0.82 0.370
Group:Time 2376 2376 1 36 3.30 0.077 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The columns DenDF and Pr(>F) refer to denominator degrees of freedom and P-values, which are calculated using the Satterthwaite’s method of approximation. Based on the P-values, the Chao 1 richness index for Group and Time interaction is marginally significant (P = 0.077).

> anova(chao1, type = 1)
Type I Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
Group 405 405 1 36 0.56 0.4579
Time 6916 6916 1 36 9.61 0.0037 **
Group:Time 2376 2376 1 36 3.30 0.0775 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The Time variable is statistically significant (P = 0.0037) and the Group by Time interaction is marginally statistically significant (P = 0.0775) from the Type I ANOVA table with Satterthwaite’s approximation.

We can require another type of ANOVA by changing the type argument and Kenward-Roger’s method for calculating the F-test. The following commands require the Type I ANOVA table with Kenward-Roger’s approximation.

> anova(chao1, type = 1, ddf = "Kenward-Roger")
Type I Analysis of Variance Table with Kenward-Roger's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
Group 405 405 1 60.8 0.56 0.4559
Time 6916 6916 1 36.0 9.61 0.0037 **
Group:Time 2376 2376 1 36.0 3.30 0.0775 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Kenward-Roger’s method achieves the same results.
  • Step 4: Run LMMs using the summary method for objects returned by the lmer() function.

> summary(chao1)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: Chao1 ~ Group * Time + (1 | Group)
Data: alpha_amp

REML criterion at convergence: 348.2

Scaled residuals:
Min 1Q Median 3Q Max
-1.758 -0.710 -0.276 0.749 2.235
Random effects:
Groups Name Variance Std.Dev.
Group (Intercept) 10.8 3.28
Residual 719.4 26.82
Number of obs: 40, groups: Group, 2
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 170.23 9.09 36.00 18.72 <2e-16 ***
Group -8.16 12.86 36.00 -0.63 0.530
TimePost 10.88 11.99 36.00 0.91 0.370
Group:TimePost 30.83 16.96 36.00 1.82 0.077 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) Group TimPst
Group -0.707
TimePost -0.660 0.466
Group:TmPst 0.466 -0.660 -0.707
optimizer (nloptwrap) convergence code: 0 (OK)
unable to evaluate scaled gradient
Hessian is numerically singular: parameters are not uniquely determined

In the output, additional columns of df and Pr(>|t|) in the fixed effects are added to the lme4 package. The df refers to degrees of freedom based on Satterthwaite’s approximation and Pr(>|t|) is the P-value for the t-test with df as degrees of freedom.

The following commands require the Kenward-Roger’s approximation.
> summary(chao1, dff="Kenward-Roger")
  • Step 5: Perform backward elimination using the step method for objects returned by the lmer() function.

To illustrate the step method for backward elimination, here we consider a full model including Group, Time, and their interaction as fixed-effect terms, as well as Group and Group by Time interaction as random-effect terms.

> chao1a <- lmer(Chao1 ~ Group*Time + (1 | Group)+ (1 | Group:Time), data = alpha_amp)
> b_step <- step(chao1a)
> b_step

Backward reduced random-effect table:

Eliminated npar logLik AIC LRT Df Pr(>Chisq)
<none> 6 -174 360
(1 | Group:Time) 1 5 -174 358 5.68e-14 1 1
Backward reduced fixed-effect table:
Eliminated Df Sum of Sq RSS AIC F value Pr(>F)
Group:Time 1 1 2376 28273 268 3.30 0.0775 .
Group 2 1 526 28800 267 0.69 0.4119
Time 0 1 6916 35716 274 9.13 0.0045 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Model found:
Chao1 ~ Time

The reduced model showed that Group and Time interaction in fixed-effect terms is marginally significant (P = 0.0775) and Time is significant (P = 0.0045).

15.4 Implement LMMs Using QIIME 2

Example 15.3: Mouse Gut Microbiome Data, Example 9.2, Cont.

In Chap. 9 (Sect. 9.​5.​4), we used this study to illustrate how to obtain core alpha diversities via QIIME 2 and then we used pairwise Kruskal-Wallis test to compare Sex by Time variables interaction. There are total 349 samples (208 for early and 141 for late times). The results showed that later time has larger Shannon diversity than early time with P-value of 0.003227 and Q-value of 0.003227. In Chap. 9, we commented that Kruskal-Wallis test does not consider the dependency of the individuals/subjects and a longitudinal model is more appropriate. Here we use QIIME 2 to implement LMMs to test whether Shannon diversity index changed over time and had gender difference in this study.

15.4.1 Introduction to the QIIME Longitudinal Linear-Mixed-Effects Command

Linear mixed-effects models (LMMs) are more flexible in modeling the association between a single response variable and one or more independent variables in a longitudinal or repeated-measures design. The qiime longitudinal linear mixed-effects perform LMMs to evaluate the covariates effects of “group_columns” and “random_effects” to a single dependent variable (“metric”) and plot line plots of each group column. To implement LMMs, at least one numeric state-column (e.g., time variable) and one or more comma-separated group-columns need to be specified as the fixed effects, i.e., independent variables. These fixed effects may be categorical or numeric metadata columns. After implementing this model, regression plots of the response variable (“metric”) are plotted as a function of the state column and each group column.

15.4.2 Fit LMMs in QIIME 2

First, we need activate QIIME 2. Open the terminal and type: source activate qiime2-2022.2, which is latest version of QIIME 2. When the terminal appears “(qiime2-2022.2),” it indicates qiime2-2022.2 version has been activated. We can work with this version now.

We already create a folder LMM in the path:QIIME2-Biostatistics/longitudinal/LMM, so we can access to this folder using cd QIIME2R-Biostatistics/longitudinal/LMM. If this folder did not exist in your computer, you can create one: mkdir QIIME2R-Biostatistics/longitudinal/LMM. The following commands perform LMMs by specifying Sex and Time as fixed-effects terms, but not requiring the random-effects (Fig. 15.1 for Sex effects and Fig. 15.2 for Time effects).

A scatterplot of Shannon versus D P W plots female and male. An increasing line and an almost horizontal line mark the trend of male and female, respectively.

Fig. 15.1

Regression scatterplots of Shannon diversity by male and female mice

A scatterplot of Shannon versus D P W plots early and late. A decreasing line from (0, 4.9) to (375, 4.8) marks the trend of late, and a decreasing line from (0,4.8) to (55,4.7) marks the trend of early. Values are approximated.

Fig. 15.2

Regression scatterplots of Shannon diversity by early and later times

# Figures 15.1 and 15.2
qiime longitudinal linear-mixed-effects \
--m-metadata-file SampleMetadataMiSeq_SOP.tsv\
--m-metadata-file shannon_vector.qza\
--p-metric shannon\
--p-group-columns Sex,Time\
--p-state-column DPW\
--p-individual-id-column StudyID\
--o-visualization LMMEffectsMiSeq_SOP.qzv

A text reads, saved visualization to, L M M E f f e c t s M i S e q underscore S O P dot q z v.

where the four columns are required: (1) sample metadata (here, SampleMetadataMiSeq_SOP.tsv) which contains individual-id-column; (2) state-column parameter: metadata column containing state (time) variable (here, DPW); (3) individual-id-column parameter, metadata column containing study IDs for individual subjects (here, StudyID), which indicates the individual subject/site that was sampled repeatedly; and (4) the o-visualization output column.

The p-metric is used to specify the response (dependent) variable column name, which must be located in the metadata or feature table files if the feature table is provided as input data. The p-group columns is used to specify the metadata columns as the independent covariates that are used to determine mean structure of “metric.” Several fixed-effects variables can be specified via a comma-separated list. The p-random-effects parameter can be added. The random-effects metadata columns are used to specify the independent covariates that are used to determine the variance and covariance structure (random effects) of “metric.” A random intercept for each individual is set by default, while to specify a random slope, set the variable in the p-state-column, in which the state-column value is passed as input to the random-effects parameter. Same to fixed effects, to specify several random effects, a comma-separated list of random effects can be provided in the p-random-effects column. The o-visualization output is used to name VISUALIZATION outputs. We here named it as LMMEffectsMiSeq_SOP.qzv.

By reviewing LMMEffectsMiSeq_SOP.qzv via qiime2 view, we can see that there were no statistically significant effects of sex (P = 0.088) or sampling period (early vs. late, P = 0.356) on Shannon diversity index, which confirmed the results that were originally reported in Schloss et al. (2012).

The qiime longitudinal linear-mixed-effects command returns several results, including (1) the input parameters at the top of the visualization, (2) the Model summary of descriptive information about the LMMs, (3) the main Model results, which summarizes the effects of each fixed effect (and their interactions) on the dependent variable (here, Shannon diversity). This table shows parameter estimates, including standard errors, z scores, P-values (P>|z|), and 95% confidence interval upper and lower bounds for each parameter, (4) the Regression scatterplots that categorized by each “group column” at the bottom of the visualization, with linear regression lines and 95% confidence interval in gray for each group, and (5) Projected Residuals, the scatterplots of fit vs. residual plots. The plots show the relationship between metric predictions for each sample (on the x-axis), and the residual or observation error (prediction - actual value) for each sample (on the y-axis). They are used for diagnostics. The roughly zero-centered residuals suggest a well-fitted model.

Below we add random-effects parameters and specify DPW as a random effect. For details of how specifying random effects, the reader is referred to Sect. 15.1.5 How to Fit LMMs.

qiime longitudinal linear-mixed-effects \
--m-metadata-file SampleMetadataMiSeq_SOP.tsv\
--m-metadata-file shannon_vector.qza\
--p-metric shannon\
--p-group-columns Sex,Time\
--p-random-effects DPW\
--p-state-column DPW\
--p-individual-id-column StudyID\
--o-visualization LMMEffectsMiSeq_SOP_Random.qzv

A text reads, saved visualization to, L M M E f f e c t s M i S e q underscore S O P underscore R a n d o m dot q z v.

The above regression scatterplots in Figs. 15.1 and 15.2 just provide a quick summary information of the data. To interactively plot the longitudinal data, QIIME 2 recommends using the volatility plot.

15.4.3 Perform Volatility Analysis

Volatility analysis generates interactive line plots to visualize how a dependent variable is volatile over a continuous, independent variable (e.g., time) in group(s). Thus, a volatility plot is a good qualitative tool to identify outliers that disproportionately drive the variance within individuals and groups.

The input data can be metadata files (e.g., alpha and beta diversity artifacts) and FeatureTable[RelativeFrequency] tables. Different dependent variables can be plotted on the y-axis. The following commands examine how variance in Shannon diversity and Sex changes over DPW (day post weaning) (specified in the state-column) (Fig. 15.3).
# Figure 15.3
qiime longitudinal volatility \
--m-metadata-file SampleMetadataMiSeq_SOP.tsv\
--m-metadata-file shannon_vector.qza\
--p-default-metric shannon \
--p-default-group-column Sex\
--p-state-column DPW\
--p-individual-id-column StudyID \
--o-visualization VolatilitySexLMMEffectsMiSeq_SOP.qzv

A line graph of metric column versus D P W plots lines for the female and male. The female line starts at (0, 5.4), fluctuates and drops to (160, 0.4), raises to (290, 5.7), and again drops. The male line starts at (0, 5.2), fluctuates, and ends at (375, 5.5). Values are approximated.

Fig. 15.3

Volatility plots of Shannon diversity by male and female mice

A text reads, saved visualization to, V o l a t i l i t y S e x L M M E f f e c t s M i S e q underscore S O P dot q z v.

The following commands examine how variance in Shannon diversity and Time (early and later periods) changes over DPW (day post weaning) (specified in the state-column) (Fig. 15.4).

A line graph of metric column versus D P W plots lines for early and late. The early line starts at (0, 5.3), fluctuates, and ends at (55, 4.8). The late line starts at (125, 5.3), drops to (160, 0.2), raises, and then decreases. All values are approximated.

Fig. 15.4

Volatility plots of Shannon diversities by early and later times

# Figure 15.4
qiime longitudinal volatility \
--m-metadata-file SampleMetadataMiSeq_SOP.tsv\
--m-metadata-file shannon_vector.qza\
--p-default-metric shannon \
--p-default-group-column Time\
--p-state-column DPW\
--p-individual-id-column StudyID \
--o-visualization VolatilityTimeLMMEffectsMiSeq_SOP.qzv

A text reads, saved visualization to, V o l a t i l i t y T i m e L M M E f f e c t s M i S e q underscore S O P dot q z v.

By reviewing the resulted output visualization using QIIME 2 view, the plot displays a line plot on the left-hand side of and a panel of “Plot Controls” to the right-hand side.

The “Plot Controls” panel is used to interactively adjust variables and parameters for determining how “groups” and “individuals” values change across the specified state-column (a single independent variable). For details on how use the interactive features, the reader is referred to QIIME 2 documents for further references or practicing the output visualization in QIIME 2 view.

15.5 Remarks on LMMs

The development of new methods for analysis of microbiome data often takes the advantage of LMM framework such as in Zhang and Yi (2020) and Hui et al. (2017). However, LMM is not specifically designed for microbiome data, and hence it does not address issues due to characteristics of microbiome data. LMM method is often criticized for using to analyze microbiome data because (1) it needs to transform absolute abundance read account into the relative abundance via a transformation method (e.g., arcsine square root transformation) before fitting the model, and most often treats transformed microbiome abundance as normally distributed responses (such as in Srinivas et al. 2013; Rosa et al. 2014; Leamy et al. 2014; Wang et al. 2015); (2) it does not explicitly handle the excess zeros in the data (Chen and Li 2016), and thus cannot fit the model with zero-inflation and over-dispersion to address the sparsity issue (Zhang and Yi 2020). Thus, directly applying LMM method to analyze microbiome data may be underpowered and have potential bias to identify the dynamic microbiome effects.

Several proposed longitudinal microbiome models have compared their methods to LMM method (Chen and Li 2016; Zhang et al. 2018; Zhang and Yi 2020).

15.6 Summary

This chapter focused on LMMs to microbiome data analysis. Section 15.1 covered the general topics of LMMs including the advantages and disadvantages of using LMMs, definitions of fixed and random effects, and its formulation, statistical hypothesis testing, and the procedures of fitting LMMs. Section 15.2 described how to identify the significant taxa with the nlme package. A general introduction to LMMs in microbiome research and longitudinal microbiome data structure were also described. Section 15.3 introduced the lme4 and LmerTest packages and how to use these two packages to analyze the diversity indices. Section 15.4 described how to implement LMMs in QIIME 2, including introduction to the QIIME longitudinal linear-mixed-effects command and volatility analysis. Section 15.5 provided some general remarks on LMMs in microbiome data.

Chapter 16 will describe the generalized linear mixed models (GLMMs), and Chap. 17 will introduce GLMMs and zero-inflated GLMMs for longitudinal microbiome data.