© Springer Nature Switzerland AG 2023
Y. Xia, J. Sun Bioinformatic and Statistical Analysis of Microbiome Data https://doi.org/10.1007/978-3-031-21391-5_17

17. Generalized Linear Mixed Models for Longitudinal Microbiome Data

Yinglin Xia1   and Jun Sun 1
(1)
Department of Medicine, University of Illinois Chicago, Chicago, IL, USA
 

Abstract

Chapter 16 investigated some general topics of generalized linear mixed-effects models (GLMMs). This chapter focuses on some newly developed GLMMs that take account for correlated observations with random effects while considering over-dispersion and zero-inflation. First, it reviews and discusses some general issues of GLMMs in microbiome research. Then, it introduces three GLMMs that model over-dispersed and zero-inflated longitudinal microbiome data through using (1) glmmTMB package; (2) GLMMadaptive package; and (3) NBZIMM package for fast zero-inflated negative binomial mixed modeling (FZINBMM), which was specifically designed for analyzing longitudinal microbiome data. Finally, it provides some remarks on fitting GLMMs.

Keywords
Generalized linear mixed models (GLMMs) glmmTMB package GLMMadaptive package Fast zero-inflated negative binomial mixed modeling (FZINBMM) NBZIMM package Over-dispersed Zero-inflated pscl package Zero-hurdle Negative binomial (NB) model Over-dispersion Zero-inflated negative binomial (ZINB) Zero-hurdle negative binomial (ZHNB) Poisson Zero-inflated Poisson (ZIP) Zero-hurdle Poisson (ZHP) Zero-inflated generalized linear mixed models (ZIGLMMs) BIC Wald t test F test Satterthwaite approximation Kenward-Roger (KR) approximation TMB (template model builder) package AICtab () BICtab() bbmle package DHARMa package plotQQunif () plotResiduals () testDispersion() testZeroinflation() Conway-Maxwell-Poisson distribution zicmp marginal_coefs () effectPlotData ().devtools package QQ-plot glmm.zinb() lme() nlme package stats package glmPQL() MASS package

Generalized linear mixed-effects models (GLMMs) and the generalized estimating equations (GEEs) are the two most appealing paradigms in a longitudinal setting. Thus, GEEs and GLMMs not only have been adopted for the analysis of longitudinal microbiome data at the beginning stage of microbiome research, but also have been used as a statistical foundation to build statistical models that specifically target longitudinal microbiome data in recent years. In Chap. 16, we investigated some general topics of GLMMs. Microbiome count data are often over-dispersed and zero-inflated, containing excess zeros than would be expected from the typical error distributions. In this chapter, we introduce GLMMs, with focusing on some newly developed GLMMs that take account for correlated observations with random effects while considering over-dispersion and zero-inflation. In Sect. 17.1, we review and discuss some general issues of GLMMs in microbiome research. Then, we introduce three GLMMs that can be used for modeling over-dispersed and zero-inflated longitudinal microbiome data. They are generalized linear mixed modeling (1) using glmmTMB package (Sect. 17.2); (2) using GLMMadaptive package (Sect. 17.3); and (3) fast zero-inflated negative binomial mixed modeling (FZINBMM) using the NBZIMM package (Sect. 17.4), which was specifically designed for analyzing longitudinal microbiome data. In Sect. 17.5, we provide some remarks on fitting GLMMs. Finally, we summarize this chapter in Sect. 17.6.

17.1 Generalized Linear Mixed Models (GLMMs) in Microbiome Research

In longitudinal study or repeated measurements, the observations on the same individual are often correlated, which is usually modeled with a random effect in generalized linear mixed models (GLMMs). Count data are typically modeled with GLMs and GLMMs using either Poisson or negative binomial distributions. However, microbiome count data are over-dispersed and sparse and often contain many zeros. Thus, Poisson model is not able to analyze microbiome count data due to its assumption of equality of the variance and the mean. The negative binomial model and its extensions are usually used to analyze the over-dispersed microbiome data. To model zero-inflated microbiome data, the zero-inflated or zero-hurdle models with the negative binomial distribution (a mixture of Poisson distributions with Gamma-distributed rates) can be applied. Xia et al. (2018a, b) introduced over-dispersed and zero-inflated modeling microbiome data.

In our 2018 book, we mainly adopted two widely used negative binomial (NB) models edgeR and DESeq2 in RNA-seq literature to model over-dispersed microbiome data as well as the widely-used pscl package to fit zero-inflated and zero-hurdle GLMs using maximum likelihood estimation (MLE) (Zeileis et al. 2008) along with various model comparisons.

A negative binomial (NB) model can model over-dispersion due to clustering and but cannot model over-dispersion, sparsity caused by zero-inflation, which could result in biased parameter estimates (Xia et al. 2018a; Xia 2020). We concluded that the zero-inflated and zero-hurdle GLMs with NB regression, i.e., zero-inflated negative binomial (ZINB) and zero-hurdle negative binomial (ZHNB) are the best fitted models to over-dispersed and zero-inflated microbiome data among the competing models including Poisson, NB, zero-inflated Poisson(ZIP), and zero-hurdle Poisson (ZHP) (Xia et al. 2018b).

However, pscl is not equipped with the component of random effects and thus is limited to cross-sectional data and cannot model longitudinal data or repeated measurements. Without modeling random effects and thereby ignoring correlation could lead the statistical inference to be anti-conservative (Bolker et al. 2009; Bolker 2015). Here, we introduce other two R packages glmmTMB and GLMMadaptive that can be used to model over-dispersed and zero-inflated longitudinal microbiome data. After the introduction of glmmTMB and GLMMadaptive, we will move on modeling microbiome data from GLMs, GLMMs, to zero-inflated generalized linear mixed models (ZIGLMMs). Before we move on to introducing the glmmTMB and GLMMadaptive packages, in this section, we review and introduce some specific issues in modeling microbiome data, including data transformation, challenges of model selection, and statistical hypothesis testing in microbiome data when using GLMMs.

17.1.1 Data Transformation Versus Using GLMMs

Typically three kinds of zeros exist in microbiome data: structural zero, sampling zero, and rounded zero. Excess zeros not only causes the data to be zero-inflated but also to be over-dispersed (Xia et al. 2018c). Such kinds of zeros cannot be simply treated using a data processing step such as replacing with a small value or using transformation to make them be normally distributed because even the data steps process successfully, the resulting data might violate statistical assumptions or limit the scope of inference (e.g., the estimates of fixed effects cannot be extrapolated to new groups) (Bolker et al. 2009). For example, in practice, log transformation is often employed. Log transformation make multiplicative models additive to fascinate the data analysis and can perfectly remove heteroscedasticity if the relative standard deviation is constant (Kvalheim et al. 1994). However, log transformation has two main drawbacks: (1) it is unable to deal with the zero values because log zero is undefined; (2) the effect of log transformation on values with a large relative standard deviation is limited, which usually is the case in microbiome data with the low level of taxonomy ranks (e.g., genus and species). It was shown by simulation and real data (Feng et al. 2013, 2014) that the log transformation is often misused:
  • Log transformation usually only can remove or reduce skewness of the original data that follows a log-normal distribution or approximately so, which in some cases comes at the cost of actually making the distribution more skewed than the original data.

  • It is not generally true that the log transformation can reduce variability of data especially if the data includes outliers. In fact, whether or not log transformation can reduce variability depends on the magnitude of the mean of the observations — the larger the mean, the smaller the variability.

  • It is difficult to interpret model estimates from log transformed data because the results obtained from standard statistical tests on log-transformed data are often not relevant to the original, non-transformed data. In practice, the obtained model estimates from fitting the transformed data are usually required to translate back to the original scale through exponentiation to give a straightforward biological interpretation (Bland and Altman 1996). However, since no inverse function can map back exp(E(log X)) to the original scale in a meaningful fashion, it was advised that all interpretations should focus on the transformed scale once data are log-transformed (Feng et al. 2013).

  • Fundamentally, statistical hypothesis testing of equality of (arithmetic) means of two samples is different from testing equality of (geometric) means of two samples after log transformation of right-skewed data. These two hypothesis tests are equivalent if and only if the two samples have equivalent standard deviations.

The underlying assumption of the log transformation is that the transformed data have a distribution equal or close to the normal distribution (Feng et al. 2013). Since this assumption is often violated, therefore, log transformation with adding the shift parameter on the one hand cannot help reduce the variability and on the another hand could be quite problematic to test the equality of means of two samples when there are values close to zero in the samples. Actually even nonparametric models still need assumptions (e.g., homogeneity of variance across groups). Thus, the over-dispersed and zero-inflated count models are recommended (Xia et al. 2018a, b).

17.1.2 Model Selection in Microbiome Data

In general, nested models are compared using likelihood or score test, for example, to compare ZIP vs. ZINB (ZIP is nested within ZINB) and ZHP vs. ZHNB (ZHP nested within ZHNB), while non-nested models are evaluated using AIC and/or the Vuong test (Long 1997; Vuong 1989).

In statistical theory, AIC (Akaike 1973, 1998) and BIC (Schwarz 1978; Wit et al. 2012) have served as a common criterion (or the golden rules) for model selection since they were proposed (Aho et al. 2014).

In GLMMs, estimates of AIC have been developed based on certain exponential family distributions (Saefken et al. 2014). As we described in Chap. 16 (Sect. 16.​5.​3 of Information Criteria for Model Selection), AIC and BIC have several theoretical properties: consistency in selection, asymptotic efficiency, and minimax-rate optimality. For example, it was shown that AIC is asymptotically efficient for the nonparametric framework and is also minimax optimal (Barron et al. 1999) and BIC is consistent and asymptotically efficient for the parametric framework. However, AIC and BIC also have their own drawbacks. For example, AIC is inconsistent in a parametric framework (at least exist two correct candidate models), and hence AIC is not asymptotically efficient in such a framework. Actually it is difficult to evaluate analytically the properties of AIC for small samples, and thus its statistical optimality could not be demonstrated (Shibata 1976). BIC, on the other hand, is not minimax optimal and asymptotically efficient in a nonparametric framework (Shao 1997; Foster and George 1994). For the details on the good properties and drawbacks of AIC and BIC, the reader is referred to Ding et al. (2018a, b) for general references on these topics. In order to leverage the strengths of AIC and BIC and overcome their weaknesses, hybrid or adaptive approaches of AIC and BIC have been proposed such as in these cited references (Ing 2007; Yang 2007; Liu and Yang 2011; van Erven et al. 2012; Zhang and Yang 2015; Ding et al. 2018a).

Generally, AIC and BIC are sufficient to be used to choose better models. Various variants of AIC and BIC may not improve the decision too much. However, relying solely on AIC may misinterpret the importance of the variables in the set of candidate models; thus, combining AIC and BIC (BIC is more appropriate for heterogeneous data due to using more penalties) is recommended in modeling microbiome data in GLMMs.

Microbiome data and ecology data share some common characteristics. However, microbiome data are more over-dispersed and zero-inflated compared to ecology data. There are two general characteristics in model selection for microbiome data.
  1. 1.

    Over-dispersed and zero-inflated models are usually more appropriate than their nested (simple) models. Since generally it is obvious that NB models have advantages compared to their nested Poisson models, similarly, zero-inflated NB and zero-hurdle NB models have advantages than their nested zero-inflated Poisson and zero-hurdle Poisson models; thus, the information-theoretic approach is much more important than the null hypothesis testing approach.

     
  2. 2.

    Most often the information-theoretic approach and the null hypothesis testing approach can provide consistent results for choosing the best models. Compared to ecology, model selection is relatively simple in microbiome study. Based on our experience, usually the test results from AIC, BIC, Vuong test, and LRT are consistent in microbiome data (Xia et al. 2018b). Considering them jointly should enable to choose a better model.

     

For the improvements of various variants of AIC and BIC, the readers can read the description of their properties and characteristics in Chap. 16 (Sect. 16.​5.​3).

17.1.3 Statistical Hypothesis Testing in Microbiome Data

In this subsection, we describe some challenges when conducting statistical hypothesis testing in microbiome data. Generally, when we use GLMMs, we need to consider two statistical approaches: (1) ML versus REML and (2) Wald test versus LRT. For details of consideration, see Chaps. 15 (Sect. 15.​1.​5) and 16 (Sect. 16.​5). Particularly, the statistical hypothesis testing microbiome data using GLMMs is even challenging. We summarize some challenges as follows.
  • First, the statistical methods that are appropriate for testing over-dispersion and zero-inflation are limited.

In general, the null hypothesis of statistical Wald Z, χ2, t, and F tests commonly measures a quantity (i.e., the resulting test statistic) of zero (null) or no difference between certain characteristics of a population (or data-generating process or models). However, Wald Z and χ2 tests cannot be used for GLMMs with over-dispersion. When over-dispersion occurs, Wald t and F tests are appropriate for GLMMs because they account for the uncertainty in the estimates of over-dispersion (Littell et al. 2006; Moscatelli et al. 2012). Thus, for over-dispersed microbiome data, Wald t and F tests are required. Similarly, for zero-inflated microbiome data, the zero-inflated GLMMs are more appropriate. However, currently both over-dispersed and zero-inflated GLMMs or GLMMs that are capable of modeling both over-dispersed and zero-inflated microbiome data are still limited.
  • Second, testing the null hypothesis on the boundary poses more challenge on most statistical tests.

Molenberghs and Verbeke (2007) brought the boundary effects to attention: Most tests require that the standard deviations must be ≥0, the null hypothesis for random effects (σ = 0) violates the assumption that the null values of the parameters are not on the boundary of their allowable ranges. Bolker et al. (2009) discussed the boundary effects when using GLMMs for ecology and evolution data. Microbiome data exist many zeros and therefore is prone to be boundary on zeros. Additionally, the count microbiome data are often normalized into proportions for analysis, which makes the data potentially to be boundary at 0 and 1. Thus, for microbiome data, boundary issue is more challenging and urging to be solved.
  • Third, choosing an appropriate method to calculate the degrees of freedom is difficult.

Wald t or F tests or AIC (AICc) needs to calculate the degrees of freedom (df), which is between 1 and the number of random-effect levels N − 1. Several approaches are available for calculating df and different software packages use different ones. But there is no consensus on which one is appropriate. For example, the simplest approach is the minimum number of df, which is contributed by random effects that affect the term being tested. SAS uses this approach as default (Littell et al. 2006). While the more complicated approach uses Satterthwaite and Kenward-Roger (KR) approximations of the degrees of freedom to adjust for the standard errors (Littell et al. 2006; Schaalje et al. 2001). KR (only available in SAS) was reviewed having overall best performance at least for LMMs (Schaalje et al. 2002; Bolker et al. 2009). Additionally, the residual df can be estimated by the sample size n minus the trace t (i.e., the sum of the diagonal elements) of the hat matrix (Burnham 1998; Spiegelhalter et al. 2002). Due to the complexity, the calculating degrees of freedom and boundary effects have been reviewed as the two particular challenges to perform statistical testing the results of GLMMs (Bolker et al. 2009).

17.2 Generalized Linear Mixed Modeling Using the glmmTMB Package

In this section, we introduce the glmmTMB package and illustrate its use with microbiome data.

17.2.1 Introduction to glmmTMB

The R package glmmTMB (Brooks et al. 2017) was developed to estimate GLMs and GLMMs and to extend the GLMMs by including zero-inflated and hurdle GLMMs using ML. Thus, glmmTMB can handle a various range of statistical distributions including Gaussian, Poisson, binomial, negative binomial, and beta. Because glmmTMB extended GLMMs, it can not only model zero-inflation, heteroscedasticity, and autocorrelation but also handle various types of within-subject correlation structures.

The glmmTMB package has wrapped all these capabilities of estimates in GLMs, GLMMs, and their extensions including zero-inflated and hurdle GLMMs using ML. Currently glmmTMB focuses on zero-inflated counts although this package can be used to fit continuous distributions too (Brooks et al. 2017).

The zero-inflated (more broadly zero-altered) models wrapped in glmmTMB was developed under the framework of GLMs and GLMMs, which allow us to model count data using a mixture of a Poisson, NB. Especially glmmTMB uses the Conway-Maxwell-Poisson distribution (Huang 2017), which consists of both mean and dispersion parameters. It is a generalized version of the Poisson distribution with the Bernoulli and geometric distributions as special cases (Sellers and Shmueli 2010). Depending on the dispersion, Conway-Maxwell-Poisson distribution can have either longer or shorter upper tail than that of the Poisson (Sellers and Shmueli 2010). Thus, it can model either over- or under-dispersed count data (Shmueli et al. 2005; Lynch et al. 2014; Barriga and Louzada 2014; Brooks et al. 2017). Conway-Maxwell-Poisson distribution can flexibly model dispersion and skewness through the Sichel and Delaporte distributions (Stasinopoulos et al. 2017).

The glmmTMB package uses the interface and formula syntax of the lme4 package (Bates et al. 2015) and performs estimation via the TMB (Template Model Builder) package (Kristensen et al. 2016). Like lme4, glmmTMB uses MLE and the Laplace approximation to integrate over random effects. Currently the restricted maximum likelihood (REML) is not available in glmmTMB, and the random effects are not integrated using the Gauss-Hermite quadrature, although they are available in lme4 (Brooks et al. 2017). However, a fundamental difference underlying implementation between glmmTMB and lme4 is that glmmTMB uses TMB to take the advantage of fast estimating non-Gaussian models and provides more flexibility to fit the classes of distributions that it can fit (Brooks et al. 2017).

A glmmTMB model consists of four main components: (1) a conditional model formula, (2) a distribution for the conditional model, (3) a dispersion model formula, and (4) a zero-inflation model formula. Both fixed and random effects models can be specified in the conditional and zero-inflated components of the model. For the dispersion parameter, only the fixed effects models can be specified.

One example syntax for full zero-inflated negative binomial GLMMs is given below:
zinb = glmmTMB(Count ~ Group* Time + (1|Subject), zi= ~ Group* Time, data=ds, family = nbinom2)
One example syntax for full hurdle negative binomial GLMMs is given below:
hnb = glmmTMB(Count ~ Group* Time + (1|Subject), zi= ~ Group* Time, data = ds, family = truncated_nbinom2)

17.2.1.1 Conditional Model Formula

glmmTMB is very flexible to fit various GLMs and GLMMs. For example, we can fit simple GLMs and GLMMs, just specify the conditional model while leaving the zero-inflation and dispersion formulas at their default values. To specify a mean of the conditional model, we can use a two-sided formula (the syntax is same as lme4): specify the response variable on the left and predictors and potentially include random effects and offsets on the right. One example syntax of conditional model formula is given as below:
Count ~ Taxon + (1 | Subject)
This formula specifies a conditional model for the dependence of mean count on taxon:
  • counts vary by taxon and vary randomly by subject.

One zero-inflated and dispersed GLMMs that can be modeled in glmmTMB is given as follows.
$$ {\displaystyle \begin{array}{c}y=E\left({y}_i|{b}_i, NSZ\right)=\exp \left({\beta}_0+{\beta}_1+{b}_i\right),\\ {}{b}_i\sim N\left(0,{\sigma}_{b_i}^2\right),\\ {}{\sigma}^2= Var\left({y}_i|{b}_i, NSZ\right)=y\left(1+y/\theta \right),\\ {}\mathrm{logit}(p)={\beta}_0^{(zi)}+{\beta}_1^{(zi)},\\ {}\log \left(\theta \right)={\beta}_0^{(disp)}+{\beta}_2^{(disp)},\end{array}} $$
(17.1)
where y is the read abundance count, bi is the subject specific random effect, NSZ is the probability of “non-structural zero,” p = 1− Pr(NSZ) is the probability of zero inflation, and 𝛽’s are regression coefficients with subscript denoting covariate/level (with 0 denoting intercept).

The above model formulation in (17.1) allows the conditional mean to depend on whether or not a subject was from covariate 1 (group variable, e.g., treatment vs. control) and to vary randomly by subject. It also allows the number of structural (i.e., extra) zeros to depend on covariate 1 (group variable). Additionally, it allows the dispersion parameter to depend on covariate 2 (e.g., time).

17.2.1.2 Distribution for the Conditional Model

We can specify the distribution of the mean of the conditional model via the family argument. For the count data, the family of the distribution includes Poisson (family = poisson), negative binomial (family = nbinom1 or family = nbinom2), and Conway-Maxwell-Poisson to fit over- and under-dispersed data (family = compois). By default, the link function of Poisson, Conway-Maxwell-Poisson, and negative binomial distributions is log. We can specify other links using the family argument such as family = poisson(link = “identity”).

glmmTMB provides two parameterizations of the negative binomial. They are different in the dependence of the variance (σ2) on the mean (μ). The argument family = nbinom1 is used to indicate that the variance increases linearly with the mean, i.e., σ2 = μ(1 + α), with α > 0; while the argument family = nbinom2 is used to indicate that the variance increases quadratically with the mean, i.e., σ2 = μ(1 +μ/𝜃), with θ >0 (Hardin et al. 2007). The variance of the Conway-Maxwell-Poisson distribution does not have a closed form equation (Huang 2017).

17.2.1.3 Dispersion Model Formula

In glmmTMB, the dispersion parameter can be specified as either identical dispersion or covariate-dependent dispersion. For example, the default dispersion model (dispformula = ~1) treats the dispersion parameter (e.g., α or θ in the NB model) as identical for each observation; while the dispersion parameter can also be treated as varying with fixed effects, in which the dispersion model uses a log link. We can also use the dispersion model to account for heteroskedasticity. For example, to account for more variation (relative to the mean) of the response to a predictor variable, we can specify the one-sided formula dispformula = ~ this predictor variable in a NB distribution. Although when the conditional and dispersion models include the same variables, the mean-variance relationship can be manipulated; however, a potential non-convergence issue could happen (Brooks et al. 2017). To see the description of the dispersion parameter for each distribution, type?sigma.glmmTMB in R or RStudio.

17.2.1.4 Zero-Inflation Model Formula

The overall distribution of zero-inflated generalized models is a mixture of the conditional model and zero-inflation model (Lambert 1992). The zero-inflation model describes the probability of observing an extra (i.e., structural) zero that is not generated by the conditional model. Zero-inflation model creates an extra point mass of zeros in the response distribution, where the zero-inflation probability is bounded between zero and one by using a logit link.

To assume that the probability of producing a structural zero for all observations is equal, we can specify an intercept model as ziformula = ~1 to account for the structural zero. To model the structural zero of the response due to a specific predictor variable, we can specify ziformula = ~ this predictor variable in the zero-inflation model. glmmTMB also allows to include random effects in both conditional and zero-inflation models, but not allowing to specify a term of random effects in the dispersion model.

17.2.2 Implement GLMMs via glmmTMB

Example 17.1: Breast Cancer QTRT1 Mouse Gut Microbiome

This dataset was used in previous chapters. Here we continue to use this dataset to illustrate glmmTMB. The 16S rRNA microbiome data were measured at pretreatment and posttreatment times for 10 samples at each time. We are interested in the differences between the genotype factors or groups. We use the example data to illustrate how to fit and compare GLMMs, zero-inflated GLMMs, and hurdle GLMMs using glmmTMB and how to extract results from a model.

Here, we illustrate how to perform model selection to select the models that are better fitted to this abundance data in detecting the genotype differences. The compared models include Poisson, Conway-Maxwell-Poisson, negative binomial, zero-inflated, and zero-hurdle models. We use the species “D_6__Ruminiclostridium.sp.KB18” for this illusttration.

To run the glmmTMB() function to perform GLMMs, we first need to install the glmmTMB package, which is available from the Comprehensive R Archive Network (CRAN) (current version 1.1.2.3, February 2022) and from GitHub (development versions). We can install glmmTMB via CRAN using the command install.packages(“glmmTMB”) or via GitHub using devtools (Wickham and Chang 2017): devtools::install_github(“glmmTMB/glmmTMB/glmmTMB”).

  • Step 1: Load taxa abundance data and meta data.

> setwd("~/Documents/QIIME2R/Ch17_GLMMs2")
> otu_tab=read.csv("otu_table_L7_MCF7.csv",row.names=1,check.names=FALSE)
> meta_tab <- read.csv("metadata_QtRNA.csv",row.names=1,check.names = FALSE)
  • Step 2: Check taxa and meta data.

> nrow(otu_tab)#taxa
> ncol(otu_tab)#samples
> nrow(meta_tab)#samples
> ncol(meta_tab)#variables
> head(otu_tab,3)
> head(meta_tab,3)
> otu_tab_t<-t(otu_tab)
> head(otu_tab_t,3)
> dim(otu_tab_t)
  • Step 3: Create analysis dataset including outcomes and covariates.

> colnames(otu_tab_t)
> colnames(meta_tab)
> class(otu_tab_t)
> class(meta_tab)
> outcome =otu_tab_t
> # Match sample IDs in outcome matrix or data frame and meta_tab data frame
> outcome = outcome[rownames(meta_tab), ]
> class(outcome)
> head(outcome,3)
> yy = as.matrix(outcome)
> yy = ifelse(is.na(yy), 0, yy) #Exclude missing outcome variables if any
> dim(yy)
> # Filter abundance taxa data
> # Here we use double filters
> filter1 <- apply(yy, 2, function(x){sum(x>0)>0.4*length(x)})
> filter2 <- apply(yy, 2, function(x){quantile(x,0.9)>1})
> filter12 <-yy[,filter1&filter2]
> cat('after filter:','samples','taxa',dim(filter12),'\n')
> colnames(filter12)
> yy = filter12[rownames(meta_tab), ]
> ds = data.frame(meta_tab,yy)
> head(ds,3)
  • Step 4: Explore the distributions of the outcomes (taxa).

First, we summarize the distribution of the species “D_6__Ruminiclostridium.sp.KB18,” including sample size, mean, sd, min, median, and percentage of zeros for each group at each time point.
> options(digits=4)
> library(FSA)
> # Summarize taxonomy by groups
> Summarize(D_6__Ruminiclostridium.sp..KB18 ~ Group4, data = ds)
Group4 n mean sd min Q1 median Q3 max percZero
1 KO_BEFORE 10 4.1 7.838 0 0.00 0.0 6.00 24 70
2 KO_POST 10 11.8 13.951 0 0.75 7.5 17.75 44 30
3 WT_BEFORE 10 0.3 0.483 0 0.00 0.0 0.75 1 70
4 WT_POST 10 92.9 111.588 0 0.00 46.5 171.75 294 50
The species “D_6__Ruminiclostridium.sp.KB18” has 70% of zeros in KO_BEFORE and WT_BEFORE, 30% of zeros in KO_POST, and 50% of zeros in WT_POST. The variances are greater than their means. The distributions suggest that this species may be over-dispered and zero-inflated. See below plots (Figs. 17.1 and 17.2).

A bar graph of percent of total versus D underscore 6 underscore Ruminiclostridium dot s p, K B 18 has 4 parts, W T before, W T post, K O before, and K O post. The graph follows a decreasing trend.

Fig. 17.1

Histogram of Ruminiclostridium.sp.KB18 over before and posttreatment in wild-type (WT) and knockout (KO) groups

A graph and a scatterplot of D underscore 6 underscore Ruminiclostridium dot s p, K B 18 plots frequencies versus observed read values. The graph displays a decreasing trend and the scatterplot has an increasing trend.

Fig. 17.2

Distribution of Ruminiclostridium.sp.KB18

> # Figure 17.1
> # Individual plots in panel of 2 columns and 2 rows
> library(lattice)
> histogram(~ D_6__Ruminiclostridium.sp..KB18|Group4, data=ds,layout=c(2,2))
> # Figure 17.2
> # Check outcome distribution and zeros
> par(mfrow=c(1,2))
> plot(table(ds$D_6__Ruminiclostridium.sp..KB18),ylab="Frequencies",main="D_6__Eubacterium.sp..14.2",
+ xlab="Observed read values")
> plot(sort(ds$D_6__Ruminiclostridium.sp..KB18),ylab="Frequencies",main="D_6__Eubacterium.sp..14.2",
+ xlab="Observed read values")
  • Step 5: Run the glmmTMB() function to implement glmmTMB.

For each outcome taxonomy variable (here, D_6__Ruminiclostridium.sp.KB18), we fit fixed effects of two main factors Group, Time, and their interaction term. For random effects, we specify that taxonomy counts vary randomly by mouse.

Poisson Model(family = poisson)
The syntax for fitting GLMMs with glmmTMB is quite similar to using glmer. A log link is the standard link function for the Poisson distribution.
> library(glmmTMB)
# Using offset
> poi = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + offset(log(Total.Read))+(1|MouseID), data=ds, family=poisson)
# Without using offset
> poi = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time +(1|MouseID), data=ds, family=poisson)
Conway-Maxwell-Poisson Model(family = compois)
Conway-Maxwell-Poisson model is fitted by using family = compois.
# Using offset
> cmpoi = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + offset(log(Total.Read)) + (1|MouseID), data=ds, family=compois)
# Without using offset
> cmpoi = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + (1|MouseID), data=ds, family=compois)
Negative Binomial Models
  • (family = nbinom1)

The following commands specify negative binomial model.
# Using offset
> nb1 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + offset(log(Total.Read)) + (1|MouseID), data=ds, family=nbinom1)
# Without using offset
> nb1 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + (1|MouseID), data=ds, family=nbinom1)
  • (family = nbinom2)

# Using offset
> nb2 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + offset(log(Total.Read)) + (1|MouseID), data=ds, family=nbinom2)
# Without using offset
> nb2 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + (1|MouseID), data=ds, family=nbinom2)
  • (family = nbinom1 and disp = ~Time)

Poisson distribution only has mean parameter, while the negative binomial distribution has both mean and dispersion parameters. The following commands specify that the taxonomy counts to become more dispersed (relative to the mean) over time using disp = ~ Time.
# Using offset
> nbdisp1 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + offset(log(Total.Read)) + (1|MouseID), disp=~Time, data=ds, family=nbinom1)
# Without using offset
> nbdisp1 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + (1|MouseID), disp=~Time, data=ds, family=nbinom1)
  • (family = nbinom2 and disp = ~Time)

# Using offset
> nbdisp2 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + offset(log(Total.Read)) + (1|MouseID), disp=~Time, data=ds,family=nbinom2)
# Without using offset
> nbdisp2 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + (1|MouseID), disp=~Time, data=ds,family=nbinom2)
Zero-Inflated Models
Unlike Poisson and negative binomial distributions, zero-inflated models have the capability to model how the probability of an extra zero (i.e., structural zero) will vary with predictors, which is fitted by using the ziformula argument or zi sub-model in glmmTMB. The formula of sub-model zi only has a right side because the left side of this formula has been specified in the first formula, which always models the probability of having a structural zero in the response. The probability of zero-inflation is always modeled with a logit link to ensure its value between 0 and 1. In the following commands, we specify the same two main factors Group, Time, and their interaction term as in conditional model. This specification assumes that absences will vary not only by group or time as well as by group and time interaction. Zero-inflation can be used with any of the distributions in glmmTMB, including zero-inflated Poisson, Conway-Maxwell-Poisson, and negative binomial distributions.
  • ZIP(family= poisson and zi = ~ Group)

The following commands specify a zero-inflated Poisson distribution with zero-inflated by group, time, and their interaction.
# Using offset
> zip0 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + offset(log(Total.Read)) + (1|MouseID), zi=~ Group*Time, data=ds, family= poisson)
Warning message:
In (function (start, objective, gradient = NULL, hessian = NULL, :
NA/NaN function evaluation
# Without using offset
> zip0 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + (1|MouseID), zi=~ Group*Time, data=ds, family= poisson)
Warning messages:
1: In fitTMB(TMBStruc) :
Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
2: In fitTMB(TMBStruc) :
Model convergence problem; singular convergence (7). See vignette('troubleshooting')
Warning messages tell us that the zero-inflated Poisson model specified in the object zip0 does not converge. Failed convergence are most likely caused by one or more following problems:
  • When a model is overparameterized, then the data does not contain enough information to estimate the parameters reliably.

  • When a random-effect variance is estimated to be zero, or random-effect terms are estimated to be perfectly correlated (“singular fit”).

  • When zero-inflation is estimated to be near zero (a strongly negative zero-inflation parameter).

  • When dispersion is estimated to be near zero.

  • When complete separation occurs in a binomial model: some categories in the model contain proportions that are either all 0 or all 1.

Commonly these problems are typically caused by trying to estimate parameters for which the data do not contain information, such as a zero-inflated and over-dispersed model is specified for the data that does not have zero-inflation and/or over-dispersion, or a random effect is assumed for varying not necessary predictors. The general model convergence issues are discussed in vignette (“troubleshooting”). The readers are referred to this vignette to obtain advice for troubleshooting convergence issues that have been developed (“troubleshooting”, package = “glmmTMB”). For the detailed estimation challenges in GLMMs, The readers are referred to Sect. 17.1.3.

We should not consider the models that do not converge in model comparison. Since specifying zi = ~ Group*Time causes the model not to converge, we instead specify zi = ~Group below.
# Using offset
> zip = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + offset(log(Total.Read)) + (1|MouseID), zi=~ Group, data=ds, family= poisson)
# Without using offset
> zip = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + (1|MouseID), zi=~ Group, data=ds, family= poisson)
  • ZICMP(family = compois and zi = ~ Group as well as zi = ~ Group*Time)

The following commands specify a zero-inflated Conway-Maxwell-Poisson distribution.
# Using offset
> zicmp0 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + offset(log(Total.Read)) + (1|MouseID), zi=~ Group*Time, data=ds, family=compois)
# Without using offset
> zicmp0 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + (1|MouseID), zi=~ Group*Time, data=ds, family=compois)
# Using offset
> zicmp = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + offset(log(Total.Read)) + (1|MouseID), zi=~ Group, data=ds, family=compois)
# Without using offset
> zicmp = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + (1|MouseID), zi=~ Group, data=ds, family=compois)
  • ZINB(family = nbinom1 and zi = ~ Group*Time)

The following commands specify a zero-inflated negative binomial distribution.
# Using offset
> zinb1 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + offset(log(Total.Read)) + (1|MouseID), zi=~ Group*Time, data=ds, family=nbinom1)
Warning messages:
1: In fitTMB(TMBStruc) :
Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
2: In fitTMB(TMBStruc) :
Model convergence problem; false convergence (8). See vignette('troubleshooting')
# Without using offset
>zinb1 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + (1|MouseID), zi=~ Group*Time, data=ds, family=nbinom1)
# Using offset
> zinb1a = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + offset(log(Total.Read)) + (1|MouseID), zi=~ Group, data=ds, family=nbinom1)
Warning messages:
1: In fitTMB(TMBStruc) :
Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
2: In fitTMB(TMBStruc) :
Model convergence problem; false convergence (8). See vignette('troubleshooting')
# Without using offset
>zinb1a = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + (1|MouseID), zi=~ Group, data=ds, family=nbinom1)
Error in eigen(h) : infinite or missing values in 'x'
In addition: Warning message:
In (function (start, objective, gradient = NULL, hessian = NULL, :
NA/NaN function evaluation
  • ZINB(family = nbinom2 and zi = ~ Group)

# Using offset
> zinb2a = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + offset(log(Total.Read)) + (1|MouseID), zi=~ Group*Time, data=ds, family=nbinom2)
Warning message:
In fitTMB(TMBStruc) :
Model convergence problem; singular convergence (7). See vignette('troubleshooting')
# Without using offset
>zinb2a = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + (1|MouseID), zi=~ Group*Time, data=ds, family=nbinom2)
Warning messages:
1: In fitTMB(TMBStruc) :
Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
2: In fitTMB(TMBStruc) :
Model convergence problem; singular convergence (7). See vignette('troubleshooting')

Since zinb2a does not converge, we specify zi = ~ Group in below zinb2 instead.

# Using offset
> zinb2 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + offset(log(Total.Read)) + (1|MouseID), zi=~ Group, data=ds, family=nbinom2)
# Without using offset
> zinb2 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + (1|MouseID), zi=~ Group, data=ds, family=nbinom2)
Hurdle Models
We fit zero-hurdle Poisson model by using a truncated Poisson distribution and specify the same predictors in both the conditional model and the zero-inflation model.
  • ZHP(family = truncated_poisson and zi = ~ Group*Time)

# Using offset
> zhp = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + offset(log(Total.Read)) + (1|MouseID), zi=~ Group*Time, data=ds, family=truncated_poisson)
# Without using offset
> zhp = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + (1|MouseID), zi=~ Group*Time, data=ds, family=truncated_poisson)
  • ZHNB (family = truncated_nbinom1/truncated_nbinom2 and zi = ~ Group*Time)

The following commands fit a zero-hurdle negative binomial model by using a truncated negative binomial distribution and specify the same predictors in both the conditional model and the zero-inflation model.
# Using offset
> zhnb1 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + offset(log(Total.Read)) + (1|MouseID), zi=~ Group*Time, data=ds, family= truncated_nbinom1)
Warning messages:
1: In fitTMB(TMBStruc) :
Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
2: In fitTMB(TMBStruc) :
Model convergence problem; singular convergence (7). See vignette('troubleshooting')
# Without using offset
> zhnb1 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + (1|MouseID), zi=~ Group*Time, data=ds, family= truncated_nbinom1)
# Using offset
> zhnb2 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + offset(log(Total.Read))
+ (1|MouseID), zi=~ Group*Time, data=ds, family= truncated_nbinom2)
Warning message:
In fitTMB(TMBStruc) :
Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
# Without using offset
> zhnb2 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + (1|MouseID), zi=~ Group*Time, data=ds, family= truncated_nbinom2)
  • Step 6: Perform model comparison and model selection using AIC and BIC.

We can use the functions AICtab () and BICtab() (standing for a table of AIC and BIC model comparisons, respectively) from the bbmle package to compare all the GLMMs, including zero-inflated and hurdle models (K.P. Burnham and Anderson 2002). To test the effects of using offset in the models, we separately specify the models using offset to adjust for the total count reads and without using offset.
  • Model comparison using AIC and BIC and offset.

#install.packages("bbmle")
> library(bbmle)
# Using offset
> AICtab(poi, cmpoi,nb1,nb2, nbdisp1, nbdisp2, zip, zip0, zicmp,zicmp0, zinb1,zinb2,zhp,zhnb1, zinb1a, zhnb2,zinb2a)
dAIC df
zhp 0.0 9
zinb2a 0.7 10
zicmp0 0.8 10
zinb2 1.1 8
zicmp 1.5 8
zip0 2.4 9
zip 2.5 7
nb2 17.5 6
nbdisp2 19.4 7
nbdisp1 20.9 7
nb1 21.9 6
cmpoi 28.2 6
poi 95.1 5
zinb1 NA 10
zhnb1 NA 10
zinb1a NA 8
zhnb2 NA 10
> # Using offset
>BICtab(poi, cmpoi,nb1,nb2, nbdisp1, nbdisp2, zip,zip0, zicmp,zicmp0, zinb1,zinb2,zhp,zhnb1, zinb1a, zhnb2,zinb2a)
dBIC df
zip 0.0 7
zinb2 0.3 8
zicmp 0.7 8
zhp 0.9 9
zinb2a 3.3 10
zip0 3.3 9
zicmp0 3.4 10
nb2 13.3 6
nbdisp2 16.9 7
nb1 17.7 6
nbdisp1 18.5 7
cmpoi 24.0 6
poi 89.2 5
zinb1 NA 10
zhnb1 NA 10
zinb1a NA 8
zhnb2 NA 10
  • Model comparison using AIC and BIC but without offset.

# Without using offset
> AICtab(poi, cmpoi,nb1,nb2, nbdisp1, nbdisp2, zip, zip0,zicmp,zicmp0, zinb1,zinb2,zhp,zhnb1,zhnb2,zinb2a)
dAIC df
zhnb2 0.0 10
zhp 0.8 9
zicmp0 1.2 10
zinb2 1.7 8
zicmp 1.8 8
zhnb1 2.3 10
zinb1 4.1 10
nb2 19.3 6
nbdisp2 21.2 7
nbdisp1 22.6 7
nb1 23.2 6
zip 28.5 7
cmpoi 29.7 6
poi 94.6 5
zip0 NA 9
zinb1a NA 8
zinb2a NA 10
# Without using offset
> BICtab(poi, cmpoi,nb1,nb2, nbdisp1, nbdisp2, zip,zip0, zicmp,zicmp0, zinb1,zinb2,zhp,zhnb1,zhnb2,zinb2a)
dBIC df
zinb2 0.0 8
zicmp 0.1 8
zhp 0.7 9
zhnb2 1.6 10
zicmp0 2.9 10
zhnb1 3.9 10
zinb1 5.8 10
nb2 14.2 6
nbdisp2 17.8 7
nb1 18.1 6
nbdisp1 19.2 7
cmpoi 24.5 6
zip 25.1 7
poi 87.8 5
zip0 NA 9
zinb1a NA 8
zinb2a NA 10

Both AICtab () and BICtab()functions report the log-likelihood of the unconverged models as NA at the end of the AIC and BIC tables.

We print AIC and BIC values from each model in Table 17.1 to compare models and check the effects of using offset.
Table 17.1

AIC and BIC values in compared models from glmmTMB

Model

No offset

Using offset

AIC

BIC

AIC

BIC

zhnb2

201.3

218.2

NA

NA

zhp

202.1

217.3

203.2

218.4

zicmp0

202.5

219.4

204.0

220.9

zinb2

203.1

216.6

204.3

217.9

zicmp

203.1

216.6

204.7

218.2

zhnb1

203.6

220.5

NA

NA

zinb1

205.4

222.3

NA

NA

nb2

220.6

230.8

220.7

230.9

nbdisp2

222.5

234.4

222.6

234.5

nbdisp1

223.9

235.7

224.2

236.0

nb1

224.5

234.6

225.1

235.2

zip

229.9

241.7

205.7

217.5

cmpoi

231.0

241.1

231.4

241.5

poi

295.9

304.3

298.3

306.7

zip0

NA

NA

205.6

220.8

zinb1a

NA

NA

NA

NA

From Table 17.1, we can see using offset does not improve model fit for this specific outcome variable: either there are more non-converged models using offset or more models have slightly worse model fits. Using offsets only zero-inflated models have better model fits than those without using offsets, but some warning messages are generated, so the model evaluation is suspicious.

Based on above goodness of fit values evaluated by AIC and BIC, zero-hurdle NB models, zero-hurdle Poisson model, and the zero-inflated NB models, as well as zero-inflated Conway-Maxwell-Poisson models have better model fits compared to other competing models. The one parametric Poisson model has the worst performance. The performance of negative binomial models is in the middle. The performance of zero-inflated Poisson (ZIP) is unstable and has some unreliable properties such as underestimating small counts and overestimating intermediate counts (Xia et al. 2012). Additionally, although ZIP has larger power than NB to estimate a larger probability of zeros when they have the same mean values and variances; however, ZIP has less capability to model over-dispersion than NB if they have the same mean values and probabilities of zeros (Feng et al. 2015).
  • Step 7: Perform residual diagnostics for GLMMs using DHARMa.

We can evaluate the goodness-of-fit of fitted GLMMs using the procedures described in the DHARMa package (version 0.4.5, 2022-01-16). The R package DHARMa (or strictly speaking, DHARM) (Hartig 2022) stands for “Diagnostics for HierArchical Regression Models.” The name DHARMa is used to avoid confusion with the term Darm which in German means intestines and also taking the meaning of DHARMa in Hinduism which signifies behaviors should be in accord with rta, the order that makes life and universe possible, including duties, rights, laws, conduct, virtues, and “right way of living.” The meaning of DHARMa in Hinduism is purposed to play for this package, i.e., to test whether fitted model is in harmony with the data tested.

DHARMa uses a simulation-based approach to create readily interpretable scaled (quantile) residuals for fitted (generalized) linear mixed models. Currently the linear and generalized linear (mixed) models supported by DHARMa are from “lme4” (classes “lmerMod,” “glmerMod”), “glmmTMB,” “GLMMadaptive” and “spam,” generalized additive models (“gam” from “mgcv”), “glm” (including “negbin” from “MASS,” but not quasi-distributions), and “lm” model classes.

DHARMa standardizes the resulting residuals to values between 0 and 1. Thus, the residuals can be intuitively interpreted as residuals from a linear regression. There are two basic steps behind the residual method: (1) Simulate new response data from the fitted model for each observation. (2) For each observation, calculate the empirical cumulative density function for the simulated observations, which describes the possible values and their probability given the observed predictors and assuming the fitted model is correct. The residual generated this way represents the value of the empirical density function at the value of the observed data. Thus, a residual of 0 means that all simulated values are larger than the observed value, and a residual of 0.5 means that half of the simulated values are larger than the observed value (Dunn and Smyth 1996; Gelman and Hill 2006).

The package provides a number of plot and test functions for diagnosing typical model misspecification problems, including over−/under-dispersion and zero-inflation.

DHARMa can be installed either by running the following command in R or RStudio:
install.packages("DHARMa")
Or through https://​github.​com/​florianhartig/​DHARMa to install a development version.
> library(DHARMa)
Most functions in DHARMa can be implemented directly on the fitted model object. For example, the following command tests for dispersion problems:
> testDispersion(fittedModel)
However, doing in this way, every test or plot needs to repeat residual calculation. Thus, DHARMa highly recommends first calculating the residuals once using the function simulateResiduals(). The syntax is given below:
> SimulatedOutput <- simulateResiduals(fittedModel = fittedModel, plot = F)
The randomized quantile residuals are calculated by this function call. The function returns an object of class DHARMa, containing the simulations and the scaled residuals. We can pass them on to all other plots and test functions. If we specify the optional argument plot = T, then the standard DHARMa residual plot is displayed directly. We also can plot and test the calculated (scaled) residuals via a number of DHARMa functions, or access directly via calling the residuals () function:
> residuals(SimulatedOutput)
Here, we illustrate how to evaluate the goodness-of-fit of two fitted GLMMs (the worst model: Poisson, and one of the best models: zinb2) using the procedures described in the DHARMa package.
  • Step 7a: Simulate and plot the scaled residuals from the fitted model using the functions simulateResiduals() and plot.DHARMa().

Here, the following commands simulate and plot the scaled residuals from the fitted Poisson model (Fig. 17.3).
> # Figure 17.3
> SimulatedOutput_poi <- simulateResiduals(fittedModel = poi)
> plot(SimulatedOutput_poi)

2 scatterplots. 1. A scatterplot of Q Q plot residuals plots observed versus expected. It exhibits an increasing trend. 2. A scatterplot of D H A R M a residual versus model predictions has an increasing trend.

Fig. 17.3

QQ-plot residuals (left panel) and the scatterplot of the residuals against the predicted (fitted) values plot (right panel) in fitted Poisson model by the glmmTMB package

The QQ-plot of the observed versus expected residuals is created by the plotQQunif () function to detect overall deviations from the expected distribution. The scaled residuals have a uniform distribution in the interval (0,1) for a well-specified model. The P-value reported in the plot is obtained from the Kolmogorov-Smirnov test for testing uniformity. By default, the three generated tests are KS test (for correct distribution), dispersion test (for detecting dispersion), and outlier test (for detecting outliers). The plot generated by the plotResiduals () by default is the residuals against the predicted values, but the plot of the residuals against specific predictors is highly recommended. To visualize the effects in detecting deviations from uniformity in y-direction, the plot function also calculates a quantile regression to compare the empirical 0.25, 0.5, and 0.75 quantile lines across the plots in y-direction. These lines should be straight, horizontal, and at y-values of the theoretical 0.25, 0.5, and 0.75 quantiles.

The above plots can be created separately.
> plotQQunif(SimulatedOutput_poi) # Left plot in plot.DHARMa()
> plotResiduals(SimulatedOutput_poi) # Right plot in plot.DHARMa()
  • Step 7b: Perform goodness-of-fit tests on the simulated scaled residuals.

The following commands first use the function plotResiduals() to estimate and test the within group residuals. Then use the function testDispersion() to test if the simulated dispersion is equal to the observed dispersion, and the function testZeroinflation() to test if there are more zeros in the data than expected from the simulations. The generated three plots are represented in a panel with 1 row by 3 columns (Fig. 17.4).
> # Figure 17.4
> par(mfrow = c(1,3))
> plotResiduals(SimulatedOutput_poi, ds$Group)
> testDispersion(SimulatedOutput_poi)
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 2.8e-05, p-value = 0.4
alternative hypothesis: two.sided
> testZeroInflation(SimulatedOutput_poi)
DHARMa zero-inflation test via comparison to expected zeros with
simulation under H0 = fitted model
data: simulationOutput
ratioObsSim = 1.1, p-value = 0.6
alternative hypothesis: two.sided

3 graphs. A. A boxplot of simulation outputs scaled residuals versus cat Pred has 2 boxes. B and C. 2 histograms of frequency versus simulated value with p-value equals 0.376 and p equals 0.624, respectively.

Fig. 17.4

Levene test for homogeneity of variance (left panel), nonparametric dispersion test (middle panel), and zero-inflation test (right panel) for fitted Poisson model by the glmmTMB package

In this case, both dispersion and zero-inflation tests are not statistically significant.

In DHARMa, several overdispersion tests are available for comparing the dispersion of simulated residuals to the observed residuals. The syntax is given:

> testDispersion(SimulatedOutput, alternative = c("two.sided", "greater",
less"), plot = T, type = c("DHARMa", "PearsonChisq"))

The default type is DHARMa, which is a non-parametric test to compare the variance of the simulated residuals to the observed residuals. Alternatively, it is PearsonChisq, which implements the Pearson-χ2 test. While if residual simulations are done via refit, DHARMa will compare the Pearson residuals of the re-fitted simulations to the original Pearson residuals. This is essentially a nonparametric test. The author of the DHARMa package showed that the default test is fast, nearly unbiased for testing both under and over-dispersion, as well as reliably powerful.

One important resource of over-dispersion is zero-inflation. In DHARMa, zero-inflation can be formally tested using the function testZeroInflation(), such as here testZeroInflation(SimulatedOutput_poi), to compare the distribution of expected zeros in the data against the observed zeros.

Based on AIC and BIC model comparison results, zinb2 is one of the best models. In order to confirm this model comparison results and check if zinb2 has better model fit than Poisson model for this data, we simulate the scaled residuals from the fitted zinb2 model and perform goodness-of-fit tests on the simulated scaled residuals below (Figs. 17.5 and 17.6).
> # Figure 17.5
> rest_zinb2 <- simulateResiduals(fittedModel = zinb2, plot = T)
Or
> rest_zinb2 <- simulateResiduals(zinb2, plot = T)

2 scatterplots. 1. A scatterplot of Q Q plot residuals plots observed versus expected. It depicts an increasing trend. 2. A scatterplot of D H A R M a residual versus model predictions. It exhibits a decreasing trend.

Fig. 17.5

QQ-plot residuals (left panel) and Residuals vs. predicted values plot (right panel) in fitted zinb2 model by the glmmTMB package

3 graphs. A. A box plot of simulation outputs scaled residuals versus cat Pred plots 2 boxes. B and C. 2 histograms of frequency versus simulated value with p-value equals 0.792 and p equals 1, respectively.

Fig. 17.6

Levene test for homogeneity of variance (left panel), nonparametric dispersion test (middle panel), and zero-inflation test (right panel) for fitted zinb2 model by the glmmTMB package

> # Figure 17.6
par(mfrow = c(1,3))
> plotResiduals(rest_zinb2, ds$Group)
> testDispersion(rest_zinb2)
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 0.75, p-value = 0.8
alternative hypothesis: two.sided
> testZeroInflation(rest_zinb2)
DHARMa zero-inflation test via comparison to expected zeros with
simulation under H0 = fitted model
data: simulationOutput
ratioObsSim = 0.98, p-value = 1
alternative hypothesis: two.sided
It is obvious that the fitted zinb2 model outperforms the fitted Poisson model for the outcome variable D_6__Ruminiclostridium.sp.KB18.
  • Step 8: Summarize the modeling results and extract the outputs of selected models.

The following commands implement ZINB(family = nbinom2) and summarize the modeling results (Table 17.2).
Table 17.2

Modeling results from the models that use offset via glmmTMB

Model

Conditional model

Interaction term

Estimate

Std. Error

z value

P-value

poi

Group:TimePost

−4.939

0.606

−8.15

3.6e-16

cmpoi

Group:TimePost

−4.347

1.032

−4.21

2.6e-05

nb1

Group:TimePost

−2.44

1.20

−2.03

0.0425

nb2

Group:TimePost

−4.905

1.512

−3.24

0.0012

nbdisp1

Group:TimePost

−1.669

1.107

−1.51

0.13

nbdisp2

Group:TimePost

−4.903

1.542

−3.18

0.0015

zip0

Group:TimePost

−6.879

0.645

−10.67

< 2e-16

zip

Group:TimePost

−6.291

0.732

−8.59

<2e-16

zicmp

Group:TimePost

−5.944

0.975

−6.10

1.1e-09

zicmp0

Group:TimePost

−6.423

0.909

−7.07

1.6e-12

zinb1

Group:TimePost

−5.827

0.939

−6.21

5.4e-10

zinb2

Group:TimePost

−5.936

0.858

−6.92

4.6e-12

zhp

Group:TimePost

−24.8

11825.8

0

1

>zinb2 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + offset(log(Total.Read))
+ (1|MouseID), zi=~ Group, data=ds, family=nbinom2)
> summary(zinb2)
Family: nbinom2 ( log )
Formula:
D_6__Ruminiclostridium.sp..KB18 ~ Group * Time + offset(log(Total.Read)) +
(1 | MouseID)
Zero inflation: ~Group
Data: ds
AIC BIC logLik deviance df.resid
204.3 217.9 -94.2 188.3 32
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
MouseID (Intercept) 1.9e-08 0.000138
Number of obs: 40, groups: MouseID, 20
Dispersion parameter for nbinom2 family (): 3.09
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.595 0.693 -16.73 < 2e-16 ***
Group 3.438 0.785 4.38 1.2e-05 ***
TimePost 5.861 0.739 7.93 2.2e-15 ***
Group:TimePost -5.936 0.858 -6.92 4.6e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Zero-inflation model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.212 0.580 -0.36 0.71
Group 0.201 0.734 0.27 0.78

The following commands implement ZHP(family = truncated_poisson) and summarize the modeling results.

> zhp = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + offset(log(Total.Read))
+ + (1|MouseID), zi=~ Group*Time, data=ds, family=truncated_poisson)
> summary(zhp)
Family: truncated_poisson ( log )
Formula:
D_6__Ruminiclostridium.sp..KB18 ~ Group * Time + offset(log(Total.Read)) +
(1 | MouseID)
Zero inflation: ~Group * Time
Data: ds
AIC BIC logLik deviance df.resid
203.2 218.4 -92.6 185.2 31
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
MouseID (Intercept) 0.366 0.605
Number of obs: 40, groups: MouseID, 20
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -30.0 11825.8 0 1
Group 22.1 11825.8 0 1
TimePost 24.2 11825.8 0 1
Group:TimePost -24.8 11825.8 0 1
Zero-inflation model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.47e-01 6.90e-01 1.23 0.22
Group 5.23e-07 9.76e-01 0.00 1.00
TimePost -8.47e-01 9.36e-01 -0.90 0.37
Group:TimePost -8.47e-01 1.35e+00 -0.63 0.53
Significantly different Group by TimePost interaction in conditional model is identified in nbinom2 model; by contrast, this term is not significant difference in truncated Poisson model (Table 17.3).
Table 17.3

Modeling results from the models that do not use offset via glmmTMB

Model

Conditional model

Interaction term

Estimate

Std. Error

z value

P-value

poi

Group:TimePost

−4.678

0.606

- 7.72

1.2e-14

cmpoi

Group:TimePost

−4.135

1.026

−4.03

5.5e-05

nb1

Group:TimePost

−2.230

1.130

−1.97

0.04847

nb2

Group:TimePost

−4.68

1.51

- 3.10

0.0019

nbdisp1

Group:TimePost

−1.478

1.114

−1.33

0.18

nbdisp2

Group:TimePost

−4.68

1.54

−3.04

0.0024

zip

Group:TimePost

−6.037

0.627

−9.63

<2e-16

zicmp

Group:TimePost

−5.732

0.948

−6.05

1.5e-09

zicmp0

Group:TimePost

−6.271

0.857

−7.32

2.5e-13

zinb1

Group:TimePost

−6.247

0.937

−6.67

2.6e-11

zinb2

Group:TimePost

−5.711

0.840

−6.80

1.1e-11

zhp

Group:TimePost

−23.9

8656.4

0

1

zhnb1

Group:TimePost

−23.9

10093.7

0

1

zhnb2

Group:TimePost

−23.8

8597.3

0

1

Based on AIC, the best model is the zero-hurdle negative binomial (zhnb2) that assume the variance increases quadratically with the mean. We print the output below.
> zhnb2 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + (1|MouseID), zi=~ Group*Time, data=ds, family= truncated_nbinom2)
> summary(zhnb2)
Family: truncated_nbinom2 ( log )
Formula:
D_6__Ruminiclostridium.sp..KB18 ~ Group * Time + (1 | MouseID)
Zero inflation: ~Group * Time
Data: ds
AIC BIC logLik deviance df.resid
201.3 218.2 -90.7 181.3 30
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
MouseID (Intercept) 0.0314 0.177
Number of obs: 40, groups: MouseID, 20
Dispersion parameter for truncated_nbinom2 family (): 3.64
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.8 8597.3 0 1
Group 21.4 8597.3 0 1
TimePost 24.0 8597.3 0 1
Group:TimePost -23.8 8597.3 0 1
Zero-inflation model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.47e-01 6.90e-01 1.23 0.22
Group 4.26e-05 9.76e-01 0.00 1.00
TimePost -8.47e-01 9.36e-01 -0.90 0.37
Group:TimePost -8.47e-01 1.35e+00 -0.63 0.53

Group by TimePost interaction in conditional model is not statistically significant identified by truncated nbinom2 model.

Based on BIC, the best model is the zero-inflated negative binomial (zinb2) that assumes the variance increases quadratically with the mean. We print the output below.
> zinb2 = glmmTMB(D_6__Ruminiclostridium.sp..KB18 ~ Group*Time + (1|MouseID), zi=~ Group, data=ds, family=nbinom2)
> summary(zinb2)
Family: nbinom2 ( log )
Formula:
D_6__Ruminiclostridium.sp..KB18 ~ Group * Time + (1 | MouseID)
Zero inflation: ~Group
Data: ds
AIC BIC logLik deviance df.resid
203.1 216.6 -93.5 187.1 32
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
MouseID (Intercept) 1.19e-08 0.000109
Number of obs: 40, groups: MouseID, 20
Dispersion parameter for nbinom2 family (): 3.39
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.705 0.686 -1.03 0.3
Group 3.310 0.772 4.29 1.8e-05 ***
TimePost 5.930 0.728 8.14 3.9e-16 ***
Group:TimePost -5.711 0.840 -6.80 1.1e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Zero-inflation model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.215 0.580 -0.37 0.71
Group 0.208 0.733 0.28 0.78

Group by TimePost interaction in conditional model is statistically significant identified by zero-inflated nbinom2 model.

17.2.3 Remarks on glmmTMB

In this section, we demonstrated that glmmTMB can be easily used to fit complicated models, although maximally complex model might not be necessary and might not converge. The glmmTMB function has the similar interface to the lme4 package. Overall, glmmTMB is a very flexible package for modeling zero-inflated and over-dispersed count data. Although it is not time efficient, glmmTMB is able to estimate the Conway-Maxwell-Poisson distribution parameterized by the mean (Brooks et al. 2017). Fitting Conway-Maxwell-Poisson is one unique feature among packages that fit zero-inflated and over-dispersed mixed models. The benefits of using this package (Brooks et al. 2017) include:
  • Various models including GLMs, GLMMs, zero-inflated GLMMs, and hurdle models can be quickly fitted in a single package.

  • The information criteria provided in the modeling results facilitates the model comparisons and model selection via using likelihood-based methods to compare the model fitting of the estimated models.

  • At least one simulation study in microbiome literature (Xinyan Zhang and Yi 2020b) confirmed that glmmTMB controlled the false-positive rates close to the significance level over different sample sizes, while remaining reasonable empirical power.

However, glmmTMB also has some disadvantages, such as:
  • Estimating Conway-Maxwell-Poisson distribution is time expensive when the model is over-parameterized.

  • Convergence is an issue when the model is over specified with more than necessary predictors.

Based on AIC and BIC and conditional model in detecting significant interesting variables, we recommend using zinb (zinb1 and zinb2) and zicmp rather than zhp and zhnb (zhnb1 and zhnb2). They are all robust in modeling estimations and sensitive to identify significant interesting variables. The concept adjustment for using zero-hurdle and zero-inflated models, the readers are referred to Xia et al. (2018b).

The benefit of using offset has not been approved in microbiome research, and like the practice of sequencing depth rarefaction (rarefying sequencing to same depth), which is adopted from macro-ecology, the argument of using offset has not been solidly validated and is difficult to validate. In this study, we compared same GLMMs using and without using offset and found that the goodness-of-fit not necessarily improved by using offset based on AIC and BIC. Actually several advanced/completed GLMMs had model convergence problems. The non-improvement of using offset may be due to two reasons: (1) For univariate approach (using one taxon per time as the response variable), it is not appropriate or not necessary to adjust for sample total reads. (2) For advanced/completed GLMMs, it is a burden to add additional parameter to estimate.

17.3 Generalized Linear Mixed Modeling Using the GLMMadaptive Package

In this section, we introduce another R package GLMMadaptive and illustrate its use with microbiome data.

17.3.1 Introduction to GLMMadaptive

GLMMadaptive was developed by Dimitris Rizopoulos (2022a) to fit GLMMs for a single grouping factor under maximum likelihood approximating the integrals over the random effects. Unlike the lme4 and glmmTMB packages, this package uses an adaptive Gauss-Hermite quadrature (AGQ) rule (Pinheiro and Bates 1995). GLMMadaptive provides functions for fitting and post-processing mixed effects models for grouped/repeated measurements/clustered outcomes for which the integral over the random effects in the definition of the marginal likelihood cannot be solved analytically. This package fits mixed effects models for grouped/repeated measurements data which have a non-normal distribution, while allowing for multiple correlated random effects.

In GLMMadaptive, the AGQ rule is efficiently implemented to allow for specifying multiple correlated random effects terms for the grouping factor including random intercepts, random linear, and random quadratic slopes. It also offers several utility functions for extracting useful information from the fitted mixed effects models. Additionally, GLMMadaptive provides a hybrid optimization procedure: starting with implementing an EM algorithm, treating the random effects as “missing data,” and then performing a direct optimization procedure with a quasi-Newton algorithm.

17.3.2 Implement GLMMs via GLMMadaptive

GLMMadaptive implement GLMMs via a single model-fitting function mixed_model() with four required arguments. One example syntax is given below.
  • mixed_model(fixed = y ~ x1 + x2, random = ~1 | g, data = df, family = zi.negative.binomial()).

where the argument:
  • fixed is a formula that is used for specifying the fixed effects including the response (outcome) variable.

  • random is a formula that is used for specifying the random effects.

  • family is a family object that is used for specifying the type of the repeatedly measured response variable (e.g., binomial() or poisson()).

  • data is a data frame containing the variables required in fixed and random.

In the mixed effects model, y denotes a grouped/clustered outcome, x1 and x2 denote the covariates, and g denotes the grouping factor. The above example syntax of mixed effects model specifies y as outcome and x1 and x2 as fixed effects and random intercepts. The mixed effects model is fitted by zero-inflated negative binomial model. If we also want to specify both intercepts and x1 as the random effects, then update the random = ~x1 | g (e.g., ~ time | id) in the call to mixed_model()function.

GLMMadaptive provides several standard methods to access the returned modeling results: summary(), anova(), coef(), fixef(), ranef(), vcov(), logLik(), confint(), fitted(), residuals(), predict(), and simulate(). GLMMadaptive can fit various GLMMs. We summarize the family objects and the models that can be fitted by GLMMadaptive in Table 17.4.
Table 17.4

Family objects and models that can be fitted in GLMMadaptive

Family object

Model to fit

negative.binomial()

Negative binomial

zi.poisson()

Zero-inflated Poisson

zi.negative.binomial()

Zero-inflated negative binomial

hurdle.poisson()

Hurdle Poisson

hurdle.negative.binomial()

Hurdle negative binomial

hurdle.lognormal()

Two-part/hurdle mixed models for semi-continuous normal data

censored.normal()

Mixed models for censored normal data

cr_setup() and cr_marg_probs()

Continuation ratio mixed models for ordinal data

beta.fam()

Beta

hurdle.beta.fam()

Hurdle beta

Gamma() or Gamma.fam()

Gamma mixed effects models

censored.normal()

Linear mixed effects models with right and left censored data

For the repeated measurements response variable, we may specify our own log-density function and use the internal algorithms for the optimization. The marginalized coefficients can be calculated via the marginal_coefs () using the approach of Hedeker et al. (2018) to calculate marginal coefficients from mixed models with nonlinear link functions. The GLMMadaptive package also provides predictions and their confidence intervals for constructing effects plots via the effectPlotData ().

Example 17.2: Breast Cancer QTRT1 Mouse G Microbiome, Example 17.1 Cont.

Here, we first use the same example data in Example 17.1 to illustrate how to fit negative binomial, zero-inflated Poisson, zero-inflated negative binomial, hurdle Poisson, and hurdle negative binomial models using the GLMMadaptive package. We then illustrate how to perform model comparison and model selection to choose the best models. As we described in Example 17.1, the species “D_6__Ruminiclostridium.sp.KB18” has 30%, 50%, and 70% of zeros in genotype and wild-type (WT) groups before and posttreatments. We use this species to illustrate implmenting GLMMs using GLMMadaptive through the following six steps.

  • Step 1: Install package and load data.

To install the GLMMadaptive package (version 0.8-5, 2022-02-07), type the following commands in R or RStudio.

install.packages("GLMMadaptive")

The development version of the GLMMadaptive package and its vignettes are available from GitHub, which can be installed using the devtools package:

devtools::install_github("drizopoulos/GLMMadaptive")
We can install the package with vignettes using the devtools package:
devtools::install_github("drizopoulos/GLMMadaptive", build_opts=NULL)
library(GLMMadaptive)
help(GLMMadaptive)
> setwd("~/Documents/QIIME2R/Ch17_GLMMs2")
> otu_tab=read.csv("otu_table_L7_MCF7.csv",row.names=1,check.names=FALSE)
> meta_tab <- read.csv("metadata_QtRNA.csv",row.names=1,check.names = FALSE)
  • Step 2: Create analysis dataset including outcomes and covariates.

In Sect. 17.2 (Example 17.1), we double filtered and created an analysis dataset in order to effectively implement the models. We repeat the core commands below.
> otu_tab_t<-t(otu_tab)
> outcome =otu_tab_t
> # Match sample IDs in outcome matrix or data frame and meta_tab data frame
> outcome = outcome[rownames(meta_tab), ]
> yy = as.matrix(outcome)
> yy = ifelse(is.na(yy), 0, yy) #Exclude missing outcome variables if any
> # Filter abundance taxa data
> # Here we use double filters
> filter1 <- apply(yy, 2, function(x){sum(x>0)>0.4*length(x)})
> filter2 <- apply(yy, 2, function(x){quantile(x,0.9)>1})
> filter12 <-yy[,filter1&filter2]
> ## Match the number of taxa to the filter12 dataset in the analysis
> yy = filter12[rownames(meta_tab), ]
> ds = data.frame(meta_tab,yy)
  • Step 3: Fit candidated mixed models.

> library(GLMMadaptive)
Poisson Mixed Model(family = poisson())
The following commands call the mixed_model () function to fit Poisson model with fixed effects of group, time, and their interaction term, and randomly vary over time.
> poi <- mixed_model(fixed = D_6__Ruminiclostridium.sp..KB18 ~ Group*Time, random = ~ Time | MouseID, data = ds, family = poisson())
Negative Binomial Mixed Model(family = negative.binomial())
> nb <- mixed_model(fixed = D_6__Ruminiclostridium.sp..KB18 ~ Group*Time, random = ~ Time | MouseID, data = ds, family = negative.binomial())
Zero-Inflated Poisson Mixed Effects Model(family = zi.poisson())
The following commands fit the zero-inflated Poisson mixed model to improve the simple Poisson model, allowing for excess zeros. We specify intercepts as fixed effects in the zero-inflated fixed model and assume that intercepts vary over mouse.
> zip <- mixed_model(fixed = D_6__Ruminiclostridium.sp..KB18 ~ Group*Time, random = ~ Time | MouseID, data = ds, family = zi.poisson(), zi_fixed = ~ 1, zi_random = ~ 1| MouseID)
Zero-Inflated Negative Binomial Mixed Effects Model(family = zi.negative.binomial())
The following commands fit a zero-inflated negative binomial mixed effects model with intercepts as the zero-inflated fixed and random effects.
> zinb <- mixed_model(fixed = D_6__Ruminiclostridium.sp..KB18 ~ Group*Time, random = ~ Time | MouseID, data = ds, family = zi.negative.binomial(), zi_fixed = ~1, zi_random = ~ 1| MouseID)
Hurdle Poisson Mixed Effects Model(family = hurdle.poisson())
Hurdle Poisson mixed models can be fitted by mixed_model() using the family objects hurdle.poisson(). For hurdle Poisson typically a truncated at zero is used.
> zhp<- mixed_model(fixed = D_6__Ruminiclostridium.sp..KB18 ~ Group*Time, random = ~ Time | MouseID, data = ds, family = hurdle.poisson(), zi_fixed = ~ 1,iter_EM = 0)
Hurdle Negative Binomial Mixed Effects Model (family = hurdle.negative.binomial())
Hurdle negative binomial mixed models can be fitted by mixed_model() using the family objects hurdle.negative.binomial(). For hurdle negative binomial mixed model typically a truncated at zero is used.
> zhnb<- mixed_model(fixed = D_6__Ruminiclostridium.sp..KB18 ~ Group*Time, random = ~ Time | MouseID, data = ds, family = hurdle.negative.binomial(), zi_fixed = ~ 1,iter_EM = 0, iter_qN_outer = 0)
  • Step 4: Perform model comparisons and model selection.

Model Comparisons Using Likelihood Ratio Test
> # Use a likelihood ratio test to compare NB with Poisson
> anova(poi, nb)
AIC BIC log.Lik LRT df p.value
poi 231.4 238.4 -108.7
nb 224.7 232.6 -104.3 8.79 1 0.003
> # Use a likelihood ratio test to compare ZINB with ZIP
> anova(zip, zinb)
AIC BIC log.Lik LRT df p.value
zip 203.7 214.7 -90.86
zinb 205.8 217.7 -90.88 0.05 1 0.828
> # Use a likelihood ratio test to compare ZHNB with ZHP
> anova(zhp, zhnb)
AIC BIC log.Lik LRT df p.value
zhp 201.1 209.1 -92.57
zhnb 226.2 235.2 -104.10 23.06 1 <0.0001
Warning messages:
1: In anova.MixMod(zhp, zhnb) : it seems that 'zhnb' has not converged.
2: In anova.MixMod(zhp, zhnb) :
it seems that the two objects represent model with different families; are the models nested? If not, you should set 'test' to FALSE.
Model Comparisons Using the bbmle Package
> library(bbmle)
> AICtab(poi,nb,zip,zinb,zhp,zhnb)
dAIC df
zhp 0.0 8
zip 2.6 11
zinb 4.6 12
nb 23.5 8
zhnb 25.1 9
poi 30.3 7
> BICtab(poi,nb,zip, zinb,zhp, zhnb)
dBIC df
zhp 0.0 8
zip 5.6 11
zinb 8.6 12
nb 23.5 8
zhnb 26.1 9
poi 29.3 7
In this case the zero-truncated hurdle (zero-inflated) Poisson models are a little bit better fitted to the data than zero-inflated negative binomial model and better than zero-inflated (negative binomial) models, suggesting that the variances of outcome variable “D_6__Ruminiclostridium.sp.KB18” is more due to zero-inflation than over-dispersion (see Example 17.1: there are 30%, 50%, and 70% of zeros in the samples before filtering) (Table 17.5).
  • Step 5: Evaluate goodness of fit using simulated residuals via the DHARMa package.

Table 17.5

AIC and BIC values in compared models from GLMMadaptive

Model

AIC

BIC

Zhp

201.1

209.1

Zip

203.7

214.7

Zinb

205.8

217.7

Nb

224.6

232.6

Zhnb

226.2

235.2

Poi

231.4

238.4

We can evaluate the goodness-of-fit of mixed models fitted by the mixed_​model() function using the procedures described in the DHARMa package (version 0.4.5, 2022-01-16). Currently DHARMa does not support objects of class MixMod yet, to enable the use of the procedures of this package, a wrapper function for scaled simulated residuals has been developed by Dimitris Rizopoulos (2022b).
# Function for Scaled Simulated Residuals
# resids_plot:
resids_plot <- function (object, y, nsim = 1000,
type = c("subject_specific", "mean_subject"),
integerResponse = NULL) {
if (!inherits(object, "MixMod"))
stop("this function works for 'MixMod' objects.\n")
type <- match.arg(type)
if (is.null(integerResponse)) {
integer_families <- c("binomial", "poisson", "negative binomial",
"zero-inflated poisson", "zero-inflated negative binomial",
"hurdle poisson", "hurdle negative binomial")
numeric_families <- c("hurdle log-normal", "beta", "hurdle beta", "Gamma")
if (object$family$family %in% integer_families) {
integerResponse <- TRUE
} else if (object$family$family %in% numeric_families) {
integerResponse <- FALSE
} else {
stop("non build-in family object; you need to specify the 'integerResponse',\n",
"\targument indicating whether the outcome variable is integer or not.\n")
}
}
sims <- simulate(object, nsim = nsim, type = type)
fits <- fitted(object, type = type)
dharmaRes <- DHARMa::createDHARMa(simulatedResponse = sims, observedResponse = y,
fittedPredictedResponse = fits,
integerResponse = integerResponse)
DHARMa:::plot.DHARMa(dharmaRes, quantreg = FALSE)
}
In above function, the argument:
  • object is an object inheriting from class MixMod.

  • y is the observed response vector.

  • nsim is used to specify an integer number of simulated datasets(defaults is 1000).

  • type is what type of fitted values and data to simulate; it is used for including the random effects or setting them to zero.

  • integerResponse is a logical argument; it sets to TRUE for discrete grouped/cluster outcome data. Based on the chosen family for fitting the model, the integerResponse is automatically determined by this function. But for user-specified family objects, this argument needs to be defined by the user. Here, we evaluate the goodness-of-fit of the chosen family, so the integerResponse argument is automatically determined.

We can either store the “resids_plot.R” wrapper function in R Script panel and call it directly or store it in Source File Location (here, “~/Documents/QIIME2R/Ch17_GLMMs“) and call it using the following command.
> source('resids_plot.R')

The following commands call the function resids_plot() and via the package DHARMa to generate two residual plots: QQ plot residuals on the left panel and residuals vs. predicted values (ranks transformed) on the right panel.

In the residual plot of residuals vs. predicted values, in order to visualize in detecting deviations from uniformity in y-direction, an (optional default) quantile regression is calculated to compare the empirical 0.25, 0.5, and 0.75 quantiles in y direction (red solid lines) with the theoretical 0.25, 0.5 and 0.75 quantiles (dashed black line) and to provide a P-value for the deviation from the expected quantile. The significance of the deviation to the expected quantiles is tested and displayed visually and can be additionally extracted with the testQuantiles function (Fig. 17.7).

2 scatterplots. 1. A scatterplot of Q Q plot residuals plots observed versus expected. It exhibits an increasing trend. 2. A scatterplot of D H A R M a residual versus model predictions. It displays a fluctuating trend.

Fig. 17.7

QQ-plot residuals (left panel) and residuals vs. predicted values plot(right panel) in fitted Poisson model by the GLMMadaptive package

> # Figure 17.7
> library(DHARMa)
> source('resids_plot.R')
> resids_plot(poi, ds$D_6__Ruminiclostridium.sp..KB18)
Unlike in the glmmTMB, it has obvious evidence from these two plots that the Poisson model does not well fit the data satisfactorily (Fig. 17.8).

A scatterplot of D H A R M a residual versus predictor, rank transformed. It displays a decreasing red curve, an increasing black curve, and an increasing red curve. The red curves are plotted for detected quantile deviations.

Fig. 17.8

The significance of the deviation of the empirical quantiles to the expected quantiles in fitted Poisson model is tested and visualized with the testQuantiles () function

> # Figure 17.8
> testQuantiles(poi,ds$D_6__Ruminiclostridium.sp..KB18)
Test for location of quantiles via qgam
data: simulationOutput
p-value = 3e-09
alternative hypothesis: both
First, let’s check the simulated residuals from the fitted zero-truncated Poisson distribution (Figs. 17.9 and 17.10).

2 scatterplots. 1. A scatterplot of Q Q plot residuals plots observed versus expected. It has an increasing trend. 2. A scatterplot of D H A R M a residual versus model predictions has a fluctuating trend.

Fig. 17.9

QQ-plot residuals (left panel) and residuals vs. predicted values plot (right panel) in fitted zero-truncated Poisson model by the GLMMadaptive package

A scatterplot of D H A R M a residual versus predictor rank transformed. It exhibits an increasing trend with 3 increasing red curves for detected quantile deviations.

Fig. 17.10

The significance of the deviation of the empirical quantiles to the expected quantiles in fitted zero-truncated Poisson model is tested and visualized with the testQuantiles () function

> # Figure 17.9

> resids_plot(zhp, ds$D_6__Ruminiclostridium.sp..KB18)

> # Figure 17.10
> testQuantiles(zhp,ds$D_6__Ruminiclostridium.sp..KB18)
Test for location of quantiles via qgam
data: simulationOutput
p-value <2e-16
alternative hypothesis: both

QQ-plot showed that the zero-truncated Poisson model has improved the model fit compared to the Poisson model. It seems to be acceptable. But based on the testing of the significance of the deviation of the empirical quantiles to the expected quantiles, the zero-truncated Poisson is still not well fitted.

The following commands check the simulated residuals from the fitted zero-inflated negative binomial distribution (Figs. 17.11 and 17.12).

2 scatterplots. 1. A scatterplot of Q Q plot residuals plots observed versus expected. It depicts an increasing trend. 2. A scatterplot of D H A R M a residual versus model predications has a fluctuating trend.

Fig. 17.11

QQ-plot residuals (left panel) and residuals vs. predicted values plot (right panel) in fitted zero-inflated negative binomial model by the GLMMadaptive package

A scatterplot of D H A R M a residual versus predictor. They display an increasing trend with increasing red curves for detected quantile deviations.

Fig. 17.12

The significance of the deviation of the empirical quantiles to the expected quantiles in fitted zero-inflated negative binomial model is tested and visualized with the testQuantiles () function

> # Figure 17.11
> resids_plot(zinb, ds$D_6__Ruminiclostridium.sp..KB18)
> # Figure 17.12
> testQuantiles(zinb,ds$D_6__Ruminiclostridium.sp..KB18)
Test for location of quantiles via qgam
data: simulationOutput
p-value = 3e-06
alternative hypothesis: both

QQ-plot showed that the model fit of zero-inflated negative binomial model is comparable to the model fit of the zero-truncated Poisson model. But the deviation between the empirical quantiles and the expected quantiles is still significant.

Similarly, the simulated residuals from other fitted models can be checked using the following commands.
> resids_plot(nb, ds$D_6__Ruminiclostridium.sp..KB18)
>resids_plot(zip, ds$D_6__Ruminiclostridium.sp..KB18)
> resids_plot(zhnb, ds$D_6__Ruminiclostridium.sp..KB18)
  • Step 6: Summarize the modeling results and extract the outputs of selected models.

  • The partial output from ZINB is printed below.

> zinb <- mixed_model(fixed = D_6__Ruminiclostridium.sp..KB18 ~ Group*Time, random = ~ Time | MouseID, data = ds, family = zi.negative.binomial(), zi_fixed = ~1, zi_random = ~ 1| MouseID)
Model:
family: zero-inflated negative binomial
link: log
Fit statistics:
log.Lik AIC BIC
-90.88 205.8 217.7
Random effects covariance matrix:
StdDev Corr
(Intercept) 0.7268 (Intr) TimPst
TimePost 1.3550 -0.9886
zi_(Intercept) 1.9091 -0.9382 0.9060
Fixed effects:
Estimate Std.Err z-value p-value
(Intercept) -1.125 0.9867 -1.140 0.25422
Group 2.984 0.7790 3.830 0.00013
TimePost 6.673 1.0771 6.196 < 1e-04
Group:TimePost -5.716 0.9889 -5.780 < 1e-04
  • The partial output from ZHP is printed below.

> zhp<- mixed_model(fixed = D_6__Ruminiclostridium.sp..KB18 ~ Group*Time, random = ~ Time | MouseID, data = ds, family = hurdle.poisson(), zi_fixed = ~ 1,iter_EM = 0)
> summary(zhp)
Model:
family: hurdle poisson
link: log
Fit statistics:
log.Lik AIC BIC
-92.57 201.1 209.1
Random effects covariance matrix:
StdDev Corr
(Intercept) 0.5510
TimePost 1.0677 -0.9375
Fixed effects:
Estimate Std.Err z-value p-value
(Intercept) -5.699 13.03 -0.4374 0.66
Group 7.992 13.03 0.6132 0.54
TimePost 10.841 13.04 0.8317 0.41
Group:TimePost -10.538 13.05 -0.8077 0.42
Significantly different fixed effects of Group by TimePost interaction is identified in zero-inflated negative binomial model; by contrast, this term is not significant difference in zero-hurdle Poisson model (Table 17.6).
Table 17.6

The modeling results of Group by Time Interaction from the selected models

Model

Fixed effects

Interaction term

Estimate

Std. Error

z value

P-value

Poi

Group:TimePost

−3.539

1.654

−2.139

0.0324

Nb

Group:TimePost

−4.678

1.5099

−3.098

0.0019

Zip

Group:TimePost

−5.697

0.9800

−5.814

< 1e-04

Zinb

Group:TimePost

−5.716

0.9889

−5.780

< 1e-04

Zhp

Group:TimePost

−10.538

13.05

−0.8077

0.42

Zhnb

Group:TimePost

−4.678

1.957

−2.391

0.01682

17.3.3 Remarks on GLMMadaptive

Various numerical approximating methods or Markov chain Monte Carlo (MCMC) have increased in use to estimate GLLMs in practice. For example, MCMC and Hamiltonian Monte Carlo can provide accurate evaluation of the integrals under the Bayesian paradigm or approaches. However, none has good properties for all possible models and data. As reviewed in Chap. 16 (Sects. 16.​4.​1 and 16.​4.​3), the numerical approximating methods or MCMC approach are particularly problematic for binary/dichotomous data and count data with small counts and few repeated measurements because in these conditions the accuracy of this approximation is rather low.

In contrast, the numerical integration algorithm that both GLMMadaptive and glmmTMB are used is considered as more accurate than the numerical approximating methods. Especially AGQ algorithm is commonly considered as the gold standard numerical approximation method (Pinheiro and Bates 1995). GLMMadaptive has adopted this method and focuses on maximum likelihood estimation. Along with glmmTMB, GLMMadaptive approach has been confirmed controlling the false-positive rates close to the significance level over different sample sizes, while remaining reasonable empirical power (Zhang and Yi 2020b). However GLMMadaptive also has the weaknesses, including:
  • The AGQ rule is more computationally intensive compared to the numerical approximating methods or MCMC approach.

  • The current version (0.8.5, February 2022) only allows a single grouping factor; i.e., no nested or crossed random effects designs are provided.

Based on the model fit evaluated by AIC and BIC and the capability of the fixed effects model in detecting significant variables of interest, we recommend using ZINB to analyze microbiome data when using GLMMadaptive.

17.4 Fast Zero-Inflated Negative Binomial Mixed Modeling (FZINBMM)

FZINBMM (Zhang and Yi 2020b) was specifically designed for longitudinal microbiome data. Similar to glmmTMB and GLMMadaptive, FZINBMM belongs to univariate approach of longitudinal microbiome data analysis. However, different from glmmTMB and GLMMadaptive, which use the Laplace approximation to integrate over random effects and an adaptive Gauss-Hermite quadrature (AGQ) rule, respectively, FZINBMM employs the EM-IWLS algorithm.

17.4.1 Introduction to FZINBMM

FZINBMM (Zhang and Yi 2020b) was developed to analyze high-dimensional longitudinal metagenomic count data, including both 16S rRNA and whole-metagenome shotgun sequencing data. The goal of FZINBMM is to simultaneously address the main challenges of longitudinal metagenomics data, including high-dimensionality, dependence among samples and zero-inflation of observed counts. FZINBMM takes two advantages: (1) built on zero-inflated negative binomial mixed models (ZINBMMs), FZINBMM is able to analyze over-dispersed and zero-inflated longitudinal metagenomic count data; (2) using a fast and stable EM-iterative weighted least squares (IWLS) model-fitting algorithm to fit the ZINBMMs, which takes advantage of fitting linear mixed models (LMMs). Thus, FZINBMM can handle various types of fixed and random effects and within-subject correlation structures and analyze many taxa fast.

17.4.1.1 Zero-Inflated Negative Binomial Mixed Models (ZINBMMs)

Let cijm denote the observed count for the mth taxon from the jth sample of the ith subject in a longitudinal metagenomics data setting and Tij as the total sequence read, where yij = cijm, j = 1, …, ni, i = 1, …, n, n is the total number of subjects, and ni is the total number of samples for each subject. For any given taxon m, assume that yij is distributed as the zero-inflated negative binomial (ZINB). ZINB is a two-part model with Part I (a logistic model) being used for predicting excess zeros and Part II (a NB model) being used for modeling over-dispersed counts. The ZINB distribution is written as:
$$ {y}_{ij}\sim \left\{\begin{array}{ll}0&amp; {p}_{ij}\\ {} NB\left({y}_{ij}|{\mu}_{ij},\theta \right)&amp; 1-{p}_{ij}\end{array}\right., $$
(17.2)
where Pij denotes the unknown probability of yij being from the excess zero state and 𝜇ij and θ are the means and the dispersion parameter of NB distribution, respectively. The relationships between the probabilities and the potential covariates are linked through the logit link functions. There are two logit link functions for logistic model. The logit link function in (17.3) only includes an intercept in Zij, indicating that the model assumes the same probability of belonging to the excess zeros for all observed zeros.
$$ \mathrm{logit}\left({p}_{ij}\right)={Z}_{ij}\alpha . $$
(17.3)

To include subject-specific effects (random effects), we can extend the above logistic model as follow:

$$ \mathrm{logit}\left({p}_{ij}\right)={Z}_{ij}\alpha +{G}_{ij}{\alpha}_i. $$
(17.4)

In (17.3) and (17.4), Zij denotes the potential covariates associated with the excess zeros, and a is the vector of effects. ai is the subject-specific effects which are usually assumed to follow a multivariate normal distribution (McCulloch and Searle 2001; Searle and McCulloch 2001; Pinheiro and Bates 2000):

$$ {\alpha}_i\sim N\left(0,{\Psi}_{\alpha}\right), $$
(17.5)
where Ψa is the variance-covariance matrix for the random effects. In the NB model, the means 𝜇ij are assumed to link with the covariates via a logarithmic link function:
$$ \log \left({\mu}_{ij}\right)=\log \left({T}_{ij}\right)+{X}_{ij}\beta +{G}_{ij}{b}_i, $$
(17.6)
where log(Tij) is the offset for accounting for the sequencing depth of the total sequence reads; β is the vector of fixed effects (i.e., population-level effects); bidenotes the random effects (i.e., the vector of subject-specific effects). Xij is the vector of covariates for the jth sample of the ith subject. In longitudinal studies, the jth sample of the ith subject is collected in different time tij. Thus, Xijcould be (1 − Xi), (1 − Xi, tij), or $$ \left(1-{X}_i,{t}_{ij},{X}_i^s{t}_{ij}\right) $$, where $$ {X}_i^s $$ is an indicator variable of interest in Xi such as an indicator variable for the case-control group. Gij denotes the vector of group-level covariates with 1 indicating only a subject-specific random intercept being specified in the model, with (1, tij) indicating a subject-specific random intercept and random time effect are included in the model.

Similar to ai in (17.4), the subject-specific effects bi (random effects) in (17.6) are assumed to follow a multivariate normal distribution (McCulloch and Searle 2001; Searle and McCulloch 2001; J. Pinheiro and Bates 2000):

$$ {b}_i\sim N\left(0,{\Psi}_b\right), $$
(17.7)
where Ψb is the variance-covariance matrix for the random effects bi. The matrices, Ψa and Ψb are the general positive-definite matrices to model the correlation structure among the random covariates. In the simplest case, they are restricted to be diagonal (see details on introduction to GLMMs in Sect. 16.2).

17.4.1.2 EM-IWLS Algorithm for Fitting the ZINBMMs

The above ZINBMMs are fitted via a proposed fast and stable EM-IWLS algorithm. A vector of latent indicator variables $$ \xi =\left({\xi}_{i1},\dots, {\xi}_{in_j}\right) $$ is used to distinguish the logit part for excess zeros and the NB distribution with ξij = 1 indicating that yij is from the excess zero and ξij = 0 indicating that yij is from the NB distribution. The log-likelihood of the complete data (y, ξ) is written as:
$$ {\displaystyle \begin{array}{l}L\left(\Phi; y,\xi \right)=\sum \limits_{i=1}^n\sum \limits_{j=1}^{n_i}\log \left[{p}_{ij}^{\xi_{ij}}{\left(1-{p}_{ij}\right)}^{1-{\xi}_{ij}}\right]\\ {}+\sum \limits_{i=1}^n\sum \limits_{j=1}^{n_i}\log \Big[\left(1-{\xi}_{ij}\right)\log \left( NB\left({y}_{ij}|{\mu}_{ij},\phi \right)\right),\end{array}} $$
(17.8)
where Φ denotes all the parameters of fixed and random effects in the ZINBMMs. The E-step in the EM-IWLS algorithm replaces the indicator variables ξij by their conditional expectations $$ {\hat{\xi}}_{ij} $$. The $$ {\hat{\xi}}_{ij} $$ can be estimated by the conditional function of ξij given all the parameters Φ and data yij:
$$ {\displaystyle \begin{array}{l}{\hat{\xi}}_{ij}=p\left({\xi}_{ij}=1|\Phi, {y}_{ij}\right)\\ {}=\frac{p\left({y}_{ij}|{\mu}_{ij},\phi, {\xi}_{ij}=1\right)p\left({\xi}_{ij}=1|{p}_{ij}\right)}{p\left({y}_{ij}|{\mu}_{ij},\phi, {\xi}_{ij}=0\right)p\left({\xi}_{ij}=0|{p}_{ij}\right)+p\left({y}_{ij}|{\mu}_{ij},\phi, {\xi}_{ij}=1\right)p\left({\xi}_{ij}=1|{p}_{ij}\right)},\end{array}} $$
(17.9)
  • If yij ≠ 0, we have p(yij | μij, ϕ, ξij = 1) = 0, and thus $$ {\hat{\xi}}_{ij}=0 $$.

  • If yij = 0, we have

$$ {\displaystyle \begin{array}{l}{\hat{\xi}}_{ij}={\left[\frac{p\left({\xi}_{ij}=0\right)\mid {p}_{ij}}{p\left({\xi}_{ij}=1\right)\mid {p}_{ij}}p\left({y}_{ij}=0|{\mu}_{ij},\phi, {\xi}_{ij}=0\right)+1\Big)\right]}^{-1}\\ {}={\left[\frac{1-{p}_{ij}}{p_{ij}} NB\left({y}_{ij}=0|{\mu}_{ij},\phi \right)+1\right]}^{-1},\end{array}} $$
(17.10)

In the M-step, the parameters in the two parts are updated separately by maximizing $$ L\left(\Phi; y,\hat{\xi}\right) $$.

The parameters in the logit part for excess zeros are updated by running a logistic regression with $$ {\hat{\xi}}_{ij} $$ as response in (17.11) without including a random-effect term:

$$ {\hat{\xi}}_{ij}\sim Bin\left(1,{p}_{ij}\right),\kern0.5em \mathrm{logit}\left({p}_{ij}\right)={Z}_{ij}\alpha, $$
(17.11)

If a random-effect term is included in the logit part, then the parameters in that part are updated by fitting the logistic mixed model as response in (17.12):

$$ {\hat{\xi}}_{ij}\sim Bin\left(1,{p}_{ij}\right),\kern0.5em \mathrm{logit}\left({p}_{ij}\right)={Z}_{ij}\alpha +{G}_{ij}{\alpha}_i,{\alpha}_i\sim N\left(0,{\Psi}_a\right). $$
(17.12)

Both the IWLS algorithm and the penalized quasi-likelihood (PQL) procedure can equivalently

approximate the logistic regression or logistic mixed model. The PQL procedure iteratively approximates the logistic likelihood by a weighted LMM, which has been used for fitting GLMMs (Breslow and Clayton 1993; McCulloch and Searle 2001; Schall 1991; Venables and Ripley 2002).

FZINBMM updates the parameters in the means of the NB part through using an extended IWLS algorithm, which iteratively approximates the weighted NB likelihood (i.e., the second part of (17.8)) by the weighted LMM:

$$ {\displaystyle \begin{array}{l}{z}_{ij}=\log \left({T}_{ij}\right)+{X}_{ij}\beta +{G}_{ij}{b}_i+{\left(1-{\hat{\xi}}_{ij}\right)}^{-1/2}{w}_{ij}^{-1/2}{e}_{ij}\\ {}{b}_i\sim N\left(0,\Psi \right),{e}_{ij}={\left({e}_{i1},\dots, {e}_{in_i}\right)}^{\hbox{'}}\sim N\left(0,{\sigma}^2{R}_i\right),\end{array}} $$
(17.13)

The extended IWLS algorithm was developed based on the standard IWLS (equivalently, PQL) by Zhang et al. (2017). In (17.13), zij and wij are the pseudo-responses and the pseudo-weights, respectively, and Ri is a correlation matrix accounting for within subject correlation structures. Here the zij and wij are calculated in the same way as in Zhang et al. (2017), where the IWLS algorithm is used for fitting NB mixed models.

Like in Pinheiro and Bates (2000), in FZINBMM framework various correlation matrices can be specified, e.g., autoregressive of order 1 [AR(1)] or continuous-time AR(1) except for assuming the simplest independence structure of correlation matrix by setting Ri as an identity matrix.

Finally, FZINBMM uses the standard Newton-Raphson algorithm to update the dispersion parameter 𝜃 by maximizing the NB likelihood:

$$ L\left(\phi \right)=\sum \sum \left(1-{\hat{\xi}}_{ij}\right)\log NB\left({y}_{ij}|{\hat{\mu}}_{ij},\phi \right), $$
(17.14)
where $$ {\hat{\mu}}_{ij}=\exp \left(\log \left({T}_{ij}\right)+{X}_{ij}\hat{\beta}+{G}_{ij}{\hat{b}}_i\right) $$. The convergence algorithm is based the criterion:
$$ {\displaystyle \begin{array}{l}{\sum}_{i=1}^n{\sum}_{j=1}^{n_i}\left[{\left({\eta}_{ij}^{(t)}-{\eta}_{ij}^{\left(t-1\right)}\right)}^2+{\left({\gamma}_{ij}^{(t)}-{\gamma}_{ij}^{\left(t-1\right)}\right)}^2\right]\\ {}&lt;\varepsilon \left({\sum}_{i=1}^n{\sum}_{j=1}^{n_i}\left[{\left({\eta}_{ij}^{(t)}\right)}^2+{\left({\gamma}_{ij}^{(t)}\right)}^2\right]\right),\end{array}} $$
(17.15)
where $$ {\eta}_{ij}^{(t)}=\log \left({T}_{ij}\right)+{X}_{ij}{\beta}^{(t)}+{G}_{ij}{b}_i^{(t)} $$, $$ {\gamma}_{ij}^{(t)}={Z}_{ij}{\alpha}^{(t)}+{G}_{ij}{\alpha}_i^{(t)} $$, and ε is a small value (say 105). All the estimated steps are repeated until convergence.

After the model is convergent, FZINBMM provides the maximum likelihood estimates of the fixed effects βk and the associated standard deviations and the estimates of the random effects 𝛼k and the associated standard deviations.

We can perform two null statistical hypotheses tests, respectively, in FZINBMM:

$$ {H}_0:{\beta}_k=0\kern0.5em \mathrm{and}\kern0.5em {H}_0:{\alpha}_k=0. $$

17.4.2 Implement FZINBMM Using the NBZIMM Package

FZINBMM was developed under the LMM framework and is implemented via the function glmm.zinb() in the package NBZIMM. So it can deal with any types of random effects and within subject correlation structures as the function lme(). The function glmm.zinb() works by repeatedly calling the function lme() of the package nlme to fit the weighted LMM and GLM in the stats package or glmPQL() in the MASS package to fit the logistic regression or logistic mixed model.

The package NBZIMM is freely available from the public GitHub repository (Yi 2021) (current version 1.0, April 2022). One example syntax of the glmm.zinb() function is given below:
  • glmm.zinb(fixed, random, data, correlation, zi_fixed = ~1, zi_random = NULL,

  • niter = 30, epsilon = 1e-05, verbose = TRUE, …)

where the arguments:
  • fixed and random are used to specify the formulas for the fixed-effects (including the count outcome) and the random-effects parts of the negative binomial model, respectively. They are the same as in the function lme() in the package nlme.

  • random only contains the right-hand side part, e.g., ~ time | id, where time is a variable, and id is the grouping factor.

  • data is a data.frame containing all the variables that the model uses.

  • correlation is used to specify an optional correlation structure. It is the same as in the function lme() in the package nlme.

  • zi_fixed and zi_random are used to specify the formulas for the fixed and random effects of the zero inflated part, respectively. They only contain the right-hand side part.

  • niter denotes the maximum number of iterations.

  • epsilon denotes the positive convergence tolerance.

  • verbose is a logical argument, with verbose = TRUE printing out the number of iterations and computational time.

  • denotes the further arguments for lme.

After implementing the glmm.zinb () function, the package NBZIMM returns both fitted model objects of the count distribution part and the zero-inflation part. The fitted model object is the class lme, which can be summarized by functions in the package nlme. The object for the zero-inflation part contains additional components, including the estimate of the dispersion parameter (theta), the zero-state probabilities (zero.prob), the conditional expectations of the zero indicators (zero.indicator), and the fitted logistic (mixed) model for the zero-inflation part (fit.zero).

Example 17.3: Breast Cancer QTRT1 Mouse Gut Microbiome, Example 17.1 Cont.

In Chap. 15, we used this dataset to illustrate LMMs to identify the significant change species over time and to assess the association of the Chao 1 richness index with group status over time.

In Examples 17.1 and 17.2, we used this dataset to illustrate glmmTMB and GLMMadaptive, respectively. Here, we use it to illustrate how to identify the significant change species over time by performing FZINBMM.

The data processing steps including loading and checking data and creating analysis dataset are same as in previous examples. For the reader’s convenience, we copy the core commands here and omit the details of interpretation.

  • Step 1: Process data to create analysis dataset consisting of outcomes and covariates.

> setwd("~/Documents/QIIME2R/Ch17_GLMMs2")
> otu_tab=read.csv("otu_table_L7_MCF7.csv",row.names=1,check.names=FALSE)
> meta_tab <- read.csv("metadata_QtRNA.csv",row.names=1,check.names = FALSE)
> otu_tab_t<-t(otu_tab)
> colnames(otu_tab_t)
> outcome =otu_tab_t
> outcome = outcome[rownames(meta_tab), ]
> yy = as.matrix(outcome)
> yy = ifelse(is.na(yy), 0, yy) #Exclude missing outcome variables if any
> filter1 <- apply(yy, 2, function(x){sum(x>0)>0.4*length(x)})
> filter2 <- apply(yy, 2, function(x){quantile(x,0.9)>1})
> filter12 <-yy[,filter1&filter2]
> yy = filter12[rownames(meta_tab), ]
> taxa_all <- colnames(filter12)
> N = meta_tab[, "Total.Read"] # total reads
> mean(N); sd(N)
> mean(log(N)); sd(log(N))
> subject = meta_tab[, "MouseID"]
> group = meta_tab[, "Group"]
> time = meta_tab[, "Time"]
  • Step 2: Run the glmm.zinb() function to implement FZINBMM.

Type the following commands to install the NBZIMM package.

> install.packages("remotes")
> remotes::install_github("nyiuab/NBZIMM")

Like in running the lme() function in Chap. 14, we first create list() and matrix to hold modeling results.

> yy <- yy[, colSums(is.na(yy)) == 0]
> # Create list() and matrix to hold the modeling results
> mod = list()
> rest1 = rest2 = rest3 = matrix(NA, ncol(yy), 1)

Then we call library(NBZIMM) and run the glmm.zinb() function iteratively to fit multiple taxa (outcome variables).

> library(NBZIMM)
> for (j in 1:ncol(yy)){
+ tryCatch({
+ y = as.numeric(yy[, j])
+ df = data.frame(y=y, Group=group, Time = time, N=N, Subject=subject)
+ mod = glmm.zinb(y ~ Group*Time + offset(log(N)),
+ random = ~ 1|Subject, data = df)
+ rest_tab = summary(mod)$tTable
+ rest1[j, ] = summary(mod)$tTable[2, 5]
+ rest2[j, ] = summary(mod)$tTable[3, 5]
+ rest3[j, ] = summary(mod)$tTable[4, 5]
+ }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
+ }
> rest_tab
Value Std.Error DF t-value p-value
(Intercept) -7.7431 0.1837 18 -42.1423 1.920e-19
Group 0.2473 0.2687 18 0.9204 3.695e-01
TimePost -0.2203 0.2418 18 -0.9111 3.743e-01
Group:TimePost -0.4221 0.3492 18 -1.2086 2.424e-01

Like in LMMs of Chap. 15, we include two main factors group, time, and their interaction term as fixed effects and include intercept only as random term. Other models can also be fitted. For example, we can fit the same fixed effects with group, time, and their interaction term but fit both intercept and time as random effects or fit random intercept and the within-subject correlation of AR(1). We can also include other covariates in the above models to test the hypothesis of interest.

For LMMS, we first perform an arcsine transformation (asin(sqrt()) to transform the read counts of each taxon, then model the transformed values of each taxon. Unlike LMMs, here we model the read counts of each taxon directly and use log of total read counts as offset.

We can obtain the information of random and fixed effects and zero model for this NBZIMM using the summary() function as implementing in the nlme() function.

> summary(mod)
> fixed(mod)
> random.effects(mod)
> summary(mod$fit.zero)
  • Step 3: Adjust p-values.

> all_zinb <- as.data.frame(cbind(rest1, rest2,rest3))
> # Adjust p-values
> group_zinb_adj <-round(p.adjust(rest1,'fdr'),4)
> TimePost_zinb_adj <-round(p.adjust(rest2,'fdr'),4)
> Group.TimePost_zinb_adj <-round(p.adjust(rest3,'fdr'),4)
> options(width = 110)
> all_zinb_adj <- as.data.frame(cbind(taxa_all,Group_zinb_adj,TimePost_zinb_adj,Group.TimePost_zinb_adj))
  • Step 4: Write table to contain the results.

Finally we can write out the modeling results into Table 17.7.
> # Write results table
> # Make the table
> library(xtable)
> table <- xtable(all_zinb_adj,caption = "Table of significant taxa",lable="sig_taxa_table")
> print.xtable(table,type="html",file = "FZINBMM_Table_Model_QtRNA.html")
> write.csv(all_zinb_adj,file = paste("FZINBMM_Table_Model_QtRNA.csv",sep = ""))
Table 17.7

Modeling results from fast zero-inflated negative binomial mixed model in QtRNA data

 

taxa_all

Group_zinb_adj

Time Post_zinb_adj

Group.TimePost_zinb_adj

1

D_6__Anaerotruncus sp. G3(2012)

NA

NA

NA

2

D_6__Clostridiales bacterium CIEAF 012

0.8725

0.3586

0.4475

3

D_6__Clostridiales bacterium CIEAF 015

0.5081

0.0035

0.0263

4

D_6__Clostridiales bacterium CIEAF 020

NA

NA

NA

5

D_6__Clostridiales bacterium VE202–09

0.2278

0.0242

0.1321

6

D_6__Clostridium sp. ASF502

NA

NA

NA

7

D_6__Enterorhabdus muris

0.4301

0.0111

0.2015

8

D_6__Eubacterium sp. 14–2

0.0048

0

0.0263

9

D_6__gut metagenome

0.8539

0.4978

0.3914

10

D_6__Lachnospiraceae bacterium A4

NA

NA

NA

11

D_6__Lachnospiraceae bacterium COE1

0.8725

0.236

0.0263

12

D_6__mouse gut metagenome

NA

NA

NA

13

D_6__Mucispirillum schaedleri ASF457

0.0034

0.1926

0.0011

14

D_6__Parabacteroides goldsteinii CL02T12C30

NA

NA

NA

15

D_6__Ruminiclostridium sp. KB18

0.0023

0

0

16

D_6__Staphylococcus sp. UAsDu23

0.4301

0.009

0.4475

17

D_6__Streptococcus danieliae

0.5061

0.4117

0.3334

17.4.3 Remarks on FZINBMM

It was demonstrated (Zhang and Yi 2020a) that like the two ZINBMMs that use numerical integration algorithm, glmmTMB (Brooks et al. 2017) and GLMMadaptive (Rizopoulos 2019), FZINBMM controlled the false-positive rates well, and all three methods had comparable performances in terms of empirical power, whereas FZINBMM outperformed glmmTMB and GLMMadaptive in terms of computational efficiency.

Zhang and Yi (2020a, b) demonstrated that:
  • FZINBMM outperformed linear mixed models (LMMs), negative binomial mixed models (NBMMs) (Zhang et al. 2017, 2018), and zero-inflated Gaussian mixed models (ZIGMMs).

  • FZINBMM detected a higher proportion of associated taxa than LMMs, NBMMs, and zero-inflated Gaussian mixed models (ZIGMMs) methods.

  • Generally transforming the count data could decrease the power in detecting significant taxa. Thus, the methods that directly analyzed the counts (i.e., NBMMs and FZINBMM) performed better than those methods that analyzed the transformed data (i.e., LMMs and ZIGMMs).

  • The models that are able to address the zero-inflation problem could have higher the power in detecting the significant microbiome taxa and their dynamic associations between the outcome and the microbiome composition. For example, they found that ZIGMMs and FZINBMM both worked better than LMMs and NBMMs in this cited study.

  • FZINBMM performed similarly to ZIGMMs and NBMMs when the microbiome data were not highly sparse. Whereas FZINBMM has better performance when the microbiome data is sparse.

However, FZINBMM also has several limitations, such as:
  • As an univariate method, FZINBMM analyzes one taxon at a time and then adjusts for multiple comparisons.

  • Assuming subject-specific effects (random effects) are followed as a multivariate normal distribution is difficult to validate.

  • FZINBMM also shares most other limitations of ZIBR (zero-inflated beta random effect model) as described in Yinglin Xia (2020).

17.5 Remarks on Fitting GLMMs

In modeling cross-sectional microbiome count data with GLMMs, zero-inflated GLMMs, and hurdle models, we have concluded that generally zero-inflated negative binomial and two-parts hurdle negative binomial models are the two best models for analyzing zero-inflated and over-dispersed microbiome data (Xia et al. 2018b). For longitudinal zero-inflated and over-dispersed microbiome count data, although the findings are more complicated because more complicated models are specified to be estimated, the overall conclusion remains the same: zero-inflated and zero-hurdle negative binomial models are the best models among the competing models. Conway-Maxwell-Poisson model is also a good alternative for this kind of microbiome data; however, it is time-consuming to run this model when the specified models do not match the data. In specific case, zero-hurdle Poisson is also a choice; however, this model cannot model over-dispersion if the non-zero parts of data are over-dispersed. Therefore, we would like to recommend using zero-inflated and zero-hurdle negative binomial models for analyzing microbiome data.

We noticed that zero-hurdle Poisson and negative binomial models are not stable cross different software and different estimation algorithms in longitudinal microbiome data and sometimes fail to detect significant taxa compared to zero-inflated negative binomial models (ZINB). Thus, we prefer to recommend using ZINB for longitudinal microbiome data.

We also recommend performing model comparisons to choose an appropriate model for applying into microbiome data.
  • It is not necessary to specify a maximally complex model to overestimate parameters because complicated models not only might not converge but also might not improve much over simplified models.

  • Model selection is very important. The choice of better models should be based on model fitting criteria rather than based on whether more significant taxa are identified by the models (Xia 2020). We emphasize here again it is misleading to say one model outperforms other models because it can identify more significant taxa. The more significant taxa identified could be due to higher false-positive rates of this model (Xia et al. 2012).

  • We diagnose the model fits from glmmTMB, GLMMadaptive through the DHARMa package, which takes the approach of simulating, scaling, and plotting residuals. The readers can use this approach as a reference, but model evaluation should not fully be determined based on its results because the simulation approach itself still need to be further validated.

17.6 Summary

This chapter introduced the GLMMs for analysis of longitudinal microbiome data. We organized this chapter into five sections. Section 17.1 provided an overview of GLMMs in microbiome research. We compared and discussed two approaches of analysis of microbiome data (data transformation versus using GLMMs directly) and particularly model selection as well as the challenges of statistical hypothesis testing in microbiome research. Next three sections (Sects. 17.2, 17.3, and 17.4) focused on investigating three GLMMs or packages and illustrated them with microbiome data. The first two packages (glmmTMB in Sect. 17.2 and GLMMadaptive in Sect. 17.3) were developed based on numerical integration algorithms, while the third package ZINBMMs was developed based on EM-IWLS algorithm (Sect. 17.4). For each model or package, we commended on their advantages and disadvantages and illustrated their model fits. The modeling results from glmmTMB and GLMMadaptive were evaluated by information criteria AIC and BIC and/or LRT as well as diagnosed by the DHARMa package. In Sect. 17.5, we provided some general remarks on fitting GLMMs, including the zero-inflation and over-dispersion issues and count-based model versus transformation-based approaches.

LMMs in Chap. 15, glmmTMB, GLMMadaptive, and FZINBMM in this chapter take the univariate approach to analyze longitudinal microbiome data. In Chap. 18, the last chapter of this book, we will introduce a multivariate longitudinal microbiome data analysis: non-parametric microbial interdependence test (NMIT).