Y. Xia, J. Sun Bioinformatic and Statistical Analysis of Microbiome Data https://doi.org/10.1007/978-3-031-21391-5_3

3. Basic Data Processing in QIIME 2

Yinglin Xia¹ and Jun Sun ¹

(1)

Department of Medicine, University of Illinois Chicago, Chicago, IL, USA

Abstract

This chapter presents some basic data processing in QIIME 2. First it introduces importing and exporting data. Then it introduces extracting data from QIIME 2 archives. Next, it describes how to filter data, review data in QIIME 2, as well as how to communicate between QIIME 2 and R.

Keywords

FASTA FASTQ Feature table Phylogenetic trees Filter QIIME 2 View qiime2R Package .qza file Demultiplexed .tsv file Newick tree format

In the last two chapters, we provided an overview of QIIME 2 and R for microbiome data analysis. Starting with this chapter and until Chap. 6, we will focus on bioinformatic analysis of microbiome data using QIIME 2. In this chapter, we introduce some basic data processing in QIIME 2. We first introduce importing and exporting data in Sects. 3.1 and 3.2, respectively. We then introduce how to extract data from QIIME 2 archives (Sect. 3.3). Next, we describe how to filter data in QIIME 2 (Sect. 3.4). In Sect. 3.5, we introduce reviewing data in QIIME 2. Section 3.6 focuses on communicating between QIIME 2 and R. We complete this chapter with a brief summary (Sect. 3.7).

3.1 Importing Data into QIIME 2

QIIME 2 stores input data in artifacts (i.e., .qza files). Thus in order to use a QIIME 2 action, except for some metadata, all data must be imported as a QIIME 2 artifact.

QIIME 2 uses the plugin qiime tools import to import data. In QIIME 2, there are dozens of format types. Different data format types need different importing methods to import them into QIIME. You can use qiime tools import --show-importable-formats to check all the available import formats and qiime tools import --show-importable-types to check all available import types, respectively.

Currently either the QIIME 2 command-line interface (q2cli), or QIIME 2 Studio (q2studio), or Artifact API can be used to import input data. Depending on the task you want to implement, importing can be performed at any step in your analysis although importing typically starts with your raw sequence (e.g., FASTA or FASTQ) data. For “downstream” statistical analyses, typically importing starts with a feature table in either .biom or .csv format.

QIIME 2 supports importing many types of input data. Type the following command in the terminal to check which formats of input data are importable:

source activate qiime2-2022.2

qiime tools import \

--show-importable-formats

Type the following command to check which QIIME 2 types you can use to import these formats:

qiime tools import \

--show-importable-types

Currently no detailed documentations are available from QIIME 2 to tell us which QIIME 2 data types need what data formats although the information is indicated in the names of these formats and types. The most commonly used data formats are FASTA (sequences without quality information), FASTQ (sequence data with sequence quality information), feature table data, and phylogenetic trees.

3.1.1 Import FASTA Format Data

FASTA and FASTQ are the two basic and ubiquitous text-based formats for storing nucleotide and protein sequences. Common FASTA/Q file manipulations or processing include converting, searching, filtering, deduplication, splitting, shuffling, and sampling (Shen et al. 2016).

FASTA sequence file format or briefly FASTA format was originally invented by William Pearson in the FASTA software package (DNA and protein sequence alignment) (Lipman and Pearson 1985; Pearson and Lipman 1988). Nowadays FASTA format almost becomes a universal standard format in bioinformatics. The FASTA format represents either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.

There is no standard filename extension for FASTA file although each extension has its respective meaning (Wikipedia 2021). For example, fasta, or fa, means generic FASTA, which represents any generic fasta file; fna means FASTA nucleic acid, which is used generically to specify nucleic acids; ffn means FASTA nucleotide of gene regions, which contains coding regions for a genome; faa means FASTA amino acid, i.e., containing amino acid sequences; and frn means FASTA non-coding RNA, i.e., containing non-coding RNA regions for a genome, in DNA alphabet e.g., tRNA and rRNA. One typical FASTA format file used in QIIME 1 and currently supported by QIIME 2 is called the post-split libraries FASTA file format. We cite an example of this format as follows (QIIME 2022):

>PC.634_1 FLP3FBN01ELBSX orig_bc=ACAGAGTCGGCT new_bc=ACAGAGTCGGCT bc_diffs=0

CTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTTACCCTCTCAGGCCGGCTACGCATCATCGCCTTGGTGGGCCGTT

The sequence in FASTA format consists of exactly two lines per record: header (label line or description line) and sequence. They are distinguished by a greater-than (“>”) symbol in the first column.

The label line is separated by spaces and has five fields. From left to right, they are (1) the ID with the format <sample-id>_<seq-id> (e.g., PC.634_1), <sample-id> is used to identify the sample the sequence belongs to, and <seq-id> is used to identify the sequence within its sample; (2) the unique sequence id (e.g., FLP3FBN01ELBSX); (3) the original barcode (e.g., orig_bc=ACAGAGTCGGCT); (4) the new barcode after error-correction (e.g., new_bc=ACAGAGTCGGCT); and (5) the number of positions that differs between the original and new barcode (e.g. bc_diffs=0). A(Adenine), C(Cytosine), G(Guanine), and T(Thymine) represent the four nucleobases in the nucleic acid of DNA in the letters G–C–A–T.

Each sequence must span exactly one line and cannot be split across multiple lines. The ID in each header must follow the format. The sequences in this data format are without quality information.

A feature sequence data with a FASTA format including DNA, RNA, or protein sequences could be aligned or unaligned. The purpose of aligning sequences is to identify regions of similarity that may be due to a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. In order to align the columns to each other, gaps in a column (typically a dash “-”) are inserted between the residues so that identical or similar characters are aligned in successive columns (Edgar 2004). Thus, all aligned sequences result in exactly the same length.

When importing FASTA format files, QIIME 2 specifies type as “'FeatureData[Sequence]'” for unaligned sequences and type as “'FeatureData[AlignedSequence]'” for aligned sequences. Here, we show how to import unaligned and aligned sequences into QIIME 2, respectively.

Example 3.1: VDR Fasta Data File

The SequencesVDR fasta data file was from the study of Vitamin D Receptor(VDR) and the murine intestinal microbiome (Jin et al. 2015). This study investigates whether VDR status regulates the composition and functions of the intestinal bacterial community. Here, we use this “SequencesVDR.fna” file to illustrate FASTA format data importation.

We take three steps to import this FASTA data into QIIME 2.

Step 1: Create a directory to store the fasta.gz files.
First, we need to create a directory folder to store the sequences data files (here, QIIME2R-Bioinformatics/Ch3). We can create the folder directly in computer or via the terminal of Mac: mkdir QIIME2R-Bioinformatics/Ch3. Then in the terminal, type source activate qiime2-2022.2 (depending on your QIIME 2 version) to activate QIIME 2 environment, and type cd QIIME2R-Bioinformatics/Ch3 to direct the QIIME 2 command to this folder.
Step 2: Store the fasta.gz files in this created directory.
We save the data files “SequencesVDR.fna” in the directory “QIIME2R-Bioinformatics/Ch3.”
Step 3: Import the data into QIIME 2 artifacts (i.e., qza files) using “qiime tools import” command.
As we described in Chap. 1, all input data to QIIME 2 is in form of QIIME 2 artifacts, containing information about the type of data and the source of the data. Thus, we first need to import these sequence data files into a QIIME 2 artifact. For unaligned sequences, the semantic type of QIIME 2 artifact is FeatureData[Sequence]. We name the output file as “SequencesVDR.qza” in “output-path.” The following commands can be used to import unaligned sequences into QIIME 2:

qiime tools import \

--input-path SequencesVDR.fna \

--output-path SequencesVDR.qza \

--type 'FeatureData[Sequence]'

In above commands, “qiime tools import” defines the action, “input-path” specifies the data file path, and “output-path” specifies output data file path. We can see that SequencesVDR.fna was imported to SequencesVDR.qza as DNASequencesDirectoryFormat.

Example 3.2: Aligned Fasta Data File

The following aligned sequences were downloaded from QIIME 2 website. We extract two sequences from AlignedSequencesQiime2.fna (open using SeqKit software) to see what the aligned sequences look like.

>New.CleanUp.ReferenceOTU998 M2.Index.L_12921

-CTGGGCCGTATCTCAGTC-CCAATGTGGCCGGTCGCCCT---------CTCAGGCCGGC

TACCCGTCAAGGCC-TTGGTGGG-CCACTA-CCC-C-ACCAACAAGCTGATAGGCCGCGA

-G-ACGATCC-CTGACCGCA------------AAAA-------G----------C-TTT-

-------CCAACAAC-CC-------GG--A---TG--CCCGG-G-AAA------------

---CTG-AATAT-T--CGG-GA-TTA---------------C--CAC-C-T---GTTTCC

--AAG---T--GCT--A--T-ACC-A--AAG-TCA-AG--GG-------CA-CG-TT-C-

--C--TCA-CG-TG-----------------TTACT-C---ACCCGTT-CGCCA-CT---

-----------------------------------------

>New.CleanUp.ReferenceOTU999 M2.Ring.R_1432

-CTGGGCCGTATCTCAGTC-CCAATGTGGCCGGTCACCCT---------CTCAGGCCGGC

TACCCGTCGCCGCC-TTGGTAGG-CCACTA-CCC-C-ACCAACAAGCTGATAGGCCGCGA

-G-TCCATCC-ACAACCGCC------------GGAG------------------C-TTT-

-------CCAACCCC-CA-------CC--A---TG--CAGCA-G-GAG------------

---CA--CATAT-C--CAG-TA-TTA---------------G--CAC-C-A---GTTTCC

--TAG---C--GTT--A--T-CCC-A--AAG-TTG-TG--GG-------CA-GG-TT-A-

--C--TCA-CG-TG-----------------TTACT-C---ACCCG--------------

-----------------------------------------

Now we use this fasta data file to illustrate importing the aligned sequences into QIIME 2. For aligned sequences, the semantic type of QIIME 2 artifact is FeatureData[AlignedSequence]. We can use the following commands.

qiime tools import \

--input-path AlignedSequencesQiime2.fna \

--output-path AlignedSequencesQiime2.qza \

--type 'FeatureData[AlignedSequence]'

3.1.2 Import FASTQ Format Data

FASTQ sequence file format or briefly FASTQ format was originally developed at the Wellcome Trust Sanger Institute (Cock et al. 2010) as a simple extension to the FASTA format to store each nucleotide in a sequence and its corresponding quality score. For sequence file with FASTQ format, both the sequence letter and quality score are each encoded with a single ASCII character. In the field of DNA sequencing, the FASTQ file format has emerged as de facto standard format for storing the output of high-throughput sequencing instruments such as the Illumina Genome Analyzer and data exchange between tools (Cock et al. 2010).

A FASTQ file typically has four lines per sequence:

Line 1 is the @title and optional description, begins with a “@” character and is followed by a sequence identifier and an optional description. This is a free format field with no length limit and allows including arbitrary annotation or comments.
Line 2 is sequence line(s): the raw sequence letters (like in the FASTA format).
Line 3 is +optional repeat of title line: signaling the end of the sequence lines and the start of the quality string. It begins with a “+” character and may include the same sequence identifier (and any description) again.
Line 4 is quality line(s): encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. They use a subset of the ASCII printable characters (at most ASCII 33–126 inclusive) with a simple offset mapping and the “@” marker character (ASCII 64) may be anywhere in the quality string.

A FASTQ file containing a single sequence might look like this:

@M00967:43:000000000-A3JHG:1:1101:18327:1699 1:N:0:188

NACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCCTGCCAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTACAGCCGTTGAAACTGCCGGGCTCGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACCCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGTGCGGGGATCAAACAG

The Earth Microbiome Project (EMP) founded in 2010 is a systematic effort to characterize global microbial taxonomic and functional diversity on this for planet earth (Thompson et al. 2017; Gilbert et al. 2010, 2014). “EMP protocol” has two fastq formats: multiplexed single-end and paired-end. In QIIME 2 terminology, the single-end reads refers to forward or reverse reads in isolation; the paired-end reads refers to forward and reverse reads that have not yet been joined; and the joined reads refers to forward and reverse reads that have already been joined (or merged).

“EMP Protocol” Multiplexed Single-End fastq

Single-end “Earth Microbiome Project (EMP) protocol” formatted reads total have two fastq.gz files: one contains the single-end reads, and another contains the associated barcode reads. The corresponding association between a sequence read and its barcode read is defined by the order of the records in these two files.

“Earth Microbiome Project (EMP) Protocol” Multiplexed Paired-End fastq

EMP paired-end formatted reads have three fastq.gz files total: one contains the forward sequence reads, another contains the reverse sequence reads, and a third contains the associated barcode reads.

The Illumina 1.8 FASTQ format was created and maintained by the Institute for Integrative Genome Biology UC Riverside. Each entry in a FASTQ file consists of four lines: Sequence identifier, Sequence, Quality score identifier line (consisting of a +), Quality score. An example of a valid entry is as follows:

@HWI-ST279:211:C0BFTACXX:3:1101:3469:2181 1:N:0:ACTTGA GAACTATGCCTGATCAGGTTGAAGTCAGGGGAAACCCTGATGGAGGACCGA + CCCFFFFFHHHHHJJJJJIIIJJJHJJJJJJJIJJJJIIIJJJIJJJJJJJ

Casava 1.8 Single-End Demultiplexed fastq

This fastq data file has one fastq.gz file for each sample in the study which contains the single-end reads for that sample. The file name includes the sample identifier, which looks like: L2S357_15_L001_R1_001.fastq.gz. The underscore-separated fields in this file name by order are the sample identifier, the barcode sequence or a barcode identifier, the lane number, the direction of the read (i.e., only R1, because these are single-end reads), and the set number.

Casava 1.8 Paired-End Demultiplexed fastq

This fastq format has two fastq.gz files for each sample in the study, each containing the forward or reverse reads for that sample. The file name includes the sample identifier. The forward and reverse read file names for a single sample might look like:

L2S357_15_L001_R1_001.fastq.gz and L2S357_15_L001_R2_001.fastq.gz, respectively.

The underscore-separated fields in this file name are the sample identifier, the barcode sequence or a barcode identifier, the lane number, the direction of the read (i.e., R1 or R2), and the set number.

If the data do not have either EMP or Casava format, the data need to be manually imported into QIIME 2. First you need to create a “manifest” text file and then use the qiime tools import command. The specifications are different in the EMP or Casava import commands. The manifest file is a tab-separated (i.e., .tsv) text file: the first column defines the Sample ID, while the second (and optional third) column is the absolute file path to the forward (and optional reverse) reads. There are four variants of manifest FASTQ data in QIIME 2, including:

(1) singleEndFastqManifestPhred33V2;
(2) singleEndFastqManifestPhred64V2;
(3) pairedEndFastqManifestPhred33V2; and
(4) pairedEndFastqManifestPhred64V2.

In the format names, “Phred” indicates the PHRED software. This software reads DNA sequencing trace files, calls bases, and assigns a quality value to each base called (Ewing et al. 1998; Ewing and Green 1998), which defines the PHRED quality score of a base call in terms of the estimated probability of error. To hold these quality scores, PHRED introduced a new file format called the QUAL format. This is FASTA-like format, holding PHRED scores as space separated plain text integers and supplement a corresponding FASTA file with the associated sequences (Cock et al. 2010).

Phred33 means PHRED scores with an ASCII offset of 33, which is associated with Sanger FASTQ format. To be easily readable and editable by human, Sanger restricted the ASCII printable characters to 32–126 (decimal). Since ASCII 32 is the space character, Sanger FASTQ files use ASCII 33–126 to encode PHRED qualities from 0 to 93, which sets PHRED ASCII offset of 33.

Phred64 means PHRED scores with an ASCII offset of 64, which is associated with Illumina 1.3+ FASTQ format. The Illumina FASTQ format encodes PHRED scores with an ASCII offset of 64, which can hold PHRED scores from 0 to 62 (ASCII 64–126) (Cock et al. 2010).

The encoded quality scores of PHRED 64 are different from PHRED 33; however, the encoded quality scores of PHRED 64 will be converted to those of PHRED 33 during importing.

Different types of FASTQ data need different functions to import. Table 3.1 summarizes FASTQ data formats and the importing functions in QIIME 2.

Table 3.1

FASTQ data formats and the importing functions

Data formats	Command with data type
“EMP protocol” multiplexed single-end fastq	Implement command “qiime tools import” with specifying data type as “ EMPSingleEndSequences”
“EMP protocol” multiplexed paired-end fastq	Implement command “qiime tools import” with specifying data type as “EMPPairedEndSequences”
Casava 1.8 single-end demultiplexed fastq	Implement command “qiime tools import” with specifying data type as “'SampleData[SequencesWithQuality]'” and input-format as “CasavaOneEightSingleLanePerSampleDirFmt”
Casava 1.8 paired-end demultiplexed fastq	Implement command “qiime tools import” with specifying data type as “'SampleData[PairedEndSequencesWithQuality]'” and input-format as “CasavaOneEightSingleLanePerSampleDirFmt”
SingleEndFastqManifestPhred33V2	Implement command “qiime tools import” with specifying data type as “'SampleData[SequencesWithQuality]'” and input-format as “SingleEndFastqManifestPhred33V2”
PairedEndFastqManifestPhred64V2	Implement command “qiime tools import” with specifying data type as “'SampleData[PairedEndSequencesWithQuality]'” and input-format as “PairedEndFastqManifestPhred64V2”

Example 3.3: “EMP Protocol” Multiplexed Single-End fastq Sequences Data File

We downloaded the example data “Moving Pictures” from QIIME 2 website including the single-end reads (“sequences.fastq”) and its associated barcode reads (“barcodes.fastq”) to illustrate this importation.

We take the following three steps to import FASTQ data into QIIME 2.

Step 1: Create a directory to store these two fastq.gz files.
Here, we create a directory called “QIIME2RCh3EMPSingleEndSequences.” By typing the following command in a terminal, mkdir QIIME2RCh3EMPSingleEndSequences, we create a directory “QIIME2RCh3EMPSingleEndSequences” for Ch3 (the name suggests that the data is “EMP protocol” multiplexed single-end fastq, you can choose any name for the directory) to store the data file.
Step 2: Store the two fastq.gz files in this created directory.
We save the two data files “sequences.fastq” and “barcodes.fastq” in the directory “QIIME2RCh3EMPSingleEndSequences.”
Step 3: Import the data into QIIME 2 artifacts (i.e., qza files) using “qiime tools import” command.
For “EMP protocol” multiplexed single-end fastq, the semantic type of QIIME 2 artifact is EMPSingleEndSequences, which contains sequences that are multiplexed, meaning that the sequences have not yet been assigned to samples and hence we need to include both sequences.fastq.gz file and barcodes.fastq.gz file, where it contains the barcode read associated with each sequence in sequences.fastq.gz.
With both two files “sequences.fastq.gz” and “barcodes.fastq.gz” stored in the directory “QIIME2RCh3EMPSingleEndSequences,” now you can import these data into QIIME 2 artifacts (i.e., qza files). In the terminal, first type source activate qiime2-2022.2 to activate QIIME 2, and then type the following commands.

qiime tools import \

--type EMPSingleEndSequences \

--input-path QIIME2RCh3EMPSingleEndSequences\

--output-path QIIME2RCh3EMPSingleEndSequences.qza

In above commands, “qiime tools import” defines the action, “type” specifies the data type (in this case, the data type is “EMPSingleEndSequences”), “input-path” specifies the data file path, and “output-path” specifies output data file path. We can see that the data “QIIME2RCh3EMPSingleEndSequences.qza” are stored in QIIME 2 artifacts as format:"EMPSingleEndDirFmt".

Similarly, you can import “EMP protocol” multiplexed paired-end fastq, Casava 1.8 single-end demultiplexed fastq, and Casava 1.8 paired-end demultiplexed fastq files.

3.1.3 Import Feature Table

In Chap. 2 (Sect. 2.5), we have briefly introduced that the BIOM (Biological Observation Matrix) format is designed to be a general-use format for representing biological sample by counts of observation contingency tables (McDonald et al. 2012), and is a recognized standard for the Earth Microbiome Project and Genomics Standards Consortium candidate project.

Currently the BIOM file format has three versions: versions 1.0.0, 2.0.0, and 2.1.0. Here, we briefly introduce format specifications for version 1.0.0 and 2.1.0 and how to import pre-processed feature tables with BIOM format into QIIME 2. BIOM v1.0.0 format is based on JSON (JavaScript Object Notation) to provide the overall structure for the format (biom-format.org 2020a). BIOM v2.1.0 format is based on HDF5^® Enterprise Support to provide the overall structure for the format (biom-format.org 2020b).

The BIOM format is generally used in various omics. For example, in marker-gene surveys, OTU or AVS tables primarily use this format; in metagenomics, metagenome tables also use this format; in genome data, a set of genomes uses this format too. Currently many projects support the BIOM format including QIIME 2, Mothur, phyloseq, MG-RAST, PICRUSt, MEGAN, VAMPS, metagenomeSeq, Phinch, RDP Classifier, USEARCH, PhyloToAST, EBI Metagenomics, GCModeller, and MetaPhlAn 2. The phyloseq package includes BIOM format examples with the four main types of biom files. The import_biom() function can be used to simultaneously import an associated phylogenetic tree file and reference sequence file (e.g., fasta).

Example 3.4: BIOM Sequences Data File with Version 1.0 .0 BIOM Format

The Seq_tableQTRT1.biom is the BIOM sequences data file with version 1.0 .0 BIOM format. The data was from the study of tRNA queuosine(Q)-modifications on the gut microbiome in breast cancers (Zhang et al. 2020). This study investigates how the enzyme queuine tRNA ribosyltransferase catalytic subunit 1 (QTRT1) affects tumorigenesis.

To import this file into QIIME 2, we first store it in the folder QIIME2R-Bioinformatics/Ch3. Then we type cd QIIME2R-Bioinformatics/Ch3 and the following commands in the terminal.

qiime tools import \

--input-path Seq_tableQTRT1.biom \

--type 'FeatureTable[Frequency]' \

--input-format BIOMV100Format \

--output-path Seq_tableQTRT1.qza

Example 3.5: BIOM Sequences Data File with Version 2.1.0 BIOM Format

The data “feature-table-v210.biom” was downloaded from the QIIME 2 website and renamed as “FeatureTablev210.biom,” which was stored in the folder QIIME2R-Bioinformatics/Ch3. We type cd QIIME2R-Bioinformatics/Ch3 and the following commands in the terminal to import it into QIIME 2.

qiime tools import \

--input-path FeatureTablev210.biom \

--type 'FeatureTable[Frequency]' \

--input-format BIOMV210Format \

--output-path FeatureTablev2.qza

3.1.4 Import Phylogenetic Trees

The Newick (parenthetic) tree format was introduced in the package castor in Sect. 2.4.3 of Chap. 2.

The Newick (parenthetic) tree format standard was adopted on June 26, 1986, by James Archie, William H. E. Day, Joseph Felsenstein, Wayne Maddison, Christopher Meacham, F. James Rohlf, and David Swofford, in an informal committee meeting in Durham, New Hampshire, and the second meeting in 1986, which was at Newick’s restaurant in Dover, New Hampshire, US. This is the reason that the name of Newick came from. The adopted format represents a generalization of the format developed by Christopher Meacham in 1984 for the first tree-drawing programs in Felsenstein’s PHYLogeny Inference Package (PHYLIP) (Felsenstein 1981, 2021).

The Newick format defines a tree by creating a minimal representation of nodes and their relationships to each other, which stores spanning-trees with weighted edges and node names in a minimal file format. Gary Olsen in 1990 provided an interpretation of the “Newick’s 8:45” tree format standard (Olsen 1990). Newick formatted files are useful for representing phylogenetic trees and taxonomies.

A phylogenetic tree (a.k.a. phylogeny or evolutionary tree) is a branching diagram or a tree that represents evolutionary relationships among various biological species or other organisms based on similarities and differences in their physical or genetic characteristics (Felsenstein 2004). Phylogenetic trees may be rooted or unrooted. In a rooted phylogenetic tree, each node (called a taxonomic unit) has descendants to represent the inferred most recent common ancestor of those descendants, and in some trees the edge lengths may be interpreted as time estimates, whereas unrooted trees illustrate only the relatedness of the leaf nodes without assuming and do not require the ancestral root to be known or inferred (NIH 2002).

Example 3.6: Unrooted and Rooted Phylogenetic Trees, Example 2.7, Cont.

In Chap. 2, Example 2.7, we generated two tree data based on Dietswap study via the ape package:

Unrooted_tree_dietswap.tre and Rooted_tree_dietswap.tre. Here, we rename them as UnrootedTreeDietswap.tre and RootedTreeDietswap.tre, respectively, and use them to illustrate the importation of phylogenetic trees into QIIME 2. The following command can be used to import unrooted tree.

source activate qiime2-2022.2

cd QIIME2R-Bioinformatics/Ch3

qiime tools import \

--input-path UnrootedTreeDietswap.tre \

--output-path UnrootedTreeDietswap.qza \

--type 'Phylogeny[Unrooted]'

If you have a rooted tree, you can use --type 'Phylogeny[Rooted]' instead. The following command can be used to import rooted tree.

qiime tools import \

--input-path RootedTreeDietswap.tre \

--output-path RootedTreeDietswap.qza \

--type 'Phylogeny[Rooted]'

3.2 Exporting Data from QIIME 2

With QIIME 2 installed, you can export data from a QIIME 2 artifact to statistically analyze the data in R or using a different microbiome analysis software. This can be achieved using the qiime tools export command. Below we illustrate how to export feature table and phylogenetic tree.

3.2.1 Export Feature Table

The qiime tools export command takes a QIIME 2 artifact (.qza) file and an output directory as input. The data in the artifact will be exported to one or more files depending on the specific artifact. A FeatureTable[Frequency] artifact will be exported as a BIOM v2.1.0 formatted file.

Example 3.7: Exporting Feature Table, Example 3.5, Cont.

In Example 3.5, we imported a FeatureTablev210.biom as BIOMV210Format to FeatureTablev2.qza. Now we use the following command to export this feature table.qza data file to ExportedFeatureTable directory.

qiime tools export \

--input-path FeatureTablev2.qza \

--output-path ExportedFeatureTable

3.2.2 Export Phylogenetic Trees

Example 3.8: Exporting Phylogenetic Tree, Example 3.6, Cont.

Both UnrootedTreeDietswap.qza and RootedTreeDietswap.qza generated in Example 2.7 were stored in the directory folder QIIME2R-Bioinformatics/Ch3. We can export the unrooted tree data into the directory folder “ExportedTreeUnrooted” via the following command.

qiime tools export \

--input-path UnrootedTreeDietswap.qza \

--output-path ExportedTreeUnrooted

We can export the rooted tree data into the directory folder “ExportedTreeRooted” via the following command.

qiime tools export \

--input-path RootedTreeDietswap.qza \

--output-path ExportedTreeRooted

3.3 Extracting Data from QIIME 2 Archives

In Chap. 1, we have introduced that QIIME 2 .qza and .qzv files are zip file archives or containers with a defined internal directory structure. The data files stored in the file archives can be either exported or extracted; however, do not confuse “extract” and “export.” In QIIME 2, extracting and exporting are two different data processing operations. Extracting an artifact differs from exporting an artifact. Exporting an artifact will only place the data files in the output directory; whereas extracting will not only place the data files, but also provide QIIME 2’s metadata about an artifact, including the artifact’s provenance in plain-text formats in the output directory. The output directory must already exist; otherwise must be created before extracting.

There are two ways to extract the data from the archives: one is to use the qiime tools export command if QIIME 2 and the q2cli command line interface are installed; another is to use standard decompression utilities such as unzip, WinZip, or 7zip when QIIME 2 is not installed. We illustrate these two ways to extract data below, respectively.

3.3.1 Extract Data Using the Qiime Tools Export Command

To extract QIIME 2 artifacts using qiime tools extract command, we first need to create an output directory such as “ExtractedFeatureTable,” then call qiime tools extract command and specify input-path with file name (in this case, “FeatureTableMiSeq_SOP.qza”) and just created output-path “ExtractedFeatureTable.”

Example 3.9: FeatureTableMiSeq_SOP

The original sequencing data was downloaded from the published paper by Schloss et al. (2012) entitled “Stabilization of the murine gut microbiome following weaning.” We generated the Feature Table using bioinformatic tool data2 software through QIIME 2 in Chap. 4. The objective of this study was to investigate the development and stabilization of the murine microbiome. The 360 fecal samples were collected from 12 mice (6 female and 6 male) longitudinally over the first year of life at 35 time points. Two mock community samples were added in the analysis for estimating the error rate. The raw sequence data are demultiplexed paired-end 16S rRNA gene reads generated using highly overlapping Illumina’s MiSeq 2x250 amplicon sequencing platform from the V4 region of the 16S gene. The mouse gut dataset has been successfully used for testing new protocols and workflows of microbiome data analysis and new tool for integrative microbiome analysis (Buza et al. 2019; Westcott and Schloss 2015; Callahan et al. 2016). We use this dataset here and other chapters of this book to illustrate bioinformatic analysis using QIIME 2 and statistical analysis using R. More information on this MiSeq_SOP dataset, see Example 4.1: (MiSeq_SOP: One sample demultiplexed paired-end FASTQ data).

In the above commands, we first make a directory “ExtractedFeatureTable” by the command: mkdir ExtractedFeatureTable. Then use the command: qiime tools extract to extract the data file “FeatureTableMiSeq_SOP.qza” to the created directory “ExtractedFeatureTable.” The output directory contain a new directory whose name is the artifact’s UUID (in this case, 46eef13e-a20c-43f2-a7cf-944d36a8ebac). You can check that all artifact data and metadata are stored in this directory.

3.3.2 Extract Data Using Unzip Program on macOS

Above “FeatureTableMiSeq_SOP.qza” artifact also can be extracted using unzip program as below:

unzip FeatureTableMiSeq_SOP.qza

Archive: FeatureTableMiSeq_SOP.qza

inflating: 46eef13e-a20c-43f2-a7cf-944d36a8ebac/metadata.yaml

inflating: 46eef13e-a20c-43f2-a7cf-944d36a8ebac/checksums.md5

inflating: 46eef13e-a20c-43f2-a7cf-944d36a8ebac/VERSION

inflating: 46eef13e-a20c-43f2-a7cf-944d36a8ebac/provenance/metadata.yaml

inflating: 46eef13e-a20c-43f2-a7cf-944d36a8ebac/provenance/citations.bib

inflating: 46eef13e-a20c-43f2-a7cf-944d36a8ebac/provenance/VERSION

inflating: 46eef13e-a20c-43f2-a7cf-944d36a8ebac/provenance/artifacts/18ca53e7-d11f-4b48-9a33-72562f66084c/metadata.yaml

inflating: 46eef13e-a20c-43f2-a7cf-944d36a8ebac/provenance/artifacts/18ca53e7-d11f-4b48-9a33-72562f66084c/citations.bib

inflating: 46eef13e-a20c-43f2-a7cf-944d36a8ebac/provenance/artifacts/18ca53e7-d11f-4b48-9a33-72562f66084c/VERSION

inflating: 46eef13e-a20c-43f2-a7cf-944d36a8ebac/provenance/artifacts/18ca53e7-d11f-4b48-9a33-72562f66084c/action/action.yaml

inflating: 46eef13e-a20c-43f2-a7cf-944d36a8ebac/provenance/action/action.yaml

inflating: 46eef13e-a20c-43f2-a7cf-944d36a8ebac/data/feature-table.biom

The above unzip action created a new directory. The name of that directory is the UUID of the artifact being unzipped: 46eef13e-a20c-43f2-a7cf-944d36a8ebac. We can achieve a similar thing on Windows or Linux.

3.4 Filtering Data in QIIME 2

In this section, we will introduce how to filter feature tables, sequences, and distance matrices in QIIME 2.

Example 3.10: FeatureTableMiSeq_SOP. Example 3.9, Cont.

The data that are used to illustrate the filtering functionality in QIIME 2 are FeatureTableMiSeq_SOP.qza (feature table), TaxonomyMiSeq_SOP.qza (taxonomy data), SampleMetadataMiSeq_SOP.tsv (sample metadata), “BrayCurtisDistanceMatrixMiSeq_SOP.qza” (distance matrix), and “sequences.qza” (sequence data).

First, we create a directory for working on.

mkdir QIIME2R-Bioinformatics/Ch3/Filtering

cd QIIME2R-Bioinformatics/Ch3/Filtering

Then, we put all above data into the directory just created.

3.4.1 Filter Feature Table

Filtering feature tables include filtering (i.e., removing) samples and features from a feature table. Feature tables consist of the sample axis and the feature axis. The filtering operations are generally applicable to these two axes. The filter-samples method is used to filter sample axis, whereas the filter-features method is used to filter the feature axis. Both methods are implemented in the q2-feature-table plugin. We can also use the filter-table method in the q2-taxa plugin to perform the taxonomy-based filtering: filter features from a feature table.

3.4.1.1 Total-Frequency-Based Filtering

As the name suggested, total-frequency-based filtering filters samples or features based on the frequencies that samples or features are represented in the feature table. Two usual situations are (1) filter samples when total frequency is an outlier detected in the distribution of sample frequencies; (2) set up a cut-off point or minimum total frequency and then use it as a criterion to remove samples with a total frequency less than this cut-off point.

We can use the --p-max-frequency command to filter samples and features based on the maximum total frequency. We can also combine the commands --p-min-frequency and --p-max-frequency to filter samples and features based on lower and upper limits of total frequency.

The following commands filter (i.e., remove) samples with a total frequency less than 1500 from FeatureTableMiSeq_SOP.qza.

qiime feature-table filter-samples \

--i-table FeatureTableMiSeq_SOP.qza \

--p-min-frequency 1500 \

--o-filtered-table SampleFrequencyFilteredFeatureTableMiSeq_SOP.qza

The following commands remove all features with a total abundance (summed across all samples) of less than 10 from FeatureTableMiSeq_SOP.qza.

qiime feature-table filter-features \

--i-table FeatureTableMiSeq_SOP.qza \

--p-min-frequency 10 \

--o-filtered-table FeatureFrequencyFilteredTable.qza

3.4.1.2 Contingency-Based Filtering

Those features that present in only one or a few samples may not represent real biological diversity but rather PCR or sequencing errors (such as PCR chimeras). Contingency-based filtering is designed to filter samples or features from a table contingent on the number of each other they contain. The following commands remove the features from FeatureTableMiSeq_SOP.qza that are not contained in at least 2 samples.

qiime feature-table filter-features \

--i-table FeatureTableMiSeq_SOP.qza \

--p-min-samples 2 \

--o-filtered-table SampleContingencyFilteredTable.qza

The following commands remove samples from FeatureTableMiSeq_SOP.qza that contain less or equal to 10 features.

qiime feature-table filter-samples \

--i-table FeatureTableMiSeq_SOP.qza \

--p-min-features 10 \

--o-filtered-table FeatureContingencyFilteredTable.qza

Similar as the total-frequency-based filtering methods, contingency-based filtering methods can use the --p-max-features and --p-max-samples parameters to filter contingent on the maximum number of features or samples. They also can optionally be used in combination with --p-min-features and --p-min-samples.

3.4.1.3 Identifier-Based Filtering

When we want to keep the specific samples or features for analysis, we can define a user-specified list of samples or features based on their identifiers (IDs) in a QIIME 2 metadata file and then use the identifier-based filtering to retain these samples or features. Since IDs will be used to identify samples or features, then a QIIME 2 metadata file that contains the IDs in the first column is required. The metadata file is used as input with the --m-metadata-file parameter.

We can use either already existed metadata file or create a new one containing the IDs of the samples to filter by. To illustrate how to remove samples from a feature table using the identifier-based filtering method, below we create a simple QIIME 2 metadata file called SamplesToKeep.tsv that consists of two sample IDs to keep.

echo SampleID > SamplesToKeep.tsv

echo F3DO >> SamplesToKeep.tsv

echo F3D9 >> SamplesToKeep.tsv

The following commands use the identifier-based filtering method to retain these two samples from FeatureTableMiSeq_SOP.qza.

qiime feature-table filter-samples \

--i-table FeatureTableMiSeq_SOP.qza \

--m-metadata-file SamplesToKeep.tsv \

--o-filtered-table IdFilteredTable.qza

After running the filter-samples method with the parameter --m-metadata-file SamplesToKeep.tsv, only the F3DO and F3D9 samples are retained in the IdFilteredTable.qza file.

3.4.1.4 Metadata-Based Filtering

Similar to identifier-based filtering, metadata-based filtering uses metadata search criteria to filter the feature table to keep the samples that the user wants to retain. This is achieved in QIIME 2 by combining the --p-where parameter and the --m-metadata-file parameter. The following commands filter FeatureTableMiSeq_SOP.qza to contain only samples from Male mice.

qiime feature-table filter-samples\

--i-table FeatureTableMiSeq_SOP.qza\

--m-metadata-file SampleMetadataMiSeq_SOP.tsv\

--p-where "Sex='Male'"\

--o-filtered-table MaleFilteredTable.qza

We can also use multiple values in a single metadata column to filter samples. As in other programs, such as SAS, the IN clause can be used to specify those values. In this example, Time variable has two values Early and Later. The following commands can be used to retain both Early and Later samples. Please note that because Early and Later samples are all the samples for this dataset, so the command actually will not filter out any samples. Here, we just use this dataset to illustrate the program.

qiime feature-table filter-samples \

--i-table FeatureTableMiSeq_SOP.qza \

--m-metadata-file SampleMetadataMiSeq_SOP.tsv \

--p-where "Time IN ('Early', 'Later')" \

--o-filtered-table TimeFilteredTable.qza

Like in other programs, the keywords AND and OR can be used in --p-where parameter to evaluate both of the expressions or either of the expressions. The following commands are used to retain only those Early and Female samples.

qiime feature-table filter-samples \

--i-table FeatureTableMiSeq_SOP.qza \

--m-metadata-file SampleMetadataMiSeq_SOP.tsv \

--p-where "Time='Early' AND Sex='Female'" \

--o-filtered-table EarlyFemaleFilteredTable.qza

The following commands use OR keyword syntax to retain samples.

qiime feature-table filter-samples \

--i-table FeatureTableMiSeq_SOP.qza \

--m-metadata-file SampleMetadataMiSeq_SOP.tsv \

--p-where "Time='Early' OR Sex='Female'" \

--o-filtered-table EarlyORFemaleFilteredTable.qza

Specifying Time='Early', Later samples would not be in the resulting table, but both Female and Male would retain in the resulting table; specifying Sex='Female', Male samples would not be in the resulting table, but both Early and Later samples would retain in the resulting table. Thus, actually evaluating OR syntax in this case would retain all of the samples. Here we just use it to illustrate the OR syntax.

The following commands will retain only the Early and Male samples in SampleMetadataMiSeq_SOP.tsv.

qiime feature-table filter-samples \

--i-table FeatureTableMiSeq_SOP.qza \

--m-metadata-file SampleMetadataMiSeq_SOP.tsv \

--p-where "Time='Early' AND NOT Sex='Female'" \

--o-filtered-table EarlyNonFemaleFilteredTable.qza

3.4.2 Taxonomy-Based Tables and Sequences Filtering

The filter-table method in QIIME 2’s q2-taxa plugin is designed to facilitate the process of taxonomy-based filtering, which is one of the most common types of feature-metadata-based filtering. The specific taxa can be retained or removed from a table using --p-include or p-exclude parameters, respectively.

3.4.2.1 Filter Tables Based on Taxonomy

Search terms in the --p-mode parameter by default are case insensitive. Thus, in the following commands, --p-exclude parameter would result in removing all features annotated as mitochondria and Mitochondria from the table.

qiime taxa filter-table \

--i-table FeatureTableMiSeq_SOP.qza \

--i-taxonomy TaxonomyMiSeq_SOP.qza \

--p-exclude mitochondria \

--o-filtered-table FeatureTableMiSeq_SOPNoMitochondria.qza

Removing features can be done using more than one search term via listing a comma-separated search terms.

For example, the following commands will remove all features that contain either mitochondria or Rhodobacteraceae in their taxonomic annotation table.

qiime taxa filter-table \

--i-table FeatureTableMiSeq_SOP.qza \

--i-taxonomy TaxonomyMiSeq_SOP.qza \

--p-exclude mitochondria,Rhodobacteraceae\

--o-filtered-table FeatureTableMiSeq_SOPNoMitochondriaNoRhodobacteraceae.qza

The --p-include parameter is used to filter a table for retaining only specific features. For example, the following commands include p__ in --p-include parameter to retain only features that contain a phylum-level annotation.

qiime taxa filter-table \

--i-table FeatureTableMiSeq_SOP.qza \

--i-taxonomy TaxonomyMiSeq_SOP.qza \

--p-include p__\

--o-filtered-table FeatureTableMiSeq_SOPWithPhyla.qza

The --p-include and --p-exclude parameters can be used combinedly. For example, the following commands use the --p-include parameter to retain all features that contain a phylum-level annotation(p__), and use --p-exclude parameter to exclude all features that contain either mitochondria or Rhodobacteraceae in their taxonomic annotation.

qiime taxa filter-table \

--i-table FeatureTableMiSeq_SOP.qza \

--i-taxonomy TaxonomyMiSeq_SOP.qza \

--p-include p__ \

--p-exclude mitochondria,Rhodobacteraceae\

--o-filtered-table FeatureTableMiSeq_SOPWithPhylaButNoMitochondriaNoRhodobacteraceae.qza

By default, QIIME 2 matches the term(s) provided for --p-include or --p-exclude if they are contained in a taxonomic annotation.

However, sometimes we want to match the terms only if they are the complete taxonomic annotation. The parameter --p-mode exact (to indicate the search should require an exact match) is designed to achieve this goal. Since the search is an exact match, the search terms are case sensitive when searching with -p-mode exact. Thus, the search term mitochondria would not return the same results as the search term Mitochondria.

The following commands remove mitochondrial and chloroplast sequences with an exact match.

qiime taxa filter-table \

--i-table FeatureTableMiSeq_SOP.qza \

--i-taxonomy TaxonomyMiSeq_SOP.qza \

--p-include p__ \

--p-exclude mitochondria,chloroplast \

--o-filtered-table table-with-phyla-no-mitochondria-no-chloroplast.qza

In QIIME 2, we can also use qiime feature-table filter-features with the --p-where parameter to achieve the taxonomy-based filtering of tables. The qiime feature-table filter-features supports more complex filtering query than the qiime taxa filter-table filtering.

3.4.2.2 Filter Sequences Based on Taxonomy

The filter-seqs method in QIIME 2’s q2-taxa plugin is designed to filter FeatureData[Sequence] based on a feature’s taxonomic annotation. The filter-seqs method has very similar functionality that provided in qiime taxa filter-table. Below, the filter-seqs method is used to retain all features that contain a phylum-level annotation, but exclude all features that contain either mitochondria or Rhodobacteraceae in their taxonomic annotation.

qiime taxa filter-table \

--i-table FeatureTableMiSeq_SOP.qza \

--i-taxonomy TaxonomyMiSeq_SOP.qza \

--p-include p__ \

--p-exclude mitochondria,Rhodobacteraceae\

--o-filtered-table SequencesMiSeq_SOPWithPhylaButNoMitochondriaNoRhodobacteraceae.qza

For other filtering-sequences methods, we refer the reader to the q2-feature-table and q2-quality-control plugins. The q2-feature-table plugin also has a filter-seqs method, which can be used to remove sequences based on various criteria, including which features are present within a feature table. The q2-quality-control plugin has an exclude-seqs action, which can be used for filtering sequences based on alignment to a set of reference sequences or primers.

3.4.3 Filter Distance Matrices

The q2-diversity plugin provides the filter-distance-matrix method to filter (i.e., remove) samples from a distance matrix. It works the same way as filtering feature tables by identifiers or sample metadata.

3.4.3.1 Filtering Distance Matrix Based on Identifiers

The following commands filter the Bray-Curtis distance matrix to retain the two samples specified in SamplesToKeep.tsv above.

qiime diversity filter-distance-matrix \

--i-distance-matrix BrayCurtisDistanceMatrixMiSeq_SOP.qza \

--m-metadata-file SamplesToKeep.tsv \

--o-filtered-distance-matrix IdentifierFilteredBrayCurtisDistanceMtrix.qza

3.4.3.2 Filter Distance Matrix Based on Sample Metadata

The following commands filter the Bray-Curtis distance matrix to retain only samples from Female mice.

qiime diversity filter-distance-matrix \

--i-distance-matrix BrayCurtisDistanceMatrixMiSeq_SOP.qza \

--m-metadata-file SampleMetadataMiSeq_SOP.tsv \

--p-where "Sex='Female'" \

--o-filtered-distance-matrix FemaleFilteredBrayCurtisDistanceMatrix.qza

3.5 Introducing QIIME 2 View

QIIME 2 View (https://view.qiime2.org) is designed to allow the user to use the browser to directly open and read .qza and .qzv files that are archived on the user’s computer. Thus, it facilitates sharing the visualizations generated in QIIME 2 with a collaborator who can explore the results interactively without having QIIME 2 installed. To use QIIME 2 View, simply open it with qiime tools view or https://view.qiime2.org/ and then drag the .qza and .qzv files to the area of QIIME 2 View.

3.6 Communicating Between QIIME 2 and R

To use QIIME 2 and R integratively, some communicating tools to link them have been developed. Here, we first introduce the qiime2R package and then describe how to prepare a feature table and metadata table in R and import them into QIIME 2.

3.6.1 Export QIIME 2 Artifacts into R Using qiime2R Package

As we reviewed in Chap. 1 and so far covered in this chapter, QIIME 2 artifact is a crucial and novel concept in QIIME 2. As a method for storing the inputs and outputs for QIIME 2 as well as associated metadata and provenance information about how the object was formed, QIIME 2 artifact file in reality is a compressed directory with an intuitive structure, which has the extension of .qza. Thus QIIME 2 artifact facilitates the data storage and delivery. Although QIIME 2 equips the export tool to export QIIME 2 artifact such as exporting feature table and sequences from the artifact, however, it does not mean it is easy to import to R for the R users.

The qiime2R package was developed for importing QIIME 2 artifacts directly into R (current version 0.99.6, March 2022). The package has two important usages: (1) the read_qza() function and (2) the qza_to_phyloseq() wrapper. By using the read_qza() function, the artifact can be easily obtained into R without discarding any of the associated data. The qza_to_phyloseq() wrapper can be used to generate a phyloseq object, which is very useful when you use the phyloseq package to further analyze data. We briefly introduce these two functions below.

To use this package, we first install this package by entering the following commands in R or RStudio:

install.packages("remotes")

remotes::install_github("jbisanz/qiime2R")

Example 3.11: FeatureTableMiSeq_SOP, Example 3.9, Cont.

We continue to use the data from Example 3.9 to illustrate the qiime2R package.

3.6.1.1 Read a .qza File

To read a .qza file, we first call library qiime2R:

> setwd("~/Documents/QIIME2R/Ch3_DataProcessing ")

> library(qiime2R)

Then, use the read_qza( ) to read the file:

> feature_tab<-read_qza("FeatureTableMiSeq_SOP.qza")

> names(feature_tab)

[1] "uuid" "type" "format" "contents" "version"

[6] "data" "provenance"

3.6.1.2 Create a phyloseq Object

A phyloseq object consists of at least two out of four files: (1) feature, (2) taxonomy, (3) tree, and (4) metadata. The four QIIME 2 files, (1) FeatureTableMiSeq_SOP.qza, (2) TaxonomyMiSeq_SOP.qza, (3) RootedTreeMiSeq_SOP.qza, and (4) SampleMetadataMiSeq_SOP.tsv, have been saved in R source file directory “~/Documents/QIIME2R/Ch3_DataProcessing”). Given the files are available, we now call the function qza_to_phyloseq() to build a phyloseq object as below:

> library(phyloseq)

> phyloseqObj<-qza_to_phyloseq(features="FeatureTableMiSeq_SOP.qza", taxonomy = "TaxonomyMiSeq_SOP.qza", tree = "RootedTreeMiSeq_SOP.qza", metadata="SampleMetadataMiSeq_SOP.tsv")

> phyloseqObj

phyloseq-class experiment-level object

otu_table() OTU Table: [ 392 taxa and 360 samples ]

sample_data() Sample Data: [ 360 samples by 11 sample variables ]

tax_table() Taxonomy Table: [ 392 taxa by 7 taxonomic ranks ]

phy_tree() Phylogenetic Tree: [ 392 tips and 389 internal nodes ]

> otu<-otu_table(phyloseqObj)

> head(otu,3)

OTU Table: [3 taxa and 360 samples]

taxa are rows

F3D0 F3D1 F3D11 F3D125 F3D13 F3D141 F3D142 F3D143 F3D144 F3D145

b14d7992a4619e3524cad64f88ff8aa8 0 0 0 0 0 0 0 0 0 0

528ba5bd8a07c70f82636810d4a7743b 0 0 0 2 0 0 0 0 0 0

------

> tax<-tax_table(phyloseqObj)

> head(tax,3)

Taxonomy Table: [3 taxa by 7 taxonomic ranks]:

Kingdom Phylum Class Order

b14d7992a4619e3524cad64f88ff8aa8 "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Rhizobiales"

528ba5bd8a07c70f82636810d4a7743b "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Rhodobacterales"

5e13b5d5c72d5fb765a27828562246bb "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Rickettsiales"

Family Genus Species

------

> sam<-sample_data(phyloseqObj)

> head(sam,3)

Sample Data: [3 samples by 11 sample variables]:

BarcodeSequence ForwardPrimerSequence ReversePrimerSequence ForwardRead

F3D0 <NA> <NA> <NA> F3D0_S188_L001_R1_001.fastq.gz

F3D1 <NA> <NA> <NA> F3D1_S189_L001_R1_001.fastq.gz

F3D11 <NA> <NA> <NA> F3D11_S198_L001_R1_001.fastq.gz

ReverseRead Group Sex Time DayID DPW Description

F3D0 F3D0_S188_L001_R2_001.fastq.gz F3D0 Female Early D000 0 QIIME2RAnalysisSet

F3D1 F3D1_S189_L001_R2_001.fastq.gz F3D1 Female Early D001 1 QIIME2RAnalysisSet

F3D11 F3D11_S198_L001_R2_001.fastq.gz F3D11 Female Early D011 11 QIIME2RAnalysisSet

> tree<-phy_tree(phyloseqObj)

> head(tree,3)

$edge

[,1] [,2]

[1,] 393 394

[2,] 394 395

[3,] 395 1

------

$edge.length

[1] 0.013504585 0.063435583 0.028701321 0.046779347 0.017936212 0.431774093 0.018533412 0.000000005

[9] 0.000000005 0.000000005 0.095598550 0.081652745 0.000000005 0.004416175 0.000000005 0.042783284

[17] 0.038235871 0.046480155 0.004419571 0.000000005 0.021835292 0.076448202 0.162150745 0.022725035

------

$Nnode

[1] 389

3.6.2 Prepare Feature Table and Metadata in R and Import into QIIME 2

When using QIIME 2 to analyze microbiome data, probably most artifacts already have been generated from a count table. However, when using R for data analysis, an artifact may be not available. In this section, we demonstrate how to generate an artifact from a count table and then import this artifact into QIIME 2. We also demonstrate how to import metadata with an appropriate format into QIIME 2.

Example 3.12: QTRT1 Data, Example 3.4, Cont.

In Example 3.4, we used the sequences data from QTRT1 (Zhang et al. 2020) to demonstrate how to import BIOM sequences data file into QIIME 2. Here, we use this dataset to illustrate how to first generate feature table and metadata table and then import an artifact and metadata into QIIME 2.

Step 1: Generate feature table in R or RStudio.

> setwd("~/QIIME2R-Bioinformatics/Ch3")

> otu_tab <- read.csv("otu_table_genus_QTRT1.csv", check.names = FALSE)

> meta_tab <- read.csv("metadata_QTRT1.csv", check.names = FALSE)

> head(otu_tab,3)

> # Remove rownames

> otu <- cbind(rownames(otu_tab), otu_tab[,2:41])

> head(otu,3)

> dim(otu)

[1] 586 41

> # Qiime 2 needs a featureid column

> colnames(otu)[1] <- "featureid"

> colnames(otu)[1]

[1] "featureid"

> # Remove rowname

> rownames(otu) <- NULL

> head(otu,3)

> write.table(otu, "feature_table_genus_QTRT1.txt", sep = "\t", col.names=TRUE, row.names=FALSE, quote = FALSE)

Step 2: Generate metadata table in R or RStudio.

> head(meta_tab,3)

SampleID Group Time Group4

1 Sun071.PG1 KO Post KO_POST

2 Sun027.BF2 WT Before WT_BEFORE

3 Sun066.PF1 WT Post WT_POST

> write.table(meta_tab, "metadata_QTRT1.txt", sep = "\t", col.names=TRUE, row.names=FALSE, quote = FALSE)

Now we exit R and continue to process in QIIME 2. We need make sure that QIIME 2 and R have the same directory(in this case, “QIIME2R-Bioinformatics/Ch3”) because “feature_table_genus_QTRT1.txt” and “metadata_QTRT1.txt” are written into this directory folder.

Step 3: Convert feature table into OTU table with biom2.0 format.

# Make sure to activate conda(QIIME 2)environment

source activate qiime2-2022.2

cd QIIME2R-Bioinformatics/Ch3

# Convert the feature_table_genus_QTRT1 dataset to biom2.0

biom convert -i feature_table_genus_QTRT1.txt -o feature_table_genus_QTRT1.hdf5 --table-type="OTU table" --to-hdf5

Step 4: Import biom2.0 format OTU table into qiime 2.

# Import the biom2.0 format into qiime2

# This makes an artifact to be used.

qiime tools import --input-path feature_table_genus_QTRT1.hdf5 --type FeatureTable[Frequency] --input-format BIOMV210Format --output-path feature_table_genus_QTRT1.qza

After both feature table and metadata are imported into QIIME 2, we can use them to analyze in QIIME 2.

3.7 Summary

This chapter demonstrated some basic data processing procedures in QIIME 2 with real microbiome datasets. First, importing FASTA and FASTQ format data as well as importing feature table and phylogenetic trees were described and illustrated. Then, BIOM format and Newick tree format were described and exporting feature table and exporting rooted and unrooted phylogenetic trees were illustrated. Next, two ways of extracting data from QIIME 2 archives, using the QIIME tools export command and using Unzip program on macOS, were illustrated. Followed that various filtering data methods including filtering feature table, taxonomy-based tables, and sequences filtering as well as filtering distance matrices were demonstrated. QIIME 2 View was also introduced. Finally, two ways of communicating between QIIME 2 and R were introduced and illustrated: exporting QIIME 2 artifacts into R using qiime2R package and preparing feature table and metadata in R and then importing them into QIIME 2. In Chap. 4, we will introduce building feature table and feature representative sequences from raw reads in QIIME 2.

References

biom-format.org. 2020a. The biom file format: Version 1.0. The BIOM Format Development Team. Last modified 05 Nov 2020. Accessed 8 March 2022. http://biom-format.org/documentation/format_versions/biom-1.0.html.
———. 2020b. The biom file format: Version 2.1. The BIOM Format Development Team. Last modified 05 Nov 2020. Accessed 8 March 2022. http://biom-format.org/documentation/format_versions/biom-2.1.html.
Buza, Teresia M., Triza Tonui, Francesca Stomeo, Christian Tiambo, Robab Katani, Megan Schilling, Beatus Lyimo, Paul Gwakisa, Isabella M. Cattadori, Joram Buza, and Vivek Kapur. 2019. iMAP: An integrated bioinformatics and visualization pipeline for microbiome data analysis. BMC Bioinformatics 20 (1): 374. https://doi.org/10.1186/s12859-019-2965-4.CrossrefPubMedPubMedCentral
Callahan, Ben J., Kris Sankaran, Julia A. Fukuyama, Paul J. McMurdie, and Susan P. Holmes. 2016. Bioconductor workflow for microbiome data analysis: From raw reads to community analyses. F1000Research 5: 1492–1492. https://doi.org/10.12688/f1000research.8986.2. https://www.ncbi.nlm.nih.gov/pubmed/27508062. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4955027/.CrossrefPubMedPubMedCentral
Cock, Peter J.A., Christopher J. Fields, Naohisa Goto, Michael L. Heuer, and Peter M. Rice. 2010. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 38 (6): 1767–1771. https://doi.org/10.1093/nar/gkp1137. https://www.ncbi.nlm.nih.gov/pubmed/20015970. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/.CrossrefPubMed
Edgar, Robert C. 2004. MUSCLE: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5 (1): 113. https://doi.org/10.1186/1471-2105-5-113.CrossrefPubMedPubMedCentral
Ewing, B., and P. Green. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Research 8 (3): 186–194.CrossrefPubMed
Ewing, B., L. Hillier, M.C. Wendl, and P. Green. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Research 8 (3): 175–185. https://doi.org/10.1101/gr.8.3.175.CrossrefPubMed
Felsenstein, Joseph. 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution 17 (6): 368–376.CrossrefPubMed
———. 2004. Inferring phylogenies. Sunderland: Sinauer Associates, Inc.
———. 2021. The Newick tree format. Accessed January 17. https://evolution.genetics.washington.edu/phylip/newicktree.html.
Gilbert, J.A., F. Meyer, D. Antonopoulos, P. Balaji, C.T. Brown, C.T. Brown, N. Desai, J.A. Eisen, D. Evers, D. Field, W. Feng, D. Huson, J. Jansson, R. Knight, J. Knight, E. Kolker, K. Konstantindis, J. Kostka, N. Kyrpides, R. Mackelprang, A. McHardy, C. Quince, J. Raes, A. Sczyrba, A. Shade, and R. Stevens. 2010. Meeting report: The terabase metagenomics workshop and the vision of an Earth microbiome project. Standards in Genomic Sciences 3 (3): 243–248. https://doi.org/10.4056/sigs.1433550.CrossrefPubMedPubMedCentral
Gilbert, Jack A., Janet K. Jansson, and Rob Knight. 2014. The Earth Microbiome project: Successes and aspirations. BMC Biology 12 (1): 69. https://doi.org/10.1186/s12915-014-0069-1.CrossrefPubMedPubMedCentral
Jin, Dapeng, Wu Shaoping, Yong-guo Zhang, Lu Rong, Yinglin Xia, Hui Dong, and Jun Sun. 2015. Lack of Vitamin D receptor causes dysbiosis and changes the functions of the murine intestinal microbiome. Clinical Therapeutics 37 (5): 996–1009.e7. https://doi.org/10.1016/j.clinthera.2015.04.004. https://www.sciencedirect.com/science/article/pii/S0149291815002283.CrossrefPubMed
Lipman, D.J., and W.R. Pearson. 1985. Rapid and sensitive protein similarity searches. Science 227 (4693): 1435–1441. https://doi.org/10.1126/science.2983426. https://science.sciencemag.org/content/sci/227/4693/1435.full.pdf.CrossrefPubMed
McDonald, Daniel, Jose C. Clemente, Justin Kuczynski, Jai Ram Rideout, Jesse Stombaugh, Doug Wendel, Andreas Wilke, Susan Huse, John Hufnagle, Folker Meyer, Rob Knight, and J. Gregory Caporaso. 2012. The Biological Observation Matrix (BIOM) format or: How I learned to stop worrying and love the ome-ome. GigaScience 1 (1): 7. https://doi.org/10.1186/2047-217X-1-7.CrossrefPubMedPubMedCentral
NIH. 2002. “Tree” facts: Rooted versus unrooted trees. Last modified revised 15 July 2002. https://www.ncbi.nlm.nih.gov/Class/NAWBIS/Modules/Phylogenetics/phylo9.html.
Olsen, Gary. 1990. Interpretation of “Newick’s 8:45” tree format. Accessed 17 Jan. https://evolution.genetics.washington.edu/phylip/newick_doc.html.
Pearson, W.R., and D.J. Lipman. 1988. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the United States of America 85 (8): 2444–2448. https://doi.org/10.1073/pnas.85.8.2444. https://www.ncbi.nlm.nih.gov/pubmed/3162770. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC280013/.CrossrefPubMedPubMedCentral
QIIME. 2022. Post- split_libraries FASTA File Overview. QIIME.org. Accessed 8 Mar 2022. http://qiime.org/documentation/file_formats.html#post-split-libraries-fasta-file-overview.
Schloss, Patrick D., Alyxandria M. Schubert, Joseph P. Zackular, Kathryn D. Iverson, Vincent B. Young, and Joseph F. Petrosino. 2012. Stabilization of the murine gut microbiome following weaning. Gut Microbes 3 (4): 383–393. https://doi.org/10.4161/gmic.21008. https://www.ncbi.nlm.nih.gov/pubmed/22688727. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3463496/.CrossrefPubMedPubMedCentral
Shen, Wei, Shuai Le, Yan Li, and Hu. Fuquan. 2016. SeqKit: A cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS One 11 (10): e0163962–e0163962. https://doi.org/10.1371/journal.pone.0163962. https://pubmed.ncbi.nlm.nih.gov/27706213. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5051824/.CrossrefPubMedPubMedCentral
Thompson, Luke R., Jon G. Sanders, …, Janet K. Jansson, Jack A. Gilbert, Rob Knight, and The Earth Microbiome Project Consortium. 2017. A communal catalogue reveals Earth’s multiscale microbial diversity. Nature 551: 457. https://doi.org/10.1038/nature24621. https://www.nature.com/articles/nature24621#supplementary-information.
Westcott, Sarah L., and Patrick D. Schloss. 2015. De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units. PeerJ 3: e1487–e1487. https://doi.org/10.7717/peerj.1487. https://www.ncbi.nlm.nih.gov/pubmed/26664811. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4675110/.CrossrefPubMedPubMedCentral
Wikipedia. 2021. FASTA format. From Wikipedia, the free encyclopedia. Last modified 16 Nov 2021. Accessed 8 Mar 2022. https://en.wikipedia.org/wiki/FASTA_format.
Zhang, J., R. Lu, Y. Zhang, Ż. Matuszek, W. Zhang, Y. Xia, T. Pan, and J. Sun. 2020. tRNA queuosine modification enzyme modulates the growth and microbiome recruitment to breast tumors. Cancers (Basel) 12 (3). https://doi.org/10.3390/cancers12030628.