Introduction to Basic Microbial Ecology Data Analysis Using Phyloseq, Vegan, and DESeq2

In my last post, I walked through the process of analyzing an amplicon sequence dataset with the DADA2 pipeline. At the end of that walkthrough, I combined an OTU table, taxonomy table, and sample metadata together into a Phyloseq object. This post will go through some of the basic data exploration we do in the Buckley lab with microbiome datasets. This is a total jumping off point, and the packages we use have much more depth than I explore here. This is just a walkthrough to show basic functionality instead of a highly specialized analysis pipeline.

Data Source

The example data used in this tutorial comes from forest soils in upstate New York. Three sites in Tompkins Country were sampled; Bald Hill (BH), Carter Creek (CC), and Mount Pleasant (MP). At each of these sites, two 40x40m plots were definied in each of two forest types; primary, or old growth forest (noted as ‘Age 1’ in sample data), and secondary, or forest returned from agriculture (noted as ‘Age 2’). Four replicate soil samples were collected at a depth of 1-5cm from each plot in June 2015. DNA was extracted using a MoBio PowerMicrobiome 96-well extraction kit, and PCR amplified using 16S rRNA V4 primers, as outlined in Kozich et al. (2013).

Setting Up the Workspace

Load Necessary Packages

To begin, we will be using the phyloseq, vegan, and DESeq2 R packages to perform analyses on our dataset. In addition to these, the ggplot2 package will be used for plotting figures, and the plyr and dplyr packages provide functions to more easily work with dataframes. Note: I ran into some issues with phyloseq’s distance() function when DESeq2 was also loaded, so I wait until it is needed to load the DESeq2 package.

library(ggplot2)
library(plyr)
library(dplyr)
library(phyloseq)

Load Phyloseq Object

Previously, we created a phyloseq object using the DADA2 pipeline.

#Load the previously generated phyloseq object using `readRDS`
physeq = readRDS("otu_physeq.rds")
physeq

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1460 taxa and 24 samples ]
## sample_data() Sample Data:       [ 24 samples by 12 sample variables ]
## tax_table()   Taxonomy Table:    [ 1460 taxa by 6 taxonomic ranks ]

Make Sample Data Dataframe

You will need to use some metadata variables for some options in phyloseq functions. We will create a dataframe of just the sample data to more easily access these variables.

# For future referencing, create a dataframe of just the sample data from your phyloseq object
physeq.m = sample_data(physeq)
physeq.m

##            Sample Barcode     i1_r     i2_f Primer Site Age Treatment Rep
## BH1C1156 BH1C1156     130 tgagtacg cgttacta    130   BH   1         C   1
## BH1C2156 BH1C2156     131 tgagtacg agagtcac    131   BH   1         C   2
## BH1C3156 BH1C3156     132 tgagtacg tacgagac    132   BH   1         C   3
## BH1C4156 BH1C4156     133 tgagtacg acgtctcg    133   BH   1         C   4
## BH2C1156 BH2C1156     146 tagtctcc cgttacta    146   BH   2         C   1
## BH2C2156 BH2C2156     147 tagtctcc agagtcac    147   BH   2         C   2
## BH2C3156 BH2C3156     148 tagtctcc tacgagac    148   BH   2         C   3
## BH2C4156 BH2C4156     149 tagtctcc acgtctcg    149   BH   2         C   4
## CC1C1156 CC1C1156     163 actacgac agagtcac    163   CC   1         C   1
## CC1C2156 CC1C2156     164 actacgac tacgagac    164   CC   1         C   2
## CC1C3156 CC1C3156     162 actacgac cgttacta    162   CC   1         C   3
## CC1C4156 CC1C4156     165 actacgac acgtctcg    165   CC   1         C   4
## CC2C1156 CC2C1156     178 gtctatga cgttacta    178   CC   2         C   1
## CC2C2156 CC2C2156     180 gtctatga tacgagac    180   CC   2         C   2
## CC2C3156 CC2C3156     179 gtctatga agagtcac    179   CC   2         C   3
## CC2C4156 CC2C4156     181 gtctatga acgtctcg    181   CC   2         C   4
## MP1C1156 MP1C1156      97 cgagagtt ctactata     97   MP   1         C   1
## MP1C2156 MP1C2156      98 cgagagtt cgttacta     98   MP   1         C   2
## MP1C3156 MP1C3156      99 cgagagtt agagtcac     99   MP   1         C   3
## MP1C4156 MP1C4156     100 cgagagtt tacgagac    100   MP   1         C   4
## MP2C1156 MP2C1156     114 acgctact cgttacta    114   MP   2         C   1
## MP2C2156 MP2C2156     115 acgctact agagtcac    115   MP   2         C   2
## MP2C3156 MP2C3156     116 acgctact tacgagac    116   MP   2         C   3
## MP2C4156 MP2C4156     117 acgctact acgtctcg    117   MP   2         C   4
##          Time Temperature Moisture
## BH1C1156  156        15.2      1.7
## BH1C2156  156        15.2      3.1
## BH1C3156  156        15.0      2.0
## BH1C4156  156        15.0      7.6
## BH2C1156  156        15.0      8.5
## BH2C2156  156        15.3      3.8
## BH2C3156  156        15.2      2.6
## BH2C4156  156        15.8      7.5
## CC1C1156  156        15.8      2.3
## CC1C2156  156        15.8      0.6
## CC1C3156  156        15.9      3.0
## CC1C4156  156        15.5       NA
## CC2C1156  156        15.6      3.9
## CC2C2156  156        15.3      2.3
## CC2C3156  156        15.9      1.1
## CC2C4156  156        15.8      2.0
## MP1C1156  156        16.3      7.2
## MP1C2156  156        15.6      7.3
## MP1C3156  156        16.5     10.7
## MP1C4156  156        15.2      8.1
## MP2C1156  156        15.7      7.1
## MP2C2156  156        15.9      3.5
## MP2C3156  156        15.9      2.7
## MP2C4156  156        16.1      8.2

Plot Distribution of Read Counts from Samples

We want to check the distribution of read counts for each sample, to make sure we don’t have any outliers in terms of sequencing depth.

# We can use the `sample_sums()` function of phyloseq to add up total read counts for each sample.
sample_sum_df <- data.frame(sum = sample_sums(physeq))

# Histogram of sample read counts
ggplot(sample_sum_df, aes(x = sum)) + 
  geom_histogram(color = "black", fill = "indianred", binwidth = 2500) +
  ggtitle("Distribution of sample sequencing depth") + 
  xlab("Read counts") +
  theme(axis.title.y = element_blank())

Diversity Analysis

A basic question that nearly all microbial ecology studies seek to answer is ‘do the microbial communities differ between my treatments?’. The following sections will demonstrate how to perform basic analysis of alpha and beta diversity.

Check Alpha Diversity of Samples

A simple check of ‘how many unique observations are in each sample type’ can be useful for some studies. We will first plot different alpha diversty metrics, then perform an ANOVA on the unique observed sequences.

# Originally, the "Age variable was read as numerical, while I want it to be categorical. I use the `as.factor()` function to do this."
sample_data(physeq)$Age <- as.factor(sample_data(physeq)$Age)

#Phyloseq contains the `plot_richness()` function to display multiple alpha diversity measures at once. You can modify the plots based on sample metadata as well.
plot_richness(physeq, x="Site", measures=c("Observed", "Shannon", "Simpson", "Chao1", "InvSimpson"), color="Age") +   theme_bw()

Now we can run an ANOVA on the ‘Observed’ diversity.

# First we will create a dataframe containing all diversity measures from above using the `estimate_richness()` function.
adiv <- estimate_richness(physeq, measures = c("Observed", "Shannon", "Simpson", "Chao1", "InvSimpson"))

# Then we can run an analysis of variance test using the `aov()` function.
summary(aov(adiv$Observed ~ physeq.m$Site))

##               Df Sum Sq Mean Sq F value Pr(>F)
## physeq.m$Site  2   1191   595.3   0.484  0.623
## Residuals     21  25806  1228.9

We can see here that there is no significant difference in the number of observed unique sequences by Site.

Calculate Distance Between Samples and Plot Ordinations

The distance function calculates distance between samples based on a given metric. We commonly use the Bray-Curtis metric, but wieghted and unweighted Unifrac can be used as well. A basic ordination plot is then created with the plot_ordination function.

# First, we normalize the sequence counts by converting from raw abundance to relative abundance. This removes any bias due to total sequence counts per sample.
pn = transform_sample_counts(physeq, function(x) 100 * x/sum(x))

# Next, we use the `distance()` function from phyloseq to generate a distance matrix from our phyloseq object. You can select from multiple methods; we will use Bray-Curtis
iDist <- distance(pn, method = "bray")

# Using the distance matrix created above, we now make an NMDS ordination using the `ordinate()` function.
pn.nmds = ordinate(pn, 
                method = "NMDS", 
                distance = iDist)

# Finally, we create an plot of the previous ordination using the `plot_ordination()` function. The `justDF` option is set to `TRUE`, which indicates that we only want the dataframe created with plot ordination to be returned, not a ggplot object.
plot.pn.nmds = plot_ordination(pn, pn.nmds, justDF = TRUE)

Using ggplot2, you can customize your plot with any metadata available in your sample_data file.

# Next we can display the figure using the `print(ggplot())` command, with any ggplot aesthetic arguments you want. 
print(ggplot(plot.pn.nmds, aes(x = NMDS1, y = NMDS2, color = Site, shape = Age)) + 
        geom_point( size = 4, alpha = 0.75))

Simply observing what looks like a trend in an ordination is not the same as having statistical support, so we will now run a PERMANOVA test using the adonis() function in the vegan package.

library(vegan)
p.df = as(sample_data(pn), "data.frame")
p.d = distance(pn, method = "bray")
p.adonis = adonis(p.d ~ Site + Age + Site*Age, p.df)
p.adonis

## 
## Call:
## adonis(formula = p.d ~ Site + Age + Site * Age, data = p.df) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## Site       2    2.2594 1.12968  7.5507 0.38795  0.001 ***
## Age        1    0.2723 0.27232  1.8202 0.04676  0.068 .  
## Site:Age   2    0.5991 0.29956  2.0022 0.10287  0.025 *  
## Residuals 18    2.6930 0.14961         0.46242           
## Total     23    5.8239                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here, we can see that our communities significantly differ by Site, but not by Age. There is some interaction between the two variables as well. You can easily adjust the model to whatever metadata you desire.

Summarize Taxonomy at the Phylum Scale

Sometimes you want to look at the relative abundance of major phyla in your samples, averaged over a particular metadata category.

# This normalization is redundant from before, but usefull if you want to just copy this chunk of code
pn = transform_sample_counts(physeq, function(x) 100 * x/sum(x))

# Here we use the `tax_glom()` function from phyloseq to collapse our OTU table at the Phylum level. This takes all OTUs under a given phylum and combines them, adding up the total sequence counts. 
phylum = tax_glom(pn, taxrank="Phylum")

# Next, we want to look at just the average community by Site, not each individual sample. We will use the `merge_samples()` function to do this. You can use any categorical metadata variable for ths merge. 
pm  = merge_samples(phylum, "Site")

# Phyloseq does some odd things with sample names when you merge, so this next line of code makes sure that everything will be labelled correctly for our figures.
sample_data(pm)$DeRep <- levels(sample_data(pm)$Sample)

# Again, we make a sample_data dataframe for future reference
pm.m = pm %>% sample_data

# Since each 'sample' in this new object is an agglomerate sample, we need to re-normalize our sequence counts for each OTU. 
pm = transform_sample_counts(pm, function(x) 100 * x/sum(x))

# Finally, we can use the `plot_bar()` function to display our barchart.
plot_bar(pm, "Sample", fill = "Phylum")

An alternative to the stacked barchart is to use the facet function of ggplot.

# Here, we will take a look at the top 5 most abundant phyla using the a faceted plot.

# First, create a list of the top 5 phyla
trimlist = names(sort(taxa_sums(pm), TRUE)[1:5])

# Then we will use the `prune_taxa()` function to select only those top 5 phyla
pm.5 = prune_taxa(trimlist, pm)
    
# Finally, we can plot the abundance of those phyla using the `facet_grid` argument for `plot_bar()`
plot_bar(pm.5, "Sample", fill = "Phylum", facet_grid = ~Phylum)

Differential Abundance Testing

After determining that our microbial communities significantly differ by Site, we might be interested in which unique sequences differ in abundance between our Sites. We will use the DESeq2 package for this analysis.

Set Up Dataframes and Run DESeq2

DESeq2 is an R package originally written to perform analyses of differential expression for RNA-Seq experiments. Fortunately, the methods used for those analysis are the same we need to perform analyses of differential abundnace for our community data. DESeq2 can only work with two conditions at a time, and since we have 3 sites, we will need to create a new phyloseq object with one site removed. You can perform this analysis on each of the three site pairings if you wish.

# Create a new `phyloseq` object with only the BH and MP sites.
mb = subset_samples(physeq, Site == "MP" | Site == "BH") %>%
                      filter_taxa(function(x) sum(x) > 0, TRUE)

# Next, we need to check the order of our Site levels. DESeq2 takes the first level as the 'Control' and the second level as the 'Treatment', and this is needed for downstream interpretation of results.
head(sample_data(mb)$Site)

## [1] BH BH BH BH BH BH
## Levels: BH MP

# Now we can prepare the DESeq2 object and run our analysis.
library(DESeq2)

# First, we will convert the sequence count data from our phyloseq object into the proper format for DESeq2
mb.dds <- phyloseq_to_deseq2(mb, ~ Site)

# Next we need to estimate the size factors for our sequences
gm_mean = function(x, na.rm=TRUE){
    exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
geoMeans = apply(counts(mb.dds), 1, gm_mean)
mb.dds = estimateSizeFactors(mb.dds, geoMeans = geoMeans)

# Finally, we can run the core DESeq2 algorithm
mb.dds = DESeq(mb.dds, fitType="local")

Format Results Table

Next, we will trim our results and take a look at those OTUs that showed a significant difference in abundance between our two sites.

# Create a new dataframe from the results of our DESeq2 run
mb.res = results(mb.dds)

# Reorder the sequences by their adjusted p-values
mb.res = mb.res[order(mb.res$padj, na.last=NA), ]

# Set your alpha for testing significance, and filter out non-significant results
alpha = 0.01
mb.sigtab = mb.res[(mb.res$padj < alpha), ]

# Add taxonomy information to each sequence
mb.sigtab = cbind(as(mb.sigtab, "data.frame"), as(tax_table(mb)[rownames(mb.sigtab), ], "matrix"))
head(mb.sigtab)

##        baseMean log2FoldChange    lfcSE      stat       pvalue
## OTU.24 6.686493       9.783456 1.069434  9.148253 5.786182e-20
## OTU.12 4.985300      -9.044197 1.127915 -8.018507 1.070383e-15
## OTU.50 3.893791       8.965181 1.177931  7.610957 2.720744e-14
## OTU.52 4.273686       9.081006 1.198890  7.574512 3.604797e-14
## OTU.72 3.304191       8.680833 1.269083  6.840241 7.905980e-12
## OTU.85 2.997141       8.559756 1.246743  6.865696 6.616774e-12
##                padj  Kingdom          Phylum               Class
## OTU.24 1.556483e-17 Bacteria  Proteobacteria Alphaproteobacteria
## OTU.12 1.439665e-13 Bacteria            <NA>                <NA>
## OTU.50 2.424226e-12 Bacteria Verrucomicrobia      Spartobacteria
## OTU.52 2.424226e-12 Bacteria  Actinobacteria      Actinobacteria
## OTU.72 3.544514e-10 Bacteria Verrucomicrobia      Spartobacteria
## OTU.85 3.544514e-10 Bacteria  Actinobacteria      Actinobacteria
##                   Order             Family        Genus
## OTU.24      Rhizobiales  Xanthobacteraceae Pseudolabrys
## OTU.12             <NA>               <NA>         <NA>
## OTU.50             <NA>               <NA>         <NA>
## OTU.52  Actinomycetales Micromonosporaceae         <NA>
## OTU.72             <NA>               <NA>         <NA>
## OTU.85 Acidimicrobiales               <NA>         <NA>

Plot DESeq2 Results

Now we can graph our results from DESeq2. In the plot below, each individual point represents a unique sequence from our dataset that was found to be significantly higher in abundance in one sample type or the other. The log2 fold change value represents the magnitude of that difference in abundance. The higher the value, the greater the difference in abundance in one sample type compared to the other.

At this point we need to consult the information we gathered earlier on the levels of our Site variable. In DESeq2, any sequence that is more abundant in the ‘Control’ variable will have a negative log2 fold change value, and any sequence more abundant in the ‘Treatment’ variable will have a positive value. In our case, sequences more abundant in the ‘BH’ site will have negative values, and those more abundant in ‘MP’ values will have a positive value.

# Set ggplot2 options
theme_set(theme_bw(base_size = 20))
scale_fill_discrete <- function(palname = "Set1", ...) {
    scale_fill_brewer(palette = palname, ...)
}
# Rearrange the order of our OTUs by Phylum.
x = tapply(mb.sigtab$log2FoldChange, mb.sigtab$Phylum, function(x) max(x))
x = sort(x, TRUE)
mb.sigtab$Phylum = factor(as.character(mb.sigtab$Phylum), levels=names(x))

# Create and display the plot
p = ggplot(mb.sigtab, aes(x=Phylum, y=log2FoldChange, color=Phylum)) + geom_jitter(size=4, width=0.25) + 
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5))
p

References

Sequencing Primers

Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. 2013. Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Appl Environ Microbiol 79(17): 5112-5120

DADA2

Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP. 2016. DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods 13, 581-583

Phyloseq

McMurdie and Holmes (2013) phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLoS ONE. 8(4):e61217

DESeq2

Love MI, Huber W and Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, pp. 550

Vegan

Oksanen J, Blanchet FG, Friendly M, Kindt R, Legendre P, McGlinn D, Minchin PR, O’Hara RB, Simpson GL, Solymos P, Stevens MHH, Szoecs E, and Wagner H. 2016. vegan: Community Ecology Package. R package version 2.4-1.

Written on December 22, 2016