South Green Logo

South Green Trainings pages

Description Hands On Lab Exercises for Metabarcoding
Authors J Orjuela (julie.orjuela@ird.fr), A Dereeper (alexis.dereeper@ird.fr), F Constancias (florentin.constancias@cirad.fr), J Reveilleud (JR) (julie.reveillaud@inra.fr), M Simonin (marie.simonin@ird.fr), F Mahé (frederic.mahe@cirad.fr), A Comte (aurore.comte@ird.fr)
Creation Date 18/04/2018
Last Modified Date 22/05/2019

Summary


Practice 1 : Obtaining an OTU table with FROGS in Galaxy

In this training we will first performed metabarcoding analysis with the FROGS v3.1 pipeline in the Galaxy environment https://github.com/geraldinepascal/FROGS . In a second time, we will perform similar analysis in command line on HPC i-Trop cluster.

We will launch every step of a metabarcoding analysis as follow :


1.1 Preprocess


1.2 Clustering


1.3 Stats on clustering (optional)


1.4 Remove chimera


1.5 OTU Filtering


1.6 Stats on clustering (optional)


1.7 Taxonomic affiliation


1.8 Affiliation stats


1.9 BIOM format standardization

Retrieve a standardize biom file using - FROGS BIOM to std BIOM


1.10 Building a Tree


1.11 Phyloseq stats in FROGSTAT


1. 12 Workflow in Galaxy

Import a preformated FROGS workflow from Galaxy. Go to Shared Data / Workflows /FROGS and import it. This workflow contains the whole of steps used before. Be free of modified it and lauch it if you want.


Practice 2 : Launch FROGs in command line

Pipeline in bash format (for command line use):

https://github.com/SouthGreenPlatform/trainings/blob/gh-pages/files/run_frogs_pipelinev3.sh

mkdir ~/TP-FROGS 
cd ~/TP-FROGS 
wget https://raw.githubusercontent.com/SouthGreenPlatform/trainings/gh-pages/files/launchFROGs_v3.sh 
chmod +x launchFROGs_v3.sh
qsub ./launchFROGs_v3.sh formationX
scp -r formationX@bioinfo-master.ird.fr:/home/formationX/OUTPUT_FROGSV3/ .

Otherwise, use this download link: https://elwe.rhrk.uni-kl.de/outgoing/OUTPUT_FROGSV3.zip (valid until 2019-06-01)


Practice 3 : Tutoriel Phyloseq Formation Metabarcoding

3.1 Setup your environment

Start with a clean session

rm(list=ls())

load the packages

require(phyloseq)
require(tidyverse)
require(reshape2)
require(gridExtra)
require(scales)
require(parallel)
require(permute)
require(lattice)

Let’s define the working directory on your local computer

setwd("your path")

3.2 Building a Phyloseq object

… and load the data generated using FROGS

load("11-phylo_import.Rdata")

Data is a phyloseq object

data

We can access the ‘OTU’ / sample occurence table with the follwing command

head(otu_table(data))

You can also use tidyr syntax to make your code net and tidy

data %>% 
  otu_table() %>%
    head()

in R, type ? and the function name to some help

?otu_table

Question1 : What is the sequencing depth of the samples ?

data %>% 
  otu_table() %>%
  colSums()

Phyloseq has some built-in functions to explore the data

data %>% sample_sums

Let’s plot the sorted sequencing depth

data %>% 
  otu_table() %>%
  colSums() %>%
  sort() %>% 
  barplot(las=2)

Question2 : How many reads are representing each of the first 10 OTU (i.e., swarm’s clusters) ?

sort(rowSums(otu_table(data)), decreasing = T)[1:10]

We can access the taxonomical information of the different OTU with the follwing command

data %>% 
  tax_table() %>%
  head()

Metadata are also stored in data phyloseq object

sample_data(data)$env_material

Phyloseq has some built-in functions

rank_names(data) # taxonimcal ranks
nsamples(data) # number of samples
ntaxa(data) # number of OTU
sample_variables(data) # metadata

3.3 Distribution per OTU and per sample

Let’s plot the sequence distribution per OTU and per sample First, create a dataframe with nreads : the sorted number of reads per OTU, sorted : the index of the sorted OTU and type : OTU

readsumsdf <- data.frame(nreads = sort(taxa_sums(data), TRUE),
                        sorted = 1:ntaxa(data), 
                        type = "OTU")

These are the first rows of our dataframe

readsumsdf %>% head()

We can plot this dataframe using ggplot

ggplot(readsumsdf, 
       aes(x = sorted, y = nreads)) + 
  geom_bar(stat = "identity") + 
  scale_y_log10() 

Now we are going to create another dataframe with the sequencing depth per sample sample_sums()

readsumsdf2 <- data.frame(nreads = sort(sample_sums(data), TRUE), 
                          sorted = 1:nsamples(data), 
                          type = "Samples")

Let’s bind the two tables

readsumsdf3 <- rbind(readsumsdf,readsumsdf2)

Check the first rows

readsumsdf3 %>% head()

Check the last rows

readsumsdf3 %>% tail()

We can plot the data using ggplot and wrap the data according to type column (that’s why we specified OTU and Samples )

p  <-  ggplot(readsumsdf3, 
              aes(x = sorted, y = nreads)) + 
  geom_bar(stat = "identity")
p + ggtitle("Total number of reads before Preprocessing") + scale_y_log10() + facet_wrap(~type, 1, scales = "free")

3.4 Rarefaction curves

Let’s explore the rarefaction curves i.e., OTU richness vs sequencing depth

data %>%
  otu_table() %>%
  t() %>%
  vegan::rarecurve()

We can do something nicer with ggplot

source("https://raw.githubusercontent.com/mahendra-mariadassou/phyloseq-extended/master/load-extra-functions.R")
p <- ggrare(data,
            step = 500,
            color = "env_material",
            plot = T,
            parallel = T,
            se = F)
p <- p + 
  facet_wrap(~ env_material ) + 
  geom_vline(xintercept = min(sample_sums(data)), 
             color = "gray60")
plot(p)

3.5 OTU table Filtering

We are now going to filter the OTU table

Explore the Taxonomy at the Kingdom level

tax_table(data)[,c("Kingdom")] %>% unique()

Remove untargeted OTU (we consider Unclassified OTU at the Kingdom level as noise) using subset_taxa

data <- subset_taxa(data, 
                    Kingdom != "Unclassified" &
                    Order !="Chloroplast" &
                    Family != "Mitochondria")

Remove low occurence / abundance OTU i.e., more than 10 sequences in total and appearing in more than 1 sample

data <-  filter_taxa(data, 
                     function(x) sum(x >= 10) > (1), 
                     prune =  TRUE) 

3.6 Normalisation to minumum sequencing depth

Rarefy to en even sequencing depth (i.e., min(colSums(otu_table(data)))

data_rare <- rarefy_even_depth(data, 
                              sample.size = min(colSums(otu_table(data))), 
                              rngseed = 63)

Rarefaction curves on filtered data

p <- ggrare(data_rare, step = 50, color = "env_material", plot = T, parallel = T, se = F)
p 

One can export the filtered OTU table

write.csv(cbind(data.frame(otu_table(data_rare)),
                tax_table(data_rare)), 
          file="filtered_otu_table.csv")

3.7 alpha-diversity

We can now explore the alpha-dviersity on the filtered and rarefied data

p <- plot_richness(data_rare, 
                   x="sample", 
                   color="env_material", 
                   measures=c("Observed","Shannon","ACE"), 
                   nrow = 1)
print(p)

That plot could be nicer Data to plot are stored in p$data

p$data %>% head()

boxplot using ggplot

ggplot(p$data,aes(env_material,value,colour=env_material)) +
  facet_grid(variable ~ env_material, drop=T,scale="free",space="fixed") +
  geom_boxplot(outlier.colour = NA,alpha=1)

More Complex

ggplot(p$data,aes(env_material,value,colour=env_material,shape=env_material)) +
  facet_grid(variable ~ env_material, drop=T,scale="free",space="fixed") +
  geom_boxplot(outlier.colour = NA,alpha=0.8, 
               position = position_dodge(width=0.9)) + 
  geom_point(size=2,position=position_jitterdodge(dodge.width=0.9)) +
  ylab("Diversity index")  + xlab(NULL) + theme_bw()

Export the alpha div values into a dataframe in short format

rich.plus <- dcast(p$data,  samples + env_material ~ variable)
write.csv(rich.plus, file="alpha_div.csv")

Alpha-div Stats using TukeyHSD on ANOVA

TukeyHSD_Observed <- TukeyHSD(aov(Observed ~ env_material, data =  rich.plus))
TukeyHSD_Observed_df <- data.frame(TukeyHSD_Observed$env_material)
TukeyHSD_Observed_df$measure = "Observed"
TukeyHSD_Observed_df$shapiro_test_pval = (shapiro.test(residuals(aov(Observed ~ env_material, data =  rich.plus))))$p.value
TukeyHSD_Observed_df

3.8 beta-diversity

Compute dissimilarity

data_rare %>% transform_sample_counts(function(x) x/sum(x)) %>%
  otu_table() %>%
  t() %>%
  sqrt() %>%
  as.data.frame() %>%
  vegdist(binary=F, method = "bray") -> dist
Ordination

run PCoA ordination on the generated distance

ord <- ordinate(data_rare,"PCoA",dist)

Samples coordinate on the PCoA vecotrs are stored in but plot_ordination can make use of ord object easily

ord$vectors
plot_ordination(data_rare, 
                ord,
                color = "env_material", 
                shape="env_material", 
                title = "PCoA sqrt Bray curtis", 
                label= "SampleID" ) + 
  geom_point(aes(size=rich.plus$Observed)) +
  theme_bw()

Let’s see if the observed pattern is significant using PERMANOVA i.e., adonis function from vegan

adonis(dist ~ get_variable(data_rare, "env_material"), permutations = 1000)$aov.tab
dispersion

Let’s see if there are difference in dispersion (i.e., variance)

boxplot(betadisper(dist, 
                   get_variable(data_rare, "env_material")),las=2, 
        main=paste0("Multivariate Dispersion Test Bray-Curtis "," pvalue = ", 
                    permutest(betadisper(dist, get_variable(data_rare, "env_material")))$tab$`Pr(>F)`[1]))
ANOSIM

ANOSIM test can also test for differences among group

plot(anosim(dist, get_variable(data_rare, "env_material"))
     ,main="ANOSIM Bray-Curtis "
     ,las=2)

3.9 Composition plot

Now, we would like to plot the distribution of phylum transformed in %

data_rare %>% transform_sample_counts(function(x) x/sum(x)) %>%
  plot_bar(fill="Phylum") +
  facet_wrap(~ env_material, scales = "free_x", nrow = 1) +
  ggtitle("Bar plot colored by Phylum ") +
  theme(plot.title = element_text(hjust = 0.5))

We can generate a nicer plot using plot_composition function

p <- plot_composition(data_rare,
                      taxaRank1 = "Kingdom",
                      taxaSet1 ="Bacteria",
                      taxaRank2 = "Phylum", 
                      numberOfTaxa = 20, 
                      fill= "Phylum") +
  facet_wrap(~env_material, scales = "free_x", nrow = 1) + 
  theme(plot.title = element_text(hjust = 0.5)) 

3.10 Some exercises

Answer :

data %>%
  subset_taxa(Kingdom == "Archaea") %>% 
  ntaxa()

Answer :

subset_taxa(data, 
            Class == "Alphaproteobacteria") %>%
  plot_richness() -> p

ggplot(subset(p$data,variable=="Observed"),
       aes(env_material,value,colour=env_material,shape=env_material)) +
  facet_grid(variable ~ env_material, drop=T,scale="free",space="fixed") +
  geom_boxplot(outlier.colour = NA,alpha=0.8, 
               position = position_dodge(width=0.9)) + 
  geom_point(size=2,position=position_jitterdodge(dodge.width=0.9)) +
  ylab("Diversity index")  + xlab(NULL) + theme_bw()

Answer :

data %>% 
  subset_taxa(Class == "Alphaproteobacteria") %>%
  subset_samples(env_material != "Endosphere") -> exo2

exo2 %>%
  otu_table() %>%
  t() %>%
  vegdist(binary=F, method = "morisita") -> dist

ord <- ordinate(exo2,"PCoA",dist)

ord

plot_ordination(physeq = exo2, ord,
                color = "env_material", 
                shape="env_material", 
                title = "PCoA  Morisita",
                label= "SampleID" ) + theme_bw()

data %>% 
  subset_taxa(Class == "Alphaproteobacteria") %>%
  subset_samples(env_material %in% c("Bulk_Soil", "Rhizosphere")) -> test

sample_data(test)
  
test %>% 
  otu_table() %>%
  t() %>%
  vegdist(binary=F, method = "morisita") -> dist

adonis(dist ~ get_variable(test, "env_material"), permutations = 1000)$aov.tab

Answer :

load("11-phylo_import.Rdata")

data %>% 
  transform_sample_counts(function(x) x/sum(x)) %>%
  subset_taxa(Order == "Chloroplast") %>%
  plot_bar(fill="Order") + facet_grid(. ~ env_material, drop=T,scale="free",space="fixed")

Some surprises ?

Answer:

load("11-phylo_import.Rdata")

data %>% 
  transform_sample_counts(function(x) x/sum(x)) %>%
  subset_taxa(Family == "Mitochondria") %>%
  plot_bar(fill="OTU") + facet_grid(. ~ env_material, drop=T,scale="free",space="fixed")

6a. Plot beta-diversity of Mitochondria and Chloroplasts OTU using Bray-Curtis distance on untransformed table

6b. what is the percentage of Mitochondria and Chloroplasts OTU

6c. plot a basic barplot of it

Answer:

load("11-phylo_import.Rdata")

data %>% 
  transform_sample_counts(function(x) x/sum(x)) %>%
  subset_taxa(Family == "Mitochondria" | Order == "Chloroplast") -> cont

sample_sums(cont) %>%
  sort(decreasing = T) %>%
  barplot(las=2)

cont %>%
  otu_table() %>%
  t() %>%
  vegan::vegdist(binary=F, method = "bray") -> dist
  
ord <- ordinate(cont,"PCoA",dist)

ord

plot_ordination(physeq = cont, ord,
                color = "env_material", 
                shape="env_material", 
                title = "PCoA  bray",
                label= "SampleID") + theme_bw()
                

Metabarcoding

Qiime2 and DADA2

Fred’s-metabarcoding-pipeline


License

The resource material is licensed under the Creative Commons Attribution 4.0 International License (here).