Metabarcoding Practice

Description	Hands On Lab Exercises for Metabarcoding
Authors	J Orjuela (julie.orjuela@ird.fr), A Dereeper (alexis.dereeper@ird.fr), F Constancias (florentin.constancias@cirad.fr), J Reveilleud (JR) (julie.reveillaud@inra.fr), M Simonin (marie.simonin@ird.fr), F Mahé (frederic.mahe@cirad.fr), A Comte (aurore.comte@ird.fr)
Creation Date	18/04/2018
Last Modified Date	22/05/2019

Summary

Practice 1: Obtaining an OTU table with FROGS in Galaxy
Practice 2: FROGs in command line
Practice 3: Handling and visualizing OTU table using PhyloSeq R package
Links
License

Practice 1 : Obtaining an OTU table with FROGS in Galaxy

In this training we will first performed metabarcoding analysis with the FROGS v3.1 pipeline in the Galaxy environment https://github.com/geraldinepascal/FROGS . In a second time, we will perform similar analysis in command line on HPC i-Trop cluster.

Connect to Galaxy i-Trop with formationN account.
Create a new history and import Metabarcoding sample datasets (paired-end fastq files compressed by tar ) from
Shared Data / Data libraries /formation Galaxy 2019 / Metabarcoding. Recovery DATA_s.tar.gz and Summary.txt
- Fastq files used here are a subset of reads obtained in a metagenomic study of Edwards et al 2015 containing 4 soil compartments: Rhizosphere, Rhizoplane, Endosphere and Bulk_Soil of a rice culture.

We will launch every step of a metabarcoding analysis as follow :

1.1 Preprocess

Merge paired reads and dereplicate using the Preprocessing tool with FLASH as merge software - FROGS Pre-process
- => Read size is 250 pb, expected, minimum and maximun amplicon size are 250,100,350 pb respectively. Use custom sequencing protocol. Use a mistmach rate of 0.15.
- How many sequences have been overlapped?
- How many sequences remain after dereplication?
- What amplicon size is obtained in the majority of merged sequences?

1.2 Clustering

Build Clustering using swarm - FROGS Clustering swarm
- => Use an aggregation distance of 1. Don’t use denoising option.
- The biom file shows the abundance of each cluster.
- The fasta file contains the cluster (OTU) representative sequences.
- The tsv file shows what sequences are contained in each cluster.

1.3 Stats on clustering (optional)

Obtain statistics about abundance of sequences in clusters - FROGS Clusters stat
- How many clusters were obtained by swarm?
- How many sequences are contained in the biggest cluster?
- How many clusters contain only one sequence?
- Observe the cumulative sequence proportion by cluster size
- Observe cluster sharing between samples through hierarchical clustering tree

1.4 Remove chimera

Remove chimera using biom obtained from swarm - FROGS Remove chimera
- What proportion of clusters were kept in this step?

1.5 OTU Filtering

Filters OTUs on several criteria. - FROGS Filters
- Eliminate OTUs with a low number of sequences (abundance at 0.005%) and keep OTUs present in at least two samples.
- How many OTUs were removed in this step?
- How many OTUs were removed because of low abundance?
Relaunch OTU Filtering but using abundance at 0.01%. How many OTUs were removed because of low abundance?

1.6 Stats on clustering (optional)

Rerun statistics of clusters after filtering - FROGS Clusters stat
- Look the effect of the cumulative proportion by cluster size.

1.7 Taxonomic affiliation

Perform taxonomic affiliation of each OTU by BLAST - FROGS Affiliation OTU
- Use the SILVA 132 16S database for taxonomic assignation by BLAST.
- Activate RDP assignation.
- How many OTU were taxonomically assigned to species?
- Visualize the biom file enriched with taxomonic information.

1.8 Affiliation stats

Obtain statistics of affiliation - FROGS Affiliation stat
- Use rarefaction ranks : Family Genus Species
- Observe global distribution of taxonomies by sample.
- Look the rarefaction curve, which is a measure of samples vs diversity.

1.9 BIOM format standardization

Retrieve a standardize biom file using - FROGS BIOM to std BIOM

You have now a standard BIOM file to phyloseq analysis.

1.10 Building a Tree

Build a tree with MAFFT and FastTree FROGS Tree using filter.fasta and filter.biom

1.11 Phyloseq stats in FROGSTAT

Import data in R FROGSSTAT Phyloseq Import using the standard BIOM file and the summary.txt file without normalisation.
Make taxonomic barcharts (kingdom level) FROGSSTAT Phyloseq Composition Visualisation using env_material as grouping variable and the R data objet.
Compute alpha diversity FROGSSTAT Phyloseq Alpha Diversity Calculate Observed, Chao1 and Shannon diversity indices. Use env_material as enviroment variable.
Compute beta diversity FROGSSTAT Phyloseq Beta Diversity. Use env_material as grouping variable and the R data objet and ‘Other methods’: cc, unifrac.
Build a head map plot and ordination FROGSSTAT Phyloseq Structure Visualisation : Use env_material as grouping variable, the R data objet and the beta-diversity unifrac.tsv output.
Hierarchical clustering of samples using Unifrac distance matrix FROGSSTAT Phyloseq Sample Clustering : Use env_material as grouping variable, the R data objet and the beta-diversity unifrac.tsv output.
Calculate a anova using unifrac distance matrix with FROGSSTAT Phyloseq Anova

1. 12 Workflow in Galaxy

Import a preformated FROGS workflow from Galaxy. Go to Shared Data / Workflows /FROGS and import it. This workflow contains the whole of steps used before. Be free of modified it and lauch it if you want.

Practice 2 : Launch FROGs in command line

Pipeline in bash format (for command line use):

https://github.com/SouthGreenPlatform/trainings/blob/gh-pages/files/run_frogs_pipelinev3.sh

Connection to account in IRD i-Trop cluster in ssh mode ssh formationX@bioinfo-master.ird.fr
Input data DATA_s.tar.gz and summary.txt are accessible from nas:/data2/formation/TPMetabarcoding/FROGS/ folder.
Create a TP-FROGS directory in your $HOME and go inside

mkdir ~/TP-FROGS 
cd ~/TP-FROGS

Download LaunchFROGs_v3.sh script and give execution rights

wget https://raw.githubusercontent.com/SouthGreenPlatform/trainings/gh-pages/files/launchFROGs_v3.sh 
chmod +x launchFROGs_v3.sh

Visualise LaunchFROGs_v3.sh script
Launch LaunchFROGsv3.sh in qsub mode. Give your user name to this script as parametter.

qsub ./launchFROGs_v3.sh formationX

Recovery output repertory and transfer it to your local machine using Fillezila or scp
- if scp mode transfert from your local machine terminal as follow

scp -r formationX@bioinfo-master.ird.fr:/home/formationX/OUTPUT_FROGSV3/ .

Otherwise, use this download link: https://elwe.rhrk.uni-kl.de/outgoing/OUTPUT_FROGSV3.zip (valid until 2019-06-01)

Practice 3 : Tutoriel Phyloseq Formation Metabarcoding

3.1 Setup your environment

Start with a clean session

rm(list=ls())

load the packages

require(phyloseq)
require(tidyverse)
require(reshape2)
require(gridExtra)
require(scales)
require(parallel)
require(permute)
require(lattice)

Let’s define the working directory on your local computer

setwd("your path")

3.2 Building a Phyloseq object

… and load the data generated using FROGS

load("11-phylo_import.Rdata")

Data is a phyloseq object

data

We can access the ‘OTU’ / sample occurence table with the follwing command

head(otu_table(data))

You can also use tidyr syntax to make your code net and tidy

data %>% 
  otu_table() %>%
    head()

in R, type ? and the function name to some help

?otu_table

Question1 : What is the sequencing depth of the samples ?

data %>% 
  otu_table() %>%
  colSums()

Phyloseq has some built-in functions to explore the data

data %>% sample_sums

Let’s plot the sorted sequencing depth

data %>% 
  otu_table() %>%
  colSums() %>%
  sort() %>% 
  barplot(las=2)

Question2 : How many reads are representing each of the first 10 OTU (i.e., swarm’s clusters) ?

sort(rowSums(otu_table(data)), decreasing = T)[1:10]

We can access the taxonomical information of the different OTU with the follwing command

data %>% 
  tax_table() %>%
  head()

Metadata are also stored in data phyloseq object

sample_data(data)$env_material

Phyloseq has some built-in functions

rank_names(data) # taxonimcal ranks
nsamples(data) # number of samples
ntaxa(data) # number of OTU
sample_variables(data) # metadata

3.3 Distribution per OTU and per sample

Let’s plot the sequence distribution per OTU and per sample First, create a dataframe with nreads : the sorted number of reads per OTU, sorted : the index of the sorted OTU and type : OTU

readsumsdf <- data.frame(nreads = sort(taxa_sums(data), TRUE),
                        sorted = 1:ntaxa(data), 
                        type = "OTU")

These are the first rows of our dataframe

readsumsdf %>% head()

We can plot this dataframe using ggplot

ggplot(readsumsdf, 
       aes(x = sorted, y = nreads)) + 
  geom_bar(stat = "identity") + 
  scale_y_log10()

Now we are going to create another dataframe with the sequencing depth per sample sample_sums()

readsumsdf2 <- data.frame(nreads = sort(sample_sums(data), TRUE), 
                          sorted = 1:nsamples(data), 
                          type = "Samples")

Let’s bind the two tables

readsumsdf3 <- rbind(readsumsdf,readsumsdf2)

Check the first rows

readsumsdf3 %>% head()

Check the last rows

readsumsdf3 %>% tail()

We can plot the data using ggplot and wrap the data according to type column (that’s why we specified OTU and Samples )

p  <-  ggplot(readsumsdf3, 
              aes(x = sorted, y = nreads)) + 
  geom_bar(stat = "identity")
p + ggtitle("Total number of reads before Preprocessing") + scale_y_log10() + facet_wrap(~type, 1, scales = "free")

3.4 Rarefaction curves

Let’s explore the rarefaction curves i.e., OTU richness vs sequencing depth

data %>%
  otu_table() %>%
  t() %>%
  vegan::rarecurve()

We can do something nicer with ggplot

source("https://raw.githubusercontent.com/mahendra-mariadassou/phyloseq-extended/master/load-extra-functions.R")

p <- ggrare(data,
            step = 500,
            color = "env_material",
            plot = T,
            parallel = T,
            se = F)
p <- p + 
  facet_wrap(~ env_material ) + 
  geom_vline(xintercept = min(sample_sums(data)), 
             color = "gray60")
plot(p)

3.5 OTU table Filtering

We are now going to filter the OTU table

Explore the Taxonomy at the Kingdom level

tax_table(data)[,c("Kingdom")] %>% unique()

Remove untargeted OTU (we consider Unclassified OTU at the Kingdom level as noise) using subset_taxa

data <- subset_taxa(data, 
                    Kingdom != "Unclassified" &
                    Order !="Chloroplast" &
                    Family != "Mitochondria")

Remove low occurence / abundance OTU i.e., more than 10 sequences in total and appearing in more than 1 sample

data <-  filter_taxa(data, 
                     function(x) sum(x >= 10) > (1), 
                     prune =  TRUE)

3.6 Normalisation to minumum sequencing depth

Rarefy to en even sequencing depth (i.e., min(colSums(otu_table(data)))

data_rare <- rarefy_even_depth(data, 
                              sample.size = min(colSums(otu_table(data))), 
                              rngseed = 63)

Rarefaction curves on filtered data

p <- ggrare(data_rare, step = 50, color = "env_material", plot = T, parallel = T, se = F)
p

One can export the filtered OTU table

write.csv(cbind(data.frame(otu_table(data_rare)),
                tax_table(data_rare)), 
          file="filtered_otu_table.csv")

3.7 alpha-diversity

We can now explore the alpha-dviersity on the filtered and rarefied data

p <- plot_richness(data_rare, 
                   x="sample", 
                   color="env_material", 
                   measures=c("Observed","Shannon","ACE"), 
                   nrow = 1)
print(p)

That plot could be nicer Data to plot are stored in p$data

p$data %>% head()

boxplot using ggplot

ggplot(p$data,aes(env_material,value,colour=env_material)) +
  facet_grid(variable ~ env_material, drop=T,scale="free",space="fixed") +
  geom_boxplot(outlier.colour = NA,alpha=1)

More Complex

ggplot(p$data,aes(env_material,value,colour=env_material,shape=env_material)) +
  facet_grid(variable ~ env_material, drop=T,scale="free",space="fixed") +
  geom_boxplot(outlier.colour = NA,alpha=0.8, 
               position = position_dodge(width=0.9)) + 
  geom_point(size=2,position=position_jitterdodge(dodge.width=0.9)) +
  ylab("Diversity index")  + xlab(NULL) + theme_bw()

Export the alpha div values into a dataframe in short format

rich.plus <- dcast(p$data,  samples + env_material ~ variable)
write.csv(rich.plus, file="alpha_div.csv")

Alpha-div Stats using TukeyHSD on ANOVA

TukeyHSD_Observed <- TukeyHSD(aov(Observed ~ env_material, data =  rich.plus))
TukeyHSD_Observed_df <- data.frame(TukeyHSD_Observed$env_material)
TukeyHSD_Observed_df$measure = "Observed"
TukeyHSD_Observed_df$shapiro_test_pval = (shapiro.test(residuals(aov(Observed ~ env_material, data =  rich.plus))))$p.value
TukeyHSD_Observed_df

3.8 beta-diversity

Compute dissimilarity

data_rare %>% transform_sample_counts(function(x) x/sum(x)) %>%
  otu_table() %>%
  t() %>%
  sqrt() %>%
  as.data.frame() %>%
  vegdist(binary=F, method = "bray") -> dist

Ordination

run PCoA ordination on the generated distance

ord <- ordinate(data_rare,"PCoA",dist)

Samples coordinate on the PCoA vecotrs are stored in but plot_ordination can make use of ord object easily

ord$vectors

plot_ordination(data_rare, 
                ord,
                color = "env_material", 
                shape="env_material", 
                title = "PCoA sqrt Bray curtis", 
                label= "SampleID" ) + 
  geom_point(aes(size=rich.plus$Observed)) +
  theme_bw()

Let’s see if the observed pattern is significant using PERMANOVA i.e., adonis function from vegan

adonis(dist ~ get_variable(data_rare, "env_material"), permutations = 1000)$aov.tab

dispersion

Let’s see if there are difference in dispersion (i.e., variance)

boxplot(betadisper(dist, 
                   get_variable(data_rare, "env_material")),las=2, 
        main=paste0("Multivariate Dispersion Test Bray-Curtis "," pvalue = ", 
                    permutest(betadisper(dist, get_variable(data_rare, "env_material")))$tab$`Pr(>F)`[1]))

ANOSIM

ANOSIM test can also test for differences among group

plot(anosim(dist, get_variable(data_rare, "env_material"))
     ,main="ANOSIM Bray-Curtis "
     ,las=2)

3.9 Composition plot

Now, we would like to plot the distribution of phylum transformed in %

data_rare %>% transform_sample_counts(function(x) x/sum(x)) %>%
  plot_bar(fill="Phylum") +
  facet_wrap(~ env_material, scales = "free_x", nrow = 1) +
  ggtitle("Bar plot colored by Phylum ") +
  theme(plot.title = element_text(hjust = 0.5))

We can generate a nicer plot using plot_composition function

p <- plot_composition(data_rare,
                      taxaRank1 = "Kingdom",
                      taxaSet1 ="Bacteria",
                      taxaRank2 = "Phylum", 
                      numberOfTaxa = 20, 
                      fill= "Phylum") +
  facet_wrap(~env_material, scales = "free_x", nrow = 1) + 
  theme(plot.title = element_text(hjust = 0.5))

3.10 Some exercises

1 - How many OTUs belong to Archaea (in two commands using %>%) ?

Answer :

data %>%
  subset_taxa(Kingdom == "Archaea") %>% 
  ntaxa()

2 - Plot OTU richness (and only richness = ‘Observed’ in phyloseq) of Alphaproteobacteria among samples

Answer :

subset_taxa(data, 
            Class == "Alphaproteobacteria") %>%
  plot_richness() -> p

ggplot(subset(p$data,variable=="Observed"),
       aes(env_material,value,colour=env_material,shape=env_material)) +
  facet_grid(variable ~ env_material, drop=T,scale="free",space="fixed") +
  geom_boxplot(outlier.colour = NA,alpha=0.8, 
               position = position_dodge(width=0.9)) + 
  geom_point(size=2,position=position_jitterdodge(dodge.width=0.9)) +
  ylab("Diversity index")  + xlab(NULL) + theme_bw()

3 - Explore beta diversity of Alphaproteobacteria using “morisita” distance without data transformation and without considering endosphere samples (subset_samples). Are sample from Bulk_Soil and Rhizosphere different in terms of beta-diversity (use %in% c(“Soil”, “Prank”) in order to subset from several categories

Answer :

data %>% 
  subset_taxa(Class == "Alphaproteobacteria") %>%
  subset_samples(env_material != "Endosphere") -> exo2

exo2 %>%
  otu_table() %>%
  t() %>%
  vegdist(binary=F, method = "morisita") -> dist

ord <- ordinate(exo2,"PCoA",dist)

ord

plot_ordination(physeq = exo2, ord,
                color = "env_material", 
                shape="env_material", 
                title = "PCoA  Morisita",
                label= "SampleID" ) + theme_bw()

data %>% 
  subset_taxa(Class == "Alphaproteobacteria") %>%
  subset_samples(env_material %in% c("Bulk_Soil", "Rhizosphere")) -> test

sample_data(test)
  
test %>% 
  otu_table() %>%
  t() %>%
  vegdist(binary=F, method = "morisita") -> dist

adonis(dist ~ get_variable(test, "env_material"), permutations = 1000)$aov.tab

4 - Plot proportion Chloroplasts

Answer :

load("11-phylo_import.Rdata")

data %>% 
  transform_sample_counts(function(x) x/sum(x)) %>%
  subset_taxa(Order == "Chloroplast") %>%
  plot_bar(fill="Order") + facet_grid(. ~ env_material, drop=T,scale="free",space="fixed")

5 - Plot proportion of OTU belonging to Mitochondria and facet the plot according to Site (i.e., env_material).

Some surprises ?

Answer:

load("11-phylo_import.Rdata")

data %>% 
  transform_sample_counts(function(x) x/sum(x)) %>%
  subset_taxa(Family == "Mitochondria") %>%
  plot_bar(fill="OTU") + facet_grid(. ~ env_material, drop=T,scale="free",space="fixed")

6 - beta-diversity

6a. Plot beta-diversity of Mitochondria and Chloroplasts OTU using Bray-Curtis distance on untransformed table

6b. what is the percentage of Mitochondria and Chloroplasts OTU

6c. plot a basic barplot of it

Answer:

load("11-phylo_import.Rdata")

data %>% 
  transform_sample_counts(function(x) x/sum(x)) %>%
  subset_taxa(Family == "Mitochondria" | Order == "Chloroplast") -> cont

sample_sums(cont) %>%
  sort(decreasing = T) %>%
  barplot(las=2)

cont %>%
  otu_table() %>%
  t() %>%
  vegan::vegdist(binary=F, method = "bray") -> dist
  
ord <- ordinate(cont,"PCoA",dist)

ord

plot_ordination(physeq = cont, ord,
                color = "env_material", 
                shape="env_material", 
                title = "PCoA  bray",
                label= "SampleID") + theme_bw()

7 - Do the filterd-out OTU display alpha / beta diversity patterns?

License

The resource material is licensed under the Creative Commons Attribution 4.0 International License (here).

Summary

Practice 1 : Obtaining an OTU table with FROGS in Galaxy

1.1 Preprocess

1.2 Clustering

1.3 Stats on clustering (optional)

1.4 Remove chimera

1.5 OTU Filtering

1.6 Stats on clustering (optional)

1.7 Taxonomic affiliation

1.8 Affiliation stats

1.9 BIOM format standardization

1.10 Building a Tree

1.11 Phyloseq stats in FROGSTAT

1. 12 Workflow in Galaxy

Practice 2 : Launch FROGs in command line

Practice 3 : Tutoriel Phyloseq Formation Metabarcoding

3.1 Setup your environment

3.2 Building a Phyloseq object

3.3 Distribution per OTU and per sample

3.4 Rarefaction curves

3.5 OTU table Filtering

3.6 Normalisation to minumum sequencing depth

3.7 alpha-diversity

3.8 beta-diversity

Ordination

dispersion

ANOSIM

3.9 Composition plot

3.10 Some exercises

Links

License