Description | Hands On Lab Exercises for Metabarcoding |
---|---|
Authors | J Orjuela (julie.orjuela@ird.fr), A Dereeper (alexis.dereeper@ird.fr), F Constancias (florentin.constancias@cirad.fr), J Reveilleud (JR) (julie.reveillaud@inra.fr), M Simonin (marie.simonin@ird.fr), F Mahé (frederic.mahe@cirad.fr), A Comte (aurore.comte@ird.fr) |
Creation Date | 18/04/2018 |
Last Modified Date | 22/05/2019 |
Summary
- Practice 1: Obtaining an OTU table with FROGS in Galaxy
- Practice 1.1: Preprocessing
- Practice 1.2: Clustering
- Practice 1.3: Stats on clustering (optional)
- Practice 1.4: Remove chimera
- Practice 1.5 OTU Filtering
- Practice 1.6: Stats on clustering (optional)
- Practice 1.7: Taxonomic affiliation
- Practice 1.8: Affiliation stats
- Practice 1.9: BIOM format standardization
- Practice 1.10: Building a Tree
- Practice 1.12: Workflow in Galaxy
- Practice 2: FROGs in command line
- Practice 3: Handling and visualizing OTU table using PhyloSeq R package
- Links
- License
![]() |
![]() |
![]() |
Practice 1 : Obtaining an OTU table with FROGS in Galaxy
In this training we will first performed metabarcoding analysis with the FROGS v3.1 pipeline in the Galaxy environment https://github.com/geraldinepascal/FROGS
. In a second time, we will perform similar analysis in command line on HPC i-Trop cluster.
- Connect to Galaxy i-Trop with formationN account.
- Create a new history and import Metabarcoding sample datasets (paired-end fastq files compressed by tar ) from
Shared Data / Data libraries /formation Galaxy 2019 / Metabarcoding
. RecoveryDATA_s.tar.gz
andSummary.txt
- Fastq files used here are a subset of reads obtained in a metagenomic study of Edwards et al 2015 containing 4 soil compartments: Rhizosphere, Rhizoplane, Endosphere and Bulk_Soil of a rice culture.
We will launch every step of a metabarcoding analysis as follow :
1.1 Preprocess
- Merge paired reads and dereplicate using the Preprocessing tool with FLASH as merge software -
FROGS Pre-process
- => Read size is 250 pb, expected, minimum and maximun amplicon size are 250,100,350 pb respectively. Use custom sequencing protocol. Use a mistmach rate of 0.15.
- How many sequences have been overlapped?
- How many sequences remain after dereplication?
- What amplicon size is obtained in the majority of merged sequences?
1.2 Clustering
- Build Clustering using swarm -
FROGS Clustering swarm
- => Use an aggregation distance of 1. Don’t use denoising option.
- The biom file shows the abundance of each cluster.
- The fasta file contains the cluster (OTU) representative sequences.
- The tsv file shows what sequences are contained in each cluster.
1.3 Stats on clustering (optional)
- Obtain statistics about abundance of sequences in clusters -
FROGS Clusters stat
- How many clusters were obtained by swarm?
- How many sequences are contained in the biggest cluster?
- How many clusters contain only one sequence?
- Observe the cumulative sequence proportion by cluster size
- Observe cluster sharing between samples through hierarchical clustering tree
1.4 Remove chimera
- Remove chimera using biom obtained from swarm -
FROGS Remove chimera
- What proportion of clusters were kept in this step?
1.5 OTU Filtering
- Filters OTUs on several criteria. -
FROGS Filters
- Eliminate OTUs with a low number of sequences (abundance at 0.005%) and keep OTUs present in at least two samples.
- How many OTUs were removed in this step?
- How many OTUs were removed because of low abundance?
- Relaunch OTU Filtering but using abundance at 0.01%. How many OTUs were removed because of low abundance?
1.6 Stats on clustering (optional)
- Rerun statistics of clusters after filtering -
FROGS Clusters stat
- Look the effect of the cumulative proportion by cluster size.
1.7 Taxonomic affiliation
- Perform taxonomic affiliation of each OTU by BLAST -
FROGS Affiliation OTU
- Use the SILVA 132 16S database for taxonomic assignation by BLAST.
- Activate RDP assignation.
- How many OTU were taxonomically assigned to species?
- Visualize the biom file enriched with taxomonic information.
1.8 Affiliation stats
- Obtain statistics of affiliation -
FROGS Affiliation stat
- Use rarefaction ranks : Family Genus Species
- Observe global distribution of taxonomies by sample.
- Look the rarefaction curve, which is a measure of samples vs diversity.
1.9 BIOM format standardization
Retrieve a standardize biom file using - FROGS BIOM to std BIOM
- You have now a standard BIOM file to phyloseq analysis.
1.10 Building a Tree
- Build a tree with MAFFT and FastTree
FROGS Tree
using filter.fasta and filter.biom
1.11 Phyloseq stats in FROGSTAT
-
Import data in R
FROGSSTAT Phyloseq Import
using the standard BIOM file and thesummary.txt
file without normalisation. -
Make taxonomic barcharts (kingdom level)
FROGSSTAT Phyloseq Composition Visualisation
usingenv_material
as grouping variable and the R data objet. -
Compute alpha diversity
FROGSSTAT Phyloseq Alpha Diversity
Calculate Observed, Chao1 and Shannon diversity indices. Useenv_material
as enviroment variable. -
Compute beta diversity
FROGSSTAT Phyloseq Beta Diversity
. Useenv_material
as grouping variable and the R data objet and ‘Other methods’: cc, unifrac. -
Build a head map plot and ordination
FROGSSTAT Phyloseq Structure Visualisation
: Useenv_material
as grouping variable, the R data objet and the beta-diversity unifrac.tsv output. -
Hierarchical clustering of samples using Unifrac distance matrix
FROGSSTAT Phyloseq Sample Clustering
: Useenv_material
as grouping variable, the R data objet and the beta-diversity unifrac.tsv output. -
Calculate a anova using unifrac distance matrix with
FROGSSTAT Phyloseq Anova
1. 12 Workflow in Galaxy
Import a preformated FROGS workflow from Galaxy. Go to Shared Data / Workflows /FROGS
and import it. This workflow contains the whole of steps used before. Be free of modified it and lauch it if you want.
Practice 2 : Launch FROGs in command line
Pipeline in bash format (for command line use):
https://github.com/SouthGreenPlatform/trainings/blob/gh-pages/files/run_frogs_pipelinev3.sh
-
Connection to account in IRD i-Trop cluster in ssh mode
ssh formationX@bioinfo-master.ird.fr
-
Input data
DATA_s.tar.gz
andsummary.txt
are accessible from nas:/data2/formation/TPMetabarcoding/FROGS/ folder. -
Create a TP-FROGS directory in your $HOME and go inside
mkdir ~/TP-FROGS
cd ~/TP-FROGS
- Download
LaunchFROGs_v3.sh
script and give execution rights
wget https://raw.githubusercontent.com/SouthGreenPlatform/trainings/gh-pages/files/launchFROGs_v3.sh
chmod +x launchFROGs_v3.sh
-
Visualise
LaunchFROGs_v3.sh
script -
Launch
LaunchFROGsv3.sh
in qsub mode. Give your user name to this script as parametter.
qsub ./launchFROGs_v3.sh formationX
-
Recovery output repertory and transfer it to your local machine using Fillezila or
scp
- if
scp
mode transfert from your local machine terminal as follow
- if
scp -r formationX@bioinfo-master.ird.fr:/home/formationX/OUTPUT_FROGSV3/ .
Otherwise, use this download link: https://elwe.rhrk.uni-kl.de/outgoing/OUTPUT_FROGSV3.zip (valid until 2019-06-01)
Practice 3 : Tutoriel Phyloseq Formation Metabarcoding
3.1 Setup your environment
Start with a clean session
rm(list=ls())
load the packages
require(phyloseq)
require(tidyverse)
require(reshape2)
require(gridExtra)
require(scales)
require(parallel)
require(permute)
require(lattice)
Let’s define the working directory on your local computer
setwd("your path")
3.2 Building a Phyloseq object
… and load the data generated using FROGS
load("11-phylo_import.Rdata")
Data is a phyloseq object
data
We can access the ‘OTU’ / sample occurence table with the follwing command
head(otu_table(data))
You can also use tidyr syntax to make your code net and tidy
data %>%
otu_table() %>%
head()
in R, type ? and the function name to some help
?otu_table
Question1 : What is the sequencing depth of the samples ?
data %>%
otu_table() %>%
colSums()
Phyloseq has some built-in functions to explore the data
data %>% sample_sums
Let’s plot the sorted sequencing depth
data %>%
otu_table() %>%
colSums() %>%
sort() %>%
barplot(las=2)
Question2 : How many reads are representing each of the first 10 OTU (i.e., swarm’s clusters) ?
sort(rowSums(otu_table(data)), decreasing = T)[1:10]
We can access the taxonomical information of the different OTU with the follwing command
data %>%
tax_table() %>%
head()
Metadata are also stored in data phyloseq object
sample_data(data)$env_material
Phyloseq has some built-in functions
rank_names(data) # taxonimcal ranks
nsamples(data) # number of samples
ntaxa(data) # number of OTU
sample_variables(data) # metadata
3.3 Distribution per OTU and per sample
Let’s plot the sequence distribution per OTU and per sample First, create a dataframe with nreads : the sorted number of reads per OTU, sorted : the index of the sorted OTU and type : OTU
readsumsdf <- data.frame(nreads = sort(taxa_sums(data), TRUE),
sorted = 1:ntaxa(data),
type = "OTU")
These are the first rows of our dataframe
readsumsdf %>% head()
We can plot this dataframe using ggplot
ggplot(readsumsdf,
aes(x = sorted, y = nreads)) +
geom_bar(stat = "identity") +
scale_y_log10()
Now we are going to create another dataframe with the sequencing depth per sample sample_sums()
readsumsdf2 <- data.frame(nreads = sort(sample_sums(data), TRUE),
sorted = 1:nsamples(data),
type = "Samples")
Let’s bind the two tables
readsumsdf3 <- rbind(readsumsdf,readsumsdf2)
Check the first rows
readsumsdf3 %>% head()
Check the last rows
readsumsdf3 %>% tail()
We can plot the data using ggplot and wrap the data according to type column (that’s why we specified OTU and Samples )
p <- ggplot(readsumsdf3,
aes(x = sorted, y = nreads)) +
geom_bar(stat = "identity")
p + ggtitle("Total number of reads before Preprocessing") + scale_y_log10() + facet_wrap(~type, 1, scales = "free")
3.4 Rarefaction curves
Let’s explore the rarefaction curves i.e., OTU richness vs sequencing depth
data %>%
otu_table() %>%
t() %>%
vegan::rarecurve()
We can do something nicer with ggplot
source("https://raw.githubusercontent.com/mahendra-mariadassou/phyloseq-extended/master/load-extra-functions.R")
p <- ggrare(data,
step = 500,
color = "env_material",
plot = T,
parallel = T,
se = F)
p <- p +
facet_wrap(~ env_material ) +
geom_vline(xintercept = min(sample_sums(data)),
color = "gray60")
plot(p)
3.5 OTU table Filtering
We are now going to filter the OTU table
Explore the Taxonomy at the Kingdom level
tax_table(data)[,c("Kingdom")] %>% unique()
Remove untargeted OTU (we consider Unclassified OTU at the Kingdom level as noise) using subset_taxa
data <- subset_taxa(data,
Kingdom != "Unclassified" &
Order !="Chloroplast" &
Family != "Mitochondria")
Remove low occurence / abundance OTU i.e., more than 10 sequences in total and appearing in more than 1 sample
data <- filter_taxa(data,
function(x) sum(x >= 10) > (1),
prune = TRUE)
3.6 Normalisation to minumum sequencing depth
Rarefy to en even sequencing depth (i.e., min(colSums(otu_table(data)))
data_rare <- rarefy_even_depth(data,
sample.size = min(colSums(otu_table(data))),
rngseed = 63)
Rarefaction curves on filtered data
p <- ggrare(data_rare, step = 50, color = "env_material", plot = T, parallel = T, se = F)
p
One can export the filtered OTU table
write.csv(cbind(data.frame(otu_table(data_rare)),
tax_table(data_rare)),
file="filtered_otu_table.csv")
3.7 alpha-diversity
We can now explore the alpha-dviersity on the filtered and rarefied data
p <- plot_richness(data_rare,
x="sample",
color="env_material",
measures=c("Observed","Shannon","ACE"),
nrow = 1)
print(p)
That plot could be nicer Data to plot are stored in p$data
p$data %>% head()
boxplot using ggplot
ggplot(p$data,aes(env_material,value,colour=env_material)) +
facet_grid(variable ~ env_material, drop=T,scale="free",space="fixed") +
geom_boxplot(outlier.colour = NA,alpha=1)
More Complex
ggplot(p$data,aes(env_material,value,colour=env_material,shape=env_material)) +
facet_grid(variable ~ env_material, drop=T,scale="free",space="fixed") +
geom_boxplot(outlier.colour = NA,alpha=0.8,
position = position_dodge(width=0.9)) +
geom_point(size=2,position=position_jitterdodge(dodge.width=0.9)) +
ylab("Diversity index") + xlab(NULL) + theme_bw()
Export the alpha div values into a dataframe in short format
rich.plus <- dcast(p$data, samples + env_material ~ variable)
write.csv(rich.plus, file="alpha_div.csv")
Alpha-div Stats using TukeyHSD on ANOVA
TukeyHSD_Observed <- TukeyHSD(aov(Observed ~ env_material, data = rich.plus))
TukeyHSD_Observed_df <- data.frame(TukeyHSD_Observed$env_material)
TukeyHSD_Observed_df$measure = "Observed"
TukeyHSD_Observed_df$shapiro_test_pval = (shapiro.test(residuals(aov(Observed ~ env_material, data = rich.plus))))$p.value
TukeyHSD_Observed_df
3.8 beta-diversity
Compute dissimilarity
data_rare %>% transform_sample_counts(function(x) x/sum(x)) %>%
otu_table() %>%
t() %>%
sqrt() %>%
as.data.frame() %>%
vegdist(binary=F, method = "bray") -> dist
Ordination
run PCoA ordination on the generated distance
ord <- ordinate(data_rare,"PCoA",dist)
Samples coordinate on the PCoA vecotrs are stored in but plot_ordination can make use of ord object easily
ord$vectors
plot_ordination(data_rare,
ord,
color = "env_material",
shape="env_material",
title = "PCoA sqrt Bray curtis",
label= "SampleID" ) +
geom_point(aes(size=rich.plus$Observed)) +
theme_bw()
Let’s see if the observed pattern is significant using PERMANOVA i.e., adonis function from vegan
adonis(dist ~ get_variable(data_rare, "env_material"), permutations = 1000)$aov.tab
dispersion
Let’s see if there are difference in dispersion (i.e., variance)
boxplot(betadisper(dist,
get_variable(data_rare, "env_material")),las=2,
main=paste0("Multivariate Dispersion Test Bray-Curtis "," pvalue = ",
permutest(betadisper(dist, get_variable(data_rare, "env_material")))$tab$`Pr(>F)`[1]))
ANOSIM
ANOSIM test can also test for differences among group
plot(anosim(dist, get_variable(data_rare, "env_material"))
,main="ANOSIM Bray-Curtis "
,las=2)
3.9 Composition plot
Now, we would like to plot the distribution of phylum transformed in %
data_rare %>% transform_sample_counts(function(x) x/sum(x)) %>%
plot_bar(fill="Phylum") +
facet_wrap(~ env_material, scales = "free_x", nrow = 1) +
ggtitle("Bar plot colored by Phylum ") +
theme(plot.title = element_text(hjust = 0.5))
We can generate a nicer plot using plot_composition function
p <- plot_composition(data_rare,
taxaRank1 = "Kingdom",
taxaSet1 ="Bacteria",
taxaRank2 = "Phylum",
numberOfTaxa = 20,
fill= "Phylum") +
facet_wrap(~env_material, scales = "free_x", nrow = 1) +
theme(plot.title = element_text(hjust = 0.5))
3.10 Some exercises
- 1 - How many OTUs belong to Archaea (in two commands using
%>%
) ?
Answer :
data %>%
subset_taxa(Kingdom == "Archaea") %>%
ntaxa()
- 2 - Plot OTU richness (and only richness = ‘Observed’ in phyloseq) of Alphaproteobacteria among samples
Answer :
subset_taxa(data,
Class == "Alphaproteobacteria") %>%
plot_richness() -> p
ggplot(subset(p$data,variable=="Observed"),
aes(env_material,value,colour=env_material,shape=env_material)) +
facet_grid(variable ~ env_material, drop=T,scale="free",space="fixed") +
geom_boxplot(outlier.colour = NA,alpha=0.8,
position = position_dodge(width=0.9)) +
geom_point(size=2,position=position_jitterdodge(dodge.width=0.9)) +
ylab("Diversity index") + xlab(NULL) + theme_bw()
- 3 - Explore beta diversity of Alphaproteobacteria using “morisita” distance without data transformation and without considering endosphere samples (subset_samples). Are sample from Bulk_Soil and Rhizosphere different in terms of beta-diversity (use %in% c(“Soil”, “Prank”) in order to subset from several categories
Answer :
data %>%
subset_taxa(Class == "Alphaproteobacteria") %>%
subset_samples(env_material != "Endosphere") -> exo2
exo2 %>%
otu_table() %>%
t() %>%
vegdist(binary=F, method = "morisita") -> dist
ord <- ordinate(exo2,"PCoA",dist)
ord
plot_ordination(physeq = exo2, ord,
color = "env_material",
shape="env_material",
title = "PCoA Morisita",
label= "SampleID" ) + theme_bw()
data %>%
subset_taxa(Class == "Alphaproteobacteria") %>%
subset_samples(env_material %in% c("Bulk_Soil", "Rhizosphere")) -> test
sample_data(test)
test %>%
otu_table() %>%
t() %>%
vegdist(binary=F, method = "morisita") -> dist
adonis(dist ~ get_variable(test, "env_material"), permutations = 1000)$aov.tab
- 4 - Plot proportion Chloroplasts
Answer :
load("11-phylo_import.Rdata")
data %>%
transform_sample_counts(function(x) x/sum(x)) %>%
subset_taxa(Order == "Chloroplast") %>%
plot_bar(fill="Order") + facet_grid(. ~ env_material, drop=T,scale="free",space="fixed")
- 5 - Plot proportion of OTU belonging to Mitochondria and facet the plot according to Site (i.e., env_material).
Some surprises ?
Answer:
load("11-phylo_import.Rdata")
data %>%
transform_sample_counts(function(x) x/sum(x)) %>%
subset_taxa(Family == "Mitochondria") %>%
plot_bar(fill="OTU") + facet_grid(. ~ env_material, drop=T,scale="free",space="fixed")
- 6 - beta-diversity
6a. Plot beta-diversity of Mitochondria and Chloroplasts OTU using Bray-Curtis distance on untransformed table
6b. what is the percentage of Mitochondria and Chloroplasts OTU
6c. plot a basic barplot of it
Answer:
load("11-phylo_import.Rdata")
data %>%
transform_sample_counts(function(x) x/sum(x)) %>%
subset_taxa(Family == "Mitochondria" | Order == "Chloroplast") -> cont
sample_sums(cont) %>%
sort(decreasing = T) %>%
barplot(las=2)
cont %>%
otu_table() %>%
t() %>%
vegan::vegdist(binary=F, method = "bray") -> dist
ord <- ordinate(cont,"PCoA",dist)
ord
plot_ordination(physeq = cont, ord,
color = "env_material",
shape="env_material",
title = "PCoA bray",
label= "SampleID") + theme_bw()
- 7 - Do the filterd-out OTU display alpha / beta diversity patterns?
Links
- Related courses :
License
data:image/s3,"s3://crabby-images/efd20/efd208514e1e7ec7016e02517156a8abb0fbc1f7" alt=""