South Green Trainings pages

Description	Hands On Lab Exercises for RNASeq
Related-course materials	Transcriptomique
Authors	Julie Orjuela (julie.orjuela@ird.fr), Gautier Sarah (gautier.sarah@inrae.fr), Catherine Breton (c.breton@cgiar.org), Aurore Comte (aurore.comte@ird.fr), Alexis Dereeper (alexis.dereeper@ird.fr), Sebastien Ravel (sebastien.ravel@cirad.fr), Sebastien Cunnac (sebastien.cunnac@ird.fr) Alexandre Soriano (alexandre.soriano@cirad.fr)
Creation Date	15/03/2018
Last Modified Date	31/06/2023

Summary

Practice 1: Reads QC and cleaning with fastp
Practice 2: Mapping against annotated genome reference with STAR
Practice 3: Counting with htseq-count
Practice 4: Differential expression analysis using DIANE
Practice 5: Compare list of DE genes with DIANE EdgeR and DESeq2
Practice 6: Hierarchical Clustering
Practice 7: Visualization of mapped reads agains genes using IGV
Links
License

Preambule. Dataset used during this pratice -

Dataset used in this practical comes from

Origine:

ref : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3488244/
data : NCBI SRA database under accession number SRS307298 S. cerevisiae.
Genome size of S. cerevisiae : 12M (12.157.105) (https://www.yeastgenome.org/strain/S288C#genome_sequence)

In this session, we will analyze RNA-seq data from one sample of S. cerevisiae (NCBI SRA SRS307298). It is from two different origin (CENPK and Batch), with three biological replications for each origin (rep1, rep2 and rep3).

Practice 1 : Reads QC and cleaning with fastp

Practice1 will be performed on the biosphere VM using Jupyter notebook

We will perform the reads cleaning

Connect to the VM you previously launched
Connect to the terminal
Download the notebook

wget --no-check-certificate hpc.cirad.fr/rnaseq/Practice.ipynb

You can also download the correction

wget --no-check-certificate hpc.cirad.fr/rnaseq/Correction.ipynb

Choose one sample between SRR453566_500k and SRR453571_500k
Run the downloads cells
Run Fastqc
Check data quality
Run fastp
Run Fastqc again
Check cleaned data quality

Practice 2 : Mapping against annotated genome reference with STAR

Running STAR

First you need to create the genome index
Create the index genome index. What file do you need to create the index?
Run the mapping with STAR
Sort your BAM on read name

Practice 3 : Counting with htseq-count

Run htseq-count on the VM
Have a look to the count table generated

Practice 4 : Differential expression analysis DIANE

Practice 4 : Differential expression analysis DIANE

1.1. Introduction

As you have seen, RNA sequencing is an experimental method for measuring RNA expression levels via high-throughput sequencing. The number of reads that map to a gene is the proxy for its expression level. There are many steps between receiving the raw reads from the sequencer and gaining biological insight. In this part, we will try to find differentially expressed genes between pairs of conditions, using an online tool : DIANE

1.2. A bit of informations about DIANE

DIANE (Dashboard for the Inference and Analysis of Networks from Expression data) is a web application written in R-shiny for the analysis of high throughput gene expression data (RNA-Seq) using state of the art methods. Its function is to extract important regulatory pathways involved in the response to environmental changes, or any perturbation inducing genomic modifications. For advanced users, DIANE can also be used through R command line, as the application is written as an R package.

To cite DIANE in publications use:

Cassan, O., Lèbre, S. & Martin, A. Inferring and analyzing gene regulatory networks from multi-factorial expression data: a complete and interactive suite. BMC Genomics 22, 387 (2021). https://doi.org/10.1186/s12864-021-07659-2

Authors : Océane Cassan, Antoine Martin, Sophie Lèbre

Main developper : Océane Cassan, PhD Student at IPSIM (Institute for Plant Sciences in Montpellier) research unit, SUPAGRO Montpellier (now Postdoc researcher in statistical learning applied to gene regulation in LIRMM - Laboratory of Computer Science, Robotics and Microelectronics of Montpellier), with contributions from Alexandre Soriano.

1.3. Test dataset presentation

The chosen dataset contains the transcriptome of Arabidopsis thaliana plants exposed to global warming related conditions. It was generated for the article “Molecular plant responses to combined abiotic stresses put a spotlight on unknown and abundant genes”, by Sewelam et Al. in Journal of experimental Botany, 2020. (https://academic.oup.com/jxb/advance-article/doi/10.1093/jxb/eraa250/5842162#204457392).

The experimental perturbations studied are heat, salinity and drought in culture soil. Combinations of these perturbations have also been generated.

This is the “demo dataset” included in DIANE, but the data have been reanalyzed using the pipeline presented in this formation, using an expression matrix generated using the tools presented previously.

1.4. Get the input data

Download the data from hpc.cirad.fr/rnaseq_southgreen_2023/data_DIANE_southgreen2023.zip, and unzip the file.

Take a look at the count matrix. Notice that the rows contain genes and the column the different conditions. The names of the conditions follows a specific pattern, which is needed for DIANE : the name of the condition, followed by a “_”, followed by the replicate number. Your can refer to the corresponding slide for a coloured exemple.

for example, for the first replicate of a condition named “reference”, the sample name must look like : reference_1

The first column must also be named “Gene”

1.5. Open the app

Open the application : http://shinyapps.southgreen.fr/app/dianebetapca

note : this is a beta version including some new functionalities. For the legacy version, you can use https://diane.ipsim.inrae.fr.

1.6. discover the user interface

DIANE user interface is divided in two parts. In the left part of the screen, you can see some tabs, each one allowing you to perform a specific analysis using state of the art methods. When you click on one of these tabs, you will be able to carry the corresponding analysis. You will see some green question marks (“?”) at some positions. Click on them to learn some information about a specific part of the analysis process.

1.7. Data import

1.7.1. Input your expression matrix

This is the only mandatory file for any analysis with DIANE.

First, click on the “data import” tab, in the upper left part of the screen. (This is very important, otherwise the count matrix that you will input will not be used.)
click on “Toggle to import your data” in the “expression file upload” box
Select the organism. Here we are working with “Arabidopsis Thaliana”.
click on “browse” under the “Choose CSV/TXT expression file” field. Choose the count matrix you just downloaded.

Now your data should be loaded. A preview of your expression matrix should be displayed in the “Preview of the expression matrix” box. You can also check if the number of genes, conditions and samples is

Note : there is a “other” organism available. Indeed, All the functions of DIANE can be used with any organism. For some functionalities, you’ll have to import files containing information, like functional annotations, for your custom organism. We can discuss this point if you are interested in using DIANE with a non integrated organism.

1.7.2. Input Design Information (facultative)

The design information is used in some steps of the analysis process to provide additional information. It can be interesting when working with complex datasets, such as the one we are using. Note that only a few parts of the analysis make use of this file, you can work without uploading one.

Take a look at the “design.csv” file, downloaded previously.

To import this file inside of DIANE :

click on “browse” in the “Design and gene information files” box, and select the “design.csv” file.

The content of the file should then appear just below.

1.8. Normalization

Now that your data is loaded, we will perform some analysis. But first, we need to normalize our count matrix. To do so, go to the “Normalisation” tab (the third tab in the left part of the screen).

Two normalization methods are included in DIANE. TMM, from the package “edgeR” , and DESeq2, from the package “DESeq2”. These two methods are known to be robust, and should give similar results. You can also tick “prior removal of differentially expressed genes”, to enable a special method that tries to remove differentially expressed genes before computing normalization factors. The two implemented normalization methods indeed rely on the presumption that most of the genes in our dataset are not differentially expressed, and this special method could help reduce biases in some conditions. So select a normalization method and click on “normalize” to proceed with the normalization.

After that, it is advised to remove genes with very low abundance. It is a common practice in RNA-Seq analysis pipelines for several reasons :

They have little biological significance, and could be caused either by noise or mapping errors.
The statistical modeling we are planning to perform next is not well suited for low counts, as they make the mean-variance relationship harder to estimate.

The default value in DIANE is 10 * number_of_conditions. Click on “filter” in the “low count filtering” part to remove low count genes.

Take a look at the plot in the right part of the screen. Is it homogenous ?

1.9. Exploratory analysis

Go to the “Exploratory analysis” tab (the fourth tab in the left part of the screen). You will see a plot representing some PCA (Principal Component Analysis). During PCA, new variables are computed, as linear combinations of your initial variables (e.g. experimental conditions). Those new variables, also called principal components, are designed to carry the maximum of the data variability.

Look at the “PCA summary”. If you want to see the plots one by one, you can select the “Specific PCA plot” tab in the upper part of the screen. These plots should give you hints about the conditions that have a noticeable impact on gene expression.

Look at the first 3 PCA : Which conditions seem to be correlated with Component 1 ? Component 2 ? Component 3 ?

Look at the PCA correlation plot, in the tab “PCA correlation plot”. Does this plot agree with your observations ?

1.10. Differential expression analysis

To detect the genes that have significant changes in their expression caused by experimental perturbations, we use the edgeR package, based on negative binomial models.

The first step is thus to estimate the gene dispersions, which is achieved by pooling genes with similar expression levels, and using empirical Bayes strategies. We will perform a differential expression analysis to see which genes are differentially expressed between two conditions. We will here compare the control and the heat condition. To do so :

Go to Differential Expression tab.
Choose the conditions to compare. Here we want to compare the control (C) and the heat stressed plants (H)
click on detect differentially expressed genes

The results are presented in a table, ordered by adjusted pvalues (FDR). The table contains for each gene with a FDR inferior to a specific threshold and a fold change superior to a specific threshold :

the log2 of the fold changes (logFC)
the average expression in log2 of counts per million (logCPM)
the adjusted pvalue (FDR)
some information about the genes. These informations are available for integrated organisms, and must be imported for custom organisms.

How many genes are detected as differentially expressed ?

You can click on one gene in the table to see his expression level in all the conditions. You can try to click on any gene and see the differences in count between control and heat stress.

You can try to change the adjusted pvalue threshold and the fold change threshold, and rerun the analysis to see the difference in the number of DEG (differentially expressed genes) detected.

1.11. Differentially expressed genes exploration

DIANE offers some tools to explore DEG lists.

1.11.1. MA plot

The MA - Vulcano plot allows you to see log2fc depending on logCPM. Take a look at this plot. What do you observe ?

1.11.2. Pvalue histogram

Now observe the Pvalue histogram. You should observe this plot by setting “0” in the Fold change filter. The density of pvalue should be evenly distributed along the plot, except for a peak near 0.

Is it the case here ?

1.11.3. Gene ontology enrichment

Now, set the pvalue threshold to 0.05 and the fold change threshold to 1. Click again in “Detect differentially expressed genes”. Then, go to the “Gene ontology enrichment” tab. Here you can perform gene ontology enrichment on your DEG list. Select only up-regulated genes in the “Genes to study section”, Click on “Start GO enrichment analysis”. This will display a plot showing the GO terms that are over represented in your gene set. To display this plot as a data table, click on “Data table”. Then sort this data table by pvalue.

What is the enriched GO term with the lowest pvalue ? Is it expected ?

1.11.4. Venn diagram

The last tab allows you to plot venn diagrams showing the common and different genes from multiple DEG lists. You can try to perform some other DEG analysis, and download the gene list common to all of these lists.

1.12. The last test, if you have finished

Now you can try to run all the previous steps using a different normalization method. Are the results very different ?

Links

DIANE : ShinyApp
Related courses : Transcriptomics
Comparison of methods for differential expression: Report
DESeq2: DESeq2
EdgR: EdgeR

Bibliography

Cassan, O., Lèbre, S. & Martin, A. Inferring and analyzing gene regulatory networks from multi-factorial expression data: a complete and interactive suite. BMC Genomics 22, 387 (2021). https://doi.org/10.1186/s12864-021-07659-2

License

The resource material is licensed under the Creative Commons Attribution 4.0 International License (here).