South Green Logo

South Green Trainings pages

Description Hands On Lab Exercises for RNASeq
Related-course materials Transcriptomique
Related-course materials Trinity and Trinotate
Related-course materials Linux for Dummies
Authors Julie Orjuela (julie.orjuela_AT_irf.fr), Christine Tranchant (christine.tranchant_AT_ird.fr)
Creation Date 15/03/2018
Last Modified Date 03/10/2019

Summary


Practice 0. Going to the cluster - ssh,srun,scp

Dataset used in this practical comes from

In this session, we will analyze RNA-seq data from one sample of S. cerevisiae (NCBI SRA SRS307298). It is from two different origin (CENPK and Batch), with three biological replications for each origin (rep1, rep2 and rep3).

Connection into cluster through ssh mode

We will work on the cluster of the Joseph KI-ZERBI University (UJKZ) using SLURM scheduler.

ssh formationX@YOUR_IP_ADRESS

Opening an interactive bash session on a node srun -p partition --pty bash -i

Read this survival document containig basic commands to SLURM (https://southgreenplatform.github.io/trainings/slurm/)

srun -p PARTITIONNAME --time 08:00:00 --cpus-per-task 2 --pty bash -i

Prepare input files

# declare PATHTODATA variable : in this TP data are in /scratch/SRA_SRS307298
PATHTODATA="/PATH/TO/DATA/" 

# create scratch repertory 
cd /scratch
mkdir formationX
cd formationX

Link files from the shared location from scratch directory (it is essential that all calculations take place here). But keep in mind that your data can be located in your /home or on a NAS for example.

ln -s $PATHTODATA/SRA_SRS307298/RAWDATA/ /scratch/formationX/

When the files transfer is finished, verify by listing the content of the current directory and the subdirectory RAWDATA with the command ls -al. You should see 12 gzipped read files in a listing, a samples.txt file and a adapt-125pbLib.txt file.

[orjuela@node25 RAWDATA]$ more samples.txt 
CENPK	CENPK_rep1	PATH/SRR453569_1.fastq.gz	PATH/SRR453569_2.fastq.gz
CENPK	CENPK_rep2	PATH/SRR453570_1.fastq.gz	PATH/SRR453570_2.fastq.gz
CENPK	CENPK_rep3	PATH/SRR453571_1.fastq.gz	PATH/SRR453571_2.fastq.gz
Batch	Batch_rep1	PATH/SRR453566_1.fastq.gz	PATH/SRR453566_2.fastq.gz
Batch	Batch_rep2	PATH/SRR453567_1.fastq.gz	PATH/SRR453567_2.fastq.gz
Batch	Batch_rep3	PATH/SRR453568_1.fastq.gz	PATH/SRR453568_2.fastq.gz

Practice 1. Check Reads Quality

FastQC perform some simple quality control checks to ensure that the raw data looks good and there are no problems or biases in data which may affect how user can usefully use it. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

# make a fastqc repertoty
cd /scratch/formationX
mkdir FASTQC
cd FASTQC

# charge modules 
module load fastqc/0.11.7/

# run fastqc in the whole of samples (it will take 10min)
fastqc -t 1 /scratch/formationX/RAWDATA/*.gz -o /scratch/formationX/FASTQC/

Multiqc is a modular tool to aggregate results from bioinformatics analyses across many samples into a single report. Use this tool to visualise results of quality. https://multiqc.info/

#charge modules
module load MultiQC/1.7

#launch Multiqc to create a report in html containing the whole of informations generated by FastQC
multiqc .

#transfer results to your cluster home 
scp -r multiqc* master:/home/formationX/

# transfert results FROM your local machine by scp or filezilla
scp -r formationX@:YOUR_IP_ADRESS /home/formationX/multiqc* ./

# open in your favorite web navigator
firefox multiqc_report.html .

In this practice, reads quality is ok. You need to observe sequences and check biases. To remove adaptors and primers you can use Trimmomatic. Use PRINSEQ2 to detect Poly A/T tails and low complexity reads. Remove contaminations with SortMeRNA, riboPicker or DeconSeq.

1.1 Cleanning reads with trimmomatic

# make a fastqc repertoty
cd /scratch/formationX
mkdir TRIMMOMATIC
cd TRIMMOMATIC

# loading modules
module load Trimmomatic/0.38

# Copy samples file and changing PATH to current directory in samples file.
cp ../RAWDATA/samples.txt .
sed -i 's|PATH|'$PWD'|ig' samples.txt  

Running trimmomatic for each sample

In this example, reads of SRR453566 sample are trimmed. (_U untrimmed _P paired)

java -jar /usr/local/Trimmomatic-0.38/trimmomatic-0.38.jar PE -phred33  /scratch/formationX/RAWDATA/SRR453567_1.fastq.gz /scratch/formationX/RAWDATA/SRR453567_2.fastq.gz SRR453567_1.P.fastq.gz SRR453567_1.U.fastq.gz SRR453567_2.P.fastq.gz SRR453567_2.U.fastq.gz ILLUMINACLIP:/scratch/formationX/RAWDATA/adapt-125pbLib.txt:2:30:10 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25

In this example :

Input Read Pairs: 7615732 Both Surviving: 7504326 (98,54%) Forward Only Surviving: 71456 (0,94%) Reverse Only Surviving: 11202 (0,15%) Dropped: 28748 (0,38%)
TrimmomaticPE: Completed successfully

1.2 Check quality after trimming

Observe effet of trimming in your samples. Run fastqc in the whole of trimmed reads and observe it with MultiQC


Practice 2. Indexing references files

2.1 indexing a reference file

We will index references with kallisto index, samtools index to prepare references to follow analysis.

# load modules
module load R/3.5.0
module load samtools/1.8
module load perl/5.24.0
module load trinity/2.8.5
#module load express/1.5.1
module load kallisto/0.44.0


# copy transcrits fasta reference file
cp -r $PATHTODATA/SRA_SRS307298/REF /scratch/formationX/

# go to REF repertoty
cd /scratch/formationX/REF

# index reference
kallisto index -i GCF_000146045.2_R64_genes.fai GCF_000146045.2_R64_genes.fasta

Practice 3. Mapping against transcriptome reference + counting with Kallisto

3.1 Kallisto

Now, we will perform a transcriptome-based mapping and estimates of transcript levels using Kallisto, and a differential analysis using EdgeR.

# declare bash variables
path_to_trinity=/usr/local/trinityrnaseq-Trinity-v2.8.5/
fasta=/scratch/formationX/REF/GCF_000146045.2_R64_genes.fasta
samplesfile=/scratch/SRA_SRS307298/KALLISTO/samples.txt

# make a KALLISTO repertory and go on
cd /scratch/formationX
mkdir KALLISTO
cd KALLISTO

# symbolic link to trimmed fastq 
ln -s /scratch/SRA_SRS307298/TRIMMOMATIC/*.P.fastq.gz .

# Run the kallisto quant program by providing `GCF_000146045.2_R64_genes.fasta` as transcriptome reference and specifying correctly pairs of input fastq- `kallisto quant`
# Create and modify samples file path
cp /scratch/formationX/RAWDATA/samples.txt .

# changing PATH to current directory in samples file
sed -i 's|PATH|'$PWD'|ig' samples.txt 
sed -i 's|_1|_1.P|ig' samples.txt 
sed -i 's|_2|_2.P|ig' samples.txt

# lauch kallisto using  `align_and_estimate_abundance.pl`
perl $path_to_trinity/util/align_and_estimate_abundance.pl \
--transcripts $fasta \
--seqType fq \
--samples_file $samplesfile \
--est_method kallisto \
--prep_reference > kallisto_align_and_estimate_abundance.log 2>&1
[orjuela@node25 KALLISTO]$ more CENPK_rep1/abundance.tsv 
target_id	length	eff_length	est_counts	tpm
NC_001133.9:1807-2169	362	184.532	1	0.885838
NC_001133.9:2480-2707	227	80.5542	3.70979	7.52813
NC_001133.9:7235-9016	1781	1583.42	0	0
NC_001133.9:11565-11951	386	205.881	0	0
NC_001133.9:12046-12426	380	200.374	0	0
NC_001133.9:13363-13743	380	200.374	0	0
NC_001133.9:21566-21850	284	120.528	6	8.13748
NC_001133.9:22395-22685	290	125.236	9	11.7473
NC_001133.9:24000-27968	3968	3770.42	92.7662	4.02185
NC_001133.9:31567-32940	1373	1175.42	807.109	112.244
NC_001133.9:33448-34701	1253	1055.42	661	102.377
NC_001133.9:35155-36303	1148	950.425	1092	187.815
NC_001133.9:36509-37147	638	443.67	143	52.6869
NC_001133.9:37464-38972	1508	1310.42	195	24.3248
# go to kallisto results repertory
cd /scratch/formationX/KALLISTO/

# Calculate expression matrix (if salmon use quant.sf files, if kallisto use abundance.tsv files)

$path_to_trinity/util/abundance_estimates_to_matrix.pl \
--est_method kallisto \
--name_sample_by_basedir \
--gene_trans_map none \
CENPK_rep1/abundance.tsv \
CENPK_rep2/abundance.tsv \
CENPK_rep3/abundance.tsv \
Batch_rep1/abundance.tsv \
Batch_rep2/abundance.tsv \
Batch_rep3/abundance.tsv 
#kallisto results
[orjuela@node25 KALLISTO]$ grep 'processed' kallisto_align_and_estimate_abundance.log 
[quant] processed 3,966,369 reads, 3,569,394 reads pseudoaligned
[quant] processed 6,638,859 reads, 4,384,262 reads pseudoaligned
[quant] processed 6,073,246 reads, 5,536,151 reads pseudoaligned
[quant] processed 5,630,114 reads, 5,271,023 reads pseudoaligned
[quant] processed 7,504,326 reads, 7,072,861 reads pseudoaligned
[quant] processed 5,476,735 reads, 5,178,176 reads pseudoaligned

[orjuela@node25 KALLISTO]$  more kallisto.isoform.counts.matrix
	CENPK_rep1	CENPK_rep2	CENPK_rep3	Batch_rep1	Batch_rep2	Batch_rep3
NC_001136.10:872112-873428	0	0	0	0	0	0
NC_001145.3:196628-197950	0	0	0	0	0	0
NC_0011###47.6:230085-231272	264	272	387	219	236	148
NC_001136.10:446967-447578	211	195	297	287	326	234
NC_001148.4:646448-647038	176	211	283	176	229	168
NC_001147.6:78352-79479	69	103	100	99	141	95
NC_001148.4:587189-587518	26	32	42	9	10	15
NC_001147.6:960987-965981	1074	1320	1625	3087	4204	3298

3.2 Examine your data and your experimental replicates before DE

Before differential expression analysis, examine your data to determine if there are any confounding issues. Trinity comes with a ‘PtR’ script that we use to simplify making various charts and plots based on a matrix input file. Run these three commands lines :

# go to kallisto results repertory
cd /scratch/formationX/KALLISTO

#create design file from samples.txt
cut -f1,2 samples.txt > design.txt

[orjuela@node25 salmon_outdir]$ more design.txt 
CENPK	CENPK_rep1
CENPK	CENPK_rep2
CENPK	CENPK_rep3
Batch	Batch_rep1
Batch	Batch_rep2
Batch	Batch_rep3

# Run PtR like so to compare the biological replicates for each of your samples.
$path_to_trinity/Analysis/DifferentialExpression/PtR  --matrix kallisto.isoform.counts.matrix  --samples design.txt --log2 --min_rowSums 10 --compare_replicates

# Now let's compare replicates across all samples. Run PtR to generate a correlation matrix for all sample replicates like so :
$path_to_trinity/Analysis/DifferentialExpression/PtR  --matrix kallisto.isoform.counts.matrix --samples design.txt --log2 --min_rowSums 10 --CPM --sample_cor_matrix

# Another important analysis method to explore relationships among the sample replicates is Principal Component Analysis (PCA). You can generate a PCA plot like so:
$path_to_trinity/Analysis/DifferentialExpression/PtR  --matrix kallisto.isoform.counts.matrix --samples design.txt --log2 --min_rowSums 10  --CPM --center_rows --prin_comp 3 
#transfering plots
cp *.pdf /home/formationX/
# from your local machine
scp formationX@YOURADRESSIP:*.pdf .

As you can see in the above, the replicates are more highly correlated within samples than between samples.

Confirmed by replicate Pearson correlation heatmap:


Results can be also download here


Practice 4 : Mapping against annotated genome reference with Hisat2 + counting with Stringtie

Running Hisat2 and Stringtie with TOGGLe

Connection to account in cluster ssh formationX@YOURIPADRESS

Input data are accessible from :

You can also download from here : RNASeqReadCount.config.txt

Import data from $PATHTODATA to your /scratch/formationX repertory and configure TOGGLe to RNAseq analysis

# Create a TOGGLe-RNASEQ directory in your /scratch 
cd /scratch/formationX/
mkdir TOGGLe-RNASEQ
cd TOGGLe-RNASEQ

# Make à copy of the configuration file used by TOGGLe into TOGGLe-RNASEQ directory
scp $PATHTODATA/RNASeqHisat2Stringtie.config.txt .
$slurm
-p YOURPARTITION

Mapping is performed using HISAT2 and usually the first step, prior to mapping, is to create an index of the reference genome. TOGGle index genome automatically if indexes are absents in reference folder. In order to save some space think to only keep sorted BAM files one these are created. After mapping, assemble the mapped reads into transcripts. StringTie can assemble transcripts with or without annotation, annotation can be helpful when the number of reads for a transcript is too low for an accurate assembly.

[orjuela@node25 orjuela]$ more RNASeqHisat2Stringtie.config.txt 
$order
#1=fastqc
2=hisat2
3=samtoolsView
4=samtoolsSort
5=stringtie 1
1000=stringtie 2

$samtoolsview
-b
-h

$samtoolssort

$cleaner
3

# PUT YOUR OWN SLURM CONFIGURATION HERE IF AVAILABLE RESSOURCES 
#$slurm
#-p YOURPARTITION

$stringtie 1

$stringtie 2
--merge

$hisat2
--dta

# If your data are in scratch don't activate this option
#$scp
#/scratch/

$env
module load TOGGLE/0.3.7
[orjuela@node25 orjuela]$ more runTOGGLe_RNAseq.sh 
#!/bin/bash -l
#SBATCH -J TOGGLeRNASeq
#SBATCH --export=ALL
#SBATCH -e toggle."%j".err
#SBATCH -o toggle."%j".out
#SBATCH -p YOURPARTITION

# Defining scratch and destination repertories\n
dir="/scratch/formationX/TOGGLe-RNASEQ/RAWDATA"
out="/scratch/formationX/TOGGLe-RNASEQ/OUT"
config="/scratch/formationX/TOGGLe-RNASEQ/RNASeqHisat2Stringtie.config.txt"
ref="/scratch/formationX/REF/SC_CHR01_genomic.fasta"
gff="/scratch/formationX/REF/SC_CHR01_genomic.gtf"

# Software-specific settings exported to user environment
module load TOGGLE/0.3.7

# running tooglegenerator 
toggleGenerator.pl -d $dir -c $config -r $ref -g $gff -o $out --report --nocheck;

echo "FIN de TOGGLe ^^"
sbatch ./runTOGGLeRNASEQ.sh

Explore output OUT TOGGLe and check if everything was ok.

Open and exploreOUT/finalResults/intermediateResults.STRINGTIEMERGE.gtf

[orjuela@node25 OUT]$ more finalResults/intermediateResults.STRINGTIEMERGE.gtf 
# /usr/local/stringtie-1.3.4/stringtie /scratch/orjuela/HISAT-STRINGTIE/OUT/output/intermediateResults/0_initialFiles/SRR453566.STRINGTIE.gtf /scratch/orjuela/HISAT-STRINGTIE/OUT/output/intermediateResults/0_init
ialFiles/SRR453567.STRINGTIE.gtf /scratch/orjuela/HISAT-STRINGTIE/OUT/output/intermediateResults/0_initialFiles/SRR453568.STRINGTIE.gtf /scratch/orjuela/HISAT-STRINGTIE/OUT/output/intermediateResults/0_initialFil
es/SRR453569.STRINGTIE.gtf /scratch/orjuela/HISAT-STRINGTIE/OUT/output/intermediateResults/0_initialFiles/SRR453570.STRINGTIE.gtf /scratch/orjuela/HISAT-STRINGTIE/OUT/output/intermediateResults/0_initialFiles/SRR
453571.STRINGTIE.gtf --merge -o /scratch/orjuela/HISAT-STRINGTIE/OUT/output/intermediateResults/1000_stringtie_2/intermediateResults.STRINGTIEMERGE.gtf -G /scratch/orjuela/HISAT-STRINGTIE/REF/GCF_000146045.2_R64_
genomic.gtf
# StringTie version 1.3.4d
NC_001133.9	StringTie	transcript	116	348	1000	.	.	gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; 
NC_001133.9	StringTie	exon	116	348	1000	.	.	gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "1"; 
NC_001133.9	StringTie	transcript	1807	2169	1000	-	.	gene_id "MSTRG.2"; transcript_id "NM_001180043.1"; gene_name "PAU8"; ref_gene_id "YAL068C"; 
NC_001133.9	StringTie	exon	1807	2169	1000	-	.	gene_id "MSTRG.2"; transcript_id "NM_001180043.1"; exon_number "1"; gene_name "PAU8"; ref_gene_id "YAL068C"; 
NC_001133.9	StringTie	transcript	2480	2707	1000	+	.	gene_id "MSTRG.3"; transcript_id "NM_001184582.1"; ref_gene_id "YAL067W-A"; 
NC_001133.9	StringTie	exon	2480	2707	1000	+	.	gene_id "MSTRG.3"; transcript_id "NM_001184582.1"; exon_number "1"; ref_gene_id "YAL067W-A"; 

Now that we have our assembled transcripts, we can estimate their abundances.

To estimate abundances, we have to run again stringtie using options -B and -e.

prepare data : in a new directory
cd /scratch/formationX/TOGGLe-RNASEQ/OUT/
mkdir stringtieEB
cd stringtieEB 
ln -s /scratch/formationX/TOGGLe-RNASEQ/OUT/finalResults/intermediateResults.STRINGTIEMERGE.gtf .
ln -s /scratch/formationX/TOGGLe-RNASEQ/OUT/output/*/4_samToolsSort/*SAMTOOLSSORT.bam .
Recovery annotations
module load system/python/3.6.5
python3 $PATHTODATA/scripts_utils/gpertea-scripts/mstrg_prep.py intermediateResults.STRINGTIEMERGE.gtf > intermediateResults.STRINGTIEMERGE_prep.gtf

Stringtie -e -B
for i in *bam ; do echo "stringtie -e -B over $i ..."; eval "mkdir ${i/.SAMTOOLSSORT.bam/}; module load bioinfo/stringtie/1.3.4; stringtie" $PWD"/"$i "-G $PWD"/"intermediateResults.STRINGTIEMERGE_prep.gtf -e -B -o $PWD/${i/.SAMTOOLSSORT.bam/}/${i/bam/count}"; done

gffcompare
$PATHTODATA/scripts_utils/gffcompare/gffcompare -r /scratch/formationX/REF/SC_CHR01_genomic.gtf -o gffcompare_out  intermediateResults.STRINGTIEMERGE.gtf
 
 /home/orjuela/Documents/2019/BURKINA_FORMATION/TP-OUAGA/REF/SC_CHR01_genomic.gtf  -o gffcompare_out  ../intermediateResults.STRINGTIEMERGE_prep.gtf 
  101 reference transcripts loaded.
  6446 query transfrags loaded.

Convert to counts table
# create COUNTS repertory
cd /scratch/formationX/TOGGLe-RNASEQ/OUT
mkdir COUNTS
cd COUNTS/
# create a symbolic link from stringtieEB counts results
ln -s /scratch/formationX/TOGGLe-RNASEQ/OUT/stringtieEB/*/*.count .
# create a GTF list file 
for i in *.count; do echo ${i/.SAMTOOLSSORT.count/} $PWD/$i; done > listGTF.txt

Observe the listGTFtxt file :

[orjuela@node25 COUNTS]$ more listGTF.txt 
SRR453566 /scratch/formationX/TOGGLe-RNASEQ/OUT/COUNTS/SRR453566.SAMTOOLSSORT.count
SRR453567 /scratch/formationX/TOGGLe-RNASEQ/OUT/COUNTS/SRR453567.SAMTOOLSSORT.count
SRR453568 /scratch/formationX/TOGGLe-RNASEQ/OUT/COUNTS/SRR453568.SAMTOOLSSORT.count
SRR453569 /scratch/formationX/TOGGLe-RNASEQ/OUT/COUNTS/SRR453569.SAMTOOLSSORT.count
SRR453570 /scratch/formationX/TOGGLe-RNASEQ/OUT/COUNTS/SRR453570.SAMTOOLSSORT.count
SRR453571 /scratch/formationX/TOGGLe-RNASEQ/OUT/COUNTS/SRR453571.SAMTOOLSSORT.count
python2 $PATHTODATA/scripts_utils/prepDE.py -i listGTF.txt

How many genes have some counts? Remember, we are working only in chr01.

Transfer data from /scratch to your home on cluster
# copy gene counts files
scp -r /scratch/SRA_SRS307298/TOGGLe-RNASEQ/COUNTS/*tsv /home/formationX
# copy gffcompare annotation
scp -r /scratch/SRA_SRS307298/TOGGLe-RNASEQ/OUT/stringtieEB/gffcompare_out.annotated.gtf /home/formationX
# copy references
scp -r /scratch/SRA_SRS307298/REF/ /home/formationX
# copy BAMs files but before index it with `samtools index` 
for file in /scratch/formationX/TOGGLe-RNASEQ/OUT/output/*/4_samToolsSort/*SAMTOOLSSORT.bam; do samtools index $file; done
# and copy BAM and BAI files
scp  /scratch/SRA_SRS307298/TOGGLe-RNASEQ/OUT/output/*/4_samToolsSort/*SAMTOOLSSORT.ba* /home/formationX

Practice 5 : Visualization of mapped reads against genes using IGV

Practice 5 will be performed with Integrated Genome Viewer (IGV).

Focus on a gene that has been found to be differentially expressed and observe the structure of the gene.

orjuela@GFFCOMPARE$ grep  'class_code "u"' gffcompare_out.annotated.gtf | grep ^'NC_001133.9' 
NC_001133.9	StringTie	transcript	31192	31470	.	+	.	transcript_id "MSTRG.12.1"; gene_id "MSTRG.12"; xloc "XLOC_000004"; class_code "u"; tss_id "TSS4";
NC_001133.9	StringTie	transcript	151618	152052	.	+	.	transcript_id "MSTRG.75.1"; gene_id "MSTRG.75"; xloc "XLOC_000038"; class_code "u"; tss_id "TSS38";
NC_001133.9	StringTie	transcript	210461	212199	.	+	.	transcript_id "MSTRG.106.1"; gene_id "MSTRG.106"; xloc "XLOC_000055"; class_code "u"; tss_id "TSS55";
NC_001133.9	StringTie	transcript	208135	210131	.	-	.	transcript_id "MSTRG.105.1"; gene_id "MSTRG.105"; xloc "XLOC_000109"; class_code "u"; tss_id "TSS110";
NC_001133.9	StringTie	transcript	28548	28968	.	.	.	transcript_id "MSTRG.11.1"; gene_id "MSTRG.11"; xloc "XLOC_000111"; class_code "u"; tss_id "TSS112";

orjuela@GFFCOMPARE$ grep  'class_code "x"' gffcompare_out.annotated.gtf | grep ^'NC_001133.9' 
NC_001133.9	StringTie	transcript	160119	166158	.	+	.	transcript_id "MSTRG.81.2"; gene_id "MSTRG.81"; gene_name "YAR009C"; xloc "XLOC_000042"; cmp_ref "NM_001178213.1"; class_code "x"; tss_id "TSS42";
NC_001133.9	StringTie	transcript	160119	166167	.	+	.	transcript_id "MSTRG.81.1"; gene_id "MSTRG.81"; gene_name "YAR009C"; xloc "XLOC_000042"; cmp_ref "NM_001178213.1"; class_code "x"; tss_id "TSS42";
NC_001133.9	StringTie	transcript	206812	220171	.	-	.	transcript_id "MSTRG.104.1"; gene_id "MSTRG.104"; gene_name "FLO1"; xloc "XLOC_000108"; cmp_ref "NM_001178230.1"; class_code "x"; tss_id "TSS109";

orjuela@GFFCOMPARE$ grep  'class_code "j"' gffcompare_out.annotated.gtf | grep ^'NC_001133.9' 
NC_001133.9	StringTie	transcript	147594	151166	.	-	.	transcript_id "MSTRG.74.1"; gene_id "MSTRG.74|YAL001C"; gene_name "TFC3"; xloc "XLOC_000096"; cmp_ref "NM_001178148.1"; class_code "j"; tss_id "TSS96";

6. Differential Expression Analysis (DE) using scripts associated with Trinity software

6.1 Identify differentially expressed genes between the two tissues.

The tool run_DE_analysis.pl is a PERL script that use Bioconductor package edgeR.

# Go to kallisto results
cd /scratch/formationX/KALLISTO

# Run DE analysis
$path_to_trinity/Analysis/DifferentialExpression/run_DE_analysis.pl \
--matrix kallisto_trans.isoform.counts.matrix \
--method edgeR \
--samples_file design.txt \
--output edgeR_results

The output files are in the directory edgeR_results. Observe file kallisto_trans.isoform.counts.matrix.Batch_vs_CENPK.edgeR.DE_results. It provides several values for each gene:

1) FDR to indicate whether a gene is differentially expressed or not

2) logFC is the log2 transformed fold change between the two tissues

3) logCPM is the log2 transformed normalized read count of average of the samples.

Usually you have to filter this list of genes/isoforms to FDR <0.05 or below. To be more conservative, you could also use more stringent FDR cutoff (e.g. <0.001), and only keep genes with high logFC (e.g. <-2 and >2) and/or high logCPM (e.g. >1). In the edgeR_results directory there is also a “volcano plot” to visualize the distribution of the DE genes.

[orjuela@node25 salmon_outdir]$ head edgeR_results/kallisto_trans.isoform.counts.matrix.Batch_vs_CENPK.edgeR.DE_results
sampleA	sampleB	logFC	logCPM	PValue	FDR
TRINITY_DN332_c0_g1_i10	Batch	CENPK	-10.359903780639	8.33512668205486	0	0
TRINITY_DN730_c0_g2_i1	Batch	CENPK	-6.4740873961699	8.70152404919916	0	0
TRINITY_DN741_c0_g1_i1	Batch	CENPK	-6.31231314146213	10.2830859720565	0	0
TRINITY_DN287_c1_g1_i1	Batch	CENPK	-6.26650019458591	8.47203996977476	0	0
TRINITY_DN65_c0_g1_i3	Batch	CENPK	-4.13189957112457	10.730475730689	0	0
TRINITY_DN1298_c0_g1_i1	Batch	CENPK	-3.19109633696326	8.91247526735535	0	0
TRINITY_DN1253_c0_g1_i1	Batch	CENPK	-3.03997649547823	8.99502681145441	6.47317750883953e-301	3.75444295512693e-298
TRINITY_DN51_c0_g1_i1	Batch	CENPK	-4.59451625218422	10.4521118447878	5.94332889429699e-289	3.01623941385572e-286
TRINITY_DN708_c0_g1_i1	Batch	CENPK	-4.1838974791094	7.70916912374135	3.84699757684993e-275	1.73542335133453e-272
#transfering plots
cd edgeR_result
cp *.pdf /home/formationX/
# from your local machine
scp formationX@YOURADRESSIP:*.pdf .

Practice 7 : Differential expression analysis using PIVOT

Practice 7 will be performed in PIVOT via R Studio.

PIVOT: Platform for Interactive analysis and Visualization Of Transcriptomics data Qin Zhu, Junhyong Kim Lab, University of Pennsylvania Oct 26th, 2017

Intallation of PIVOT

Dependencies that needs to be manually installed. You may need to paste the following code line by line and choose if previously installed packages should be updated (recommended).

install.packages("devtools") 
library("devtools")
install.packages("BiocManager")
BiocManager::install("BiocUpgrade") 
#BiocManager::install("GO.db")
BiocManager::install("HSMMSingleCell")
#BiocManager::install("org.Mm.eg.db")
#BiocManager::install("org.Hs.eg.db")
BiocManager::install("DESeq2")
BiocManager::install("SingleCellExperiment")
BiocManager::install("scater")
BiocManager::install("monocle")
BiocManager::install("GenomeInfoDb")

Install PIVOT

install_github("qinzhu/PIVOT")
BiocManager::install("BiocGenerics") # You need the latest BiocGenerics >=0.23.3

Launch PIVOT

To run PIVOT, in Rstudio console related to a web shiny interface, use command

library(PIVOT)
pivot()

Go to your web browser.


Introduction

As you have seen, RNA sequencing (RNA-seq) is an experimental method for measuring RNA expression levels via high-throughput sequencing of small adapter-ligated fragments. The number of reads that map to a gene is the proxy for its expression level. There are many steps between receiving the raw reads from the sequencer and gaining biological insight. In this tutorial, we will start with a “Table of counts” and end with a “List of differentially expressed genes”.

Step 1 : Data Input

To input expression matrix, select “Counts Table” as input file type. PIVOT expects the count matrix to have rows as genes and samples as columns. Gene names and sample names should be the first column and the first row, respectively.

Step 2 : Input Design Information

The design infomation are used for sample point coloring and differential expression analysis. Users can input the entire sample meta sheet as design information, or manually specify groups or batches for each sample.

In order to have a quick view of your chosen data, look at the summary.

step 3 : Feature Filtering

Step 4 : Select samples

DATA MAP draw a summary of your different analysis, so you can save the history of your analysis.

Step 5 : Basic statistics

First part with EdgeR on PIVOT

Questions :

Step 6 : Differential Expression with EdgeR

Once data have been normalized in the Step 1, you can choose the method to find the Differential expression gene between the condition previously choosen. In edgeR, it is recommended to remove genes with very low counts. Remove genes (rows) which have zero counts for all samples from the dge data object.

Then Run Modeling

You obtain a table containing the list of the differential gene expressed according to your designed analysis. This table contains logFoldChange, logCountPerMillion, PValue, FDR. This table can be download in order to use it for other analysis. The Mean-Difference Plot show you the up-regulated gene in green, and Down regulated gene in black.

Questions :


Second part with DESeq2 on PIVOT

Questions:


Step 6 : Differential Expression with DESeq2

Once data have been normalized in the Step 1, you can choose the method to find the Differential expression gene between the condition previously choosen.

Then Run Modeling

You obtain a table containing the list of the differential gene expressed according to your designed analysis. This table contains baseMean, log2FoldChange, PValue, pvalueadjust. This table can be downloaded in order to use it for other analysis. The MA Plot show you the up-regulated gene LFC>0, and Down regulated gene LFC<0, the differential gene are in red.

Questions :


Step 7 : Visualisation Heatmap, Correlation, PCA

You can make correlation between the control and the rep in order to identify library bias if exists. We can also explore the relationships between the samples by visualizing a heatmap of the correlation matrix. The heatmap result corresponds to what we know about the data set. First, the samples in group 1 and 2 come from the control and the repetition, so the two groups are very different. Second, since the samples in each group are technical replicates, the within group variance is very low.

Correlation between sample and control

PCA

Question :


Practice 8 : Compare list of DE genes with EdgeR and DESeq2

Practice 8 will be performed with Venn Diagramm implemented on PIVOT.

Questions:

Some other tools are available to compare 2 lists of gene. Venny.


Practice 9 : Hierarchical Clustering

Practice 9 will be performed with PIVOT.



License

The resource material is licensed under the Creative Commons Attribution 4.0 International License (here).