Description	Hands On Lab Exercises for RNASeq
Related-course materials	Transcriptomique
Related-course materials	Trinity and Trinotate
Related-course materials	Linux for Dummies
Authors	Julie Orjuela (julie.orjuela_AT_irf.fr), Christine Tranchant (christine.tranchant_AT_ird.fr)
Creation Date	15/03/2018
Last Modified Date	03/10/2019

Summary

Preambule 0: Connection into cluster - ssh,srun,scp
Practice 1: Check Reads Quality
Practice 2: Pseudo-mapping against transcriptome reference + counting with Kallisto
Practice 3: Mapping against annotated genome reference with Hisat2 + counting with Stringtie
Practice 4: Visualization of mapped reads against genes using IGV
Practice 5: Differential Expression Analysis (DE) using scripts associated with Trinity software
Practice 6: Differential expression analysis with EdgeR and DESeq2 using PIVOT
Links
License

Practice 0. Going to the cluster - `ssh,srun,scp`

Dataset used in this practical comes from

ref : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3488244/
data : NCBI SRA database under accession number SRS307298 S. cerevisiae.
Genome size of S. cerevisiae : 12M (12.157.105) (https://www.yeastgenome.org/strain/S288C#genome_sequence)

In this session, we will analyze RNA-seq data from one sample of S. cerevisiae (NCBI SRA SRS307298). It is from two different origin (CENPK and Batch), with three biological replications for each origin (rep1, rep2 and rep3).

Connection into cluster through `ssh` mode

We will work on the cluster of the Joseph KI-ZERBI University (UJKZ) using SLURM scheduler.

ssh formationX@YOUR_IP_ADRESS

Opening an interactive bash session on a node `srun -p partition --pty bash -i`

Read this survival document containig basic commands to SLURM (https://southgreenplatform.github.io/trainings/slurm/)

srun -p PARTITIONNAME --time 08:00:00 --cpus-per-task 2 --pty bash -i

Prepare input files

Create your subdirectory in the scratch file system /scratch. In the following, please replace X with your own user ID number in formationX.

# declare PATHTODATA variable : in this TP data are in /scratch/SRA_SRS307298
PATHTODATA="/PATH/TO/DATA/" 

# create scratch repertory 
cd /scratch
mkdir formationX
cd formationX

Link files from the shared location from scratch directory (it is essential that all calculations take place here). But keep in mind that your data can be located in your /home or on a NAS for example.

ln -s $PATHTODATA/SRA_SRS307298/RAWDATA/ /scratch/formationX/

When the files transfer is finished, verify by listing the content of the current directory and the subdirectory RAWDATA with the command ls -al. You should see 12 gzipped read files in a listing, a samples.txt file and a adapt-125pbLib.txt file.

[orjuela@node25 RAWDATA]$ more samples.txt 
CENPK	CENPK_rep1	PATH/SRR453569_1.fastq.gz	PATH/SRR453569_2.fastq.gz
CENPK	CENPK_rep2	PATH/SRR453570_1.fastq.gz	PATH/SRR453570_2.fastq.gz
CENPK	CENPK_rep3	PATH/SRR453571_1.fastq.gz	PATH/SRR453571_2.fastq.gz
Batch	Batch_rep1	PATH/SRR453566_1.fastq.gz	PATH/SRR453566_2.fastq.gz
Batch	Batch_rep2	PATH/SRR453567_1.fastq.gz	PATH/SRR453567_2.fastq.gz
Batch	Batch_rep3	PATH/SRR453568_1.fastq.gz	PATH/SRR453568_2.fastq.gz

Practice 1. Check Reads Quality

FastQC perform some simple quality control checks to ensure that the raw data looks good and there are no problems or biases in data which may affect how user can usefully use it. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

# make a fastqc repertoty
cd /scratch/formationX
mkdir FASTQC
cd FASTQC

# charge modules 
module load fastqc/0.11.7/

# run fastqc in the whole of samples (it will take 10min)
fastqc -t 1 /scratch/formationX/RAWDATA/*.gz -o /scratch/formationX/FASTQC/

Multiqc is a modular tool to aggregate results from bioinformatics analyses across many samples into a single report. Use this tool to visualise results of quality. https://multiqc.info/

#charge modules
module load MultiQC/1.7

#launch Multiqc to create a report in html containing the whole of informations generated by FastQC
multiqc .

#transfer results to your cluster home 
scp -r multiqc* master:/home/formationX/

# transfert results FROM your local machine by scp or filezilla
scp -r formationX@:YOUR_IP_ADRESS /home/formationX/multiqc* ./

# open in your favorite web navigator
firefox multiqc_report.html .

In this practice, reads quality is ok. You need to observe sequences and check biases. To remove adaptors and primers you can use Trimmomatic. Use PRINSEQ2 to detect Poly A/T tails and low complexity reads. Remove contaminations with SortMeRNA, riboPicker or DeconSeq.

1.1 Cleanning reads with trimmomatic

# make a fastqc repertoty
cd /scratch/formationX
mkdir TRIMMOMATIC
cd TRIMMOMATIC

# loading modules
module load Trimmomatic/0.38

# Copy samples file and changing PATH to current directory in samples file.
cp ../RAWDATA/samples.txt .
sed -i 's|PATH|'$PWD'|ig' samples.txt

Running trimmomatic for each sample

In this example, reads of SRR453566 sample are trimmed. (_U untrimmed _P paired)

ILUMINACLIP is used to find and remove Illumina adapters.
SLIDINGWINDOW performs a sliding window trimming, cutting once the average quality within the window falls below a threshold. By considering multiple bases, a single poor quality base will not cause the removal of high quality data later in the read.
LEADING removes low quality bases from the beginning. As long as a base has a value below this threshold the base is removed and the next base will be investigated.
TRAILING specifies the minimum quality required to keep a base
MINLEN removes reads that fall below the specified minimal length

java -jar /usr/local/Trimmomatic-0.38/trimmomatic-0.38.jar PE -phred33  /scratch/formationX/RAWDATA/SRR453567_1.fastq.gz /scratch/formationX/RAWDATA/SRR453567_2.fastq.gz SRR453567_1.P.fastq.gz SRR453567_1.U.fastq.gz SRR453567_2.P.fastq.gz SRR453567_2.U.fastq.gz ILLUMINACLIP:/scratch/formationX/RAWDATA/adapt-125pbLib.txt:2:30:10 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25

In this example :

Input Read Pairs: 7615732 Both Surviving: 7504326 (98,54%) Forward Only Surviving: 71456 (0,94%) Reverse Only Surviving: 11202 (0,15%) Dropped: 28748 (0,38%)
TrimmomaticPE: Completed successfully

1.2 Check quality after trimming

Observe effet of trimming in your samples. Run fastqc in the whole of trimmed reads and observe it with MultiQC

Practice 2. Indexing references files

2.1 indexing a reference file

We will index references with kallisto index, samtools index to prepare references to follow analysis.

# load modules
module load R/3.5.0
module load samtools/1.8
module load perl/5.24.0
module load trinity/2.8.5
#module load express/1.5.1
module load kallisto/0.44.0


# copy transcrits fasta reference file
cp -r $PATHTODATA/SRA_SRS307298/REF /scratch/formationX/

# go to REF repertoty
cd /scratch/formationX/REF

# index reference
kallisto index -i GCF_000146045.2_R64_genes.fai GCF_000146045.2_R64_genes.fasta

Practice 3. Mapping against transcriptome reference + counting with Kallisto

3.1 Kallisto

Now, we will perform a transcriptome-based mapping and estimates of transcript levels using Kallisto, and a differential analysis using EdgeR.

Transfert references of transcriptome (cDNA) to be used as reference and index it, prepare a samples file and run kallisto for each sample

# declare bash variables
path_to_trinity=/usr/local/trinityrnaseq-Trinity-v2.8.5/
fasta=/scratch/formationX/REF/GCF_000146045.2_R64_genes.fasta
samplesfile=/scratch/SRA_SRS307298/KALLISTO/samples.txt

# make a KALLISTO repertory and go on
cd /scratch/formationX
mkdir KALLISTO
cd KALLISTO

# symbolic link to trimmed fastq 
ln -s /scratch/SRA_SRS307298/TRIMMOMATIC/*.P.fastq.gz .

# Run the kallisto quant program by providing `GCF_000146045.2_R64_genes.fasta` as transcriptome reference and specifying correctly pairs of input fastq- `kallisto quant`

For every sample you have to run kallisto quant (example with SRR453568 sample) by using kallisto quant -i GCF_000146045.2_R64_genes.fai -o SRR453568 <(zcat SRR453568_1.P.fastq.gz) <(zcat SRR453568_2_P.fastq.gz) command for example BUT we propose you use a trinity script align_and_estimate_abundance.pl to lauch the whole of samples in a command line.

# Create and modify samples file path
cp /scratch/formationX/RAWDATA/samples.txt .

# changing PATH to current directory in samples file
sed -i 's|PATH|'$PWD'|ig' samples.txt 
sed -i 's|_1|_1.P|ig' samples.txt 
sed -i 's|_2|_2.P|ig' samples.txt

# lauch kallisto using  `align_and_estimate_abundance.pl`
perl $path_to_trinity/util/align_and_estimate_abundance.pl \
--transcripts $fasta \
--seqType fq \
--samples_file $samplesfile \
--est_method kallisto \
--prep_reference > kallisto_align_and_estimate_abundance.log 2>&1

Visualize data abundance results

[orjuela@node25 KALLISTO]$ more CENPK_rep1/abundance.tsv 
target_id	length	eff_length	est_counts	tpm
NC_001133.9:1807-2169	362	184.532	1	0.885838
NC_001133.9:2480-2707	227	80.5542	3.70979	7.52813
NC_001133.9:7235-9016	1781	1583.42	0	0
NC_001133.9:11565-11951	386	205.881	0	0
NC_001133.9:12046-12426	380	200.374	0	0
NC_001133.9:13363-13743	380	200.374	0	0
NC_001133.9:21566-21850	284	120.528	6	8.13748
NC_001133.9:22395-22685	290	125.236	9	11.7473
NC_001133.9:24000-27968	3968	3770.42	92.7662	4.02185
NC_001133.9:31567-32940	1373	1175.42	807.109	112.244
NC_001133.9:33448-34701	1253	1055.42	661	102.377
NC_001133.9:35155-36303	1148	950.425	1092	187.815
NC_001133.9:36509-37147	638	443.67	143	52.6869
NC_001133.9:37464-38972	1508	1310.42	195	24.3248

Convert kallisto outputs into one single file that can be used as input for EdgeR -Trinity scripts

# go to kallisto results repertory
cd /scratch/formationX/KALLISTO/

# Calculate expression matrix (if salmon use quant.sf files, if kallisto use abundance.tsv files)

$path_to_trinity/util/abundance_estimates_to_matrix.pl \
--est_method kallisto \
--name_sample_by_basedir \
--gene_trans_map none \
CENPK_rep1/abundance.tsv \
CENPK_rep2/abundance.tsv \
CENPK_rep3/abundance.tsv \
Batch_rep1/abundance.tsv \
Batch_rep2/abundance.tsv \
Batch_rep3/abundance.tsv

Check reads percentage pseudo-mapped to fasta reference using kallisto and observe count matrix file. Last can be used in DE analysis.

#kallisto results
[orjuela@node25 KALLISTO]$ grep 'processed' kallisto_align_and_estimate_abundance.log 
[quant] processed 3,966,369 reads, 3,569,394 reads pseudoaligned
[quant] processed 6,638,859 reads, 4,384,262 reads pseudoaligned
[quant] processed 6,073,246 reads, 5,536,151 reads pseudoaligned
[quant] processed 5,630,114 reads, 5,271,023 reads pseudoaligned
[quant] processed 7,504,326 reads, 7,072,861 reads pseudoaligned
[quant] processed 5,476,735 reads, 5,178,176 reads pseudoaligned

[orjuela@node25 KALLISTO]$  more kallisto.isoform.counts.matrix
	CENPK_rep1	CENPK_rep2	CENPK_rep3	Batch_rep1	Batch_rep2	Batch_rep3
NC_001136.10:872112-873428	0	0	0	0	0	0
NC_001145.3:196628-197950	0	0	0	0	0	0
NC_0011###47.6:230085-231272	264	272	387	219	236	148
NC_001136.10:446967-447578	211	195	297	287	326	234
NC_001148.4:646448-647038	176	211	283	176	229	168
NC_001147.6:78352-79479	69	103	100	99	141	95
NC_001148.4:587189-587518	26	32	42	9	10	15
NC_001147.6:960987-965981	1074	1320	1625	3087	4204	3298

3.2 Examine your data and your experimental replicates before DE

Before differential expression analysis, examine your data to determine if there are any confounding issues. Trinity comes with a ‘PtR’ script that we use to simplify making various charts and plots based on a matrix input file. Run these three commands lines :

# go to kallisto results repertory
cd /scratch/formationX/KALLISTO

#create design file from samples.txt
cut -f1,2 samples.txt > design.txt

[orjuela@node25 salmon_outdir]$ more design.txt 
CENPK	CENPK_rep1
CENPK	CENPK_rep2
CENPK	CENPK_rep3
Batch	Batch_rep1
Batch	Batch_rep2
Batch	Batch_rep3

# Run PtR like so to compare the biological replicates for each of your samples.
$path_to_trinity/Analysis/DifferentialExpression/PtR  --matrix kallisto.isoform.counts.matrix  --samples design.txt --log2 --min_rowSums 10 --compare_replicates

# Now let's compare replicates across all samples. Run PtR to generate a correlation matrix for all sample replicates like so :
$path_to_trinity/Analysis/DifferentialExpression/PtR  --matrix kallisto.isoform.counts.matrix --samples design.txt --log2 --min_rowSums 10 --CPM --sample_cor_matrix

# Another important analysis method to explore relationships among the sample replicates is Principal Component Analysis (PCA). You can generate a PCA plot like so:
$path_to_trinity/Analysis/DifferentialExpression/PtR  --matrix kallisto.isoform.counts.matrix --samples design.txt --log2 --min_rowSums 10  --CPM --center_rows --prin_comp 3

TRANSFERT : Observe plots. Remember, you have to transfert *.pdf files to your home before of transfering into your local machine.

#transfering plots
cp *.pdf /home/formationX/
# from your local machine
scp formationX@YOURADRESSIP:*.pdf .

As you can see in the above, the replicates are more highly correlated within samples than between samples.

Confirmed by replicate Pearson correlation heatmap:

In this example, the replicates cluster tightly according to sample type, which is very reassuring. If you have replicates that are clear outliers, you might consider removing them from your study as potential confounders.

Results can be also download here

Practice 4 : Mapping against annotated genome reference with Hisat2 + counting with Stringtie

Running Hisat2 and Stringtie with TOGGLe

Connection to account in cluster ssh formationX@YOURIPADRESS

Input data are accessible from :

Input data : $PATHTODATA/DATA
Reference : $PATHTODATA/REF/SC_CHR01_genomic.fasta
Annotation : $PATHTODATA/REF/SC_CHR01_genomic.gtf
Config file: $PATHTODATA/RNASeqReadCount.config.txt

You can also download from here : RNASeqReadCount.config.txt

Import data from $PATHTODATA to your /scratch/formationX repertory and configure TOGGLe to RNAseq analysis

# Create a TOGGLe-RNASEQ directory in your /scratch 
cd /scratch/formationX/
mkdir TOGGLe-RNASEQ
cd TOGGLe-RNASEQ

# Make à copy of the configuration file used by TOGGLe into TOGGLe-RNASEQ directory
scp $PATHTODATA/RNASeqHisat2Stringtie.config.txt .

Change SLURM key $slurm as below using a texte editor like nano nano RNASeqHisat2Stringtie.config.txt and check the whole of parameters

$slurm
-p YOURPARTITION

in TOGGLe configuration file use /scratch in $scp key to launch your job from scratch folder. Your data are now in /scratch so deactivate this option. But activate $env key using module load TOGGLE/0.3.7 module installed on cluster. Check parameters of every step in /scratch/formationX/TOGGLe-RNASeq/RNASeqHisat2Stringtie.config.txt as recommended by https://www.nature.com/articles/nprot.2016.095.

Mapping is performed using HISAT2 and usually the first step, prior to mapping, is to create an index of the reference genome. TOGGle index genome automatically if indexes are absents in reference folder. In order to save some space think to only keep sorted BAM files one these are created. After mapping, assemble the mapped reads into transcripts. StringTie can assemble transcripts with or without annotation, annotation can be helpful when the number of reads for a transcript is too low for an accurate assembly.

[orjuela@node25 orjuela]$ more RNASeqHisat2Stringtie.config.txt 
$order
#1=fastqc
2=hisat2
3=samtoolsView
4=samtoolsSort
5=stringtie 1
1000=stringtie 2

$samtoolsview
-b
-h

$samtoolssort

$cleaner
3

# PUT YOUR OWN SLURM CONFIGURATION HERE IF AVAILABLE RESSOURCES 
#$slurm
#-p YOURPARTITION

$stringtie 1

$stringtie 2
--merge

$hisat2
--dta

# If your data are in scratch don't activate this option
#$scp
#/scratch/

$env
module load TOGGLE/0.3.7

Now, create a runTOGGLeRNASEQ.sh bash script to launch TOGGLe as follow :

[orjuela@node25 orjuela]$ more runTOGGLe_RNAseq.sh 
#!/bin/bash -l
#SBATCH -J TOGGLeRNASeq
#SBATCH --export=ALL
#SBATCH -e toggle."%j".err
#SBATCH -o toggle."%j".out
#SBATCH -p YOURPARTITION

# Defining scratch and destination repertories\n
dir="/scratch/formationX/TOGGLe-RNASEQ/RAWDATA"
out="/scratch/formationX/TOGGLe-RNASEQ/OUT"
config="/scratch/formationX/TOGGLe-RNASEQ/RNASeqHisat2Stringtie.config.txt"
ref="/scratch/formationX/REF/SC_CHR01_genomic.fasta"
gff="/scratch/formationX/REF/SC_CHR01_genomic.gtf"

# Software-specific settings exported to user environment
module load TOGGLE/0.3.7

# running tooglegenerator 
toggleGenerator.pl -d $dir -c $config -r $ref -g $gff -o $out --report --nocheck;

echo "FIN de TOGGLe ^^"

Convert runTOGGLeRNASEQ in an executable file with chmod +x runTOGGLeRNASEQ.sh
Launch runTOGGLeRNASEQ.sh in sbatch mode

sbatch ./runTOGGLeRNASEQ.sh

Explore output OUT TOGGLe and check if everything was ok.

Open and exploreOUT/finalResults/intermediateResults.STRINGTIEMERGE.gtf

[orjuela@node25 OUT]$ more finalResults/intermediateResults.STRINGTIEMERGE.gtf 
# /usr/local/stringtie-1.3.4/stringtie /scratch/orjuela/HISAT-STRINGTIE/OUT/output/intermediateResults/0_initialFiles/SRR453566.STRINGTIE.gtf /scratch/orjuela/HISAT-STRINGTIE/OUT/output/intermediateResults/0_init
ialFiles/SRR453567.STRINGTIE.gtf /scratch/orjuela/HISAT-STRINGTIE/OUT/output/intermediateResults/0_initialFiles/SRR453568.STRINGTIE.gtf /scratch/orjuela/HISAT-STRINGTIE/OUT/output/intermediateResults/0_initialFil
es/SRR453569.STRINGTIE.gtf /scratch/orjuela/HISAT-STRINGTIE/OUT/output/intermediateResults/0_initialFiles/SRR453570.STRINGTIE.gtf /scratch/orjuela/HISAT-STRINGTIE/OUT/output/intermediateResults/0_initialFiles/SRR
453571.STRINGTIE.gtf --merge -o /scratch/orjuela/HISAT-STRINGTIE/OUT/output/intermediateResults/1000_stringtie_2/intermediateResults.STRINGTIEMERGE.gtf -G /scratch/orjuela/HISAT-STRINGTIE/REF/GCF_000146045.2_R64_
genomic.gtf
# StringTie version 1.3.4d
NC_001133.9	StringTie	transcript	116	348	1000	.	.	gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; 
NC_001133.9	StringTie	exon	116	348	1000	.	.	gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "1"; 
NC_001133.9	StringTie	transcript	1807	2169	1000	-	.	gene_id "MSTRG.2"; transcript_id "NM_001180043.1"; gene_name "PAU8"; ref_gene_id "YAL068C"; 
NC_001133.9	StringTie	exon	1807	2169	1000	-	.	gene_id "MSTRG.2"; transcript_id "NM_001180043.1"; exon_number "1"; gene_name "PAU8"; ref_gene_id "YAL068C"; 
NC_001133.9	StringTie	transcript	2480	2707	1000	+	.	gene_id "MSTRG.3"; transcript_id "NM_001184582.1"; ref_gene_id "YAL067W-A"; 
NC_001133.9	StringTie	exon	2480	2707	1000	+	.	gene_id "MSTRG.3"; transcript_id "NM_001184582.1"; exon_number "1"; ref_gene_id "YAL067W-A";

Now that we have our assembled transcripts, we can estimate their abundances.

To estimate abundances, we have to run again stringtie using options -B and -e.

prepare data : in a new directory

Create symbolics links to order data before transfering them to /scratch

cd /scratch/formationX/TOGGLe-RNASEQ/OUT/
mkdir stringtieEB
cd stringtieEB 
ln -s /scratch/formationX/TOGGLe-RNASEQ/OUT/finalResults/intermediateResults.STRINGTIEMERGE.gtf .
ln -s /scratch/formationX/TOGGLe-RNASEQ/OUT/output/*/4_samToolsSort/*SAMTOOLSSORT.bam .

Recovery annotations

Before merging gtf files obtained by stringtie, we have to recover the annotations in order to see the known genes names in gtf file. Stringtie annotate transcripts using gene id ‘MSTRG.1’ nomenclature . See https://github.com/gpertea/stringtie/issues/179

module load system/python/3.6.5
python3 $PATHTODATA/scripts_utils/gpertea-scripts/mstrg_prep.py intermediateResults.STRINGTIEMERGE.gtf > intermediateResults.STRINGTIEMERGE_prep.gtf

Stringtie -e -B

Finally, we launch stringtie to obtain de novo transcrits assemblies :

for i in *bam ; do echo "stringtie -e -B over $i ..."; eval "mkdir ${i/.SAMTOOLSSORT.bam/}; module load bioinfo/stringtie/1.3.4; stringtie" $PWD"/"$i "-G $PWD"/"intermediateResults.STRINGTIEMERGE_prep.gtf -e -B -o $PWD/${i/.SAMTOOLSSORT.bam/}/${i/bam/count}"; done

gffcompare

Let’s compare the StringTie transcripts to known transcripts using gffcompare https://github.com/gpertea/gffcompare and explore results. Observe statistics. How many “J”, “U” and “=” do you obtain?. gffcompare_out.annotated.gtf file will be visualised with IGV later.

$PATHTODATA/scripts_utils/gffcompare/gffcompare -r /scratch/formationX/REF/SC_CHR01_genomic.gtf -o gffcompare_out  intermediateResults.STRINGTIEMERGE.gtf
 
 /home/orjuela/Documents/2019/BURKINA_FORMATION/TP-OUAGA/REF/SC_CHR01_genomic.gtf  -o gffcompare_out  ../intermediateResults.STRINGTIEMERGE_prep.gtf 
  101 reference transcripts loaded.
  6446 query transfrags loaded.

Convert to counts table

Convert stringtie output in counts using prepDE.py

# create COUNTS repertory
cd /scratch/formationX/TOGGLe-RNASEQ/OUT
mkdir COUNTS
cd COUNTS/
# create a symbolic link from stringtieEB counts results
ln -s /scratch/formationX/TOGGLe-RNASEQ/OUT/stringtieEB/*/*.count .
# create a GTF list file 
for i in *.count; do echo ${i/.SAMTOOLSSORT.count/} $PWD/$i; done > listGTF.txt

Observe the listGTFtxt file :

[orjuela@node25 COUNTS]$ more listGTF.txt 
SRR453566 /scratch/formationX/TOGGLe-RNASEQ/OUT/COUNTS/SRR453566.SAMTOOLSSORT.count
SRR453567 /scratch/formationX/TOGGLe-RNASEQ/OUT/COUNTS/SRR453567.SAMTOOLSSORT.count
SRR453568 /scratch/formationX/TOGGLe-RNASEQ/OUT/COUNTS/SRR453568.SAMTOOLSSORT.count
SRR453569 /scratch/formationX/TOGGLe-RNASEQ/OUT/COUNTS/SRR453569.SAMTOOLSSORT.count
SRR453570 /scratch/formationX/TOGGLe-RNASEQ/OUT/COUNTS/SRR453570.SAMTOOLSSORT.count
SRR453571 /scratch/formationX/TOGGLe-RNASEQ/OUT/COUNTS/SRR453571.SAMTOOLSSORT.count

… and launch preDE.py to obtain a global counts file. this global file can be used in DE analysis

python2 $PATHTODATA/scripts_utils/prepDE.py -i listGTF.txt

In this step, you have obtained gene_count_matrix.csv and transcript_count_matrix.csv files

How many genes have some counts? Remember, we are working only in chr01.

Transfer data from /scratch to your home on cluster

Don’t forget scp files to your /home/formationX. Transfer reference files fasta, gff and gffcompare_out.annotated.gtf to use it later with IGV.

# copy gene counts files
scp -r /scratch/SRA_SRS307298/TOGGLe-RNASEQ/COUNTS/*tsv /home/formationX
# copy gffcompare annotation
scp -r /scratch/SRA_SRS307298/TOGGLe-RNASEQ/OUT/stringtieEB/gffcompare_out.annotated.gtf /home/formationX
# copy references
scp -r /scratch/SRA_SRS307298/REF/ /home/formationX
# copy BAMs files but before index it with `samtools index` 
for file in /scratch/formationX/TOGGLe-RNASEQ/OUT/output/*/4_samToolsSort/*SAMTOOLSSORT.bam; do samtools index $file; done
# and copy BAM and BAI files
scp  /scratch/SRA_SRS307298/TOGGLe-RNASEQ/OUT/output/*/4_samToolsSort/*SAMTOOLSSORT.ba* /home/formationX

Transfert data from your home on cluster to your LOCAL machine (filezilla or scp)

Practice 5 : Visualization of mapped reads against genes using IGV

Practice 5 will be performed with Integrated Genome Viewer (IGV).

Focus on a gene that has been found to be differentially expressed and observe the structure of the gene.

Run igv : igv.sh &
Load reference genome, GFF annotation file, BAMs files and the gffCompare gffcompare_out.annotated.gtf output.
Recovery some ID to visualise it in IGV from gffcompare_out.annotated.gtf

orjuela@GFFCOMPARE$ grep  'class_code "u"' gffcompare_out.annotated.gtf | grep ^'NC_001133.9' 
NC_001133.9	StringTie	transcript	31192	31470	.	+	.	transcript_id "MSTRG.12.1"; gene_id "MSTRG.12"; xloc "XLOC_000004"; class_code "u"; tss_id "TSS4";
NC_001133.9	StringTie	transcript	151618	152052	.	+	.	transcript_id "MSTRG.75.1"; gene_id "MSTRG.75"; xloc "XLOC_000038"; class_code "u"; tss_id "TSS38";
NC_001133.9	StringTie	transcript	210461	212199	.	+	.	transcript_id "MSTRG.106.1"; gene_id "MSTRG.106"; xloc "XLOC_000055"; class_code "u"; tss_id "TSS55";
NC_001133.9	StringTie	transcript	208135	210131	.	-	.	transcript_id "MSTRG.105.1"; gene_id "MSTRG.105"; xloc "XLOC_000109"; class_code "u"; tss_id "TSS110";
NC_001133.9	StringTie	transcript	28548	28968	.	.	.	transcript_id "MSTRG.11.1"; gene_id "MSTRG.11"; xloc "XLOC_000111"; class_code "u"; tss_id "TSS112";

orjuela@GFFCOMPARE$ grep  'class_code "x"' gffcompare_out.annotated.gtf | grep ^'NC_001133.9' 
NC_001133.9	StringTie	transcript	160119	166158	.	+	.	transcript_id "MSTRG.81.2"; gene_id "MSTRG.81"; gene_name "YAR009C"; xloc "XLOC_000042"; cmp_ref "NM_001178213.1"; class_code "x"; tss_id "TSS42";
NC_001133.9	StringTie	transcript	160119	166167	.	+	.	transcript_id "MSTRG.81.1"; gene_id "MSTRG.81"; gene_name "YAR009C"; xloc "XLOC_000042"; cmp_ref "NM_001178213.1"; class_code "x"; tss_id "TSS42";
NC_001133.9	StringTie	transcript	206812	220171	.	-	.	transcript_id "MSTRG.104.1"; gene_id "MSTRG.104"; gene_name "FLO1"; xloc "XLOC_000108"; cmp_ref "NM_001178230.1"; class_code "x"; tss_id "TSS109";

orjuela@GFFCOMPARE$ grep  'class_code "j"' gffcompare_out.annotated.gtf | grep ^'NC_001133.9' 
NC_001133.9	StringTie	transcript	147594	151166	.	-	.	transcript_id "MSTRG.74.1"; gene_id "MSTRG.74|YAL001C"; gene_name "TFC3"; xloc "XLOC_000096"; cmp_ref "NM_001178148.1"; class_code "j"; tss_id "TSS96";

Search a “unique” gene “MSTRG.12.1” in IGV. Show this loci.
Search other gene, for example, “MSTRG.81.2” from class “x” or “YAL014C” from class “=”.

6. Differential Expression Analysis (DE) using scripts associated with Trinity software

6.1 Identify differentially expressed genes between the two tissues.

The tool run_DE_analysis.pl is a PERL script that use Bioconductor package edgeR.

# Go to kallisto results
cd /scratch/formationX/KALLISTO

# Run DE analysis
$path_to_trinity/Analysis/DifferentialExpression/run_DE_analysis.pl \
--matrix kallisto_trans.isoform.counts.matrix \
--method edgeR \
--samples_file design.txt \
--output edgeR_results

The output files are in the directory edgeR_results. Observe file kallisto_trans.isoform.counts.matrix.Batch_vs_CENPK.edgeR.DE_results. It provides several values for each gene:

1) FDR to indicate whether a gene is differentially expressed or not

2) logFC is the log2 transformed fold change between the two tissues

3) logCPM is the log2 transformed normalized read count of average of the samples.

Usually you have to filter this list of genes/isoforms to FDR <0.05 or below. To be more conservative, you could also use more stringent FDR cutoff (e.g. <0.001), and only keep genes with high logFC (e.g. <-2 and >2) and/or high logCPM (e.g. >1). In the edgeR_results directory there is also a “volcano plot” to visualize the distribution of the DE genes.

[orjuela@node25 salmon_outdir]$ head edgeR_results/kallisto_trans.isoform.counts.matrix.Batch_vs_CENPK.edgeR.DE_results
sampleA	sampleB	logFC	logCPM	PValue	FDR
TRINITY_DN332_c0_g1_i10	Batch	CENPK	-10.359903780639	8.33512668205486	0	0
TRINITY_DN730_c0_g2_i1	Batch	CENPK	-6.4740873961699	8.70152404919916	0	0
TRINITY_DN741_c0_g1_i1	Batch	CENPK	-6.31231314146213	10.2830859720565	0	0
TRINITY_DN287_c1_g1_i1	Batch	CENPK	-6.26650019458591	8.47203996977476	0	0
TRINITY_DN65_c0_g1_i3	Batch	CENPK	-4.13189957112457	10.730475730689	0	0
TRINITY_DN1298_c0_g1_i1	Batch	CENPK	-3.19109633696326	8.91247526735535	0	0
TRINITY_DN1253_c0_g1_i1	Batch	CENPK	-3.03997649547823	8.99502681145441	6.47317750883953e-301	3.75444295512693e-298
TRINITY_DN51_c0_g1_i1	Batch	CENPK	-4.59451625218422	10.4521118447878	5.94332889429699e-289	3.01623941385572e-286
TRINITY_DN708_c0_g1_i1	Batch	CENPK	-4.1838974791094	7.70916912374135	3.84699757684993e-275	1.73542335133453e-272

#transfering plots
cd edgeR_result
cp *.pdf /home/formationX/
# from your local machine
scp formationX@YOURADRESSIP:*.pdf .

Practice 7 : Differential expression analysis using PIVOT

Practice 7 will be performed in PIVOT via R Studio.

PIVOT: Platform for Interactive analysis and Visualization Of Transcriptomics data Qin Zhu, Junhyong Kim Lab, University of Pennsylvania Oct 26th, 2017

Intallation of PIVOT

Dependencies that needs to be manually installed. You may need to paste the following code line by line and choose if previously installed packages should be updated (recommended).

install.packages("devtools") 
library("devtools")
install.packages("BiocManager")
BiocManager::install("BiocUpgrade") 
#BiocManager::install("GO.db")
BiocManager::install("HSMMSingleCell")
#BiocManager::install("org.Mm.eg.db")
#BiocManager::install("org.Hs.eg.db")
BiocManager::install("DESeq2")
BiocManager::install("SingleCellExperiment")
BiocManager::install("scater")
BiocManager::install("monocle")
BiocManager::install("GenomeInfoDb")

Install PIVOT

install_github("qinzhu/PIVOT")
BiocManager::install("BiocGenerics") # You need the latest BiocGenerics >=0.23.3

Launch PIVOT

To run PIVOT, in Rstudio console related to a web shiny interface, use command

library(PIVOT)
pivot()

Go to your web browser.

Introduction

As you have seen, RNA sequencing (RNA-seq) is an experimental method for measuring RNA expression levels via high-throughput sequencing of small adapter-ligated fragments. The number of reads that map to a gene is the proxy for its expression level. There are many steps between receiving the raw reads from the sequencer and gaining biological insight. In this tutorial, we will start with a “Table of counts” and end with a “List of differentially expressed genes”.

Step 1 : Data Input

To input expression matrix, select “Counts Table” as input file type. PIVOT expects the count matrix to have rows as genes and samples as columns. Gene names and sample names should be the first column and the first row, respectively.

Go to file Tab.
Take the count file gene_count_matrix.csv generated previously.
Import this file into Data input and then Input file type.
Add Use Tab separator to Skip Rows.
Check if yours data are imported in the rigth window.

Step 2 : Input Design Information

The design infomation are used for sample point coloring and differential expression analysis. Users can input the entire sample meta sheet as design information, or manually specify groups or batches for each sample.

Go to DESIGN.
Go to Designed Table Upload. Upload design.txt
Verify that the header of the info file corresponds to the count file.
Choose the Separator : Space or the appropriate separator.
Verify on the Design Table Preview and submit design.
Choose the Normalieation Method :
- for Edge R you can use DESeq, Trimmed Mean of M-values TMM, or Upperquartile.
- for DESeq you can use DESeq, Modifed DESeq

In order to have a quick view of your chosen data, look at the summary.

step 3 : Feature Filtering

There are currently 3 types of feature filter in PIVOT: the expression filter, which filters based on various expression statistics;
You can choose the filter criteria.

Step 4 : Select samples

Go To SAMPLE
Select your sample and condition,
- Subset by sample and or subset by list, as you can create different experiment analysis.

DATA MAP draw a summary of your different analysis, so you can save the history of your analysis.

Step 5 : Basic statistics

If you want to keep the count table, upload it.
Check the distribution of each condition in the standard deviation graph, the dispersion graph.
If needed, you can download the “Variably Expressed Genes”, and on the graph, you can see the dispersion of your data.

First part with EdgeR on PIVOT

Run the EdgeR program for differential analysis - EdgeR
Check relevance of normalized expression values provided by EdgeR
Observe MDS plot of experimental conditions. Observe Smear plot.

Questions :

Using filters parameters, determine how many genes are found to be differentally expressed using a minimum pvalue <= 0.05, 0.1? Using a minimum FDR-adjusted pvalue <= 0.05, 0.1?

Step 6 : Differential Expression with EdgeR

Once data have been normalized in the Step 1, you can choose the method to find the Differential expression gene between the condition previously choosen. In edgeR, it is recommended to remove genes with very low counts. Remove genes (rows) which have zero counts for all samples from the dge data object.

Go to Differential Expression, choose edgeR
According to your normalized method choice in step 1, notify the Data Normalization method, and choose Experiment Design, and condition.
For this dataset choose condition, condition
Choose the Test Method, Exact test, GLM likelihood ratio test, and GLM quasi-likelihood F-test.
For this dataset choose Exact test

Then Run Modeling

Choose the P adjustement adapted
For this dataset choose False discovery rate or Bonferroni correction
Choose FDR cutoff 0.1 and 0.05
Choose Term 1 and Term 2

You obtain a table containing the list of the differential gene expressed according to your designed analysis. This table contains logFoldChange, logCountPerMillion, PValue, FDR. This table can be download in order to use it for other analysis. The Mean-Difference Plot show you the up-regulated gene in green, and Down regulated gene in black.

Keep this file edgeR_results.csv
Change the name of the file according to your analysis.
Example : edgeR_results_FDR_01.csv to be able to recognize the different criteria used.

Questions :

Do you have the sample number of differential gene for a FDR cutoff 0.1 and 0.05?
According to you, what is the best cutoff?
When you change the order of the Term1 and Term2, what are the consequences on the Results Table?

Second part with DESeq2 on PIVOT

Run the DESeq program for differential analysis - DESeq2
Check relevance of normalized expression values provided by DESeq2
Observe MDS plot of experimental conditions. Observe Smear plot.

Questions:

Using filters parameters, determine how many genes are found to be differentially expressed using a minimum pvalue <= 0.05, 0.1? Using a minimum FDR-adjusted pvalue <= 0.05, 0.1?

Step 6 : Differential Expression with DESeq2

Once data have been normalized in the Step 1, you can choose the method to find the Differential expression gene between the condition previously choosen.

Go to Differential Expression, choose DEseq2
According to your normalized method choice in step 1, notify the Data Normalization method, and choose Experiment Design, and condition.
For this dataset choose condition, condition
Choose the Test Method, Exact test, GLM likelihood ratio test, and GLM quasi-likelihood F-test.
For this dataset choose Exact test

Then Run Modeling

Choose the P adjustement adapted
For this dataset choose FDR cutoff
Choose FDR cutoff 0.1 and 0.05

You obtain a table containing the list of the differential gene expressed according to your designed analysis. This table contains baseMean, log2FoldChange, PValue, pvalueadjust. This table can be downloaded in order to use it for other analysis. The MA Plot show you the up-regulated gene LFC>0, and Down regulated gene LFC<0, the differential gene are in red.

Keep this file deseq_results.csv
Change the name of the file according to your analysis.
Example : deseq_results_FDR_01.csv to be able to recognize the different criteria used.

Questions :

Do you have the sample number of defferential gene for a FDR cutoff 0.1 and 0.05?
According to you, what is the best cutoff?
When you change the order of the Term1 and Term2, what are the consequence on the Results Table?

Step 7 : Visualisation Heatmap, Correlation, PCA

You can make correlation between the control and the rep in order to identify library bias if exists. We can also explore the relationships between the samples by visualizing a heatmap of the correlation matrix. The heatmap result corresponds to what we know about the data set. First, the samples in group 1 and 2 come from the control and the repetition, so the two groups are very different. Second, since the samples in each group are technical replicates, the within group variance is very low.

Correlation between sample and control

Go to Correlation.
1 Pairwaise Scatterplot show the pairwaise comparison between your samples.
2 Sample correlation with 3 methods, pearson, sperman or kendal, with the agglomeration method show you how are linked your samples.
3 Feature Correlation represented with a heatMap ordered by variance of the expression.

PCA

To separate your sample a PCA is way to make a dimension reduction.

Question :

Can you describe the structure of your sample, with a correlation analysis, and a PCA?

Practice 8 : Compare list of DE genes with EdgeR and DESeq2

Practice 8 will be performed with Venn Diagramm implemented on PIVOT.

Compare lists of DE genes with the two approches.
Go to Toolkit.
Upload your lists of gene obtained with edegR edgeR_results_FDR_01.csv , and DESeq2 deseq_results_FDR_01.csv.
Upload the first list then Add List and upload the second list.
You can see the Venn diagram and download the common list of gene between the both methods.

Questions:

Look at the expression values for a gene found DE with EdgeR and not with DESeq2, and vice-versa, give the pvalue of each gene?

Some other tools are available to compare 2 lists of gene. Venny.

Practice 9 : Hierarchical Clustering

Practice 9 will be performed with PIVOT.

Connect to your PIVOT interface.
Go to Clustering.
For each analysis EdgeR or DESeq2 specify the Data Input (count, log…).
Choose the distance Euclidean or an other, the Agglomeration method Wardand the number of cluster.

License

The resource material is licensed under the Creative Commons Attribution 4.0 International License (here).

Summary

Practice 0. Going to the cluster - ssh,srun,scp

Connection into cluster through ssh mode

Opening an interactive bash session on a node srun -p partition --pty bash -i

Prepare input files

Practice 1. Check Reads Quality

1.1 Cleanning reads with trimmomatic

Running trimmomatic for each sample

1.2 Check quality after trimming

Practice 2. Indexing references files

2.1 indexing a reference file

Practice 3. Mapping against transcriptome reference + counting with Kallisto

3.1 Kallisto

3.2 Examine your data and your experimental replicates before DE

Practice 4 : Mapping against annotated genome reference with Hisat2 + counting with Stringtie

Running Hisat2 and Stringtie with TOGGLe

Now that we have our assembled transcripts, we can estimate their abundances.

prepare data : in a new directory

Recovery annotations

Stringtie -e -B

gffcompare

Convert to counts table

Transfer data from /scratch to your home on cluster

Practice 5 : Visualization of mapped reads against genes using IGV

6. Differential Expression Analysis (DE) using scripts associated with Trinity software

6.1 Identify differentially expressed genes between the two tissues.

Practice 7 : Differential expression analysis using PIVOT

Intallation of PIVOT

Install PIVOT

Launch PIVOT

Introduction

Step 1 : Data Input

Step 2 : Input Design Information

step 3 : Feature Filtering

Step 4 : Select samples

Step 5 : Basic statistics

First part with EdgeR on PIVOT

Step 6 : Differential Expression with EdgeR

Second part with DESeq2 on PIVOT

Step 6 : Differential Expression with DESeq2

Step 7 : Visualisation Heatmap, Correlation, PCA

Correlation between sample and control

PCA

Practice 8 : Compare list of DE genes with EdgeR and DESeq2

Practice 9 : Hierarchical Clustering

Links

License

Practice 0. Going to the cluster - `ssh,srun,scp`

Connection into cluster through `ssh` mode

Opening an interactive bash session on a node `srun -p partition --pty bash -i`