RNA Seq differential expression

This page describes a serie of tools and linux commands used to manipulate raw data (fastq file) for differential expression.

We need, in this tutorial:

a directory with fastq files
a reference file used for the mapping step.

Author(s)

Authors	Gaetan Droc
Research Unit	UMR AGAP
Institut

Keywords

fastqc, cutadapt, hitsat2, stringtie, ballgown, bedtools

Files format

fastq, sam, bam, bed

Date

16/03/2017

Summary

Performing a quality control check with fastqc
Using cutadapt to remove adapters and to trim reads based on quality
Mapping reads with hisat2
Convert and sort SAM to BAM with samtools
Transcript assembly and quantification with StringTie

Performing a quality control check with `fastqc`

FastQC perform some simple quality control checks to ensure that the raw data looks good and there are no problems or biases in data which may affect how user can usefully use it.

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

`fastqc` command

For one file

[droc@cc2-login RNASeq_fastq_MGX]$ mkdir FastQC
[droc@cc2-login RNASeq_fastq_MGX]$ qsub -b y -q normal.q -N fastqc -V fastqc --format fastq --outdir FastQC Leaf_CGATGT_L006_R2_017.fastq

Loop on every file for a directory and print command on terminal

[droc@cc2-login RNASeq_fastq_MGX]$ for i in *fastq; do echo qsub -b y -q normal.q -N fastqc -V fastqc --format fastq --outdir ~/work/FastQC $i;done
qsub -b y -q normal.q -N fastqc -V fastqc --format fastq --outdir /homedir/droc/work/FastQC Leaf_CGATGT_L006_R1_001.fastq
qsub -b y -q normal.q -N fastqc -V fastqc --format fastq --outdir /homedir/droc/work/FastQC Leaf_CGATGT_L006_R1_002.fastq
qsub -b y -q normal.q -N fastqc -V fastqc --format fastq --outdir /homedir/droc/work/FastQC Leaf_CGATGT_L006_R1_003.fastq
qsub -b y -q normal.q -N fastqc -V fastqc --format fastq --outdir /homedir/droc/work/FastQC Leaf_CGATGT_L006_R1_004.fastq
qsub -b y -q normal.q -N fastqc -V fastqc --format fastq --outdir /homedir/droc/work/FastQC Leaf_CGATGT_L006_R1_005.fastq
qsub -b y -q normal.q -N fastqc -V fastqc --format fastq --outdir /homedir/droc/work/FastQC Leaf_CGATGT_L006_R1_006.fastq

Loop on every file for a directory and write command on a file (cmd_fastqc.sh)

[droc@cc2-login RNASeq_fastq_MGX]$ for i in *fastq; do echo qsub -b y -q normal.q -N fastqc -V fastqc --format fastq --outdir ~/work/FastQC $i >> cmd_fastqc.sh;done

[droc@cc2-login RNASeq_fastq_MGX]$ ./cmd_fastqc.sh

HTML Report

[droc@cc2-login RNASeq_fastq_MGX]$ cd FastQC
Leaf_CGATGT_L006_R2_017_fastqc.html  Leaf_CGATGT_L006_R2_017_fastqc.zip

Using `cutadapt` to remove adapters and to trim reads based on quality

Cutadapt is a tool specifically designed to remove adapters from NGS data. https://code.google.com/p/cutadapt/

`cutadadapt` command

[droc@cc2-login RNASeq_fastq_MGX]$ module load bioinfo/cutadapt/1.8.1
[droc@cc2-login RNASeq_fastq_MGX]$ for i in *fastq; do echo qsub -q normal.q -b yes -V -N CUTADAPT cutadapt -a AGATCGGAAGAGCG -O 10 -q 30,30 -f fastq -m 30 -o /work/droc/sugarcane/cutadapt/$i.cutadapt.fastq /work/NGSwaiting4newNAS/sugarcane/BackupCarine/RNASeq/RNASeq_fastq_MGX/$i >> cmd_cutadapt.sh ;done
./cmd_cutadapt.sh

-q 30, 30 : by default, only the 3’ end of each read is quality-trimmed. If you want to trim the 5’ end as well, use the -q option with two comma-separated cutoffs

Mapping reads with `hisat2`

https://ccb.jhu.edu/software/hisat2/index.shtml

There are several steps involved in mapping sequence reads.

Creating an index of the reference genome if necessary

[droc@cc2-login RNASeq_fastq_MGX]$ module load bioinfo/hisat2/2.0.5
[droc@cc2-login RNASeq_fastq_MGX]$ qsub -V -b y -q normal.q -N index hisat2-build reference_genome reference_genome

Performing mapping

[droc@cc2-login RNASeq_fastq_MGX]$ qsub -V -b y -q bigmem.q -N hisat2 hisat2 -x reference_genome -1 Leaf_CGATGT_L006_R1_001.fastq -2  Leaf_CGATGT_L006_R2_001.fastq -S Leaf_CGATGT_L006.sam

Here is a description for the contents of the SAM file: https://samtools.github.io/hts-specs/SAMv1.pdf

Do the same thing for all the library. For this you can use and adapt this Perl script

[droc@cc2-login RNASeq_fastq_MGX]$ nedit fastq2tab.pl&

#!/usr/bin/perl
use File::Basename;
# Get the directory of fastq
my $pwd = shift;

# List all of the file
open(IN,"ls $pwd/*fastq|");
my %fastq;
while(<IN>){
    chomp;
    my $file = fileparse($_);
    # In this example, $file = Leaf_CGATGT_L006_R1_006.fastq
    # Split the name by _
    my ($tissu,$library,$strand,$number) = (split(/\_/,$file))[0,2,3,4];
    # Create an uniq identifier $tag
    my $tag = join("_",$tissu,$library,$number);
    # Create a hash with 2 keys, the uniq name ($tag) and strand (R1 or R2)
    $fastq{$tag}{$strand} = $_;
}
# Next foreach
foreach my $tag (keys %fastq) {
    if ($fastq{$tag}{R1} && $fastq{$tag}{R2}) {
        print "qsub -V -b y -q bigmem.q -N hisat2 hisat2 -x reference_genome -1 $fastq{$tag}{R1} -2  $fastq{$tag}{R2} -S $tag.sam\n";
    }
}

[droc@cc2-login RNASeq_fastq_MGX]$ perl fastq2tab.pl /work/NGSwaiting4newNAS/sugarcane/BackupCarine/RNASeq/RNASeq_fastq_MGX/ > cmd_hisat2.sh
[droc@cc2-login RNASeq_fastq_MGX]$ ./cmd_hisat2.sh

Convert and sort SAM to BAM with `samtools`

http://samtools.sourceforge.net/

[droc@cc2-login RNASeq_fastq_MGX]$ module load bioinfo/samtools/1.3
[droc@cc2-login RNASeq_fastq_MGX]$ qsub -V -q normal.q -N samtools -b y "samtools view -bS Leaf_CGATGT_L006.sam -o Leaf_CGATGT_L006.bam"
[droc@cc2-login RNASeq_fastq_MGX]$ qsub -V -q normal.q -N samtools -b y "samtools sort Leaf_CGATGT_L006.bam -o Leaf_CGATGT_L006_sorted.bam"

To run on batch, you can use the command for

[droc@cc2-login RNASeq_fastq_MGX]$ for i in *sam; do echo qsub -V -q normal.q -N samtools -b y samtools view -bS $i -o $i.bam >> cmd_samtools.sh;done;
[droc@cc2-login RNASeq_fastq_MGX]$ rm -f *sam
[droc@cc2-login RNASeq_fastq_MGX]$ for i in *bam; do echo qsub -V -q normal.q -N samtools -b y samtools sort $i -o $i.sorted.bam >> cmd_samtools_sorted.sh;done;

Transcript assembly and quantification with `StringTie`

https://ccb.jhu.edu/software/stringtie/

[droc@cc2-login RNASeq_fastq_MGX]$ module load bioinfo/stringtie/1.2.1
[droc@cc2-login RNASeq_fastq_MGX]$ qsub -V -b y -q normal.q -N stringtie stringtie  -p 16 -o Leaf_CGATGT_L006.gtf Leaf_CGATGT_L006_sorted.bam

For all sample

[droc@cc2-login RNASeq_fastq_MGX]$ for i in *sorted.bam; do echo qsub -V -b y -q normal.q -N stringtie stringtie  -p 16 -o $i.gtf $i >> cmd_stringtie.sh;done

Merge transcripts from all sample

[droc@cc2-login RNASeq_fastq_MGX]$ qsub -V -b y -q normal.q -N merge stringtie --merge -p 8 -o stringtie_merged.gtf mergelist.txt

Here mergelist.txt is a text file that has the names of the gene transfer format (GTF) files created in the previous step, with each file name on a single line.

RNA Seq differential expression

Author(s)

Keywords

Files format

Date

Summary

Performing a quality control check with fastqc

fastqc command

Using cutadapt to remove adapters and to trim reads based on quality

cutadadapt command

Mapping reads with hisat2