1.2. Quality trimming and adapter removal using Trimmomatic
If you are sure of quality reads and parameters, you can directly run trimmomatic and assembly of reads usign Trinity.
Similar for normalisation.
1.3. Removing Ribosomal RNA using sortmerna
i. indexing the rRNA databases
ii. merging reads
iii. Filtering out rRNA from reads
iv. unmerging reads
1.4. Normalisation using Trinity
If you don’t have biological replicates, you can directly done a alone normalisation of reads by sample.
If biological replicates, you can run trinity assembly with option --normalize_by_read_set in section 2 or give R1 and R2 reads for each condition to insilico_read_normalization.pl :
2. Generating a Trinity de novo RNA-Seq assembly
You can assembly reads from one sample :
If you want assembly reads using the whole of samples of a specie (several tissues of a specie without biological replicates) OR
if you have biological replicates in your experiment and you want to obtain a transcriptome by condition :
Remember that is possible run trimmomatic, normalisation and assembly in one command line :
Samples.txt file exemple (tabulated file)
2.1. Evaluating the quality of the assembly
Reads mapping back rate :
A typical ‘good’ assembly has ~80 % reads mapping to the assembly and ~80% are properly paired
We suggest visualise mapping back using IGV. Recovery BAM and Trinity.fasta files and import it in IGV browser. You must to index BAMs files before. Use samtools index BAM to do it.
If you don’t have replicates and you want only mapping reads agains transcriptome obtained by trinity use :
To get alignment statistics, run the following:
Expression matrix construction
You have to obtain two matrices: The firts one containing the estimated counts, and the second one containing the TPM expression values that are cross-sample normalized using the TMM method Trinity_trans.TMM.EXPR.matrix. TMM normalization assumes that most transcripts are not differentially expressed, and linearly scales the expression values of samples to better enforce this property.
Compute N50 based on the top-most highly expressed transcripts (Ex50)
If you want to know, how many transcripts correspond to the Ex 90 peak, you could:
To avoid redundant transcripts, we kept the longest isoform for each “gene” identified by TRINITY (unigene) using the get_longest_isoform_seq_per_trinity_gene.pl utility in TRINITY:
Validation using Transrate
Validation using BUSCO
Validation using BLASTX
First, we downloaded and indexed the database:
Then, we ran BLASTX to get the top match hit:
Finally, we examined the percent of alignment coverage:
If you generate assemblies at a range of different read depths up to and including your assembly leveraging all available reads, you can perform this full-length transcript analysis separately for each of your assemblies, and then plot the number of full-length transcripts vs. number of input RNA-Seq fragments.
Extracting differentially expressed transcripts and generating heatmaps
Extract those differentially expressed (DE) transcripts that are at least 4-fold (C is set to 2^(2) ) differentially expressed at a significance of <= 0.001 (-P 1e-3) in any of the pairwise sample comparisons
Extract transcript clusters by expression profile by cutting the dendrogram
Extract clusters of transcripts with similar expression profiles by cutting the transcript cluster dendrogram at a given percent of its height (ex. 60%), like so:
Most downstream analyses should be applied to the entire set of assembled transcripts, including functional annotation and differential expression analysis.
If you decide that you want to filter transcripts to exclude those that are lowly expressed, you can use the following script:
3. Functional annotation of transcripts using Trinotate and predicting coding regions using TransDecoder
Transcrits assembled using Trinity can be easily annotate using trinotate https://github.com/Trinotate/Trinotate.github.io/wiki.
Trinotate use different methods for functional annotation including homology search to known sequence data (BLAST+/SwissProt), protein domain identification (HMMER/PFAM), protein signal peptide and transmembrane domain prediction (signalP/tmHMM), and take advantage from annotation databases (eggNOG/GO/Kegg). These data are integrated into a SQLite database which allows to create an annotation report for a transcriptome.
Two bash scripts were created to obtain the whole of files obligatories to build a Sqlite database and create reports.
The fist one, trinotate-JAv1.0.sh https://github.com/julieaorjuela/scripts/blob/master/trinotate-JAv1.0.sh, needs as input a repertory containing the fasta files you want to annotate. It generates three repertories : Trinonate, sh, and trash and a submitQsub.sge file that launch every fasta analysis in job array mode. The bash repertory contains scripts created automatically for every fasta file, the Trinotate repertory contains annotation results and the trash contains the log files for every step in the process.
To understand steps run by trinotate-JAv1.0.sh we can view a script generated from HNglobal fasta file as exemple :
Building a Sqlite database and report
The second bash script, build_Sqlite_trinotate_database_and_report-JAv1.2.0.shhttps://github.com/julieaorjuela/scripts/blob/master/build_Sqlite_trinotate_database_and_report-JAv1.2.0.sh, needs as input the assembled transcrits and the repertory containing the whole of results obtained by trinotate-JAv1.0.sh in the last step.