data : NCBI SRA database under accession number SRS307298 S. cerevisiae
Genome size of S. cerivisiae : 12M (12.157.105) (https://www.yeastgenome.org/strain/S288C#genome_sequence)
In this session, we will analyze RNA-seq data from one sample of S. cerevisiae (NCBI SRA
SRS307298). It is from two different origin (CENPK and Batch), with three biological replications for each
origin (rep1, rep2 and rep3).
Connection to the i-Trop Cluster through ssh mode
We will work on the i-Trop Cluster with a “supermem” node using SLURM scheduler.
Opening an interactive bash session on the node25 (supermem partition) - srun -p partition --pty bash -i
Read this survival document containig basic commands to SLURM (https://southgreenplatform.github.io/trainings/slurm/)
Prepare input files
Create your subdirectory in the scratch file system /scratch. In the following, please replace X with your own user ID number in formationX.
Copy the exercise files from the shared location to your scratch directory (it is essential that all
calculations take place here)
When the files transfer is finished, verify by listing the content of the current directory and the subdirectory RAWDATA with the
command ls -al. You should see 12 gzipped read files in a listing, the samples.txt file and the run_trinity.sh bash script.
1. Check Reads Quality
FastQC perform some simple quality control checks to ensure that the raw data looks good and there are no problems or biases in data which may affect how user can usefully use it. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Multiqc is a modular tool to aggregate results from bioinformatics analyses across many samples into a single report. Use this tool to visualise results of quality. https://multiqc.info/
In this practice, reads quality is ok. You need to observe sequences and check biases. To remove adaptors and primers you can use Trimmomatic. Use PRINSEQ2 to detect Poly A/T tails and low complexity reads. Remove contaminations with SortMeRNA, riboPicker or DeconSeq.
2. Performing a de novo RNA-Seq assembly with trinity
2.1 Running Trinity with trimmomatic and reads normalisation
Preparing assembly sample file and check parameters of trinity assembler
Observe run_trinity.sh script and adapt to your formation number. This script is not sending in sbatch mode because it could be take time in this training. If you need launch it in a slurm cluster you can use this version containing SBATCH commands run_trinity.slurm or adapt it to SGE.
Running assembly :
All screen output (info messages and error messages, if any) will be saved in the file
../trinity.log. The script will start executing in the background (the & at the end), so
that the terminal will return to the prompt right after you hit “Enter”.
You can use jobs to monitor jobs but if you logout the program does not keep running.
While running you can examine steps tail -f ../trinity.log.
After trimmomatic and reads normalisation, Three stages are done by Trinity
Jellyfish: Extracts and counts K-mers (K=25) from reads
Inchworm: Assembles initial contigs by “greedily” extending sequences with most abundant K-mers
Chrysalis: Clusters overlapping Inchworm contigs, builds de Bruijn graphs for each cluster, partitions reads between clusters
Butterfly: resolves alternatively spliced and paralogous transcripts independently for each cluster (in parallel)
Ending up assembly and downloading assembly results
WARNING !: This job is expected to run 12 hours. Kill your job using `fg` and `ctl+c`
Recover assembly results (only Trinity.fasta) generated by trinity from the shared location to your scratch directory :
Upon successful completion of Trinity, the assembled transcriptome is written to the FASTA file called
Trinity.fasta located in the output directory /scratch/formationX/TRINITY_OUT.
3. Assessing transcriptome assembly quality
3.1 Getting basic Assembly metrics with the trinity script TrinityStats.pl
Running trinity script
Trinity.fasta contains transcripts to be evaluated, annotated, and used in downstream analysis of
expression. In this exercise, we only concentrate on basis statistics of the assembled transcriptome,
which can be obtained using a Trinity utility script TrinityStats.pl.
Parsing the results generated by the script
The output file generated (trinityStats.log) will contain basic information about contig length distributions, based on all transcripts and only on the longest isoform per gene.
Besides average and median contig lengths, also given are quantities N10 through N50. Nx is the smallest contig length such that (x/100)% of all assembled bases are in contigs longer than Nx. Specifically, N50 is the contig length such that half of all assembly sequence is contained in contigs longer than that. In whole genome assembly, N50 is often used as a measure (one of many) of assembly quality, since the longer the contigs, the better the assembly. In the case of transcriptome, contig lengths should be correct, which does not imply “large”. If it falls in the right ballpark (about 1000-1,500), N50 can still be used as a check on overall “sanity” of the transcriptome assembly.
3.2 Reads mapping back rate and abundance estimation using the trinity script align_and_estimate_abundance.pl
Read congruency is an important measure in determining assembly accuracy. Clusters of read pairs that align incorrectly are strong indicators of mis-assembly. A typical “good” assembly has ~80% reads mapping to the assembly and ~80% are properly paired.
Several tools can be used to calculate reads mapping back rate over Trinity.fasta assembly : bwa, bowtie2 (mapping), kallisto, salmon (pseudo mapping). Quantify read counts for each gene/isoform can be calculate. Mapping and quantification can be obtained by using the –est_method argument into the align_and_estimate_abundance.pl script.
We will performing this analyses step successively with the align_and_estimate_abundance.pl script :
Pseudomapping methods (kallisto or salmon) are faster than mapping based. So firstly we will use salmon to pseudoalign reads from sample to the reference and quantify abondance.
Then, we will use bowtie2 and RSEM to align and quantify read counts for each gene/isoform.
Preparing data and environment for our following analysis
Launch Salmon
Check reads percentage mapped to Trinity.fasta reference by sample for salmon and bowtie-rsem est method.
OPTIONAL: Launch bowtie2-rsem
WARNING !: this job can take a lot ~1h30 by sample. This step will not run in this practice.
Here, we show you results with bowtie2-rsem OPTIONAL part
3.3. Expression matrix construction
Combine read count from all samples into a matrix, and normalize the read count using the TMM method. This command will take in RSEM output files from each sample, and combine them into a single matrix file.
You have to obtain two matrices:
The firts one containing the estimated counts ̀Trinity_trans.isoform.counts.matrix and the second one containing the TPM expression values that are cross-sample normalized using the TMM method Trinity_trans.TMM.EXPR.matrix.
TMM normalization assumes that most transcripts are not differentially expressed, and linearly scales the expression values of samples to better enforce this property.
In both files, each column represents a sample, and each row represents a gene, the values are either the raw read counts or normalized FPKM values. The “counts” file will be used for differentially expressed gene identification, and the “fpkm” file will be used for clustering analysis. By default, the fpkm file is normalized with TMM method.
3.4 Compute N50 based on the top-most highly expressed transcripts (Ex50)
TRANSFERT: Observe plots. Remember, you have to transfert *.pdf files to your home before of transfering into your local machine.
3.5 Quantifying completness using BUSCO
Assessing gene space is a core aspect of knowing whether or not you have a good assembly.
Benchmarking Universal Single-Copy Orthologs (BUSCO) sets are collections of orthologous groups with near-universally-distributed single-copy genes in each species, selected from OrthoDB root-level orthology delineations across arthropods, vertebrates, metazoans, fungi, and eukaryotes. BUSCO groups were selected from each major radiation of the species phylogeny requiring genes to be present as single-copy orthologs in at least 90% of the species; in others they may be lost or duplicated, and to ensure broad phyletic distribution they cannot all be missing from one sub-clade. The species that define each major radiation were selected to include the majority of OrthoDB species, excluding only those with unusually high numbers of missing or duplicated orthologs, while retaining representation from all major sub-clades. Their widespread presence means that any BUSCO can therefore be expected to be found as a single-copy ortholog in any newly-sequenced genome from the appropriate phylogenetic clade.
Running Busco on Trinity.fasta assembly
Displaying the short summary table generated by BUSCO
3.6 BLASTX comparison to known protein sequences database
Performing a blastx againt the swissprot database (the manually annotated and reviewed section of the UniProt Knowledgebase)
WARNING !: This step will not run in this practice because it will be inclued in annotation script !! But we give you commands lines to lauch with your data.
Analyzing blast results
8616 transcripts from Trinity.fasta file were blasted against the swissprot database.
440 hits were reported in SwissProt_1E20_Trinity_blastx.outfmt6 file.
In SwissProt_1E20_Trinity_blastx.outfmt6.grouped.output file we can observed for example that 242 sequences were found with a 100% identity to an uniprot protein (count_in_bin). bin_below column represent a accumulative number of sequences.
4. Differential Expression Analysis (DE)
4.1 Examine your data and your experimental replicates before DE
Before differential expression analysis, examine your data to determine if there are any confounding issues. Trinity comes with a ‘PtR’ script that we use to simplify making various charts and plots based on a matrix input file. Run these three commands lines :
Run PtR scripts
TRANSFERT : Observe plots. Remember, you have to transfert *.pdf files to your home before of transfering into your local machine.
4.2 Identify differentially expressed genes between the two tissues.
The tool run_DE_analysis.pl is a PERL script that use Bioconductor package edgeR.
The output files are in the directory edgeR_results. Observe file Trinity_trans.isoform.counts.matrix.Batch_vs_CENPK.edgeR.DE_results. It provides several values for each gene:
1) FDR to indicate whether a gene is differentially expressed or not
2) logFC is the log2 transformed fold change between the two tissues
3) logCPM is the log2 transformed normalized read count of average of the samples.
Usually you have to filter this list of genes/isoforms to FDR <0.05 or below. To be more conservative, you could also use more stringent FDR cutoff (e.g. <0.001), and only keep genes with high logFC (e.g. <-2 and >2) and/or high logCPM (e.g. >1). In the edgeR_results directory there is also a “volcano plot” to visualize the distribution of the DE genes.
4.3 Clustering analysis
Hierarchical clustering and k-means clustering for samples and genes can be done using trinity scripts.
The clustering will be performed only on differentially expressed genes, with FDR and logFC cutoff defined by -P and -C parameters.
In this example, we set K=4 for k-means analysis. The genes will be separated into 4 groups based on expression pattern.
There are two prefiltered files produced: *DE_results.P1e-3_C2.Batch-UP.subset and *DE_results.P1e3_C2.CENPK-UP.subset, with differentially expressed genes (FDR cutoff 0.001, logFC cutoff 2 and -2).
Before finish …
Go to here to recover the scripts used in this training.
License
The resource material is licensed under the Creative Commons Attribution 4.0 International License (here).