Transcrits assembled using Trinity can be easily annotate using trinotate https://github.com/Trinotate/Trinotate.github.io/wiki.
Trinotate use different methods for functional annotation including homology search to known sequence data (BLAST+/SwissProt), protein domain identification (HMMER/PFAM), protein signal peptide and transmembrane domain prediction (signalP/tmHMM), and take advantage from annotation databases (eggNOG/GO/Kegg). These data are integrated into a SQLite database which allows to create an annotation report for a transcriptome.
Two bash scripts were created to obtain the whole of files obligatories to build a Sqlite database and create reports.
0. Connection to the i-Trop Cluster through ssh mode
We will work on the i-Trop Cluster with a “supermem” node using SLURM scheduler.
Connection to supermem partition
Connect you to node25 (supermem partition) without opening an interactive bash session
Prepare input files
1. Trinotate pipeline : first part (run_trinotate.slurm script )
Let’s run run_trinotate.slurm to obtain ORFs, seach sequence homology and conserved domains and others …
WARNING !: This job can be run for about 12h
You can scp results from `nas:/data2/formation/TP-trinity/TRINITY_OUT/ANNOTATION/results_sacharomyces` to your `/scratch/formationX/ANNOTATION` repertory.
What is doing this script ? Most important steps are explained here :
Determining longest Open Reading Frames (ORFs)
First step of the annotation of transcript is to determine open reading frame (ORFs), they will be then annotated. Use TransDecoder to identy likely coding sequences based on the following steps:
Running TransDecoder is a two-step process. First run the TransDecoder step that identifies all long ORFs and then the step that predicts which ORFs are likely to be coding (TransDecoder.LongOrfs, TransDecoder.Predict). Once you have the sequences you can start looking for sequence or domain homologies.
Now, run the step that predicts which ORFs are likely to be coding.
Sequence homology searches from predicted protein sequences
Now, let’s look for sequence homologies by just searching our predicted protein sequences rather than using the entire transcript as a target
Search conserved domains
Using our predicted protein sequences, let’s also run a HMMER search against the Pfam database, and identify conserved domains that might be indicative or suggestive of function
Computational prediction of sequence features
Recheche de peptides signaux
The signalP and tmhmm software tools are very useful for predicting signal peptides (secretion signals) and transmembrane domains, respectively.
question : How many of your proteins are predicted to encode signal peptides?
Running Rnammer to detected rRNA
The program uses hidden Markov models trained on data from the 5S ribosomal RNA database and the European ribosomal RNA database project
Trinotate pipeline : second part (Annotation report script)
Now, we need allocate 10 RAM memory and 2 CPU with srun to continue with this practical.
WARNING !: Don't forget that you can scp results from first part of this TP from `nas:/data2/formation/TP-trinity/TRINITY_OUT/ANNOTATION/results_sacharomyces`!
Loading results into a Trinotate SQLite database and generating a report.
Generating a Trinotate annotation report involves first loading all of our bioinformatics computational results into a Trinotate SQLite database. The Trinotate software provides a boilerplate SQLite database called Trinotate.sqlite that comes pre-populated with a lot of generic data about SWISSPROT records and Pfam domains. This database is populated with all computes obtained before and the expression data to build a final report.
Run the second bash script build_sqlite_trinotateDB.sh . This script needs as input the assembled transcrits and the repertory containing the whole of results obtained by run_trinotate.slurm (option -r) and the transcripts assembled by trinity file (option -f).
What is running in build_sqlite_trinotateDB.sh script?
Report can be found in sacharomyces_annotation_report_filtered.xls file. For details of report generated go to https://github.com/Trinotate/Trinotate.github.io/wiki/Loading-generated-results-into-a-Trinotate-SQLite-Database-and-Looking-the-Output-Annotation-Report
If you want to visualise GO go to wego site : http://wego.genomics.org.cn/ and import your sacharomyces_go_annotations_rfm.txt file after replace comma by tabulations :
License
The resource material is licensed under the Creative Commons Attribution 4.0 International License (here).