Goal of the exercice
Annotate the genes of the whole Arabidopsis genome from the following dataset with Eugene-EP appliance / VM from IFB cloud
Lineage (full): cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; Mesangiospermae; eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales; Brassicaceae; Camelineae; Arabidopsis
116M TAIR_genome.fasta (135 Mbp 5 chr)
30M TAIR_est2.fasta
RIKEN Arabidopsis full-length cDNA clones (RAFL clones) http://epd.brc.riken.jp/en/pdna/rafl_clones
8.3M uniprot_sp_viridiplantae_not_camelineae_short_header.fna
UniprotKB Swiss-Prot taxonomy:”Viridiplantae [33090]” NOT taxonomy:”Camelineae [980083]” => 23706
63M uniprot_trembl_brassiceae_short_header.fna
UniprotKB TrEMBL taxonomy:”Brassiceae 981071” => 171467
Prerequisites
You need an account on the IFB cloud with an ACTIVE SSH-KEY. If you don’t have any please refer to the following documentation:
Create a user account on the IFB cloud & ACTIVATE YOUR SSH-KEY (if not already done)
You need to be connected to an EuGène appliance with 8 CPU et 32 Go de RAM . If it is not the case, please refer to the following documentation:
Launch a Eugene-EP appliance on the IFB cloud
EGNEP run
From an Eugene IFB cloud appliance
1) Preparing your data
From your the root terminal of your appliance, index the databanks
cd /root/bank_tair/
makeblastdb -in TAIR_est2.fasta -dbtype nucl
makeblastdb -in repbase20.05_aaSeq_cleaned_TE.fa -dbtype prot
makeblastdb -in uniprot_sp_viridiplantae_not_camelineae_short_header.fna -dbtype prot -parse_seqids
makeblastdb -in uniprot_trembl_brassiceae_short_header.fna -dbtype prot -parse_seqids
2) Running & checking
Run EGNEP (without –no_red option)
cd
nohup time $EGNEP /bin/int/egn-euk.pl --indir /root/input_dir/ --outdir /root/output_dir/ --cfg /root/bank_tair/egnep-test.cfg --workingdir /root/work_dir/ > & pipeline.txt &
Check the progress of the run
less pipeline.txt
tail -f pipeline.txt
################################################################################
# /usr/bin/egnep-1.4/bin/int/egn-euk.pl --indir /root/input_dir/ --outdir /root/output_dir/ --cfg /root/bank_tair/egnep-test.cfg --workingdir /root/work_dir/
# EuGene Pipeline EUK - version 1.4
# EUGENEDIR /usr/bin/eugene-4.2a
# EGNEP /usr/bin/egnep-1.4
# Log file /root/work_dir/logger.1536944185.2045.txt
################################################################################
################################################################################
Create tree.....................................................................started
Create tree.....................................................................done
######################### Protein database cleaning ##########################
##################### Protein sequence similarity search #####################
BlastX uniprot_sp_viridiplantae_not_camelineae_short_header.fna uniprot_trembl_brassiceae_short_header.fnastarted
BLASTX PARAMETERS = -outfmt 6 -evalue 0.01 -gapopen 9 -gapextend 2 -max_target_seqs 500000 -max_intron_length 15000 -seg yes
UBLAST PARAMETERS = -threads 8 -evalue 1 -lopen 9 -lext 2 -accel
Understanding EGNEP run
1) What are the value of environment variable $EGNEP and $EUGENEDIR ?
echo $EGNEP
/usr/bin/egnep-1.4
echo $EUGENEDIR
/usr/bin/eugene-4.2a
2) Where is the EGN-EP configuration file and how to set the data parameters ?
gedit bank_tair/egnep-test.cfg &
blastx_db_list = 1 2
blastx_db_1_file = /root/bank_tair/uniprot_sp_viridiplantae_not_camelineae_short_header.fna
blastx_db_2_file = /root/bank_tair/uniprot_trembl_brassiceae_short_header.fna
est_list = 1
est_1_file = /root/bank_tair/TAIR_est2.fasta
repeat_sequence_db = /root/bank_tair/repbase20.05_aaSeq_cleaned_TE.fa
3) Where is the EGN-EP executable?
/usr/bin/egnep-1.4/bin/int/egn-euk.pl
4) Where is the Log file?
/root/work_dir/logger.1536944185.2045.txt
Understanding EGNEP results
EGN-EP
1) Where are the analysis results of Chr4?
/root/work_dir/0001/Chr4
ls * .gff3
Chr4.blast1.gff3 Chr4.est1.gff3 Chr4.masked.blastrep.gff3 Chr4.repet_noexpressed_nosimprot.gff3 Chr4.rnammer.gff3
Chr4.blast2.gff3 Chr4.ltrharvest.gff3 Chr4.red.gff3 Chr4.rfamscan.gff3 Chr4.trnascan.gff3
2) Where are the eugenev0.par and eugenev1.par, what are the difference?
/root/work_dir/egn_param
diff eugenev0.par eugenev1.par
< Sensor.AnnotaStruct.use 3
---
> AnnotaStruct.FileExtension[3] repet_noexpressed_nosimprot
> AnnotaStruct.TranscriptFeature[3] transcript
> AnnotaStruct.Start* [ 3] 0
> AnnotaStruct.StartType[3] s
> AnnotaStruct.Stop* [ 3] 0
> AnnotaStruct.StopType[3] s
> AnnotaStruct.Acc* [ 3] 0
> AnnotaStruct.AccType[3] s
> AnnotaStruct.Don* [ 3] 0
> AnnotaStruct.DonType[3] s
> AnnotaStruct.TrStart* [ 3] 0
> AnnotaStruct.TrStartType[3] s
> AnnotaStruct.TrStop* [ 3] 0
> AnnotaStruct.TrStopType[3] s
> AnnotaStruct.TrStartNpc* [ 3] 0
> AnnotaStruct.TrStartNpcType[3] s
> AnnotaStruct.TrStopNpc* [ 3] 0
> AnnotaStruct.TrStopNpcType[3] s
> AnnotaStruct.Exon* [ 3] 0
> AnnotaStruct.Intron* [ 3] 1
> AnnotaStruct.CDS* [ 3] 0
> AnnotaStruct.npcRNA* [ 3] 0
> AnnotaStruct.Intergenic* [ 3] 2
> AnnotaStruct.format[3] GFF3
> Sensor.AnnotaStruct.use 4
3) Where are the sensor priorities of the eugenev1.par?
Sensor.MarkovIMM 1
Sensor.SignalWAM 10
Sensor.AnnotaStruct 30
Sensor.BlastX 20
Sensor.Est 20
Sensor.MarkovIMM.use 1
Sensor.SignalWAM.use 2 ( acceptor, donor)
Sensor.AnnotaStruct.use 4 ( trna, rrna, ncrna, repeat)
Sensor.Est.use 1
Sensor.BlastX.use 1
4) Where is the report file?
/root/output_dir
more report.1536944185.2045.txt
## Transcriptome mapping information
Nb transcriptome seq_number mapped_sequence_number( raw gmap result) mapped_filtered_sequence_number( after filtering) mapped_filtered_seque
nce__percentage
1 /root/bank_tair/TAIR_est2.fasta 20683 20668 20625 99.7
## Splicing sites read in the training dataset
Canonical acceptor AG 72636 sites
Canonical donor GT 71768 sites
Non canonical donor GC 771 sites 1.1% of the canonical site number
## Arabidopsis Thaliana specific repeat domains
File = /root/work_dir/db/SpeciesRepeatDomain.fa
Repeat domain number = 1274 Repeat domain length = 684374 nt ( 0.6% of genomic sequences)
## LTR masking
LTR region length = 2054544 nt ( 1.7% of genomic sequences)
## Red repeat predictions
Red region length = 23443746 nt ( 19.6% of genomic sequences)
## Repeat regions (LTR + species specific repeat domains, where no expression and no protein similarity)
Repeat region number = 10735 Repeat region length = 21406343 nt ( 17.9% of genomic sequences)
5) Where are the general statistics?
/root/output_dir
root@machine068d5c96-f666-4443-aea3-7c6d0c83170a:~/output_dir# more sequences.general_statistics.xls
Number of nucleotides ( without 'N' ) 119482012
Per cent GC 36.06
Total number of genes 25968
Total nucleotides ( bp) 51117321
** Protein coding genes
Number of protein coding genes 23786
Mean gene length ( bp) 2132.27
Coding nucleotides ( bp) 29924318
Per cent genes with introns 78
Per cent genes with five UTR 59
Per cent genes with three UTR 65
Exons
Mean number per gene 5.22
Mean length ( bp) 280.80
GC per cent 42.85
Introns
Mean number per gene 4.22
Mean length ( bp) 157.66
GC per cent 32.52
CDS
Mean length ( bp) 1258.06
Min length ( bp) 123.00
Max length ( bp) 15234.00
GC per cent 44.14
five_prime_UTR
Mean length ( bp) 131.84
GC per cent 38.45
three_prime_UTR
Mean length ( bp) 201.22
GC per cent 33.01
** Non protein coding genes
Number of non protein coding genes 2182
Mean ncRNA gene length ( bp) 182.93
Min length ( bp) 39
Max length ( bp) 7615
GC per cent 46.45
Per cent ncRNA genes with introns 0
Mean exon number per ncRNA gene 1.00
** Intergenic ( inter protein-coding genes)
Mean length 2639.79
GC per cent 33.26
6) Where are the gene annotation file and the polypeptide sequence file?
/root/output_dir
root@machine068d5c96-f666-4443-aea3-7c6d0c83170a:~/output_dir# more sequences.gff3
##gff-version 3
##sequence-region Chr1 1 30427671
Chr1 EuGene gene 3634 5894 . + . ID = gene:Chr1g0000001; Name = Chr1g0000001
Chr1 EuGene mRNA 3634 5894 . + . ID = mRNA:Chr1g0000001; Name = Chr1g0000001; Parent = gene:Chr1g0000001
Chr1 EuGene exon 3634 3913 . + . ID = exon:Chr1g0000001.1; Parent = mRNA:Chr1g0000001
Chr1 EuGene exon 3996 4276 . + 2 ID = exon:Chr1g0000001.2; Parent = mRNA:Chr1g0000001
Chr1 EuGene exon 4486 4605 . + 0 ID = exon:Chr1g0000001.3; Parent = mRNA:Chr1g0000001
Chr1 EuGene exon 4706 5095 . + 0 ID = exon:Chr1g0000001.4; Parent = mRNA:Chr1g0000001
Chr1 EuGene exon 5174 5326 . + 0 ID = exon:Chr1g0000001.5; Parent = mRNA:Chr1g0000001
Chr1 EuGene exon 5439 5894 . + 0 ID = exon:Chr1g0000001.6; Parent = mRNA:Chr1g0000001
Chr1 EuGene five_prime_UTR 3634 3759 . + . ID = five_prime_UTR:Chr1g0000001.0; Parent = mRNA:Chr1g0000001; est_cons = 100.0; est_incons = 0
.0
Chr1 EuGene CDS 3760 3913 . + 0 ID = CDS:Chr1g0000001.1; Parent = mRNA:Chr1g0000001; est_cons = 100.0; est_incons = 0.0
Chr1 EuGene CDS 3996 4276 . + 2 ID = CDS:Chr1g0000001.2; Parent = mRNA:Chr1g0000001; est_cons = 100.0; est_incons = 0.0
Chr1 EuGene CDS 4486 4605 . + 0 ID = CDS:Chr1g0000001.3; Parent = mRNA:Chr1g0000001; est_cons = 100.0; est_incons = 0.0
Chr1 EuGene CDS 4706 5095 . + 0 ID = CDS:Chr1g0000001.4; Parent = mRNA:Chr1g0000001; est_cons = 100.0; est_incons = 0.0
Chr1 EuGene CDS 5174 5326 . + 0 ID = CDS:Chr1g0000001.5; Parent = mRNA:Chr1g0000001; est_cons = 100.0; est_incons = 0.0
Chr1 EuGene CDS 5439 5630 . + 0 ID = CDS:Chr1g0000001.6; Parent = mRNA:Chr1g0000001; est_cons = 100.0; est_incons = 0.0
Chr1 EuGene three_prime_UTR 5631 5894 . + . ID = three_prime_UTR:Chr1g0000001.12; Parent = mRNA:Chr1g0000001; est_cons = 100.0; est_incons
= 0.0
grep -c '>' sequences_prot.fna
23786
Eugene
7) Where to find and what is the command line to run eugene?
/root/work_dir
more logger.1536944185.2045.txt
export PARALOOP = /usr/bin/egnep-1.4/bin/ext/paraloop ; /usr/bin/egnep-1.4/bin/ext/paraloop/bin/paraloop.pl --clean --wait --ncpus = 7 --interleaved --program = Shell --input /root/work_dir/annotationV1//raw_eugene/EGN_ANNOT_1536944185.2045/eugene.cmd.paraloop --output /root/work_dir/annotationV1//raw_eugene/EGN_ANNOT_1536944185.2045/paraloop.output --clean > /dev/null 2>&1
more /root/work_dir/annotationV1//raw_eugene/EGN_ANNOT_1536944185.2045/eugene.cmd.paraloop
export EUGENEDIR = /usr/bin/eugene-4.2a; /usr/bin/eugene-4.2a/bin/eugene -A /root/work_dir/egn_param//eugenev1.par -m /root/work_dir/egn_param//eugene.mat -pg
-O /root/work_dir/annotationV1//raw_eugene/ /root/work_dir/0001/Chr1/Chr1 > /root/work_dir/annotationV1//raw_eugene/Chr1.eugene.stdout 2> /root/work_dir/ann
otationV1//raw_eugene/Chr1.eugene.stderr
export EUGENEDIR = /usr/bin/eugene-4.2a; /usr/bin/eugene-4.2a/bin/eugene -A /root/work_dir/egn_param//eugenev1.par -m /root/work_dir/egn_param//eugene.mat -pg
-O /root/work_dir/annotationV1//raw_eugene/ /root/work_dir/0001/Chr5/Chr5 > /root/work_dir/annotationV1//raw_eugene/Chr5.eugene.stdout 2> /root/work_dir/ann
otationV1//raw_eugene/Chr5.eugene.stderr
8) Where are the intron parameters ?
/usr/bin/eugene-4.2a/models
root@machine068d5c96-f666-4443-aea3-7c6d0c83170a:/usr/bin/eugene-4.2a/models# more intron.dist
40 0.0
41 0.0
If you change it you need to recompile (make; make install)
9) Where are the splice signal WAM files?
/usr/bin/eugene-4.2a/models/WAM/plant
Busco run
run_BUSCO.py -i output_dir/sequences_prot.fna -o BUSCO_output -sp arabidopsis -c 4 -l bank_tair/embryophyta_odb9 -m proteins
10) What are the Busco sumary results?
INFO ****************** Start a BUSCO 3.0.2 analysis, current time : 09/20/2018 08:54:17 ******************
INFO Configuration loaded from /usr/bin/BUSCO/config/config.ini
INFO Init tools...
INFO Check dependencies...
INFO Check input file...
INFO To reproduce this run: python /usr/bin/BUSCO/scripts/run_BUSCO.py -i output_dir/sequences_prot.fna -o BUSCO_output -l /usr/bin/BUSCO/scripts/LINEAGE/embryophyta_odb9/ -m proteins -c 4
INFO Mode is: proteins
INFO The lineage dataset is: embryophyta_odb9 ( eukaryota)
INFO Temp directory is ./tmp/
INFO Running HMMER on the proteins:
INFO [ hmmsearch] 144 of 1440 task( s) completed at 09/20/2018 08:56:25
INFO [ hmmsearch] 288 of 1440 task( s) completed at 09/20/2018 08:59:06
INFO [ hmmsearch] 432 of 1440 task( s) completed at 09/20/2018 09:00:54
INFO [ hmmsearch] 576 of 1440 task( s) completed at 09/20/2018 09:02:27
INFO [ hmmsearch] 720 of 1440 task( s) completed at 09/20/2018 09:03:28
INFO [ hmmsearch] 864 of 1440 task( s) completed at 09/20/2018 09:03:58
INFO [ hmmsearch] 1008 of 1440 task( s) completed at 09/20/2018 09:04:12
INFO [ hmmsearch] 1152 of 1440 task( s) completed at 09/20/2018 09:04:26
INFO [ hmmsearch] 1296 of 1440 task( s) completed at 09/20/2018 09:04:36
INFO [ hmmsearch] 1440 of 1440 task( s) completed at 09/20/2018 09:04:44
INFO Results:
INFO C:93.2%[S:91.9%,D:1.3%],F:3.2%,M:3.6%,n:1440
INFO 1342 Complete BUSCOs ( C)
INFO 1324 Complete and single-copy BUSCOs ( S)
INFO 18 Complete and duplicated BUSCOs ( D)
INFO 46 Fragmented BUSCOs ( F)
INFO 52 Missing BUSCOs ( M)
INFO 1440 Total BUSCO groups searched
INFO BUSCO analysis done with WARNING( s) . Total running time : 628.484469891 seconds
INFO Results written in /root/run_BUSCO_output/
MyGenomeBrowser
If you need to launch myGenomeBrowser, run the script script_myGenomeBrowser.sh (/root) (login/password generated)
TO COMPLETE SSB
EGNEP errors
1) Variables
The environment variables should be already set
# export EUGENEDIR=/usr/bin/eugene-4.2a
# export EGNEP=/usr/bin/egnep-1.4
2) Index databanks
The database “/root/bank_tair/uniprot-thaliana_swiss2.fasta” does not exist or isn’t indexed. (Use ‘makeblastdb’ program to index).
The database “/root/bank_tair/uniprot-thaliana_trembl2.fasta” does not exist or isn’t indexed. (Use ‘makeblastdb’ program to index).
For software licence reasons, transfer the transposable element polypeptide file, for instance
Downloads SIDIBEBOCS$ scp repbase20.05_aaSeq_cleaned_TE.fa root@134.158.247.40:/root/bank_tair/
3) Program missing
=>The value of the parameter prg_rnammer is >/usr/bin/egnep-1.4/bin/ext/rnammer< which is not a name of an existing and non empty file at /usr/bin/egnep-1.4/bin/int/egn-euk.pl line 2207.
Command exited with non-zero status 25
cd /usr/bin/egnep-1.4/bin/ext/
lrwxrwxrwx 1 root root 17 Aug 27 15:14 bedtools2 -> bedtools2-2.24.0/
drwxrwxr-x 11 root root 4096 Aug 27 15:12 bedtools2-2.24.0
drwxr-xr-x 4 root root 4096 Aug 27 15:12 bin
-rwxrwxr-x 1 339 ubuntu 55318 Feb 24 2017 BioFileConverter.pl
drwxrwxr-x 5 339 ubuntu 4096 Feb 24 2017 blast-2.2.26
-rwxrwxr-x 1 339 ubuntu 2509 Feb 24 2017 convert_rfam2gff3.pl
-rwxrwxr-x 1 339 ubuntu 2280 Feb 24 2017 convert_rnammer2gff3.pl
-rwxrwxr-x 1 339 ubuntu 2813 Feb 24 2017 convert_trnascan2gff3.pl
lrwxrwxrwx 1 root root 17 Aug 27 15:12 genometools -> genometools-1.5.6
drwxrwxr-x 15 root root 4096 Aug 27 15:12 genometools-1.5.6
lrwxrwxrwx 1 root root 15 Aug 27 15:22 gmap -> gmap-2017-02-15
drwxr-xr-x 7 11414 1279 4096 Aug 27 15:22 gmap-2017-02-15
drwxrwxr-x 13 990287 93203 4096 Mar 4 2015 hmmer-3.1b2-linux-intel-x86_64
drwxr-xr-x 3 root root 4096 Aug 27 15:12 include
drwxr-xr-x 10 41650 93203 4096 Aug 27 15:08 infernal-1.1.1
drwxr-xr-x 3 2314 cdrom 4096 Jul 4 2006 lib
-rwxr-xr-x 1 2314 cdrom 0 Jul 24 2007 LICENSE
-rwxrwxr-x 1 339 ubuntu 9931 Feb 24 2017 lipm_bed_filter.pl
-rwxrwxr-x 1 339 ubuntu 3351 Feb 24 2017 lipm_bed_split_by_sequence.pl
-rwxrwxr-x 1 339 ubuntu 26402 Feb 24 2017 lipm_bed_to_expr.pl
-rwxrwxr-x 1 339 ubuntu 6591 Feb 24 2017 lipm_bed_to_gff3.pl
-rwxrwxr-x 1 339 ubuntu 4284 Feb 24 2017 lipm_dbprot_remove_repbase.pl
-rwxrwxr-x 1 339 ubuntu 1600 Feb 24 2017 lipm_fasta2overlappingwins.pl
-rwxrwxr-x 1 339 ubuntu 15028 Feb 24 2017 lipm_fasta2tree.pl
-rwxrwxr-x 1 339 ubuntu 10965 Feb 24 2017 lipm_fastafilter.pl
-rwxrwxr-x 1 339 ubuntu 3700 Feb 24 2017 lipm_fastasplitter.pl
-rwxrwxr-x 1 339 ubuntu 18852 Feb 24 2017 lipm_genome_statistics.pl
-rwxrwxr-x 1 339 ubuntu 12125 Feb 24 2017 lipm_m8_to_gff3.pl
-rwxrwxr-x 1 339 ubuntu 5885 Feb 24 2017 lipm_m8tom8plus.pl
-rwxrwxr-x 1 339 ubuntu 7770 Feb 24 2017 lipm_N50.pl
-rwxrwxr-x 1 339 ubuntu 1047496 Feb 24 2017 lipm_nrdb
-rwxrwxr-x 1 339 ubuntu 1172 Feb 24 2017 lipm_smp.pl
-rwxrwxr-x 1 339 ubuntu 13886 Feb 24 2017 lipm_transfer_gff3_attributes.pl
-rwxrwxr-x 1 339 ubuntu 10551 Feb 24 2017 lipm_wig_to_expr.pl
-rwxrwxr-x 1 339 ubuntu 9792 Feb 24 2017 MapWithBlast.pl
lrwxrwxrwx 1 root root 18 Aug 27 15:14 ncbi-blast -> ncbi-blast-2.2.31+
drwxr-xr-x 4 11236 13030 4096 Jun 2 2015 ncbi-blast-2.2.31+
lrwxrwxrwx 1 339 ubuntu 13 Feb 24 2017 paraloop -> paraloop-1.3/
drwxrwxr-x 7 339 ubuntu 4096 Feb 24 2017 paraloop-1.3
lrwxrwxrwx 1 root root 9 Aug 27 15:22 red -> redUnix64
drwxr-x--- 2 10194 10194 4096 Jun 18 2015 redUnix64
-rwxr-xr-x 1 2314 cdrom 0 Aug 27 15:22 rnammer
-rw-r--r-- 1 root root 0 Feb 5 2015 rnammer-1.2.src.tar.Z
-rwxr-xr-x 1 2314 cdrom 8849 Aug 27 15:22 rnammere
drwxr-xr-x 4 root root 4096 Aug 27 15:10 share
lrwxrwxrwx 1 339 ubuntu 17 Feb 24 2017 tRNAscan-SE -> tRNAscan-SE-1.3.1
drwxr-x--- 5 2841 users 4096 Aug 27 15:08 tRNAscan-SE-1.3.1
-rw-r--r-- 1 root root 740960 Jun 16 07:54 tRNAscan-SE.tar.gz
lrwxrwxrwx 1 root root 24 Aug 27 15:22 usearch -> usearch9.2.64_i86linux32
-rwxr-xr-x 1 root root 0 Aug 27 15:22 usearch9.2.64_i86linux32
-rwxr-xr-x 1 2314 cdrom 0 Feb 6 2007 xml2fsa
-rwxr-xr-x 1 2314 cdrom 0 May 22 2007 xml2gff
# rm rnammer
# mv rnammere rnammer
=>The value of the parameter prg_usearch is >/usr/bin/egnep-1.4/bin/ext/usearch< which is not a name of an existing and non empty file at /usr/bin/egnep-1.4/bin/int/egn-euk.pl line 2307.
Command exited with non-zero status 25
scp sidibebocs@cc2-login.cirad.fr:/homedir/sidibebocs/work/ganoderma/egnep-1.4/bin/ext/usearch9.2.64_i86linux32 .
4) Red error
You can try with the no_red argument will disable the repeat-masking and thus will require less memory to run.
However, it is not recommanded to use this argument as it will potentially have negative effect on gene prediction.
nohup $EGNEP /bin/int/egn-euk.pl --no_red --indir /root/input_dir/ --outdir /root/output_dir/ --cfg /root/bank_tair/egnep-test.cfg --workingdir /root
/work_dir > & pipeline.txt &
5) Empty result file?
# more pipeline.txt
nohup : ignoring input
################################################################################
################################################################################
# /usr/bin/egnep-1.4/bin/int/egn-euk.pl --indir /root/input_dir/ --outdir /root/output_dir/ --cfg /root/bank_tair/egnep-test.cfg --workingdir /root/work_dir/
# EuGene Pipeline EUK - version 1.4
# EUGENEDIR /usr/bin/eugene-4.2a
# EGNEP /usr/bin/egnep-1.4
# Log file /root/work_dir/logger.1535964229.7865.txt
################################################################################
################################################################################
Create tree.....................................................................started
Create tree.....................................................................done
######################### Protein database cleaning ##########################
##################### Protein sequence similarity search #####################
BlastX uniprot-thaliana_swiss2.fasta uniprot-thaliana_trembl2.fasta.............started
BLASTX PARAMETERS = -outfmt 6 -evalue 0.01 -gapopen 9 -gapextend 2 -max_target_seqs 500000 -max_intron_length 15000 -seg yes
UBLAST PARAMETERS = -threads 8 -evalue 1 -lopen 9 -lext 2 -accel 1
/usr/bin/egnep-1.4/bin/ext/gmap/bin/gmap_build -d sequences -D /root/work_dir/db/GMAP_INDEX /root/work_dir/sequences 2> /root/work_dir/db/GMAP_INDEX/gmap_idx.7865.stde
BlastX uniprot-thaliana_swiss2.fasta uniprot-thaliana_trembl2.fasta.............done
########################### Transcriptome mapping ############################
Gmap TAIR_est2.fasta............................................................started
PARAMETERS = -n0 -B 5 -t 8 -L 100000 --min-intronlength = 35 -K 25000 --trim-end-exons = 25
FILTERS = EST length percentage > 50, identity percentage > 95
Gmap TAIR_est2.fasta............................................................done
############################# IMM model building #############################
Build IMM models................................................................started
BlastX TAIR_est2.fasta.filterlen300 uniprot-thaliana_swiss2.fasta...........started
PARAMETERS = -outfmt 6 -evalue 0.01 -gapopen 9 -gapextend 2 -max_target_seqs 500000 -max_intron_length 15000 -seg yes
BlastX TAIR_est2.fasta.filterlen300 uniprot-thaliana_swiss2.fasta...........done
BLASTX FILTERS = HSP_length > 100 AA, identity percentage > 50, e-value > 0.0001
Gmap TAIR_est2.fasta.filterlen300...........................................started
PARAMETERS = -n0 -B 5 -t 8 -L 100000 --min-intronlength = 35 -K 25000 --trim-end-exons = 25
FILTERS = EST length percentage > 95, identity percentage > 95
ERROR: no data to train eugene IMM ( because no result for mapping of the reference transcriptome to the genomic sequence) : choose an other reference transcriptome and launch again.
Gmap TAIR_est2.fasta.filterlen300...........................................done
EGNEP kill
You know the process identifier (PID = 26448)
# nohup $EGNEP/bin/int/egn-euk.pl --no_red --indir /root/input_dir/ --outdir /root/output_dir/ --cfg /root/bank_tair/egnep-test.cfg --workingdir /root/work_dir >& pipeline.txt &
[ 1] 26448
You can see all the subprocesses
ps -edf | grep egn
root 4507 26367 0 16:11 pts/0 00:00:00 grep --color = auto egn
root 26448 26367 0 15:47 pts/0 00:00:01 /usr/bin/perl /usr/bin/egnep-1.4/bin/int/egn-euk.pl --no_red --indir /root/input_dir/ --outdir /root/output_dir/ --cfg /root/bank_tair/egnep-test.cfg --workingdir /root/work_dir
root 31805 26697 0 15:55 pts/0 00:00:00 /usr/bin/perl /usr/bin/egnep-1.4/bin/int/get_BlastX.pl --sequence /root/work_dir/0001/Chr1/Chr1 --cfg /root/bank_tair/egnep-test.cfg --db /root/work_dir/db/uniprot-thaliana_trembl2.fasta --outfile /root/work_dir/0001/Chr1/Chr1.blast2 --workingdir /root/work_dir/0001/Chr1/work.1536421660.26448/
root 31806 31805 0 15:55 pts/0 00:00:00 /usr/bin/perl /usr/bin/egnep-1.4/bin/ext/MapWithBlast.pl --sequence /root/work_dir/0001/Chr1/Chr1 --db /root/work_dir/db/uniprot-thaliana_trembl2.fasta --output /root/work_dir/0001/Chr1/Chr1.blast2 --workingdir /root/work_dir/0001/Chr1/work.1536421660.26448/ --cfg /root/bank_tair/egnep-test.cfg
root 31807 31806 0 15:55 pts/0 00:00:00 sh -c export PARALOOP = /usr/bin/egnep-1.4/bin/ext/paraloop ; /usr/bin/egnep-1.4/bin/ext/paraloop/bin/paraloop.pl --clean --wait --ncpus = 7 --interleaved --program = Shell --input /root/work_dir/0001/Chr1/work.1536421660.26448//BlastX.31806.1536422142/Chr1_cmd.31806.1536422142 --output /root/work_dir/0001/Chr1/work.1536421660.26448//BlastX.31806.1536422142/Chr1_cmd.31806.1536422142.output --clean
root 31808 31807 0 15:55 pts/0 00:00:00 /usr/bin/perl /usr/bin/egnep-1.4/bin/ext/paraloop/bin/paraloop.pl --clean --wait --ncpus = 7 --interleaved --program = Shell --input /root/work_dir/0001/Chr1/work.1536421660.26448//BlastX.31806.1536422142/Chr1_cmd.31806.1536422142 --output /root/work_dir/0001/Chr1/work.1536421660.26448//BlastX.31806.1536422142/Chr1_cmd.31806.1536422142.output --clean
You need to kill at least
# kill -9 26448
# kill -9 26697
Before rerunning
# rm pipeline.txt
# rm -fr work_dir
# mkdir work_dir
How to install EGNEP
See Eugene page of the Elixir GAA wiki