Handsome bacterial comparative genomics#

First commands for preparing the working environment#

First create your working directory:

mkdir -p /scratch3/TP_bcg/$USER/training_bcg
cd /scratch3/TP_bcg/$USER/training_bcg

We can now download the git repository of this training that contains material needed for the training.

git clone https://github.com/SouthGreenPlatform/training_bacterial_comparative_genomics.git

We also need some raw data:

mkdir raw_data
cd raw_data
wget https://panexplorer.southgreen.fr/data/mysample.fastq.gz

Question

Using a unix command, count how many reads are in the raw Fastq file?

Solution

zgrep -c 'barcode=' mysample.fastq.gz

Genome Assembly (from Oxford Nanopore Technologies (ONT) long reads) (using Flye)#

We start by creating and moving into a directory dedicated for the task:

mkdir -p /scratch3/TP_bcg/$USER/training_bcg/assembly
cd /scratch3/TP_bcg/$USER/training_bcg/assembly

We will use Flye to perform the genome assembly.
Load the appropriate module for flye (module load) or install the tool locally with conda

conda create -n flye -c bioconda flye
conda activate flye

We can now run the assembly

flye --nano-raw /scratch3/TP_bcg/$USER/training_bcg/raw_data/mysample.fastq.gz -o assembly >>flye.log 2>&1

As this task is a high time consuming step, we can stop the tool with CTRL-C and download the expected output:

cp -rf /scratch3/TP_bcg/$USER/training_bcg/training_bacterial_comparative_genomics/data/precomputed_assembly/assembly_precomputed.fasta .

Question

How many contigs could be assembled from reads?

Solution

grep -c ">" assembly_precomputed.fasta

Separate chromosomal and plasmid scaffolds (using MOB-Suite)#

Let's start by creating and moving into a directory dedicated for the task:

mkdir -p /scratch3/TP_bcg/$USER/training_bcg/mob_recon
cd /scratch3/TP_bcg/$USER/training_bcg/mob_recon

We will run MOB-suite using a singularity image.

Run the mob_recon utility using singularity image (/scratch3/TP_bcg/mob_suite_3.0.3.sif) to separate chromosome and plasmid contigs
First, load the appropriate module for running singularity Use the --bind option to bind your current directory on the host to /mnt in the container

module load system/singularity/3.6.0
cp -rf /scratch3/TP_bcg/mob_suite_3.0.3.sif .
cp -rf /scratch3/TP_bcg/$USER/training_bcg/training_bacterial_comparative_genomics/data/precomputed_assembly/assembly_precomputed.fasta .
singularity exec --bind /scratch3/TP_bcg/$USER/training_bcg/mob_recon/:/tmp mob_suite_3.0.3.sif mob_recon -i /tmp/assembly_precomputed.fasta -o /tmp/assembly.mob_recon

The results is made by 4 files chromosome.fasta contig_report.txt mobtyper_results.txt plasmid_AD399.fasta

We can have a look at the report:

Question

Looking at the MOB-SUITE report, does our sequenced sample contain plasmid sequence?

Solution

more assembly.mob_recon/contig_report.txt

sample_id   molecule_type   primary_cluster_id  secondary_cluster_id    contig_id   size    gc  md5 circularity_status  rep_type(s) rep_type_accession(s)   relaxase_type(s)    relaxase_type_accession(s)  mpf_type    mpf_type_accession(s)   orit_type(s)    orit_accession(s)   predicted_mobility  mash_nearest_neighbor   mash_neighbor_distance  mash_neighbor_identification    repetitive_dna_id   repetitive_dna_type filtering_reason
assembly_precomputed    chromosome  -   -   contig_1_circular_rotated   4682534 63.833983907004196  cf955654136e9041810a80b8191fa41f    not tested  -   -   MOBP    NC_007507_00032 -   -   -   -   -   -   -   -   -   -   -
assembly_precomputed    plasmid AD399   -   contig_2_circular_rotated   69058   61.861044339540676  475037e138b4ade1836aa05f421c47b8    not tested  rep_cluster_1289    000607__CP000620_00033  MOBP    CP022994_00148  -   -   -   -   -   CP033195    0.0537328   Xanthomonas oryzae pv. oryzae   -   -   -

Genome annotation (using Prokka)#

We will now annotate the chromosome. Let's start by creating and moving into a directory dedicated for the task.

Annotation will be performed on the chromosome file only (not plasmid). A symbolic link can be created to have the chromosome.fasta file in the current directory.

mkdir -p /scratch3/TP_bcg/$USER/training_bcg/prokka
cd /scratch3/TP_bcg/$USER/training_bcg/prokka
ln -s /scratch3/TP_bcg/$USER/training_bcg/mob_recon/assembly.mob_recon/chromosome.fasta

We will use Prokka to perform the genome annotation. Load the appropriate module for prokka (module load) or install the tool locally with conda

module load bioinfo/prokka/1.14.6

or

conda create -n prokka -c conda-forge -c bioconda prokka
conda activate prokka

List the database that will be used by prokka for annotation

prokka --listdb

We can now launch the annotation:

prokka chromosome.fasta --prefix assembly --force --outdir prokka_out >> prokka.log 2>&1

You can now deactivate the prokka conda environment

conda deactivate

Let's have a look at the Genebank output file:

head -40 prokka_out/assembly.gbk

Solution

LOCUS       contig_1_circular_rotated4682534 bp   DNA  linear       10-JUN-2022
DEFINITION  Genus species strain strain.
ACCESSION   
VERSION
KEYWORDS    .
SOURCE      Genus species
ORGANISM  Genus species
            Unclassified.
COMMENT     Annotated using prokka 1.14.6 from
            https://github.com/tseemann/prokka.
FEATURES             Location/Qualifiers
    source          1..4682534
                    /organism="Genus species"
                    /mol_type="genomic DNA"
                    /strain="strain"
    CDS             1..1329
                    /gene="dnaA"
                    /locus_tag="GBGACGDJ_00001"
                    /inference="ab initio prediction:Prodigal:002006"
                    /inference="similar to AA sequence:UniProtKB:P03004"
                    /codon_start=1
                    /transl_table=11
                    /product="Chromosomal replication initiator protein DnaA"
                    /db_xref="COG:COG0593"
                    /translation="MDAWPRCLERLEAEFPPEDVHTWLKPLQAEDRGDSIVLYAPNAF
                    IVDQVRERYLPRIRELLAYFAGNREVALAVGSRPRAPEPEPAPVAATIAPQAAPIAPF
                    AGNLDSHYTFANFVEGRSNQLGLAAAIQAAQKPGDRAHNPLLLYGSTGLGKTHLMFAA
                    GNALRQAKPAAKVMYLRSEQFFSAMIRALQDKAMDQFKRQFQQIDALLIDDIQFFAGK
                    DRTQEEFFHTFNALFDGRQQIILTCDRYPREVEGLEPRLKSRLAWGLSVAIDPPDFET
                    RAAIVLAKARERGAEIPDDVAFLIAKKMRSNVRDLEGALNTLVARANFTGRSITVEFA
                    QETLRDLLRAQQQAIGIPNIQKTVADYYGLQMKDLLSKRRTRSLARPRQVAMALAKEL
                    TEHSLPEIGDAFAGRDHTTVLHACRQIRTLMEADGKLREDWEKLIRKLSE"
    CDS             1607..2389
                    /gene="dnaN_1"
                    /locus_tag="GBGACGDJ_00002"
                    /inference="ab initio prediction:Prodigal:002006"
                    /inference="similar to AA sequence:UniProtKB:Q9I7C4"
                    /codon_start=1
                    /transl_table=11
                    /product="Beta sliding clamp"

Now let's have a look at the GFF output file:

head -10 prokka_out/assembly.gff

Solution

##gff-version 3
##sequence-region contig_1_circular_rotated 1 4682534
contig_1_circular_rotated   Prodigal:002006 CDS 1   1329    .   +   0   ID=GBGACGDJ_00001;Name=dnaA;db_xref=COG:COG0593;gene=dnaA;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P03004;locus_tag=GBGACGDJ_00001;product=Chromosomal replication initiator protein DnaA
contig_1_circular_rotated   Prodigal:002006 CDS 1607    2389    .   +   0   ID=GBGACGDJ_00002;Name=dnaN_1;db_xref=COG:COG0592;gene=dnaN_1;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:Q9I7C4;locus_tag=GBGACGDJ_00002;product=Beta sliding clamp
contig_1_circular_rotated   Prodigal:002006 CDS 2346    2708    .   +   0   ID=GBGACGDJ_00003;Name=dnaN_2;db_xref=COG:COG0592;gene=dnaN_2;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A988;locus_tag=GBGACGDJ_00003;product=Beta sliding clamp
contig_1_circular_rotated   Prodigal:002006 CDS 3777    4883    .   +   0   ID=GBGACGDJ_00004;Name=recF;db_xref=COG:COG1195;gene=recF;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A7H0;locus_tag=GBGACGDJ_00004;product=DNA replication and repair protein RecF
contig_1_circular_rotated   Prodigal:002006 CDS 4999    7443    .   +   0   ID=GBGACGDJ_00005;eC_number=5.6.2.2;Name=gyrB;db_xref=COG:COG0187;gene=gyrB;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A2I3;locus_tag=GBGACGDJ_00005;product=DNA gyrase subunit B
contig_1_circular_rotated   Prodigal:002006 CDS 7511    8347    .   +   0   ID=GBGACGDJ_00006;inference=ab initio prediction:Prodigal:002006;locus_tag=GBGACGDJ_00006;product=hypothetical protein
contig_1_circular_rotated   Prodigal:002006 CDS 8579    9340    .   +   0   ID=GBGACGDJ_00007;eC_number=3.4.-.-;Name=bepA_1;gene=bepA_1;inference=ab initio prediction:Prodigal:002006,protein motif:HAMAP:MF_00997;locus_tag=GBGACGDJ_00007;product=Beta-barrel assembly-enhancing protease
contig_1_circular_rotated   Prodigal:002006 CDS 9617    10810   .   +   0   ID=GBGACGDJ_00008;inference=ab initio prediction:Prodigal:002006;locus_tag=GBGACGDJ_00008;product=hypothetical protein

Check that prokka assign a functional COG annotation in genbank or gff files.

grep COG prokka_out/assembly.gff | head -5

Solution

contig_1_circular_rotated   Prodigal:002006 CDS 1   1329    .   +   0   ID=GBGACGDJ_00001;Name=dnaA;db_xref=COG:COG0593;gene=dnaA;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P03004;locus_tag=GBGACGDJ_00001;product=Chromosomal replication initiator protein DnaA
contig_1_circular_rotated   Prodigal:002006 CDS 1607    2389    .   +   0   ID=GBGACGDJ_00002;Name=dnaN_1;db_xref=COG:COG0592;gene=dnaN_1;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:Q9I7C4;locus_tag=GBGACGDJ_00002;product=Beta sliding clamp
contig_1_circular_rotated   Prodigal:002006 CDS 2346    2708    .   +   0   ID=GBGACGDJ_00003;Name=dnaN_2;db_xref=COG:COG0592;gene=dnaN_2;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A988;locus_tag=GBGACGDJ_00003;product=Beta sliding clamp
contig_1_circular_rotated   Prodigal:002006 CDS 3777    4883    .   +   0   ID=GBGACGDJ_00004;Name=recF;db_xref=COG:COG1195;gene=recF;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A7H0;locus_tag=GBGACGDJ_00004;product=DNA replication and repair protein RecF
contig_1_circular_rotated   Prodigal:002006 CDS 4999    7443    .   +   0   ID=GBGACGDJ_00005;eC_number=5.6.2.2;Name=gyrB;db_xref=COG:COG0187;gene=gyrB;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A2I3;locus_tag=GBGACGDJ_00005;product=DNA gyrase subunit B

Using awk for extracting information, prepare two files named "genes_plus.txt" and "genes_minus.txt" for representing genes with Circos, located respectively in positive and negative strand. Circos input file must respect the following format with space separator
Chr start end

Solution

awk {'if ($3 == "CDS" && $7 == "+")print "Chr "$4" "$5'} prokka_out/assembly.gff >genes_plus.txt
awk {'if ($3 == "CDS" && $7 == "-")print "Chr "$4" "$5'} prokka_out/assembly.gff >genes_minus.txt

GC content analysis (SkewIT)#

We will now use SkewIT for analyzing GC Skew.

GC skew is a statistical method for measuring strand-specific guanine overrepresentation.
The GC skew is proven to be useful as the indicator of the DNA leading strand, lagging strand, replication origin, and replication terminal.
The GC skew is positive and negative in the leading strand and in the lagging strand respectively; therefore, it is expected to see a switch in GC skew sign just at the point of DNA replication origin and terminus

GC skew = (G - C)/(G + C)

Let's start by creating and moving into a directory dedicated for the task:

mkdir -p /scratch3/TP_bcg/$USER/training_bcg/skewit
cd /scratch3/TP_bcg/$USER/training_bcg/skewit
ln -s /scratch3/TP_bcg/$USER/training_bcg/mob_recon/chromosome.fasta

Let's start by installing the tool:

git clone https://github.com/jenniferlu717/SkewIT.git

We can now run the analysis:

module load system/python/3.8.12
python SkewIT/src/gcskew.py -i chromosome.fasta -o gcskew.txt -k 500

Let's have a look at the result:

head -5 gcskew.txt

Solution

Sequence    Index   GC Skew (0kb)
contig_1_circular_rotated   0   -0.05952381
contig_1_circular_rotated   500 0.00958466
contig_1_circular_rotated   1000    -0.00327869
contig_1_circular_rotated   1500    -0.01010101

Using awk, prepare a file named "gcskew.circos.txt" for representing GC skew with Circos
Circos input file must respect the following format with space separator
Chr position value

Solution

grep -v 'Sequence' gcskew.txt | awk {'print "Chr "$2" "$2+500" "$3'} > gcskew.circos.txt

Visualize genome annotation (using Circos)#

Create a directory for circos and move into this directory dedicated for the task.

mkdir -p /scratch3/TP_bcg/$USER/training_bcg/circos
cd /scratch3/TP_bcg/$USER/training_bcg/circos

Install Circos by cloning the GitHub repository

git clone https://github.com/vigsterkr/circos.git
cd circos
./install-unix
cd ..

Create a karyotype file for Circos, indicating the name, size (4.6Mb) and color of the chromosome.

echo "chr - Chr 1 0 4600000 black" >karyotype.txt

Copy the template of circos configuration file located in the training material.

cp -rf /scratch3/TP_bcg/$USER/training_bcg/training_bacterial_comparative_genomics/data/circos1.conf ./circos1.conf

Edit the Circos configuration file (circos1.conf) to adapt and customize your Circos image.

Launch circos by specifying your circos configuration file with -conf option

circos/bin/circos -conf circos1.conf

cp -rf circos.png circos1.png

This should result in a circular representation of genes along your chromosome as two distinct tracks.
An additional track shows the GC skew.

Exercise

Follow the same process to include a new track for tRNA in your Circos

Retrieve public genomes available for comparison#

Public genomes can be retrived using NCBI : https://www.ncbi.nlm.nih.gov/datasets/genomes/

Questions

How many public genomes of Xanthomonas have been released in Genbank?

How many genomes are complete?

Alternatively, one can use the remote file to explore Genbank releases of genomes. https://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt

Let's start by creating a new directory "public_genomes" and moving into this directory dedicated for the task:

mkdir -p /scratch3/TP_bcg/$USER/training_bcg/public_genomes
cd /scratch3/TP_bcg/$USER/training_bcg/public_genomes

Using the wget command, retrieve the prokaryotes.txt file hosted at NCBI.

Questions

Can we find the same number of Xanthomonas genomes from this file?  
How many public genomes of Xanthomonas oryzae species have been released in Genbank?  
How many genomes are complete?  
How many public complete genomes of Xanthomonas oryzae, pathovar oryzae are available?

Solution

wget https://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt

grep "Xanthomonas" prokaryotes.txt

grep "Xanthomonas" prokaryotes.txt | grep -c 'chromosome:'

grep "Xanthomonas oryzae pv. oryzae" prokaryotes.txt | grep -c 'chromosome:'

grep "Xanthomonas oryzae pv. oryzicola" prokaryotes.txt | grep -c 'chromosome:'

Using metadata contained in the file, retrieve the genome files (fasta and annotations) of the first 3 complete genome of Xoo (Xanthomonas oryzae pv. oryzae) (as they appear in the file).
Name each genome file with the prefix Xoo_

Solution

grep "Xanthomonas oryzae pv. oryzicola" prokaryotes.txt | grep 'chromosome:' | head -3 
Xanthomonas oryzae pv. oryzicola    129394  PRJNA248159 248159  Proteobacteria  Gammaproteobacteria 4.43083 64.1    chromosome:NZ_CP007810.1/CP007810.1 -   1   4137    3486    2015/06/08  2021/11/23  Complete Genome Key Laboratory of Agricultural Biodiversity for Plant Disease Management under the Ministry of Education    SAMN02793996    GCA_001021915.1 REPR    ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/021/915/GCA_001021915.1_ASM102191v1  -   YM15
Xanthomonas oryzae pv. oryzicola    129394  PRJNA237971 237971  Proteobacteria  Gammaproteobacteria 5.0801  64  chromosome:NZ_CP007221.1/CP007221.1 -   1   4690    3924    2015/03/04  2022/01/21  Complete Genome Cornell University  SAMN02640212    GCA_000940825.1 -   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/940/825/GCA_000940825.1_ASM94082v1   27148456    CFBP7342
Xanthomonas oryzae pv. oryzicola    129394  PRJNA280380 280380  Proteobacteria  Gammaproteobacteria 5.01777 63.9    chromosome:NZ_CP011959.1/CP011959.1 -   1   4564    3815    2015/06/30  2021/11/23  Complete Genome Bogdanove Lab, Plant Pathology and Plant-Microbe Biology, Cornell University    SAMN03612287    GCA_001042835.1 -   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/042/835/GCA_001042835.1_ASM104283v1  -   CFBP7341

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/021/915/GCA_001021915.1_ASM102191v1/GCA_001021915.1_ASM102191v1_genomic.fna.gz -O Xoc_GCA_001021915.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/940/825/GCA_000940825.1_ASM94082v1/GCA_000940825.1_ASM94082v1_genomic.fna.gz -O Xoc_GCA_000940825.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/042/835/GCA_001042835.1_ASM104283v1/GCA_001042835.1_ASM104283v1_genomic.fna.gz -O Xoc_GCA_001042835.fna.gz

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/021/915/GCA_001021915.1_ASM102191v1/GCA_001021915.1_ASM102191v1_genomic.gbff.gz -O Xoc_GCA_001021915.gbff.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/940/825/GCA_000940825.1_ASM94082v1/GCA_000940825.1_ASM94082v1_genomic.gbff.gz -O Xoc_GCA_000940825.gbff.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/042/835/GCA_001042835.1_ASM104283v1/GCA_001042835.1_ASM104283v1_genomic.gbff.gz -O Xoc_GCA_001042835.gbff.gz

Do the same for retrieving the first 3 complete genomes of Xoc (Xanthomonas oryzae pv. oryzicola) with prefix Xoc_

Solution

grep "Xanthomonas oryzae pv. oryzae" prokaryotes.txt | grep 'chromosome:' | head -3

Xanthomonas oryzae pv. oryzae   64187   PRJNA485465 485465  Proteobacteria  Gammaproteobacteria 4.99067 63.7    chromosome:NZ_CP031697.1/CP031697.1 -   1   4647    3652    2019/02/04  2022/03/27  Complete Genome Martin Luther University Halle-Wittenberg   SAMN09791860    GCA_004136375.1 -   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/004/136/375/GCA_004136375.1_ASM413637v1  -   ICMP3125
Xanthomonas oryzae pv. oryzae   64187   PRJNA269487 269487  Proteobacteria  Gammaproteobacteria 5.09305 63.7    chromosome:NZ_CP040687.1/CP040687.1 -   1   4757    3787    2019/05/28  2022/03/20  Complete Genome Bacterial Genomics and Evolution Laboratory, CSIR-Institute of Microbial Technology, Sector 39-A, Chandigarh, India SAMN03252586    GCA_001929235.2 -   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/929/235/GCA_001929235.2_ASM192923v2  -   IXO1088
Xanthomonas oryzae pv. oryzae   64187   PRJNA497605 497605  Proteobacteria  Gammaproteobacteria 5.05039 63.7    chromosome:NZ_CP033192.1/CP033192.1 -   1   4713    3695    2019/03/19  2022/04/07  Complete Genome University of Florida   SAMN10261817    GCA_004355825.1 -   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/004/355/825/GCA_004355825.1_ASM435582v1  -   NX0260

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/004/136/375/GCA_004136375.1_ASM413637v1/GCA_004136375.1_ASM413637v1_genomic.fna.gz -O Xoo_GCA_004136375.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/929/235/GCA_001929235.2_ASM192923v2/GCA_001929235.2_ASM192923v2_genomic.fna.gz -O Xoo_GCA_001929235.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/004/355/825/GCA_004355825.1_ASM435582v1/GCA_004355825.1_ASM435582v1_genomic.fna.gz -O Xoo_GCA_004355825.fna.gz

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/004/136/375/GCA_004136375.1_ASM413637v1/GCA_004136375.1_ASM413637v1_genomic.gbff.gz -O Xoo_GCA_004136375.gbff.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/929/235/GCA_001929235.2_ASM192923v2/GCA_001929235.2_ASM192923v2_genomic.gbff.gz -O Xoo_GCA_001929235.gbff.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/004/355/825/GCA_004355825.1_ASM435582v1/GCA_004355825.1_ASM435582v1_genomic.gbff.gz -O Xoo_GCA_004355825.gbff.gz

Convert GenBank annotation files to GFF format#

We will make use of Perl scripts for converting genbank files to GFF format: https://github.com/bioperl/bioperl-live.git

Clone this repository and run the dedicated bp_genbank2gff3 tool on each of the 6 Xanthomonas genomes.

git clone https://github.com/bioperl/bioperl-live.git
chmod 755 bioperl-live/bin/*

perl bioperl-live/bin/bp_genbank2gff3 Xoo_GCA_004136375.gbff.gz
perl bioperl-live/bin/bp_genbank2gff3 Xoo_GCA_001929235.gbff.gz
perl bioperl-live/bin/bp_genbank2gff3 Xoo_GCA_004355825.gbff.gz
perl bioperl-live/bin/bp_genbank2gff3 Xoc_GCA_001021915.gbff.gz
perl bioperl-live/bin/bp_genbank2gff3 Xoc_GCA_000940825.gbff.gz
perl bioperl-live/bin/bp_genbank2gff3 Xoc_GCA_001042835.gbff.gz

mv Xo*gff annotations

Have a look into the content of a GFF file.

Comparison between 2 genomes (with Minimap2 and circos)#

Compare 2 Xoo

gunzip Xoo*fna.gz

ls Xoo*fna

Using minimap2, compare 2 genomes of Xoo (Xanthomonas oryzae pv.oryzae): Xoo_GCA_001929235 and Xoo_GCA_004136375.

Generate an output in paf format named "minimap2.paf".

Solution

module load bioinfo/minimap2/2.24
minimap2 Xoo_GCA_001929235.fna Xoo_GCA_004136375.fna -o minimap2.paf

Exercises

Using awk command applied on the paf output of minimap2, we can try to filter matches between the two genomes in oreder to visualize connections with circos. Generate a new file named "links.txt" that contains links between the two genomes. 
On the other hand, in order to highlight inversion, do the same for matches that are inverted between the two genomes.

Solution

awk {'if (($4-$3)>100000 && ($5 =="+"))print NR-1" "$1" "$3" "$4"\n"NR-1" "$6" "$8" "$9'} minimap2.paf >links.txt
awk {'if (($4-$3)>100000 && ($5 =="-"))print NR-1" "$1" "$3" "$4"\n"NR-1" "$6" "$9" "$8'} minimap2.paf >links_inverted.txt

Visualize matches between genomes with Circos

We will use and adapt a template for circos configuration file for visualizing circos with links.

cp -rf /scratch3/TP_bcg/$USER/training_bcg/training_bacterial_comparative_genomics/data/circos2.conf circos2.conf

Edit the Circos configuration file to add informations about links

Solution

<link segdup-bundle34>
z            = 50
color        = nyt_red
thickness    = 3
file         = /scratch3/TP_bcg/$USER/training_bcg/public_genomes/links.txt
bezier_radius_purity = 0.2
ribbon      = yes
crest = 1
</link>

<link segdup-bundle35>
z            = 50
color        = nyt_blue
thickness    = 3
file         = /scratch3/TP_bcg/$USER/training_bcg/public_genomes/links_inverted.txt
bezier_radius_purity = 0.2
ribbon      = yes
crest = 1
</link>

Edit a karyotype file with the two genomes and their size

Make one of the genome in the opposite way by entering this line in the Circos configuration file

chromosomes_reverse = CP040687.1

Launch circos for visualization

circos/bin/circos -conf circos2.conf

Alternatively, you can align the two Xoo genomes with the web application DGenies for displaying dot plot large genomes in an interactive, efficient and simple way https://dgenies.toulouse.inra.fr/run

Do the same process for the comparing one Xoo and one Xoc

Compare genome similarity using Average Nucleotide Identity (ANI) (fastANI)#

Go to the directory "public_genomes"

cd /scratch3/TP_bcg/$USER/training_bcg/public_genomes

Create a file that contains the list of genomes to be taken as input

ls Xo*fna* >list_genomes.txt

module load bioinfo/fastani/1.33
fastANI --rl list_genomes.txt --ql list_genomes.txt -o fastani.out --matrix

Check that genomes are closer within the same pathovar than between pathovars.

Questions

Which are the closest genomes?

Solution

more fastani.out.matrix

6  
Xoc_GCA_000940825.fna  
Xoc_GCA_001021915.fna   99.149597  
Xoc_GCA_001042835.fna   99.411034   99.196701  
Xoo_GCA_001929235.fna   97.226959   97.219826   97.288483  
Xoo_GCA_004136375.fna   97.313660   97.230530   97.365417   99.582520  
Xoo_GCA_004355825.fna   97.236938   97.235214   97.313324   99.704956   99.590271

How to visualize ANI as heatmap matrix?

cp -rf /scratch3/TP_bcg/$USER/training_bcg/training_bacterial_comparative_genomics/scripts/convertANI.pl .
cp -rf /scratch3/TP_bcg/$USER/training_bcg/training_bacterial_comparative_genomics/scripts/heatmaply.R .
perl convertANI.pl fastani.out.matrix >fastani.out.matrix.completed

Install using the R interface the library optparse and heatmaply

module load bioinfo/R/4.2.2
R

install.packages("optparse")
install.packages("heatmaply")
quit()

Launch the R script to generate a figure showing a heatmap of ANI values

Rscript heatmaply.R -f fastani.out.matrix.completed

It generates a HTML output showing a heatmap of ANI values for each genome pairwise comparison.

Try to make the same process including our newly assembled genome

Gene content comparison / Pangenome analysis (using Roary)#

Have a look to the help page of the roary program.

cd /scratch3/TP_bcg/$USER/training_bcg
module load bioinfo/roary/3.12.0
roary --help

Launch roary for comparing the gene content of the 6 genomes.

roary -f roary_out public_genomes/*gff

Look at the content of the roary output:

Gene presence absence file
summary statistics

head -10 roary_out/gene_presence_absence.csv  
"Gene","Non-unique Gene name","Annotation","No. isolates","No. sequences","Avg sequences per isolate","Genome Fragment","Order within Fragment","Accessory Fragment","Accessory Order with Fragment","QC","Min group size nuc","Max group size nuc","Avg group size nuc","Xoc_GCA_000940825.gbff.gz","Xoc_GCA_001021915.gbff.gz","Xoc_GCA_001042835.gbff.gz","Xoo_GCA_001929235.gbff.gz","Xoo_GCA_004136375.gbff.gz","Xoo_GCA_004355825.gbff.gz"
"IXO1088_008755","","response regulator","6","6","1","1","1683","","","","404","404","404","BE73_06245.p01","FE36_14125.p01","ACU17_06495.p01","10bd397a0f88aca11976f4a4f6631125_6321","4e95a45ca635958849417c3d368d746d_7522","cc532a0030968c6887cea9dec5a101c0_11970"
"IXO1088_008760","","hypothetical protein","6","6","1","1","1682","","","","497","497","497","BE73_06250.p01","FE36_14120.p01","ACU17_06500.p01","10bd397a0f88aca11976f4a4f6631125_6325","4e95a45ca635958849417c3d368d746d_7526","cc532a0030968c6887cea9dec5a101c0_11966"
"EBA21_16420","","hypothetical protein","6","6","1","1","1673","","","","512","512","512","BE73_06290.p01","FE36_14080.p01","ACU17_06540.p01","10bd397a0f88aca11976f4a4f6631125_6351","4e95a45ca635958849417c3d368d746d_7554","cc532a0030968c6887cea9dec5a101c0_11934"
"pqqD","","pyrroloquinoline quinone biosynthesis peptide chaperone PqqD","6","6","1","1","1671","","","","278","278","278","BE73_06300.p01","FE36_14070.p01","ACU17_06550.p01","10bd397a0f88aca11976f4a4f6631125_6359","4e95a45ca635958849417c3d368d746d_7562","cc532a0030968c6887cea9dec5a101c0_11926"
"pqqB","","pyrroloquinoline quinone biosynthesis protein PqqB","6","6","1","1","1669","","","","899","899","899","BE73_06310.p01","FE36_14060.p01","ACU17_06560.p01","10bd397a0f88aca11976f4a4f6631125_6367","4e95a45ca635958849417c3d368d746d_7570","cc532a0030968c6887cea9dec5a101c0_11918"
"DZA53_10525","","hypothetical protein","6","6","1","1","1664","","","","524","524","524","BE73_06335.p01","FE36_14035.p01","ACU17_06585.p01","10bd397a0f88aca11976f4a4f6631125_6391","4e95a45ca635958849417c3d368d746d_7594","cc532a0030968c6887cea9dec5a101c0_11894"
"DZA53_10545","","energy transducer TonB","6","6","1","1","1661","","","","875","875","875","BE73_06355.p01","FE36_14015.p01","ACU17_06605.p01","10bd397a0f88aca11976f4a4f6631125_6407","4e95a45ca635958849417c3d368d746d_7610","cc532a0030968c6887cea9dec5a101c0_11878"
"DZA53_10580","","chemotaxis protein","6","6","1","1","1654","","","","1202","1208","1205","BE73_06390.p01","FE36_13980.p01","ACU17_06640.p01","10bd397a0f88aca11976f4a4f6631125_6435","4e95a45ca635958849417c3d368d746d_7638","cc532a0030968c6887cea9dec5a101c0_11850"
"ACU17_06675","","copper homeostasis protein CutC","6","6","1","1","1646","","","","731","731","731","BE73_06425.p01","FE36_13950.p01","ACU17_06675.p01","10bd397a0f88aca11976f4a4f6631125_6467","4e95a45ca635958849417c3d368d746d_7672","cc532a0030968c6887cea9dec5a101c0_11818"

more roary_out/summary_statistics.txt  
Core genes  (99% <= strains <= 100%)    2565
Soft core genes (95% <= strains < 99%)  0
Shell genes (15% <= strains < 95%)  3763
Cloud genes (0% <= strains < 15%)   0
Total genes (0% <= strains <= 100%) 6328

In order to explore and represent graphs with roary outputs, we will make use of scripts available in Roary repository.

git clone https://github.com/sanger-pathogens/Roary.git

python Roary/contrib/roary_plots/roary_plots.py roary_out/accessory_binary_genes.fa.newick roary_out/gene_presence_absence.csv

query_pan_genome -g roary_out/clustered_proteins -a difference --input_set_one public_genomes/Xoc_GCA_000940825.gbff.gz.gff,public_genomes/Xoc_GCA_001021915.gbff.gz.gff,public_genomes/Xoc_GCA_001021915.gbff.gz.gff --input_set_two public_genomes/Xoo_GCA_001929235.gbff.gz.gff,public_genomes/Xoo_GCA_004136375.gbff.gz.gff,public_genomes/Xoo_GCA_004355825.gbff.gz.gff

Association tests with metadata (Pan-GWAS) (with Scoary)#

Scoary is designed to take the gene_presence_absence.csv file from Roary as well as a traits file created by the user and calculate the assocations between all genes in the accessory genome and the traits. It reports a list of genes sorted by strength of association per trait.

For this task, we will use a larger file precalculated with Roary for the comparison of 90 strains. For each strain, we collected the pathovar information and we will use it to define genes potentially associated with pathovar Xoo/Xoc.

cp -rf /scratch3/TP_bcg/$USER/training_bcg/training_bacterial_comparative_genomics/data/pangenome_100xantho /scratch3/TP_bcg/$USER/training_bcg
cd /scratch3/TP_bcg/$USER/training_bcg/pangenome_100xantho

Look into the input files gene_presence_absence.csv and traits.txt

more gene_presence_absence.csv

more traits.txt

Look at the help page of scoary and run it for defining genes associated with pathovar.

module load bioinfo/scoary/1.6.16
scoary -t traits.txt -g gene_presence_absence.csv

Questions

Considering a p-value threshold 0.01, how many gene clusters are associated with pathovar?  
How many clusters are specifically present in pathovar oryzae (absent in oryzicola)?  
How many clusters are specifically present in pathovar oryzicola (absent in oryzae)?

Solution

awk -F "\",\"" {'if ($12<0.01)print $_'} pathovar_*.results.csv | wc -l

awk -F "\",\"" {'if ($4==0 && $5==75)print $_'} pathovar_*.results.csv | wc -l

awk -F "\",\"" {'if ($4==15 && $5==0)print $_'} pathovar_*.results.csv | wc -l

Explore pan-genome with PanExplorer web application#

Upload your 6 genomes into the web application and choose the PanACoTA software for pangenome reconstruction. https://panexplorer.southgreen.fr/

Exercises

How many genes are strain-specific of each of the 3 Xoo strains?

Construct a pangenome graph (using minigraph2)#

Pangenome analysis based on gene clustering approach will result in a pangene atlas, construction of a presence/absence matrix of genes.

Pangenome can also be reconstructed directly on genomic sequences using pangenome graph approach using minigraph2.

cd /scratch3/TP_bcg/$USER/training_bcg/public_genomes
module load bioinfo/minigraph/0.15
minigraph -xggs -t16 Xoo_GCA_001929235.fna Xoo_GCA_004136375.fna >Xoo_pangenome.gfa

Convert GFA into FASTA of the pangenome

module load bioinfo/gfatools/0.5
gfatools gfa2fa -s Xoo_pangenome.gfa >Xoo_pangenome.fa

Download the bandage software and load the GFA pangenome graph to be visualized