South Green Logo

South Green tutorials pages

Name Commands to manipulate VCF files.
Description This page describes a serie of tools and linux commands used to manipulate VCF files.
Authors christine Tranchant-Dubreuil (christine.tranchant@ird.fr)
Creation Date 10/03/2017
Last Modified Date 25/03/2018

We need, in this tutorial:

Keywords

gatk,bcftools


Summary


Extracting list of samples from a vcf file

one line with all samples with grep
$grep "#CHROM" output | cut -f 10-

one line by sample with grep | cut | xargs
$grep "#CHROM" output | cut -f 10- | xargs -n 1
#Getting the sample number
$grep "#CHROM" output | cut -f 10- | xargs -n 1 | wc -l

Extracting a subset of samples from a multigenome vcf file

Select two samples out of a vcf with many samples with GATK selectVariants
java -Xmx12g -jar /usr/local/gatk-3.6/GenomeAnalysisTK.jar -T SelectVariants -R reference.fa -V inputFileName.vcf -o outputFilename.vcf -sn sample1 -sn sample2

Rk : if you get the following error message “Fasta dict file … for reference … does not exist”, please see https://www.broadinstitute.org/gatk/guide/article?id=1601

Select genotypes from a file containing a list of samples to include with GATK selectVariants
java -Xmx12g -jar /usr/local/gatk-3.6/GenomeAnalysisTK.jar -T SelectVariants -R reference.fa -V inputFileName.vcf -o outputFileName.vcf --sample_file barthii.only.RG.list  --ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES

Select genotypes from a file containing a list of samples to exclude with GATK selectVariants
java -Xmx12g -jar /usr/local/gatk-3.6/GenomeAnalysisTK.jar -T SelectVariants -R reference.fa -V inputFileName.vcf -o outputFileName.vcf --exclude_sample_file barthii.only.RG.list  --ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES

Rk : if you get the following error message : “Bad input: Samples entered on command line (through -sf or -sn)) that are not present in the VCF”, run with –ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES

Select genotypes from a file containing a list of samples to include with bcftools
bcftools view -S barthii.only.RG.list inputFileName.vcf --force-samples -o outputFilename.vcf

Calculating the nucleotide diversity from a vcf file with vcftools

vcftools --vcf inputFilename.vcf  --out outputFilename.PI  --window-pi 100000 --remove-filtered-all
grep "PI" OgOb-all-MSU7-CHR2.GATKSV.VCFTOOLS.stats-100000.windowed.pi -v | awk '{ sum+=$5; print $5,"; ",sum , "* ", NR ; } END { print "PI average :", sum / NR; }'