Name
Polymorphism detection from fastq files on a linux terminal.
Description
This page describes a serie of tools and linux commands used to detect polymorphism from raw data (fastq files) to polymorphism file (vcf).
Authors
christine Tranchant-Dubreuil (christine.tranchant@ird.fr)
Creation Date
10/03/2017
Last Modified Date
06/08/2018
We need, in this tutorial:
2 fastq files paired
a reference file used for the mapping step.
Keywords
fastq-stats, fastqc, cutadapt, bwa, samtools, gatk, picardtools
fastq, sam, bam, vcf
Summary
How to get fastq file size? du -sh
[ tranchant@master0 ~]$ du -sh * .fastq
311M C3KB2ACXX_5_12_11_debar.fastq
525M C3KB2ACXX_5_12_12_debar.fastq
How to get sequences number by fastq files (uncompressed files only)?
[ tranchant@master0 ~]$ wc -l * .fastq | awk '{ print $2" \t "$1/4}'
C3KB2ACXX_5_12_11_debar.fastq 1354891
C3KB2ACXX_5_12_12_debar.fastq 2249353
Getting quickly a report with various statistics about fastq file with ea-utils
This software runs quickly, faster than fastqc and the output can be parsed and formatted with some basics linux command
ea-utils website
How to use fastq-stats
with one fastq file
[ tranchant@node11 FASTQ]$ fastq-stats -D PdFIE94_R1.fq.gz
reads 15034838
len 144
len mean 144.0000
len stdev 0.0000
len min 144
phred 33
window-size 2000000
cycle-max 35
qual min 2
qual max 42
qual mean 40.2191
qual stdev 4.8129
%A 25.5314
%C 26.3730
%G 22.4570
%T 25.6209
%N 0.0178
total bases 2165016672
How to use fastq-stats
on several fastq files
[ tranchant@node2 tranchant]$ ls
DV.UNMAPPED_1.fastq DV.UNMAPPED_2.fastq
DT.UNMAPPED_1.fastq DT.UNMAPPED_2.fastq
# you can execute every fastq-stats command by using for loop (bash)
[ tranchant@node2 tranchant]$ for file in * fastq; do fastq-stats -D $file > $file .fastq-stats ; done ;
#Check that the files have been created
[ tranchant@node2 tranchant]$ ls -l * stats
-rw-r--r-- 1 tranchant ggr 246 8 juin 14:55 DT.UNMAPPED_1.fastq.fastq-stats
-rw-r--r-- 1 tranchant ggr 247 8 juin 14:55 DT.UNMAPPED_2.fastq.fastq-stats
-rw-r--r-- 1 tranchant ggr 246 8 juin 14:55 DV.UNMAPPED_1.fastq.fastq-stats
-rw-r--r-- 1 tranchant ggr 246 8 juin 14:55 DV.UNMAPPED_2.fastq.fastq-stats
#check one file
[ tranchant@node2 tranchant]$ cat DT.UNMAPPED_1.fastq.fastq-stats
reads 11043989
len 101
len mean 99.3377
len stdev 7.4780
len min 30
phred 33
window-size 2000000
cycle-max 35
qual min 2
qual max 41
qual mean 37.7341
qual stdev 3.8379
%A 29.2366
%C 20.8381
%G 20.7269
%T 29.1967
%N 0.0017
total bases 1097084447
How to get a tab file from all files generated by the command ls | cut | sed | sh
# To generate a header
[ tranchant@node11 FASTQ]$ cut -f1 PdFIE94_R1.fq.gz.fastq-stats | paste -s - - > all.stats
# To display header file
[ tranchant@node11 FASTQ]$ head all.stats
reads len len mean len stdev len min phred window-size cycle-max qual min qual max qual mean qual stdev %A %C %G %T %N total bases
# To parse every file and to create a tab-delimited file
[ tranchant@node11 FASTQ]$ awk 'BEGIN { num=0; } { num++; if (num == 1) { print FILENAME } print $NF; if (num == 18) {print "\n"; num=0; } } END { } ' * fastq-stats | awk ' BEGIN { RS="\n\n"; ;} { print $1,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19;}' >> all.stats
# Display 10 first lines of the file all.stats
[ tranchant@node11 FASTQ]$ head all.stats
reads len len mean len stdev len min phred window-size cycle-max qual min qual max qual mean qual stdev %A %C %G %T %N total bases
PdFIE94_R1.fq.gz.fastq-stats 144 144.0000 0.0000 144 33 2000000 35 2 42 40.2191 4.8129 25.5314 26.3730 22.4570 25.6209 0.0178 2165016672
PdFIE94_R2.fq.gz.fastq-stats 151 151.0000 0.0000 151 33 2000000 35 2 42 34.8602 10.4600 26.6998 23.6108 22.8715 26.4631 0.3548 2270260538
PdFIE95_R1.fq.gz.fastq-stats 143 143.0000 0.0000 143 33 2000000 35 2 42 40.1422 4.9354 25.2075 26.6116 22.5737 25.5869 0.0202 1873211197
PdFIE95_R2.fq.gz.fastq-stats 151 151.0000 0.0000 151 33 2000000 35 2 42 35.0387 10.3787 26.5003 24.0084 23.0377 26.0990 0.3546 1978006229
PdFIE96_R1.fq.gz.fastq-stats 143 143.0000 0.0000 143 33 2000000 35 2 42 40.1133 4.9931 25.0246 26.6680 22.6115 25.6756 0.0204 1488540911
PdFIE96_R2.fq.gz.fastq-stats 151 151.0000 0.0000 151 33 2000000 35 2 42 34.9494 10.4096 26.5905 24.0838 23.1924 25.7780 0.3554 1571815927
PdFIE98_R1.fq.gz.fastq-stats 143 143.0000 0.0000 143 33 2000000 35 2 42 40.2191 4.8173 25.3874 26.4649 22.4244 25.7031 0.0201 1443785200
PdFIE98_R2.fq.gz.fastq-stats 151 151.0000 0.0000 151 33 2000000 35 2 42 35.0267 10.3824 26.6316 23.8662 22.8668 26.2791 0.3564 1524556400
to perform some simple quality control checks to ensure that the raw data looks good and there are no problems or biases in data which may affect how user can usefully use it. fastqc website
[ tranchant@master0 ~]$ fastqc * .fastq -o .
[ tranchant@master0 ~]$ ls
total 543577236
drwxr-xr-x 4 tranchant ggr 4096 9 mars 22:11 PdFIE94_R1.fq_fastqc
-rw-r--r-- 1 tranchant ggr 225516 9 mars 22:11 PdFIE94_R1.fq_fastqc.zip
FASTQ cleaning
Using cutadapt
to remove adapters and to trim reads based on quality
cutadapt -q 30,30 -m 35 -B GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG -B GTTCGTCTTCTGCCGTATGCTCTAGCACTACACTGACCTCAAGTCTGCACACGAGAAGGCTAG -b GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG -b GTTCGTCTTCTGCCGTATGCTCTAGCACTACACTGACCTCAAGTCTGCACACGAGAAGGCTAG-o P1_R1.CUTADAPT.fastq.gz -p P1_R2.CUTADAPT.fastq.gz P1_R1.fq.gz P1_R2.fq.gz
Cutadapt
supports trimming of multiple types of adapters:
Adapter type
Command-line option
3’ adapter
-a ADAPTER
5’ adapter
-g ADAPTER
5’ or 3’ (both possible)
-b ADAPTER
-q 30, 30 : by default, only the 3’ end of each read is quality-trimmed. If you want to trim the 5’ end as well, use the -q option with two comma-separated cutoffs
-p is the short form of –paired-output. The option -B is used here to specify an adapter sequence that cutadapt should remove from the second read in each pair.
Mapping step with bwa
Creating an index of the reference genome with bwa index
bwa index -a is
is
for short genome
bwtsw
for genome >2Gb
Getting sai files with bwa aln
bwa aln reference file_forward.fastq > file_forward.sai
bwa aln reference file_reverse.fastq > file_reverse.sai
Getting sam file with bwa sampe
(paired)
Here is a description for the contents of the SAM file: https://samtools.github.io/hts-specs/SAMv1.pdf
bwa sampe reference file_forward.sai file_reverse.sai file_forward.fastq file_reverse.fastq -f file.sam -r '@RG ID:RG SM:RG PL:Illumina'
Getting sam file with bwa samse
(single)
bwa samse reference file.sai file.fastq file_reverse.fastq -f file.sam -r '@RG ID:RG SM:RG PL:Illumina'
bwa mem reference file.fastq file_reverse.fastq -R '@RG\tID:RC3\tSM:RC3\tPL:Illumina' > file.sam
/usr/bin/java -Xmx8g -jar /us/local/picard-tools-2.5.0/picard.jar CreateSequenceDictionary REFERENCE = Reference.fasta OUTPUT = Reference.dict
samtools faidx Rreference.fasta
/usr/bin/java -Xmx12g -jar /usr/local/picard-tools-2.5.0/picard.jar SortSam CREATE_INDEX = TRUE VALIDATION_STRINGENCY = SILENT SORT_ORDER = coordinate INPUT = file.BWASAMPE.sam OUTPUT = file.PICARDTOOLSSORT.bam
The bam index (.bai ) is created auomtically by picardtools
(samtools index
command unnecessary)
#command
$samtools flagstat file.PICARDTOOLSSORT.bam
#output
14524094 + 0 in total ( QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplimentary
0 + 0 duplicates
11219302 + 0 mapped ( 77.25%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 prorubyy paired ( -nan %:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons ( -nan %:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr ( mapQ>= 5)
#command
$samtools idxstats file.PICARDTOOLSSORT.bam
#The output is TAB-delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads. #It is written to stdout.
Chromosome_8.1 7978604 2274010 0
Chromosome_8.2 8319966 2274086 0
Chromosome_8.3 6606598 1774006 0
Chromosome_8.4 5546968 1483879 0
Chromosome_8.5 4490059 1217779 0
Chromosome_8.6 4133993 1132535 0
Chromosome_8.7 3415785 945505 0
Chromosome_8.8 535760 117502 0
* 0 0 3304792
#command
# -bg option : reports in BEDGRAPH format only for those regions of the genome that actually have coverage.
# This option produces genome-wide coverage output in BEDGRAPH format much more concise since consecutive positions with the same coverage are reported as a single output line describing the start and end coordinate of the interval having the coverage level, followed by the coverage level itself.
# -bga option : a complete report including the regions with zero coverage.
$bedtools genomecov -ibam AAOSW.PICARDTOOLSSORT.bam -g /scratch/tranchant/referenceFiles/OglaRS2ADWL02.fa -bga > AAOSW.PICARDTOOLSSORT.bam.BEDTOOLSGENOMECOV
#The output is TAB-delimited with each line consisting of reference sequence name, start position, end position, coverage level
$less AAOSW.PICARDTOOLSSORT.bam.BEDTOOLSGENOMECOV
Chr01 0 2 1
Chr01 2 7 3
Chr01 7 13 6
Chr01 13 14 8
Chr01 14 15 9
Chr01 15 16 11
Chr01 16 18 12
Chr01 18 19 13
Chr01 19 23 14
Chr01 23 29 16
Chr01 29 32 18
Chr01 32 33 19
Chr01 33 34 20
Chr01 34 36 21
Chr01 36 38 22
Chr01 38 39 24
Chr01 39 40 26
# Reads correctly mapped extracted
samtools view -h -b -f = 0* 02 -o file.SAMTOOLSVIEW-MAPPED.bam file.PICARDTOOLSSORT.bam
# Unmapped reads extracted
samtools view -h -b -F = 0* 02 -o file.SAMTOOLSVIEW-UNMAPPED.bam file.PICARDTOOLSSORT.bam
#Command
samtools index file.SAMTOOLSVIEW-MAPPED.bam
… if they haven’t been added precedently
#Command
/usr/bin/java -Xmx12g -jar /usr/local/picard-tools-2.5.0/picard.jar AddOrReplaceReadGroups INPUT = 20135.SAMTOOLSVIEW-MAPPED.bam OUTPUT = 20135.PICARDTOOLSADDORREPLACEREADGROUPS.bam RGPL = Illumina RGLB = 20135 RGPU = 20135 RGSM = 20135 CREATE_INDEX = True VALIDATION_STRINGENCY = LENIENT > 20135.PICARDTOOLSADDORREPLACEREADGROUPS.log
RGPL= Read Group Platform (i.e. Illumina)
RGLB= Read Group Library (i.e. DNA preparation library identity i.e “HG19”)
RGPU= Read Group Platform Unit (i.e Barcode)
RGSM= Read Group Sample Name (i.e. Sample Name)
</a
… if it is necessary
#Command
/usr/bin/java -Xmx12g -jar /usr/local/picard-tools-2.5.0/picard.jar MarkDuplicates INPUT = 20135.SAMTOOLSVIEW-MAPPED.bam OUTPUT = 20135.PICARDTOOLSMARKDUPLICATES.bam METRICS_FIL = metrics.txt CREATE_INDEX = True VALIDATION_STRINGENCY = LENIENT > 20135.PICARDTOOLSADDORREPLACEREADGROUPS.log
Local realignment around INDELS using GATK
Creating a target list of intervals which need to be realigned with gatk realignerTargetCreator
#command
/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference -I file.bam -o file.GATKREALIGNERTARGETCREATOR.intervals
#Command
/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -T IndelRealigner -R reference -I file.bam -targetIntervals file.GATKREALIGNERTARGETCREATOR.intervals -o file.GATKINDELREALIGNER.bam
SNP/INDEL calling
with GATK haplotypeCaller
/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -rf BadCigar -T HaplotypeCaller -R reference -I file-ind1.PICARDTOOLSMARKDUPLICATES.bam -I file-ind2.PICARDTOOLSMARKDUPLICATES.bam -o file.GATKHAPLOTYPECALLER.vcf
with GATK unifiedGenotyper
/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -rf BadCigar -T UnifiedGenotyper -R reference -I file.bam -o file.GATKUNIFIEDGENOTYPER.vcf
Basic Filtering vcf files using GATK
with gatk variantFiltration
/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -T VariantFiltration --filterName 'FILTER-DP' --filterExpression 'DP<10 || DP>600' --filterName 'LowQual' --filterExpression 'QUAL<30' -R /reference -o file.GATKVARIANTFILTRATION.vcf --variant file.GATKHAPLOTYPECALLER.vcf
with gatk selectVariants
/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -selectType SNP -T SelectVariants -R reference --variant file.GATKVARIANTFILTRATION.vcf -o file.GATKSELECTVARIANT.vcf
NGS workflow Managers and linux trainings realized by the South Green Platform
License
The resource material is licensed under the Creative Commons Attribution 4.0 International License (
here ).