South Green Logo

South Green Trainings pages

Description Hands On Lab Exercises for Linux
Related-course materials Linux for Jedi
Authors Christine Tranchant-Dubreuil (christine.tranchant@ird.fr) & Gautier Sarah (gautier.sarah
Creation Date 11/03/2018
Last Modified Date 18/04/2022
Modified by Christine Tranchant-Dubreuil

Summary


Preambule


Practice 1 : Get Connecting on a linux server by ssh

In mobaXterm:

  1. Click the session button, then click SSH.
    • In the remote host text box, type: HOSTNAME (see table below)
    • Check the specify username box and enter your user name
  2. In the console, enter the password when prompted. Once you are successfully logged in, you will use this console for the rest of the lecture.
Cluster HPC hostname
IRD HPC bioinfo-inter.ird.fr

Practice 2 : Preparing working environnement


Practice 3 : Using the && separator

# get the file on the web and decompress the gzip file 
wget http://itrop.ird.fr/LINUX-TP/LINUX4JEDI-TP.tar.gz && tar -xzvf LINUX4JEDI-TP.tar.gz
bash-4.2# tree -L 2  
.
|-- 1-fastq
|   |-- SRR8517015_1.10000.fastq
|   |-- SRR8517015_2.10000.fastq
|   |-- SRX5320622_1.10000.fastq
|   |-- SRX5320622_2.10000.fastq
|   |-- SRX5320631_1.10000.fastq
|   `-- SRX5320631_2.10000.fastq
|-- 2-bam
|   |-- B1.starMSU7.chr1.sorted.bam
|   |-- B1.starMSU7.chr1.sorted.bam.bai
|   |-- B2.starMSU7.chr1.sorted.bam
|   |-- B2.starMSU7.chr1.sorted.bam.bai
|   |-- G1.starMSU7.chr1.sorted.bam
|   |-- G1.starMSU7.chr1.sorted.bam.bai
|   |-- G2.starMSU7.chr1.sorted.bam
|   `-- G2.starMSU7.chr1.sorted.bam.bai
|-- 3-RNAseqCount
|   |-- erz340_suppl_supplementary_table_s5.csv
|   `-- erz340_suppl_supplementary_table_s5_new.csv
|-- 4-vcf
|   `-- OgOb-all-MSU7-CHR6.GATKVARIANTFILTRATION.shuf.100000.vcf.gz
|-- 9-denovoAssembly
|   |-- DAOSW_abyss-contigs.fa -> Ob/DAOSW_abyss-contigs.fa
|   |-- Ob
|   |-- Og
|   `-- TOG5681_abyss-contigs.fa -> Og/TOG5681_abyss-contigs.fa
|-- Bank
|   |-- all.con
|   `-- all.seq
|-- Other
|   |-- abcd.txt
|   |-- contact.txt
|   |-- example.txt
|   `-- test.list
|-- Script
|   |-- helloworld-var.sh
|   |-- helloworld.sh
|   |-- q
|   |-- script.sh
|   `-- testNum.sh
|-- erz340.pdf
`-- erz340_suppl_supplementary_table_s1.csv


Practice 4 : Monitoring processes

Displaying the list of processes

Kill a process - downloading files from SRA through two ways

We want to download one fastq file from NCBI SRA (available here https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=518559) using SRAtoolkit as below :

module load bioinfo/sratoolkit/2.9.2 
fastq-dump --gzip --split-files SRXXXX

This will download the SRA file (in sra format) and then convert them to fastq.gz file . More details on https://isugenomics.github.io/bioinformatics-workbook/dataAcquisition/fileTransfer/sra.html


Practice 5 : Searching for text using https://regex101.com/

xkn59438
yhdck2
eihd39d9
chdsye847
hedle3455
xjhd53e
45da
de37dp

Practice 6 : Searching for text using grep

Infos: The file all.con contains the sequence of the asian rice genome (fasta format) and all. pep contains the sequence of all the genes annotated on the rice genome (fasta format).

from a gff file

We have the genome reference (all.con, fasta file) and we want to download the annotation of our genome reference (gff format).

wget PUT_GFF_URL

Practice 7 : Displaying lines with sed

For this exercise, you will work on the fastq file LINUX4JEDI-TP/1-fastq/SRR8517015_1.10000.fastq


Practice 8 : Deleting lines with sed

For this exercise, you will work on the fastq file LINUX4JEDI-TP/1-fastq/SRR8517015_1.10000.fastq


Practice 9 : File parsing with sed using regexp (regular expression)

Fastq file

For this exercise, you will work on the fastq file LINUX4JEDI-TP/1-fastq/SRR8517015_1.10000.fastq

vcf file

For this exercise, you will work with the vcf file LINUX4JEDI-TP/4-vcf/OgOb-all-MSU7-CHR6.GATKVARIANTFILTRATION.shuf.100000.vcf.gz


Practice 10 : File modification with sed

From fasta files in LINUX-TP/Fasta

# File DAOSW_abyss-contigs.fa initially
>0 71 531
CTTTTTGAACTTTTTCATTCCGGTCAAAAAAATATCGCACCCGTGGGGGCTCAATATATGCCAATATTGGC
>2 217 449


# File DAOSW_abyss-contigs.rename.fasta
>DAOSW_0 71 531
CTTTTTGAACTTTTTCATTCCGGTCAAAAAAATATCGCACCCGTGGGGGCTCAATATATGCCAATATTGGC
>DAOSW_2 217 449

Rq : First test the sed command on one file, then store the results in new files named DAOSW_abyss-contigs.renamed.fasta and TOG5681_abyss-contigs.renamed.fasta

BONUS : try to modify each line starting with > such as : >0 71 531 to >DAOSW_0

vcf file LINUX4JEDI-TP/4-vcf/OgOb-all-MSU7-CHR6.GATKVARIANTFILTRATION.shuf.100000.vcf.gz

With fastq files in LINUX4JEDI-TP/1-fastq/


Practice 11 : Manipulating files with awk

From a fasta file

seqtk

Seqtk (https://github.com/lh3/seqtk) is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.

We are going to use seqtk comp to get statistics get the nucleotide composition of FASTA/Qprint the size of the genome

From the gff file precedently downloaded


Practice 12 : For loop with bash

fastq-stats -D SRR8517015_1.10000.fastq
for file in *fastq; do 
  fastq-stats -D $file > $file.fastq-stats ; 
done;

Practice 13 : The last but not the least practice

A bash script to download fastq files from a file that contains a list of accessions

Write a bash script that :

# Use the following code to read the file (variable $filename) line by line
while read line;
do
echo $line;
done < $filename

A bash script to get basic statistics on each fastq file (in the directory 1-fastq) using fastq-stats

Before writing the script, we will test the fastq-stats command on a bash terminal
# fastq-dump output
reads	10000
len	125
len mean	125.0000
len stdev	0.0000
len min	125
phred	33
window-size	10000
cycle-max	35
qual min	2
qual max	38
qual mean	36.1021
qual stdev	4.2358
%A	25.5594
%C	24.3560
%G	26.1111
%T	23.8691
%N	0.1043
total bases	1250000

# We want this format
10000	125	125.0000	0.0000	125	33	10000	35	2	38	36.1021	4.2358	25.5594	24.3560	26.111123.8691	0.1043	1250000
Write a bash script

Write a bash script that :

Bonus

On a terminal, use awk to parse all files created by the previous bash script and to generate the following output:

SRR8517015_1.10000.fastq.stats          10000        125        125.0000        0.0000        125        33        10000        35        2        38        36.1021        4.2358        25.5594        24.3560        26.1111        23.8691        0.1043        1250000
SRR8517015_2.10000.fastq.stats          10000        125        125.0000        0.0000        125        33        10000        35        2        38        34.4527        7.0727        23.5631        25.5657        25.6063        25.2649        0.0000        1250000
SRX5320622_1.10000.fastq.stats          10000        125        125.0000        0.0000        125        33        10000        35        2        38        36.4891        3.6410        26.3371        24.0457        24.8703        24.6883        0.0586        1250000

Analysis of the read count file 3-RNAseqCount/erz340_suppl_supplementary_table_s5.csv

Goal : Get the chromosome and its positions (start-stop) for some genes differentially expressed using the read count file and the gff file downladed

Get the first ten rows with the lowest p-value
[tranchant@node6 3-RNAseqCount]$ head erz340_suppl_supplementary_table_s5.csv 
gene_id;log2FoldChange;lfcSE;pvalue;padj;symbols;MsuAnnotation
LOC_Os06g06750;4,02391987172844;0,291852309336462;4,76E-32;1,20E-27;MADS5;OsMADS5 - MADS-box family gene with MIKCc type-box, expressed
LOC_Os03g11614;6,14058803847572;0,534044090654195;2,40E-25;3,03E-21;LHS1;OsMADS1 - MADS-box family gene with MIKCc type-box, expressed
LOC_Os04g43580;-2,32647724746766;0,178467573376802;1,70E-22;1,43E-18;G1L4;DUF640 domain containing protein, putative, expressed
LOC_Os02g45770;4,57158531291166;0,416805180501083;1,13E-21;7,08E-18;MFO1;OsMADS6 - MADS-box family gene with MIKCc type-box, expressed
LOC_Os03g14140;5,01958570820803;0,502245405032766;1,05E-18;5,29E-15;;POEI16 - Pollen Ole e I allergen and extensin family protein precursor, expressed
Get the columns chr, start, stot, info from the gff file
[tranchant@node6 3-RNAseqCount]$ head all.gene.loc.csv 
Chr1 gene 2903 10817 LOC_Os01g01010
Chr1 gene 11218 12435 LOC_Os01g01019
Chr1 gene 12648 15915 LOC_Os01g01030
Chr1 gene 16292 20323 LOC_Os01g01040
Join the lines of the two files previously created on the common field (locus identifier) - linux command join


License

The resource material is licensed under the Creative Commons Attribution 4.0 International License (here).