Description | Hands On Lab Exercises for Linux |
---|---|
Related-course materials | Linux for Jedi |
Authors | Christine Tranchant-Dubreuil (christine.tranchant@ird.fr) & Gautier Sarah (gautier.sarah |
Creation Date | 11/03/2018 |
Last Modified Date | 18/04/2022 |
Modified by | Christine Tranchant-Dubreuil |
Summary
- Preambule: Softwares to install before connecting to a distant linux server
- Practice 1: Get Connecting on a linux server by
ssh
- Practice 2: Preparing working environnement
- Practice 3: Using the
&&
separator - Practice 4: Monitoring processes) with
w, ps, kill, top
- Practice 5: Searching for text using
regex101.com
- Practice 6: Searching for text using
grep
- Practice 7: Displaying lines with
sed
- Practice 8: Deleting lines with
sed
- Practice 9: Parsing files with
sed
using regexp - Practice 10: Modifying files with
sed
- Practice 11: Manipulating files with
awk
- Practice 12: For loop with bash
- Links
- License
Preambule
- List of Softwares to install before connecting to a distant linux server more information
Practice 1 : Get Connecting on a linux server by ssh
In mobaXterm:
- Click the session button, then click SSH.
- In the remote host text box, type: HOSTNAME (see table below)
- Check the specify username box and enter your user name
- In the console, enter the password when prompted. Once you are successfully logged in, you will use this console for the rest of the lecture.
Cluster HPC | hostname |
---|---|
IRD HPC | bioinfo-inter.ird.fr |
- Connect on the HPC
Practice 2 : Preparing working environnement
- Move into the directory /scratch2
- Create a working directory such as Formation-X (X corresponds to your login id/number)
- Move into this directory just created and check the current/working directory just by looking the prompt
Practice 3 : Using the && separator
- On the console, type the 2 following linux commands to get data necessary for the next :
- Check the content of your home directory on the server now
- Delete the file LINUX4JEDI.tar.gz on the server -
rm
- Execute the
tree
command
Practice 4 : Monitoring processes
Displaying the list of processes
- Type the command
w
through 2 consoles : one connected on bioinfo-master, the other connected on one node - Type (on the node) the command
ps
without option, then with the optionu
,ua
,uax
- Type the command
top
on the node - Then use the “option” c to display the complete process
- Then use the “option” u to display only your processes
Kill a process - downloading files from SRA through two ways
- Go into the directory
LINUX4JEDI-TP/1-fastq
- Display the size of all fastq files -
ls -lh, du -h
We want to download one fastq file from NCBI SRA (available here https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=518559) using SRAtoolkit as below :
This will download the SRA file (in sra format) and then convert them to fastq.gz file . More details on https://isugenomics.github.io/bioinformatics-workbook/dataAcquisition/fileTransfer/sra.html
- Download the fastq file in the directory
LINUX4JEDI-TP/1-fastq
fastq-dump, &
- Check that 2 fastq files are downloading
ls -lhrt, watch -n 5 -d
- Display the list of processes
ps -ux, jobs
- kill your process “fastq-dump” directly from bioinfo-master
kill -9
Practice 5 : Searching for text using https://regex101.com/
- Go to the web site https://regex101.com/
- Copy the following accession gene names and paste it in the field
test string
- print only the accession names that satisfy the following criteria – treat each criterion separately
- contain the number 5
- contain the letter d or e
- contain the letters d and e in that order
- contain the letters d and e in that order with a single letter between them
- contain both the letters d and e in any order
- start with x or y
- start with x or y and end with e
- contain three or more digits in a row
- end with d followed by either a, r or p
Practice 6 : Searching for text using grep
- List the content of the directory
LINUX4JEDI-TP/Bank
- Display the first 10 lines of all the files that are the
Bank
directory -head
- Display the last 20 lines of all the files -
tail
- Count the sequences number in the two files that are the
Bank
directory -grep
- Print the line that contains the gene name
DEFL
-grep regexp
, all.seq - Print the line that contains the gene name
DEFL
following just by one digit -grep regexp
, all.seq
Infos: The file all.con contains the sequence of the asian rice genome (fasta format) and all. pep contains the sequence of all the genes annotated on the rice genome (fasta format).
from a gff file
We have the genome reference (all.con, fasta file) and we want to download the annotation of our genome reference (gff format).
- Go on the following page : http://rice.uga.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/all.dir
- Copy the url of the rice genome annotation file that we will use to download the file directly on the server (all.gff3)
- Go to the
bank
directory and type the following command :
- Count the number of genes annotated in the genome reference (lines with the word
gene
in the gff file) -grep
- Search for the nbs-lrr genes -
grep
- Count the number of gene
DEFL
following just by one digit -grep regexp
- Count the number of gene
DEFL
following by one or two digit ranging from 1 to 50 -grep regexp
- Counts the number of mRNA in the chromosome 1 -
grep -c regexp
- Counts the number of mRNA in the first five chromosomes -
grep -c regexp
- count the number of gene by chromosome -
grep, cut, sort, uniq
Practice 7 : Displaying lines with sed
For this exercise, you will work on the fastq file LINUX4JEDI-TP/1-fastq/SRR8517015_1.10000.fastq
- Print the 8 first lines
- Print the lines 5 to 12
- Print only the sequences ids
- Print only the sequences ids and nucleotides sequences
Practice 8 : Deleting lines with sed
For this exercise, you will work on the fastq file LINUX4JEDI-TP/1-fastq/SRR8517015_1.10000.fastq
- Delete the end of the file from the line 9
- Delete the lines containing only a
+
- Delete the lines containing only a
+
and the quality sequences
Practice 9 : File parsing with sed
using regexp (regular expression)
Fastq file
For this exercise, you will work on the fastq file LINUX4JEDI-TP/1-fastq/SRR8517015_1.10000.fastq
- Print only read sequences using a regular expression (print only lines with the letters ATCG)
vcf file
For this exercise, you will work with the vcf file LINUX4JEDI-TP/4-vcf/OgOb-all-MSU7-CHR6.GATKVARIANTFILTRATION.shuf.100000.vcf.gz
- Print only the line corresponding to the header (line starting by #) or polymorphisms passing all filters (line with tag
PASS
)
Practice 10 : File modification with sed
From fasta files in LINUX-TP/Fasta
- In the
LINUX4JEDI-TP/9-denovoAssembly
directory, there are two files :DAOSW_abyss-contigs.fa
andTOG5681_abyss-contigs.fa
. Before merging both libraries into a unique file, we would like to tag each sequence per its origin. In each file, add the respective tag DAOSW_ / TOG5681_ just before the identifier.
Rq : First test the sed command on one file, then store the results in new files named DAOSW_abyss-contigs.renamed.fasta and TOG5681_abyss-contigs.renamed.fasta
BONUS : try to modify each line starting with > such as :
>0 71 531
to>DAOSW_0
vcf file LINUX4JEDI-TP/4-vcf/OgOb-all-MSU7-CHR6.GATKVARIANTFILTRATION.shuf.100000.vcf.gz
- Now, in the VCF file, we would like to replace the genotypes by allelic dose. This means that we should replace the whole field by
0
when the genotype is0/0
, by1
when the genotype is0/1
and2
when the genotype is1/1
With fastq files in LINUX4JEDI-TP/1-fastq/
- Transform the file SRR8517015_1.10000.fastq into a fasta format
- In one command line, transform all fastq files of the directory in fasta (save the files before) -
sed -i
Practice 11 : Manipulating files with awk
From a fasta file
seqtk
Seqtk (https://github.com/lh3/seqtk) is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.
We are going to use seqtk comp to get statistics get the nucleotide composition of FASTA/Qprint the size of the genome
- Run seqtk comp on the file
Bank/all.con -
seqtk comp all.con` - Using awk, first print the whole line of the output generated by seqtk, then print only the columns 1 and 2 -
| awk
- Print the column 1 and 2 only for chr1 to chr12
- Calculate the genome size in pb
From the gff file precedently downloaded
- Extract the coordinate from the gff file
- Calculate the mean of the gene length
- Calculate the mean of the gene length for the chromosome 1
- Count the number of genes above 2000bp length
- Bonus: calculate the mean of gene length for each chromosome in one command line
Practice 12 : For loop with bash
- Go into the directory
LINUX4JEDI-TP/1-fastq/
- List the directory content
- Run fastq-stats program ( more to get stats about the fastq file
SRR8517015_1.10000.fastq
- Use a
for
loop to run fastq-stats with each fastq file in the directory
Practice 13 : The last but not the least practice
A bash script to download fastq files from a file that contains a list of accessions
Write a bash script that :
- takes as argument a file that contains a list of accessions (/scratch/accession.list)
- reads this file and downloads fastq files (reverse and forward) for each accession - fastq-dump
A bash script to get basic statistics on each fastq file (in the directory 1-fastq) using fastq-stats
Before writing the script, we will test the fastq-stats command on a bash terminal
- Run fastq-stats on a fastq file
- Run fastq-stats on a fastq file and get the column 2 of the output of the command
- Run fastq-stats on a fastq file, get the column 2 of the output of the command and turn the column into a single row -
linux command: paste -s
- Save the output of the command in the file
Write a bash script
Write a bash script that :
- takes as argument a directory (absolute path) that contains fastq files
- executes the command fastq-stats as seen just before , the output is saved into a file
Bonus
On a terminal, use awk to parse all files created by the previous bash script and to generate the following output:
Analysis of the read count file 3-RNAseqCount/erz340_suppl_supplementary_table_s5.csv
Goal : Get the chromosome and its positions (start-stop) for some genes differentially expressed using the read count file and the gff file downladed
Get the first ten rows with the lowest p-value
- display the first lines of this file
- substitute the “;” by the “\t”
- As the file is already sorted, extract the first ten lines and save the result in a new file called
my_10_genes.tab
- sort the file on the locus name and save the result into a new file
my_10_genes.sorted.tab
Get the columns chr, start, stot, info from the gff file
- print the columns chr, start, stot, info of the gff file
- print the columns chr, start, stot, info of the gff file but only for lines with the word
gene
in the gff file - print only the locus identifier of each line of the gff file (eg : ID=LOC_Os01g01010)
- print only the locus identifier of each line of the gff file (eg : LOC_Os01g01010)
- generate the following file :
- sort the file all.gene.loc.csv on the locus name and save the output in a new file
Join the lines of the two files previously created on the common field (locus identifier) - linux command join
Links
- Related courses : Linux for Jedi
- Tutorials : Linux Command-Line Cheat Sheet