Parameters
In a typical bioinformatics project, considerable efforts are spent on tweaking parameters for the various programs involved. It would be inconvenient if you had to change in the shell scripts themselves every time you wanted to run with a new setting. Luckily, there is a better option for this: the params
keyword.
rule some_rule:
output:
"..."
input:
"..."
params:
cutoff=2.5
shell:
"""
some_program --cutoff {params.cutoff} {input} {output}
"""
We run most of the programs with default settings in our workflow. However, there is one parameter in the rule get_SRA_by_accession
that we use for determining how many reads we want to retrieve from SRA for each sample (-X 25000
). Change in this rule to use the parameter max_reads
instead and set the value to 20000. If you need help, click to show the solution below.
Click to show the solution
rule get_SRA_by_accession:
"""
Retrieve a single-read FASTQ file from SRA (Sequence Read Archive) by run accession number.
"""
output:
"data/raw_internal/{sample_id}.fastq.gz"
params:
max_reads = 20000
shell:
"""
fastq-dump {wildcards.sample_id} -X {params.max_reads} --readids \
--dumpbase --skip-technical --gzip -Z > {output}
"""
Now run through the workflow. Because there's been changes to the get_SRA_by_accession
rule this will trigger a re-run of the rule for all three accessions. In addition all downstream rules that depend on output from get_SRA_by_accession
are re-run.
The parameter values we set in the params
section don't have to be static, they can be any Python expression. In particular, Snakemake provides a global dictionary of configuration parameters called config
. Let's modify get_SRA_by_accession
to look something like this in order to make use of this dictionary:
rule get_SRA_by_accession:
"""
Retrieve a single-read FASTQ file from SRA (Sequence Read Archive) by run accession number.
"""
output:
"data/raw_internal/{sample_id}.fastq.gz"
params:
max_reads = config["max_reads"]
shell:
"""
fastq-dump {wildcards.sample_id} -X {params.max_reads} --readids \
--dumpbase --skip-technical --gzip -Z > {output}
"""
Note that Snakemake now expects there to be a key named max_reads
in the config dictionary. If we don't populate the dictionary somehow the dictionary will be empty so if you were to run the workflow now it would trigger a KeyError
(try running snakemake -s snakefile_mrsa.smk -n
to see for yourself). In order to populate the config dictionary with data for the workflow we could use the snakemake --config KEY=VALUE
syntax directly from the command line. However, from a reproducibility perspective, it's not optimal to set parameters from the command line, since it's difficult to keep track of which parameter values that were used.
A much better alternative is to use the --configfile FILE
option to supply a configuration file to Snakemake. In this file we can collect all the project-specific settings, sample ids and so on. This also enables us to write the Snakefile in a more general manner so that it can be better reused between projects. Like several other files used in these tutorials, this file should be in yaml format. Create the file below and save it as config.yml
.
max_reads: 25000
If we now run Snakemake with --configfile config.yml
, it will parse this file to form the config
dictionary. If you want to overwrite a parameter value, e.g. for testing, you can still use the --config KEY=VALUE
flag, as in --config max_reads=1000
.
Tip
Rather than supplying the config file from the command line you could also add the line configfile: "config.yml"
to the top of your Snakefile. Keep in mind that with such a setup Snakemake will complain if the file config.yml
is not present.
Quick recap
In this section we've learned:
- How to set parameter values with the
params
directive. - How to run Snakemake with the
config
variable and with a configuration file.