Snakefiles and Rules¶
A Snakemake workflow defines a data analysis in terms of rules, that are listed in so-called Snakefiles. Most importantly, a rule can consist of a name, input files, output files, and a shell command to generate the output from the input, i.e.
rule NAME:
input: "path/to/inputfile", "path/to/other/inputfile"
output: "path/to/outputfile", "path/to/another/outputfile"
shell: "somecommand {input} {output}"
The name is optional and can be left out, creating an anonymous rule, and can be overridden with name
.
Inside the shell command, all local and global variables, especially input and output files can be accessed via their names in the python format minilanguage. Here input and output (and in general any list or tuple) automatically evaluate to a space-separated list of files (i.e. path/to/inputfile path/to/other/inputfile
).
From Snakemake 3.8.0 on, adding the special formatting instruction :q
(e.g. "somecommand {input:q} {output:q}")
) will let Snakemake quote each of the list or tuple elements that contains whitespace.
Instead of a shell command, a rule can run some python code to generate the output:
rule NAME:
input: "path/to/inputfile", "path/to/other/inputfile"
output: "path/to/outputfile", somename = "path/to/another/outputfile"
run:
for f in input:
...
with open(output[0], "w") as out:
out.write(...)
with open(output.somename, "w") as out:
out.write(...)
As can be seen, instead of accessing input and output as a whole, we can also access by index (output[0]
) or by keyword (output.somename
).
Note that, when adding keywords or names for input or output files, their order won’t be preserved when accessing them as a whole via e.g. {output}
in a shell command.
Shell commands like above can also be invoked inside a python based rule, via the function shell
that takes a string with the command and allows the same formatting like in the rule above, e.g.:
shell("somecommand {output.somename}")
Further, this combination of python and shell commands allows us to iterate over the output of the shell command, e.g.:
for line in shell("somecommand {output.somename}", iterable=True):
... # do something in python
Note that shell commands in Snakemake use the bash shell in strict mode by default.
Wildcards¶
Usually, it is useful to generalize a rule to be applicable to a number of e.g. datasets. For this purpose, wildcards can be used. Automatically resolved multiple named wildcards are a key feature and strength of Snakemake in comparison to other systems. Consider the following example.
rule complex_conversion:
input:
"{dataset}/inputfile"
output:
"{dataset}/file.{group}.txt"
shell:
"somecommand --group {wildcards.group} < {input} > {output}"
Here, we define two wildcards, dataset
and group
. By this, the rule can produce all files that follow the regular expression pattern .+/file\..+\.txt
, i.e. the wildcards are replaced by the regular expression .+
. If the rule’s output matches a requested file, the substrings matched by the wildcards are propagated to the input files and to the variable wildcards, that is here also used in the shell command. The wildcards object can be accessed in the same way as input and output, which is described above.
For example, if another rule in the workflow requires the file 101/file.A.txt
, Snakemake recognizes that this rule is able to produce it by setting dataset=101
and group=A
.
Thus, it requests file 101/inputfile
as input and executes the command somecommand --group A < 101/inputfile > 101/file.A.txt
.
Of course, the input file might have to be generated by another rule with different wildcards.
Importantly, the wildcard names in input and output must be named identically. Most typically, the same wildcard is present in both input and output, but it is of course also possible to have wildcards only in the output but not the input section.
Multiple wildcards in one filename can cause ambiguity.
Consider the pattern {dataset}.{group}.txt
and assume that a file 101.B.normal.txt
is available.
It is not clear whether dataset=101.B
and group=normal
or dataset=101
and group=B.normal
in this case.
Hence wildcards can be constrained to given regular expressions.
Here we could restrict the wildcard dataset
to consist of digits only using \d+
as the corresponding regular expression.
With Snakemake 3.8.0, there are three ways to constrain wildcards.
First, a wildcard can be constrained within the file pattern, by appending a regular expression separated by a comma:
output: "{dataset,\d+}.{group}.txt"
Second, a wildcard can be constrained within the rule via the keyword wildcard_constraints
:
rule complex_conversion:
input:
"{dataset}/inputfile"
output:
"{dataset}/file.{group}.txt"
wildcard_constraints:
dataset="\d+"
shell:
"somecommand --group {wildcards.group} < {input} > {output}"
Finally, you can also define global wildcard constraints that apply for all rules:
wildcard_constraints:
dataset="\d+"
rule a:
...
rule b:
...
See the Python documentation on regular expressions for detailed information on regular expression syntax.
Aggregation¶
Input files can be Python lists, allowing to easily aggregate over parameters or samples:
rule aggregate:
input:
["{dataset}/a.txt".format(dataset=dataset) for dataset in DATASETS]
output:
"aggregated.txt"
shell:
...
The above expression can be simplified in two ways.
The expand function¶
rule aggregate:
input:
expand("{dataset}/a.txt", dataset=DATASETS)
output:
"aggregated.txt"
shell:
...
Note that dataset is NOT a wildcard here because it is resolved by Snakemake due to the expand
statement.
The expand
function also allows us to combine different variables, e.g.
rule aggregate:
input:
expand("{dataset}/a.{ext}", dataset=DATASETS, ext=FORMATS)
output:
"aggregated.txt"
shell:
...
If FORMATS=["txt", "csv"]
contains a list of desired output formats then expand will automatically combine any dataset with any of these extensions.
Furthermore, the first argument can also be a list of strings. In that case, the transformation is applied to all elements of the list. E.g.
expand(["{dataset}/a.{ext}", "{dataset}/b.{ext}"], dataset=DATASETS, ext=FORMATS)
leads to
["ds1/a.txt", "ds1/b.txt", "ds2/a.txt", "ds2/b.txt", "ds1/a.csv", "ds1/b.csv", "ds2/a.csv", "ds2/b.csv"]
Per default, expand
uses the python itertools function product
that yields all combinations of the provided wildcard values. However by inserting a second positional argument this can be replaced by any combinatoric function, e.g. zip
:
expand(["{dataset}/a.{ext}", "{dataset}/b.{ext}"], zip, dataset=DATASETS, ext=FORMATS)
leads to
["ds1/a.txt", "ds1/b.txt", "ds2/a.csv", "ds2/b.csv"]
You can also mask a wildcard expression in expand
such that it will be kept, e.g.
expand("{{dataset}}/a.{ext}", ext=FORMATS)
will create strings with all values for ext but starting with the wildcard "{dataset}"
.
The multiext function¶
multiext
provides a simplified variant of expand
that allows us to define a set of output or input files that just differ by their extension:
rule plot:
input:
...
output:
multiext("some/plot", ".pdf", ".svg", ".png")
shell:
...
The effect is the same as if you would write expand("some/plot.{ext}", ext=[".pdf", ".svg", ".png"])
, however, using a simpler syntax.
Moreover, defining output with multiext
is the only way to use between workflow caching for rules with multiple output files.
Targets and aggregation¶
By default snakemake executes the first rule in the snakefile. This gives rise to pseudo-rules at the beginning of the file that can be used to define build-targets similar to GNU Make:
rule all:
input: ["{dataset}/file.A.txt".format(dataset=dataset) for dataset in DATASETS]
Here, for each dataset in a python list DATASETS
defined before, the file {dataset}/file.A.txt
is requested. In this example, Snakemake recognizes automatically that these can be created by multiple applications of the rule complex_conversion
shown above.
Threads¶
Further, a rule can be given a number of threads to use, i.e.
rule NAME:
input: "path/to/inputfile", "path/to/other/inputfile"
output: "path/to/outputfile", "path/to/another/outputfile"
threads: 8
shell: "somecommand --threads {threads} {input} {output}"
Snakemake can alter the number of cores available based on command line options. Therefore it is useful to propagate it via the built in variable threads
rather than hardcoding it into the shell command.
In particular, it should be noted that the specified threads have to be seen as a maximum. When Snakemake is executed with fewer cores, the number of threads will be adjusted, i.e. threads = min(threads, cores)
with cores
being the number of cores specified at the command line (option --cores
).
Hardcoding a particular maximum number of threads like above is useful when a certain tool has a natural maximum beyond which parallelization won’t help to further speed it up.
This is often the case, and should be evaluated carefully for production workflows.
Also, setting a threads:
maximum is required to achieve parallelism in tools that (often implicitly and without the user knowing) rely on an environment variable for the maximum of cores to use.
For example, this is the case for many linear algebra libraries and for OpenMP.
Snakemake limits the respective environment variables to one core by default, to avoid unexpected and unlimited core-grabbing, but will override this with the threads:
you specify in a rule (the parameters set to threads:
, or defaulting to 1
, are: OMP_NUM_THREADS
, GOTO_NUM_THREADS
, OPENBLAS_NUM_THREADS
, MKL_NUM_THREADS
, VECLIB_MAXIMUM_THREADS
, NUMEXPR_NUM_THREADS
).
If it is certain that no maximum for efficient parallelism exists for a tool, one can instead define threads as a function of the number of cores given to Snakemake:
rule NAME:
input: "path/to/inputfile", "path/to/other/inputfile"
output: "path/to/outputfile", "path/to/another/outputfile"
threads: workflow.cores * 0.75
shell: "somecommand --threads {threads} {input} {output}"
The number of given cores is globally available in the Snakefile as an attribute of the workflow object: workflow.cores
.
Any arithmetic operation can be performed to derive a number of threads from this. E.g., in the above example, we reserve 75% of the given cores for the rule.
Snakemake will always round the calculated value down (while enforcing a minimum of 1 thread).
Starting from version 3.7, threads can also be a callable that returns an int
value. The signature of the callable should be callable(wildcards[, input])
(input is an optional parameter). It is also possible to refer to a predefined variable (e.g, threads: threads_max
) so that the number of cores for a set of rules can be changed with one change only by altering the value of the variable threads_max
.
Resources¶
In addition to threads, a rule can use arbitrary user-defined resources by specifying them with the resources-keyword:
rule:
input: ...
output: ...
resources:
mem_mb=100
shell:
"..."
If limits for the resources are given via the command line, e.g.
$ snakemake --resources mem_mb=100
the scheduler will ensure that the given resources are not exceeded by running jobs.
If no limits are given, the resources are ignored in local execution.
In cluster or cloud execution, resources are always passed to the backend, even if --resources
is not specified.
Apart from making Snakemake aware of hybrid-computing architectures (e.g. with a limited number of additional devices like GPUs) this allows us to control scheduling in various ways, e.g. to limit IO-heavy jobs by assigning an artificial IO-resource to them and limiting it via the --resources
flag.
Resources must be int
or str
values. Note that you are free to choose any names for the given resources.
Standard Resources¶
There are two standard resources for memory and disk usage though: mem_mb
and disk_mb
.
When defining memory constraints, it is advised to use mem_mb
, because some execution modes make direct use of this information (e.g., when using Kubernetes).
Since it would be cumbersome to define them for every rule, you can set default values at the terminal or in a profile.
This works via the command line flag --default-resources
, see snakemake --help
for more information.
If those resource definitions are mandatory for a certain execution mode, Snakemake will fail with a hint if they are missing.
Any resource definitions inside a rule override what has been defined with --default-resources
.
Resources can also be callables that return int
values.
The signature of the callable has to be callable(wildcards [, input] [, threads] [, attempt])
(input
, threads
, and attempt
are optional parameters).
The parameter attempt
allows us to adjust resources based on how often the job has been restarted (see All Options, option --restart-times
).
This is handy when executing a Snakemake workflow in a cluster environment, where jobs can e.g. fail because of too limited resources.
When Snakemake is executed with --restart-times 3
, it will try to restart a failed job 3 times before it gives up.
Thereby, the parameter attempt
will contain the current attempt number (starting from 1
).
This can be used to adjust the required memory as follows
rule:
input: ...
output: ...
resources:
mem_mb=lambda wildcards, attempt: attempt * 100
shell:
"..."
Here, the first attempt will require 100 MB memory, the second attempt will require 200 MB memory and so on. When passing memory requirements to the cluster engine, you can by this automatically try out larger nodes if it turns out to be necessary.
Preemptible Virtual Machine¶
You can specify parameters preemptible-rules
and preemption-default
to request a Google Cloud preemptible virtual machine for use with the Google Life Sciences Executor. There are
several ways to go about doing this. This first example will use preemptible instances for all rules, with 10 repeats (restarts
of the instance if it stops unexpectedly).
snakemake --preemption-default 10
If your preference is to set a default but then overwrite some rules with a custom value, this is where you can use --preemtible-rules
:
snakemake --preemption-default 10 --preemptible-rules map_reads=3 call_variants=0
The above statement says that we want to use preemtible instances for all steps, defaulting to 10 retries, but for the steps “map_reads” and “call_variants” we want to apply 3 and 0 retries, respectively. The final option is to not use preemptible instances by default, but only for a particular rule:
snakemake --preemptible-rules map_reads=10
Note that this is currently implemented for the Google Life Sciences API.
GPU Resources¶
The Google Life Sciences API currently has support for
NVIDIA GPUs, meaning that you can request a number of NVIDIA GPUs explicitly by adding nvidia_gpu
or gpu
to your Snakefile resources for a step:
rule a:
output:
"test.txt"
resources:
nvidia_gpu=1
shell:
"somecommand ..."
A specific gpu model can be requested using gpu_model
and lowercase identifiers like nvidia-tesla-p100
or nvidia-tesla-p4
, for example: gpu_model="nvidia-tesla-p100"
. If you don’t specify gpu
or nvidia_gpu
with a count, but you do specify a gpu_model
, the count will default to 1.
Messages¶
When executing snakemake, a short summary for each running rule is given to the console. This can be overridden by specifying a message for a rule:
rule NAME:
input: "path/to/inputfile", "path/to/other/inputfile"
output: "path/to/outputfile", "path/to/another/outputfile"
threads: 8
message: "Executing somecommand with {threads} threads on the following files {input}."
shell: "somecommand --threads {threads} {input} {output}"
Note that access to wildcards is also possible via the variable wildcards
(e.g, {wildcards.sample}
), which is the same as with shell commands. It is important to have a namespace around wildcards in order to avoid clashes with other variable names.
Priorities¶
Snakemake allows for rules that specify numeric priorities:
rule:
input: ...
output: ...
priority: 50
shell: ...
Per default, each rule has a priority of 0. Any rule that specifies a higher priority, will be preferred by the scheduler over all rules that are ready to execute at the same time without having at least the same priority.
Furthermore, the --prioritize
or -P
command line flag allows to specify files (or rules) that shall be created with highest priority during the workflow execution. This means that the scheduler will assign the specified target and all its dependencies highest priority, such that the target is finished as soon as possible.
The --dry-run
(equivalently --dryrun
) or -n
option allows you to see the scheduling plan including the assigned priorities.
Log-Files¶
Each rule can specify a log file where information about the execution is written to:
rule abc:
input: "input.txt"
output: "output.txt"
log: "logs/abc.log"
shell: "somecommand --log {log} {input} {output}"
Log files can be used as input for other rules, just like any other output file. However, unlike output files, log files are not deleted upon error. This is obviously necessary in order to discover causes of errors which might become visible in the log file.
The variable log
can be used inside a shell command to tell the used tool to which file to write the logging information.
The log file has to use the same wildcards as output files, e.g.
log: "logs/abc.{dataset}.log"
For programs that do not have an explicit log
parameter, you may always use 2> {log}
to redirect standard output to a file (here, the log
file) in Linux-based systems.
Note that it is also supported to have multiple (named) log files being specified:
rule abc:
input: "input.txt"
output: "output.txt"
log: log1="logs/abc.log", log2="logs/xyz.log"
shell: "somecommand --log {log.log1} METRICS_FILE={log.log2} {input} {output}"
Non-file parameters for rules¶
Sometimes you may want to define certain parameters separately from the rule body. Snakemake provides the params
keyword for this purpose:
rule:
input:
...
params:
prefix="somedir/{sample}"
output:
"somedir/{sample}.csv"
shell:
"somecommand -o {params.prefix}"
The params
keyword allows you to specify additional parameters depending on the wildcards values. This allows you to circumvent the need to use run:
and python code for non-standard commands like in the above case.
Here, the command somecommand
expects the prefix of the output file instead of the actual one. The params
keyword helps here since you cannot simply add the prefix as an output file (as the file won’t be created, Snakemake would throw an error after execution of the rule).
Furthermore, for enhanced readability and clarity, the params
section is also an excellent place to name and assign parameters and variables for your subsequent command.
Similar to input
, params
can take functions as well (see Functions as Input Files), e.g. you can write
rule:
input:
...
params:
prefix=lambda wildcards, output: output[0][:-4]
output:
"somedir/{sample}.csv"
shell:
"somecommand -o {params.prefix}"
to get the same effect as above. Note that in contrast to the input
directive, the
params
directive can optionally take more arguments than only wildcards
, namely input
, output
, threads
, and resources
.
From the Python perspective, they can be seen as optional keyword arguments without a default value.
Their order does not matter, apart from the fact that wildcards
has to be the first argument.
In the example above, this allows you to derive the prefix name from the output file.
External scripts¶
A rule can also point to an external script instead of a shell command or inline Python code, e.g.
rule NAME:
input:
"path/to/inputfile",
"path/to/other/inputfile"
output:
"path/to/outputfile",
"path/to/another/outputfile"
script:
"scripts/script.py"
Note
It is possible to refer to wildcards and params in the script path, e.g. by specifying "scripts/{params.scriptname}.py"
or "scripts/{wildcards.scriptname}.py"
.
The script path is always relative to the Snakefile containing the directive (in contrast to the input and output file paths, which are relative to the working directory).
It is recommended to put all scripts into a subfolder scripts
as above.
Inside the script, you have access to an object snakemake
that provides access to the same objects that are available in the run
and shell
directives (input, output, params, wildcards, log, threads, resources, config), e.g. you can use snakemake.input[0]
to access the first input file of above rule.
Apart from Python scripts, this mechanism also allows you to integrate R and R Markdown scripts with Snakemake, e.g.
rule NAME:
input:
"path/to/inputfile",
"path/to/other/inputfile"
output:
"path/to/outputfile",
"path/to/another/outputfile"
script:
"scripts/script.R"
In the R script, an S4 object named snakemake
analogous to the Python case above is available and allows access to input and output files and other parameters. Here the syntax follows that of S4 classes with attributes that are R lists, e.g. we can access the first input file with snakemake@input[[1]]
(note that the first file does not have index 0
here, because R starts counting from 1
). Named input and output files can be accessed in the same way, by just providing the name instead of an index, e.g. snakemake@input[["myfile"]]
.
Alternatively, it is possible to integrate Julia scripts, e.g.
rule NAME:
input:
"path/to/inputfile",
"path/to/other/inputfile"
output:
"path/to/outputfile",
"path/to/another/outputfile"
script:
"path/to/script.jl"
In the Julia script, a snakemake
object is available, which can be accessed similar to the Python case (see above), with the only difference that you have to index from 1 instead of 0.
For technical reasons, scripts are executed in .snakemake/scripts
. The original script directory is available as scriptdir
in the snakemake
object. A convenience method, snakemake@source()
, acts as a wrapper for the normal R source()
function, and can be used to source files relative to the original script directory.
An example external Python script could look like this:
def do_something(data_path, out_path, threads, myparam):
# python code
do_something(snakemake.input[0], snakemake.output[0], snakemake.threads, snakemake.config["myparam"])
You can use the Python debugger from within the script if you invoke Snakemake with --debug
.
An equivalent script written in R would look like this:
do_something <- function(data_path, out_path, threads, myparam) {
# R code
}
do_something(snakemake@input[[1]], snakemake@output[[1]], snakemake@threads, snakemake@config[["myparam"]])
To debug R scripts, you can save the workspace with save.image()
, and invoke R after Snakemake has terminated. Then you can use the usual R debugging facilities while having access to the snakemake
variable.
It is best practice to wrap the actual code into a separate function. This increases the portability if the code shall be invoked outside of Snakemake or from a different rule.
An R Markdown file can be integrated in the same way as R and Python scripts, but only a single output (html) file can be used:
rule NAME:
input:
"path/to/inputfile",
"path/to/other/inputfile"
output:
"path/to/report.html",
script:
"path/to/report.Rmd"
In the R Markdown file you can insert output from a R command, and access variables stored in the S4 object named snakemake
---
title: "Test Report"
author:
- "Your Name"
date: "`r format(Sys.time(), '%d %B, %Y')`"
params:
rmd: "report.Rmd"
output:
html_document:
highlight: tango
number_sections: no
theme: default
toc: yes
toc_depth: 3
toc_float:
collapsed: no
smooth_scroll: yes
---
## R Markdown
This is an R Markdown document.
Test include from snakemake `r snakemake@input`.
## Source
<a download="report.Rmd" href="`r base64enc::dataURI(file = params$rmd, mime = 'text/rmd', encoding = 'base64')`">R Markdown source file (to produce this document)</a>
A link to the R Markdown document with the snakemake object can be inserted. Therefore a variable called rmd
needs to be added to the params
section in the header of the report.Rmd
file. The generated R Markdown file with snakemake object will be saved in the file specified in this rmd
variable. This file can be embedded into the HTML document using base64 encoding and a link can be inserted as shown in the example above.
Also other input and output files can be embedded in this way to make a portable report. Note that the above method with a data URI only works for small files. An experimental technology to embed larger files is using Javascript Blob object.
Jupyter notebook integration¶
Instead of plain scripts (see above), one can integrate Jupyter Notebooks. This enables the interactive development of data analysis components (e.g. for plotting). Integration works as follows (note the use of notebook: instead of script:):
rule hello:
output:
"test.txt"
log:
# optional path to the processed notebook
notebook="logs/notebooks/processed_notebook.ipynb"
notebook:
"notebooks/hello.py.ipynb"
It is recommended to prefix the .ipynb
suffix with either .py
or .r
to indicate the notebook language.
In the notebook, a snakemake object is available, which can be accessed in the same way as the with script integration.
In other words, you have access to input files via snakemake.input
(in the Python case) and snakemake@input
(in the R case) etc..
Optionally it is possible to automatically store the processed notebook.
This can be achieved by adding a named logfile notebook=...
to the log
directive.
Note
It is possible to refer to wildcards and params in the notebook path, e.g. by specifying "notebook/{params.name}.py"
or "notebook/{wildcards.name}.py"
.
In order to simplify the coding of notebooks given the automatically inserted snakemake
object, Snakemake provides an interactive edit mode for notebook rules.
Let us assume you have written above rule, but the notebook does not yet exist.
By running
snakemake --cores 1 --edit-notebook test.txt
you instruct Snakemake to allow interactive editing of the notebook needed to create the file test.txt
.
Snakemake will run all dependencies of the notebook rule, such that all input files are present.
Then, it will start a jupyter notebook server with an empty draft of the notebook, in which you can interactively program everything needed for this particular step.
Once done, you should save the notebook from the jupyter web interface, go to the jupyter dashboard and hit the Quit
button on the top right in order to shut down the jupyter server.
Snakemake will detect that the server is closed and automatically store the drafted notebook into the path given in the rule (here hello.py.ipynb
).
If the notebook already exists, above procedure can be used to easily modify it.
Note that Snakemake requires local execution for the notebook edit mode.
On a cluster or the cloud, you can generate all dependencies of the notebook rule via
snakemake --cluster ... --jobs 100 --until test.txt
Then, the notebook rule can easily be executed locally. An demo of the entire interactive editing process can be found by clicking below:
Protected and Temporary Files¶
A particular output file may require a huge amount of computation time. Hence one might want to protect it against accidental deletion or overwriting. Snakemake allows this by marking such a file as protected
:
rule NAME:
input:
"path/to/inputfile"
output:
protected("path/to/outputfile")
shell:
"somecommand {input} {output}"
A protected file will be write-protected after the rule that produces it is completed.
Further, an output file marked as temp
is deleted after all rules that use it as an input are completed:
rule NAME:
input:
"path/to/inputfile"
output:
temp("path/to/outputfile")
shell:
"somecommand {input} {output}"
Directories as outputs¶
Sometimes it can be convenient to have directories, rather than files, as outputs of a rule. As of version 5.2.0, directories as outputs have to be explicitly marked with directory
. This is primarily for safety reasons; since all outputs are deleted before a job is executed, we don’t want to risk deleting important directories if the user makes some mistake. Marking the output as directory
makes the intent clear, and the output can be safely removed. Another reason comes down to how modification time for directories work. The modification time on a directory changes when a file or a subdirectory is added, removed or renamed. This can easily happen in not-quite-intended ways, such as when Apple macOS or MS Windows add .DS_Store
or thumbs.db
files to store parameters for how the directory contents should be displayed. When the directory
flag is used a hidden file called .snakemake_timestamp
is created in the output directory, and the modification time of that file is used when determining whether the rule output is up to date or if it needs to be rerun. Always consider if you can’t formulate your workflow using normal files before resorting to using directory()
.
rule NAME:
input:
"path/to/inputfile"
output:
directory("path/to/outputdir")
shell:
"somecommand {input} {output}"
Ignoring timestamps¶
For determining whether output files have to be re-created, Snakemake checks whether the file modification date (i.e. the timestamp) of any input file of the same job is newer than the timestamp of the output file.
This behavior can be overridden by marking an input file as ancient
.
The timestamp of such files is ignored and always assumed to be older than any of the output files:
rule NAME:
input:
ancient("path/to/inputfile")
output:
"path/to/outputfile"
shell:
"somecommand {input} {output}"
Here, this means that the file path/to/outputfile
will not be triggered for re-creation after it has been generated once, even when the input file is modified in the future.
Note that any flag that forces re-creation of files still also applies to files marked as ancient
.
Shadow rules¶
Shadow rules result in each execution of the rule to be run in isolated temporary directories. This “shadow” directory contains symlinks to files and directories in the current workdir. This is useful for running programs that generate lots of unused files which you don’t want to manually cleanup in your snakemake workflow. It can also be useful if you want to keep your workdir clean while the program executes, or simplify your workflow by not having to worry about unique filenames for all outputs of all rules.
By setting shadow: "shallow"
, the top level files and directories are symlinked, so that any relative paths in a subdirectory will be real paths in the filesystem. The setting shadow: "full"
fully shadows the entire subdirectory structure of the current workdir. The setting shadow: "minimal"
only symlinks the inputs to the rule. Once the rule successfully executes, the output file will be moved if necessary to the real path as indicated by output
.
Typically, you will not need to modify your rule for compatibility with shadow
, unless you reference parent directories relative to your workdir in a rule.
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
shadow: "shallow"
shell: "somecommand --other_outputs other.txt {input} {output}"
Shadow directories are stored one per rule execution in .snakemake/shadow/
, and are cleared on successful execution. Consider running with the --cleanup-shadow
argument every now and then to remove any remaining shadow directories from aborted jobs. The base shadow directory can be changed with the --shadow-prefix
command line argument.
Flag files¶
Sometimes it is necessary to enforce some rule execution order without real file dependencies. This can be achieved by “touching” empty files that denote that a certain task was completed. Snakemake supports this via the touch flag:
rule all:
input: "mytask.done"
rule mytask:
output: touch("mytask.done")
shell: "mycommand ..."
With the touch
flag, Snakemake touches (i.e. creates or updates) the file mytask.done
after mycommand
has finished successfully.
Job Properties¶
When executing a workflow on a cluster using the --cluster
parameter (see below), Snakemake creates a job script for each job to execute.
This script is then invoked using the provided cluster submission command (e.g. qsub
).
Sometimes you want to provide a custom wrapper for the cluster submission command that decides about additional parameters.
As this might be based on properties of the job, Snakemake stores the job properties (e.g. rule name, threads, input files, params etc.) as JSON inside the job script.
For convenience, there exists a parser function snakemake.utils.read_job_properties
that can be used to access the properties.
The following shows an example job submission wrapper:
#!/usr/bin/env python3
import os
import sys
from snakemake.utils import read_job_properties
jobscript = sys.argv[1]
job_properties = read_job_properties(jobscript)
# do something useful with the threads
threads = job_properties[threads]
# access property defined in the cluster configuration file (Snakemake >=3.6.0)
job_properties["cluster"]["time"]
os.system("qsub -t {threads} {script}".format(threads=threads, script=jobscript))
Dynamic Files¶
Snakemake provides experimental support for dynamic files. Dynamic files can be used whenever one has a rule for which the number of output files is unknown before the rule was executed. This is useful for example with certain clustering algorithms:
rule cluster:
input: "afile.csv"
output: dynamic("{clusterid}.cluster.csv")
run: ...
Now the results of the rule can be used in Snakemake although it does not know how many files will be present before executing the rule cluster, e.g. by:
rule all:
input: dynamic("{clusterid}.cluster.plot.pdf")
rule plot:
input: "{clusterid}.cluster.csv"
output: "{clusterid}.cluster.plot.pdf"
run: ...
Here, Snakemake determines the input files for the rule all after the rule cluster was executed, and then dynamically inserts jobs of the rule plot into the DAG to create the desired plots.
Functions as Input Files¶
Instead of specifying strings or lists of strings as input files, snakemake can also make use of functions that return single or lists of input files:
def myfunc(wildcards):
return [... a list of input files depending on given wildcards ...]
rule:
input: myfunc
output: "someoutput.{somewildcard}.txt"
shell: "..."
The function has to accept a single argument that will be the wildcards object generated from the application of the rule to create some requested output files. Note that you can also use lambda expressions instead of full function definitions. By this, rules can have entirely different input files (both in form and number) depending on the inferred wildcards. E.g. you can assign input files that appear in entirely different parts of your filesystem based on some wildcard value and a dictionary that maps the wildcard value to file paths.
Note that the function will be executed when the rule is evaluated and before the workflow actually starts to execute. Further note that using a function as input overrides the default mechanism of replacing wildcards with their values inferred from the output files. You have to take care of that yourself with the given wildcards object.
Finally, when implementing the input function, it is best practice to make sure that it can properly handle all possible wildcard values your rule can have. In particular, input files should not be combined with very general rules that can be applied to create almost any file: Snakemake will try to apply the rule, and will report the exceptions of your input function as errors.
For a practical example, see the Snakemake Tutorial (Step 3: Input functions).
Input Functions and unpack()
¶
In some cases, you might want to have your input functions return named input files.
This can be done by having them return dict()
objects with the names as the dict keys and the file names as the dict values and using the unpack()
keyword.
def myfunc(wildcards):
return { 'foo': '{wildcards.token}.txt'.format(wildcards=wildcards)
rule:
input: unpack(myfunc)
output: "someoutput.{token}.txt"
shell: "..."
Note that unpack()
is only necessary for input functions returning dict
.
While it also works for list
, remember that lists (and nested lists) of strings are automatically flattened.
Also note that if you do not pass in a function into the input list but you directly call a function then you shouldn’t use unpack()
.
Here, you can simply use Python’s double-star (**
) operator for unpacking the parameters.
Note that as Snakefiles are translated into Python for execution, the same rules as for using the star and double-star unpacking Python operators apply.
These restrictions do not apply when using unpack()
.
def myfunc1():
return ['foo.txt']
def myfunc2():
return {'foo': 'nowildcards.txt'}
rule:
input:
*myfunc1(),
**myfunc2(),
output: "..."
shell: "..."
Version Tracking¶
Rules can specify a version that is tracked by Snakemake together with the output files. When the version changes snakemake informs you when using the flag --summary
or --list-version-changes
.
The version can be specified by the version directive, which takes a string:
rule:
input: ...
output: ...
version: "1.0"
shell: ...
The version can of course also be filled with the output of a shell command, e.g.:
SOMECOMMAND_VERSION = subprocess.check_output("somecommand --version", shell=True)
rule:
version: SOMECOMMAND_VERSION
Alternatively, you might want to use file modification times in case of local scripts:
SOMECOMMAND_VERSION = str(os.path.getmtime("path/to/somescript"))
rule:
version: SOMECOMMAND_VERSION
A re-run can be automated by invoking Snakemake as follows:
$ snakemake -R `snakemake --list-version-changes`
With the availability of the conda
directive (see Integrated Package Management)
the version
directive has become obsolete in favor of defining isolated
software environments that can be automatically deployed via the conda package
manager.
Code Tracking¶
Snakemake tracks the code that was used to create your files.
In combination with --summary
or --list-code-changes
this can be used to see what files may need a re-run because the implementation changed.
Re-run can be automated by invoking Snakemake as follows:
$ snakemake -R `snakemake --list-code-changes`
Onstart, onsuccess and onerror handlers¶
Sometimes, it is necessary to specify code that shall be executed when the workflow execution is finished (e.g. cleanup, or notification of the user).
With Snakemake 3.2.1, this is possible via the onsuccess
and onerror
keywords:
onsuccess:
print("Workflow finished, no error")
onerror:
print("An error occurred")
shell("mail -s "an error occurred" youremail@provider.com < {log}")
The onsuccess
handler is executed if the workflow finished without error. Otherwise, the onerror
handler is executed.
In both handlers, you have access to the variable log
, which contains the path to a logfile with the complete Snakemake output.
Snakemake 3.6.0 adds an onstart
handler, that will be executed before the workflow starts.
Note that dry-runs do not trigger any of the handlers.
Rule dependencies¶
From version 2.4.8 on, rules can also refer to the output of other rules in the Snakefile, e.g.:
rule a:
input: "path/to/input"
output: "path/to/output"
shell: ...
rule b:
input: rules.a.output
output: "path/to/output/of/b"
shell: ...
Importantly, be aware that referring to rule a
here requires that rule a
was defined above rule b
in the file, since the object has to be known already.
This feature also allows us to resolve dependencies that are ambiguous when using filenames.
Note that when the rule you refer to defines multiple output files but you want to require only a subset of those as input for another rule, you should name the output files and refer to them specifically:
rule a:
input: "path/to/input"
output: a = "path/to/output", b = "path/to/output2"
shell: ...
rule b:
input: rules.a.output.a
output: "path/to/output/of/b"
shell: ...
Handling Ambiguous Rules¶
When two rules can produce the same output file, snakemake cannot decide which one to use without additional guidance. Hence an AmbiguousRuleException
is thrown.
Note: ruleorder is not intended to bring rules in the correct execution order (this is solely guided by the names of input and output files you use), it only helps snakemake to decide which rule to use when multiple ones can create the same output file!
To deal with such ambiguity, provide a ruleorder
for the conflicting rules, e.g.
ruleorder: rule1 > rule2 > rule3
Here, rule1
is preferred over rule2
and rule3
, and rule2
is preferred over rule3
.
Only if rule1 and rule2 cannot be applied (e.g. due to missing input files), rule3 is used to produce the desired output file.
Alternatively, rule dependencies (see above) can also resolve ambiguities.
Another (quick and dirty) possiblity is to tell snakemake to allow ambiguity via a command line option
$ snakemake --allow-ambiguity
such that similar to GNU Make always the first matching rule is used. Here, a warning that summarizes the decision of snakemake is provided at the terminal.
Local Rules¶
When working in a cluster environment, not all rules need to become a job that has to be submitted (e.g. downloading some file, or a target-rule like all, see Targets and aggregation). The keyword localrules allows to mark a rule as local, so that it is not submitted to the cluster and instead executed on the host node:
localrules: all, foo
rule all:
input: ...
rule foo:
...
rule bar:
...
Here, only jobs from the rule bar
will be submitted to the cluster, whereas all and foo will be run locally.
Note that you can use the localrules directive multiple times. The result will be the union of all declarations.
Benchmark Rules¶
Since version 3.1, Snakemake provides support for benchmarking the run times of rules. This can be used to create complex performance analysis pipelines. With the benchmark keyword, a rule can be declared to store a benchmark of its code into the specified location. E.g. the rule
rule benchmark_command:
input:
"path/to/input.{sample}.txt"
output:
"path/to/output.{sample}.txt"
benchmark:
"benchmarks/somecommand/{sample}.tsv"
shell:
"somecommand {input} {output}"
benchmarks the CPU and wall clock time of the command somecommand
for the given output and input files.
For this, the shell or run body of the rule is executed on that data, and all run times are stored into the given benchmark tsv file (which will contain a tab-separated table of run times and memory usage in MiB).
Per default, Snakemake executes the job once, generating one run time.
However, the benchmark file can be annotated with the desired number of repeats, e.g.,
rule benchmark_command:
input:
"path/to/input.{sample}.txt"
output:
"path/to/output.{sample}.txt"
benchmark:
repeat("benchmarks/somecommand/{sample}.tsv", 3)
shell:
"somecommand {input} {output}"
will instruct Snakemake to run each job of this rule three times and store all measurements in the benchmark file. The resulting tsv file can be used as input for other rules, just like any other output file.
Note
Note that benchmarking is only possible in a reliable fashion for subprocesses (thus for tasks run through the shell
, script
, and wrapper
directive).
In the run
block, the variable bench_record
is available that you can pass to shell()
as bench_record=bench_record
.
When using shell(..., bench_record=bench_record)
, the maximum of all measurements of all shell()
calls will be used but the running time of the rule execution including any Python code.
Defining scatter-gather processes¶
Via Snakemake’s powerful and abitrary Python based aggregation abilities (via the expand
function and arbitrary Python code, see here), scatter-gather workflows well supported.
Nevertheless, it can sometimes be handy to use Snakemake’s specific scatter-gather support, which allows to avoid boilerplate and offers additional configuration options.
Scatter-gather processes can be defined via a global scattergather
directive:
scattergather:
split=8
Each process thereby defines a name (here e.g. split
) and a default number of scatter items.
Then, scattering and gathering can be implemented by using globally available scatter
and gather
objects:
rule all:
input:
"gathered/all.txt"
rule split:
output:
scatter.split("splitted/{scatteritem}.txt")
shell:
"touch {output}"
rule intermediate:
input:
"splitted/{scatteritem}.txt"
output:
"splitted/{scatteritem}.post.txt"
shell:
"cp {input} {output}"
rule gather:
input:
gather.split("splitted/{scatteritem}.post.txt")
output:
"gathered/all.txt"
shell:
"cat {input} > {output}"
Thereby, scatter.split("splitted/{scatteritem}.txt")
yields a list of paths "splitted/0.txt"
, "splitted/1.txt"
, …, depending on the number of scatter items defined.
Analogously, gather.split("splitted/{scatteritem}.post.txt")
, yields a list of paths "splitted/0.post.txt"
, "splitted/1.pos.txt"
, …, which request the application of the rule intermediate
to each scatter item.
The default number of scatter items can be overwritten via the command line interface. For example
snakemake --set-scatter split=2
would set the number of scatter items for the split process defined above to 2 instead of 8. This allows to adapt parallelization according to the needs of the underlying computing platform and the analysis at hand.
Defining groups for execution¶
From Snakemake 5.0 on, it is possible to assign rules to groups. Such groups will be executed together in cluster or cloud mode, as a so-called group job, i.e., all jobs of a particular group will be submitted at once, to the same computing node. When executing locally, group definitions are ignored.
Groups can be defined via the group
keyword.
This way, queueing and execution time can be saved, in particular if one or several short-running rules are involved.
samples = [1,2,3,4,5]
rule all:
input:
"test.out"
rule a:
output:
"a/{sample}.out"
group: "mygroup"
shell:
"touch {output}"
rule b:
input:
"a/{sample}.out"
output:
"b/{sample}.out"
group: "mygroup"
shell:
"touch {output}"
rule c:
input:
expand("b/{sample}.out", sample=samples)
output:
"test.out"
shell:
"touch {output}"
Here, jobs from rule a
and b
end up in one group mygroup
, whereas jobs from rule c
are executed separately.
Note that Snakemake always determines a connected subgraph with the same group id to be a group job.
Here, this means that, e.g., the jobs creating a/1.out
and b/1.out
will be in one group, and the jobs creating a/2.out
and b/2.out
will be in a separate group.
However, if we would add group: "mygroup"
to rule c
, all jobs would end up in a single group, including the one spawned from rule c
, because c
connects all the other jobs.
Alternatively, groups can be defined via the command line interface. This enables to almost arbitrarily partition the DAG, e.g. in order to safe network traffic, see here.
Piped output¶
From Snakemake 5.0 on, it is possible to mark output files as pipes, via the pipe
flag, e.g.:
rule all:
input:
expand("test.{i}.out", i=range(2))
rule a:
output:
pipe("test.{i}.txt")
shell:
"for i in {{0..2}}; do echo {wildcards.i} >> {output}; done"
rule b:
input:
"test.{i}.txt"
output:
"test.{i}.out"
shell:
"grep {wildcards.i} < {input} > {output}"
If an output file is marked to be a pipe, then Snakemake will first create a named pipe with the given name and then execute the creating job simultaneously with the consuming job, inside a group job (see above). This works in all execution modes, local, cluster, and cloud. Naturally, a pipe output may only have a single consumer. It is possible to combine explicit group definition as above with pipe outputs. Thereby, pipe jobs can live within, or (automatically) extend existing groups. However, the two jobs connected by a pipe may not exist in conflicting groups.
Parameter space exploration¶
The basic Snakemake functionality already provides everything to handle parameter spaces in any way (sub-spacing for certain rules and even depending on wildcard values, the ability to read or generate spaces on the fly or from files via pandas, etc.). However, it usually would require some boilerplate code for translating a parameter space into wildcard patterns, and translate it back into concrete parameters for scripts and commands. From Snakemake 5.31 on (inspired by JUDI), this is solved via the Paramspace helper, which can be used as follows:
from snakemake.utils import Paramspace
import pandas as pd
# declare a dataframe to be a paramspace
paramspace = Paramspace(pd.read_csv("params.tsv", sep="\t"))
rule all:
input:
# Aggregate over entire parameter space (or a subset thereof if needed)
# of course, something like this can happen anywhere in the workflow (not
# only at the end).
expand("results/plots/{params}.pdf", params=paramspace.instance_patterns)
rule simulate:
output:
# format a wildcard pattern like "alpha~{alpha}/beta~{beta}/gamma~{gamma}"
# into a file path, with alpha, beta, gamma being the columns of the data frame
f"results/simulations/{paramspace.wildcard_pattern}.tsv"
params:
# automatically translate the wildcard values into an instance of the param space
# in the form of a dict (here: {"alpha": ..., "beta": ..., "gamma": ...})
simulation=paramspace.instance
script:
"scripts/simulate.py"
rule plot:
input:
f"results/simulations/{paramspace.wildcard_pattern}.tsv"
output:
f"results/plots/{paramspace.wildcard_pattern}.pdf"
shell:
"touch {output}"
Given that params.tsv contains:
alpha beta gamma
1.0 0.1 0.99
2.0 0.0 3.9
This workflow will run as follows:
[Fri Nov 27 20:57:27 2020]
rule simulate:
output: results/simulations/alpha~2.0/beta~0.0/gamma~3.9.tsv
jobid: 4
wildcards: alpha=2.0, beta=0.0, gamma=3.9
[Fri Nov 27 20:57:27 2020]
rule simulate:
output: results/simulations/alpha~1.0/beta~0.1/gamma~0.99.tsv
jobid: 2
wildcards: alpha=1.0, beta=0.1, gamma=0.99
[Fri Nov 27 20:57:27 2020]
rule plot:
input: results/simulations/alpha~2.0/beta~0.0/gamma~3.9.tsv
output: results/plots/alpha~2.0/beta~0.0/gamma~3.9.pdf
jobid: 3
wildcards: alpha=2.0, beta=0.0, gamma=3.9
[Fri Nov 27 20:57:27 2020]
rule plot:
input: results/simulations/alpha~1.0/beta~0.1/gamma~0.99.tsv
output: results/plots/alpha~1.0/beta~0.1/gamma~0.99.pdf
jobid: 1
wildcards: alpha=1.0, beta=0.1, gamma=0.99
[Fri Nov 27 20:57:27 2020]
localrule all:
input: results/plots/alpha~1.0/beta~0.1/gamma~0.99.pdf, results/plots/alpha~2.0/beta~0.0/gamma~3.9.pdf
jobid: 0
Naturally, it is possible to create sub-spaces from Paramspace
objects, simply by applying all the usual methods and attributes that Pandas data frames provide (e.g. .loc[...]
, .filter()
etc.).
Further, the form of the created wildcard_pattern
can be controlled via additional arguments of the Paramspace
constructor (see Additional utils).
Data-dependent conditional execution¶
From Snakemake 5.4 on, conditional reevaluation of the DAG of jobs based on the content outputs is possible. The key idea is that rules can be declared as checkpoints, e.g.,
checkpoint somestep:
input:
"samples/{sample}.txt"
output:
"somestep/{sample}.txt"
shell:
"somecommand {input} > {output}"
Snakemake allows to re-evaluate the DAG after the successful execution of every job spawned from a checkpoint.
For this, every checkpoint is registered by its name in a globally available checkpoints
object.
The checkpoints
object can be accessed by input functions.
Assuming that the checkpoint is named somestep
as above, the output files for a particular job can be retrieved with
checkpoints.somestep.get(sample="a").output
Thereby, the get
method throws snakemake.exceptions.IncompleteCheckpointException
if the checkpoint has not yet been executed for these particular wildcard value(s).
Inside an input function, the exception will be automatically handled by Snakemake, and leads to a re-evaluation after the checkpoint has been successfully passed.
To illustrate the possibilities of this mechanism, consider the following complete example:
# a target rule to define the desired final output
rule all:
input:
"aggregated/a.txt",
"aggregated/b.txt"
# the checkpoint that shall trigger re-evaluation of the DAG
checkpoint somestep:
input:
"samples/{sample}.txt"
output:
"somestep/{sample}.txt"
shell:
# simulate some output value
"echo {wildcards.sample} > somestep/{wildcards.sample}.txt"
# intermediate rule
rule intermediate:
input:
"somestep/{sample}.txt"
output:
"post/{sample}.txt"
shell:
"touch {output}"
# alternative intermediate rule
rule alt_intermediate:
input:
"somestep/{sample}.txt"
output:
"alt/{sample}.txt"
shell:
"touch {output}"
# input function for the rule aggregate
def aggregate_input(wildcards):
# decision based on content of output file
# Important: use the method open() of the returned file!
# This way, Snakemake is able to automatically download the file if it is generated in
# a cloud environment without a shared filesystem.
with checkpoints.somestep.get(sample=wildcards.sample).output[0].open() as f:
if f.read().strip() == "a":
return "post/{sample}.txt"
else:
return "alt/{sample}.txt"
rule aggregate:
input:
aggregate_input
output:
"aggregated/{sample}.txt"
shell:
"touch {output}"
As can be seen, the rule aggregate uses an input function.
Inside the function, we first retrieve the output files of the checkpoint somestep
with the wildcards, passing through the value of the wildcard sample.
Upon execution, if the checkpoint is not yet complete, Snakemake will record somestep
as a direct dependency of the rule aggregate
.
Once somestep
has finished for a given sample, the input function will automatically be re-evaluated and the method get
will no longer return an exception.
Instead, the output file will be opened, and depending on its contents either "post/{sample}.txt"
or "alt/{sample}.txt"
will be returned by the input function.
This way, the DAG becomes conditional on some produced data.
It is also possible to use checkpoints for cases where the output files are unknown before execution. A typical example is a clustering process with an unknown number of clusters, where each cluster shall be saved into a separate file. Consider the following example:
# a target rule to define the desired final output
rule all:
input:
"aggregated/a.txt",
"aggregated/b.txt"
# the checkpoint that shall trigger re-evaluation of the DAG
checkpoint clustering:
input:
"samples/{sample}.txt"
output:
clusters=directory("clustering/{sample}")
shell:
"mkdir clustering/{wildcards.sample}; "
"for i in 1 2 3; do echo $i > clustering/{wildcards.sample}/$i.txt; done"
# an intermediate rule
rule intermediate:
input:
"clustering/{sample}/{i}.txt"
output:
"post/{sample}/{i}.txt"
shell:
"cp {input} {output}"
def aggregate_input(wildcards):
checkpoint_output = checkpoints.clustering.get(**wildcards).output[0]
return expand("post/{sample}/{i}.txt",
sample=wildcards.sample,
i=glob_wildcards(os.path.join(checkpoint_output, "{i}.txt")).i)
# an aggregation over all produced clusters
rule aggregate:
input:
aggregate_input
output:
"aggregated/{sample}.txt"
shell:
"cat {input} > {output}"
Here, our checkpoint simulates a clustering.
We pretend that the number of clusters is unknown beforehand.
Hence, the checkpoint only defines an output directory
.
The rule aggregate
again uses the checkpoints
object to retrieve the output of the checkpoint.
This time, instead of explicitly writing
checkpoints.clustering.get(sample=wildcards.sample).output[0]
we use the shorthand
checkpoints.clustering.get(**wildcards).output[0]
which automatically unpacks the wildcards as keyword arguments (this is standard python argument unpacking).
If the checkpoint has not yet been executed, accessing checkpoints.clustering.get(**wildcards)
ensure that Snakemake records the checkpoint as a direct dependency of the rule aggregate
.
Upon completion of the checkpoint, the input function is re-evaluated, and the code beyond its first line is executed.
Here, we retrieve the values of the wildcard i
based on all files named {i}.txt
in the output directory of the checkpoint.
These values are then used to expand the pattern "post/{sample}/{i}.txt"
, such that the rule intermediate
is executed for each of the determined clusters.
This mechanism can be used to replace the use of the dynamic-flag which will be deprecated in Snakemake 6.0.