# Rules¶

Most importantly, a rule can consist of a name (the name is optional and can be left out, creating an anonymous rule), input files, output files, and a shell command to generate the output from the input, i.e.

rule NAME:
input: "path/to/inputfile", "path/to/other/inputfile"
output: "path/to/outputfile", "path/to/another/outputfile"
shell: "somecommand {input} {output}"


Inside the shell command, all local and global variables, especially input and output files can be accessed via their names in the python format minilanguage. Here input and output (and in general any list or tuple) automatically evaluate to a space-separated list of files (i.e. path/to/inputfile path/to/other/inputfile). From Snakemake 3.8.0 on, adding the special formatting instruction :q (e.g. "somecommand {input:q} {output:q}")) will let Snakemake quote each of the list or tuple elements that contains whitespace. Instead of a shell command, a rule can run some python code to generate the output:

rule NAME:
input: "path/to/inputfile", "path/to/other/inputfile"
output: "path/to/outputfile", somename = "path/to/another/outputfile"
run:
for f in input:
...
with open(output[0], "w") as out:
out.write(...)
with open(output.somename, "w") as out:
out.write(...)


As can be seen, instead of accessing input and output as a whole, we can also access by index (output[0]) or by keyword (output.somename). Note that, when adding keywords or names for input or output files, their order won’t be preserved when accessing them as a whole via e.g. {output} in a shell command.

Shell commands like above can also be invoked inside a python based rule, via the function shell that takes a string with the command and allows the same formatting like in the rule above, e.g.:

shell("somecommand {output.somename}")


Further, this combination of python and shell commands, allows to iterate over the output of the shell command, e.g.:

for line in shell("somecommand {output.somename}", iterable=True):
... # do something in python


Note that shell commands in Snakemake use the bash shell in strict mode by default.

## Wildcards¶

Usually, it is useful to generalize a rule to be applicable to a number of e.g. datasets. For this purpose, wildcards can be used. Automatically resolved multiple named wildcards are a key feature and strength of Snakemake in comparison to other systems. Consider the following example.

rule complex_conversion:
input:
"{dataset}/inputfile"
output:
"{dataset}/file.{group}.txt"
shell:
"somecommand --group {wildcards.group} < {input} > {output}"


Here, we define two wildcards, dataset and group. By this, the rule can produce all files that follow the regular expression pattern .+/file\..+\.txt, i.e. the wildcards are replaced by the regular expression .+. If the rule’s output matches a requested file, the substrings matched by the wildcards are propagated to the input files and to the variable wildcards, that is here also used in the shell command. The wildcards object can be accessed in the same way as input and output, which is described above.

For example, if another rule in the workflow requires the file the file 101/file.A.txt, Snakemake recognizes that this rule is able to produce it by setting dataset=101 and group=A. Thus, it requests file 101/inputfile as input and executes the command somecommand --group A  < 101/inputfile  > 101/file.A.txt. Of course, the input file might have to be generated by another rule with different wildcards.

Multiple wildcards in one filename can cause ambiguity. Consider the pattern {dataset}.{group}.txt and assume that a file 101.B.normal.txt is available. It is not clear whether dataset=101.B and group=normal or dataset=101 and group=B.normal in this case.

Hence wildcards can be constrained to given regular expressions. Here we could restrict the wildcard dataset to consist of digits only using \d+ as the corresponding regular expression. With Snakemake 3.8.0, there are three ways to constrain wildcards. First, a wildcard can be constrained within the file pattern, by appending a regular expression separated by a comma:

output: "{dataset,\d+}.{group}.txt"


Second, a wildcard can be constrained within the rule via the keyword wildcard_constraints:

rule complex_conversion:
input:
"{dataset}/inputfile"
output:
"{dataset}/file.{group}.txt"
wildcard_constraints:
dataset="\d+"
shell:
"somecommand --group {wildcards.group}  < {input}  > {output}"


Finally, you can also define global wildcard constraints that apply for all rules:

wildcard_constraints:
dataset="\d+"

rule a:
...

rule b:
...


See the Python documentation on regular expressions for detailed information on regular expression syntax.

## Targets¶

By default snakemake executes the first rule in the snakefile. This gives rise to pseudo-rules at the beginning of the file that can be used to define build-targets similar to GNU Make:

rule all:
input: ["{dataset}/file.A.txt".format(dataset=dataset) for dataset in DATASETS]


Here, for each dataset in a python list DATASETS defined before, the file {dataset}/file.A.txt is requested. In this example, Snakemake recognizes automatically that these can be created by multiple applications of the rule complex_conversion shown above.

Above expression can be simplified to the following:

rule all:
input: expand("{dataset}/file.A.txt", dataset=DATASETS)


The expand function thereby allows also to combine different variables, e.g.

rule all:
input: expand("{dataset}/file.A.{ext}", dataset=DATASETS, ext=PLOTFORMATS)


If now PLOTFORMATS=["pdf", "png"] contains a list of desired output formats then expand will automatically combine any dataset with any of these extensions.

Further, the first argument can also be a list of strings. In that case, the transformation is applied to all elements of the list. E.g.

expand(["{dataset}/plot1.{ext}", "{dataset}/plot2.{ext}"], dataset=DATASETS, ext=PLOTFORMATS)


["ds1/plot1.pdf", "ds1/plot2.pdf", "ds2/plot1.pdf", "ds2/plot2.pdf", "ds1/plot1.png", "ds1/plot2.png", "ds2/plot1.png", "ds2/plot2.png"]


Per default, expand uses the python itertools function product that yields all combinations of the provided wildcard values. However by inserting a second positional argument this can be replaced by any combinatoric function, e.g. zip:

expand("{dataset}/plot1.{ext} {dataset}/plot2.{ext}".split(), zip, dataset=DATASETS, ext=PLOTFORMATS)


["ds1/plot1.pdf", "ds1/plot2.pdf", "ds2/plot1.png", "ds2/plot2.png"]


You can also mask a wildcard expression in expand such that it will be kept, e.g.

expand("{{dataset}}/plot1.{ext}", ext=PLOTFORMATS)


will create strings with all values for ext but starting with "{dataset}".

Further, a rule can be given a number of threads to use, i.e.

rule NAME:
input: "path/to/inputfile", "path/to/other/inputfile"
output: "path/to/outputfile", "path/to/another/outputfile"


Snakemake can alter the number of cores available based on command line options. Therefore it is useful to propagate it via the built in variable threads rather than hardcoding it into the shell command. In particular, it should be noted that the specified threads have to be seen as a maximum. When Snakemake is executed with fewer cores, the number of threads will be adjusted, i.e. threads = min(threads, cores) with cores being the number of cores specified at the command line (option --cores). On a cluster node, Snakemake uses as many cores as available on that node. Hence, the number of threads used by a rule never exceeds the number of physically available cores on the node. Note: This behavior is not affected by --local-cores, which only applies to jobs running on the master node.

Starting from version 3.7, threads can also be a callable that returns an int value.

## Resources¶

In addition to threads, a rule can use arbitrary user-defined resources by specifying them with the resources-keyword:

rule:
input:     ...
output:    ...
resources: gpu=1
shell: "..."


If limits for the resources are given via the command line, e.g.



## Code Tracking¶

Snakemake tracks the code that was used to create your files. In combination with --summary or --list-code-changes this can be used to see what files may need a re-run because the implementation changed. Re-run can be automated by invoking Snakemake as follows:

$snakemake -R snakemake --list-code-changes  ## Onstart, onsuccess and onerror handlers¶ Sometimes, it is necessary to specify code that shall be executed when the workflow execution is finished (e.g. cleanup, or notification of the user). With Snakemake 3.2.1, this is possible via the onsuccess and onerror keywords: onsuccess: print("Workflow finished, no error") onerror: print("An error occurred") shell("mail -s "an error occurred" youremail@provider.com < {log}")  The onsuccess handler is executed if the workflow finished without error. Else, the onerror handler is executed. In both handlers, you have access to the variable log, which contains the path to a logfile with the complete Snakemake output. Snakemake 3.6.0 adds an onstart handler, that will be executed before the workflow starts. Note that dry-runs do not trigger any of the handlers. ## Rule dependencies¶ From verion 2.4.8 on, rules can also refer to the output of other rules in the Snakefile, e.g.: rule a: input: "path/to/input" output: "path/to/output" shell: ... rule b: input: rules.a.output output: "path/to/output/of/b" shell: ...  Importantly, be aware that referring to rule a here requires that rule a was defined above rule b in the file, since the object has to be known already. This feature also allows to resolve dependencies that are ambiguous when using filenames. ## Handling Ambiguous Rules¶ When two rules can produce the same output file, snakemake cannot decide per default which one to use. Hence an AmbiguousRuleException is thrown. Note: ruleorder is not intended to bring rules in the correct execution order (this is solely guided by the names of input and output files you use), it only helps snakemake to decide which rule to use when multiple ones can create the same output file! The proposed strategy to deal with such ambiguity is to provide a ruleorder for the conflicting rules, e.g. ruleorder: rule1 > rule2 > rule3  Here, rule1 is preferred over rule2 and rule3, and rule2 is preferred over rule3. Only if rule1 and rule2 cannot be applied (e.g. due to missing input files), rule3 is used to produce the desired output file. Alternatively, rule dependencies (see above) can also resolve ambiguities. Another (quick and dirty) possiblity is to tell snakemake to allow ambiguity via a command line option $ snakemake --allow-ambiguity


such that similar to GNU Make always the first matching rule is used. Here, a warning that summarizes the decision of snakemake is provided at the terminal.

## Local Rules¶

When working in a cluster environment, not all rules need to become a job that has to be submitted (e.g. downloading some file, or a target-rule like all, see Targets). The keyword localrules allows to mark a rule as local, so that it is not submitted to the cluster and instead executed on the host node:

localrules: all, foo

rule all:
input: ...

rule foo:
...

rule bar:
...


Here, only jobs from the rule bar will be submitted to the cluster, whereas all and foo will be run locally. Note that you can use the localrules directive multiple times. The result will be the union of all declarations.

## Benchmark Rules¶

Since version 3.1, Snakemake provides support for benchmarking the run times of rules. This can be used to create complex performance analysis pipelines. With the benchmark keyword, a rule can be declared to store a benchmark of its code into the specified location. E.g. the rule

rule benchmark_command:
input:
"path/to/input.{sample}.txt"
output:
"path/to/output.{sample}.txt"
benchmark:
"benchmarks/somecommand/{sample}.txt"
shell:
"somecommand {input} {output}"


benchmarks the CPU and wall clock time of the command somecommand for the given output and input files. For this, the shell or run body of the rule is executed on that data, and all run times are stored into the given benchmark txt file (which will contain a tab-separated table of run times). Per default, Snakemake executes the job once, generating one run time. With snakemake --benchmark-repeats, this number can be changed to e.g. generate timings for two or three runs. The resulting txt file can be used as input for other rules, just like any other output file.