Snakefiles and Rules

A Snakemake workflow defines a data analysis in terms of rules that are specified in the Snakefile. Most commonly, rules consist of a name, input files, output files, and a shell command to generate the output from the input:

rule myrule:
    input:
        "path/to/inputfile",
        "path/to/other/inputfile",
    output:
        "path/to/outputfile",
        "path/to/another/outputfile",
    shell:
        "somecommand {input} {output}"

However, rules can be much more complex, may use plain python or boilerplate-free scripting in various languages, can contain Wildcards, define non-file parameters, log files and many more, see below.

Inside the shell command, all local and global variables, especially input and output files can be accessed via their names in the python format minilanguage. Here, input and output (and in general any list or tuple) automatically evaluate to a space-separated list of files (i.e. path/to/inputfile path/to/other/inputfile). From Snakemake 3.8.0 on, adding the special formatting instruction :q (e.g. "somecommand {input:q} {output:q}") will let Snakemake quote each of the list or tuple elements that contains whitespace.

Note

Note that any placeholders in the shell command (like {input}) are always evaluated and replaced when the corresponding job is executed, even if they are occurring inside a comment. To avoid evaluation and replacement, you have to mask the braces by doubling them, i.e. {{input}}.

By default shell commands will be invoked with bash shell in the so-called strict mode (unless the workflow specifies something else, see Shell settings).

Wildcards

Usually, it is useful to generalize a rule to be applicable to a number of e.g. datasets. For this purpose, wildcards can be used. Automatically resolved multiple named wildcards are a key feature and strength of Snakemake in comparison to other systems. Consider the following example.

rule complex_conversion:
    input:
        "{dataset}/inputfile"
    output:
        "{dataset}/file.{group}.txt"
    shell:
        "somecommand --group {wildcards.group} < {input} > {output}"

Here, we define two wildcards, dataset and group. By this, the rule can produce all files that follow the regular expression pattern .+/file\..+\.txt, i.e. the wildcards are replaced by the regular expression .+. If the rule’s output matches a requested file, the substrings matched by the wildcards are propagated to the input files and to the variable wildcards, that is here also used in the shell command. The wildcards object can be accessed in the same way as input and output, which is described above.

For example, if another rule in the workflow requires the file 101/file.A.txt, Snakemake recognizes that this rule is able to produce it by setting dataset=101 and group=A. Thus, it requests file 101/inputfile as input and executes the command somecommand --group A < 101/inputfile > 101/file.A.txt. Of course, the input file might have to be generated by another rule with different wildcards.

Importantly, the wildcard names in input and output must be named identically. Most typically, the same wildcard is present in both input and output, but it is of course also possible to have wildcards only in the output but not the input section.

Multiple wildcards in one filename can cause ambiguity. Consider the pattern {dataset}.{group}.txt and assume that a file 101.B.normal.txt is available. It is not clear whether dataset=101.B and group=normal or dataset=101 and group=B.normal in this case.

Hence wildcards can be constrained to given regular expressions. Here we could restrict the wildcard dataset to consist of digits only using \d+ as the corresponding regular expression. With Snakemake 3.8.0, there are three ways to constrain wildcards. First, a wildcard can be constrained within the file pattern, by appending a regular expression separated by a comma (you might want to use the r prefix for a raw string to avoid having to escape backslashes, particularly for more complex regular expressions):

output: r"{dataset,\d+}.{group}.txt"

Second, a wildcard can be constrained within the rule via the keyword wildcard_constraints:

rule complex_conversion:
    input:
        "{dataset}/inputfile"
    output:
        "{dataset}/file.{group}.txt"
    wildcard_constraints:
        dataset="\d+"
    shell:
        "somecommand --group {wildcards.group}  < {input}  > {output}"

Finally, you can also define global wildcard constraints that apply for all rules:

wildcard_constraints:
    dataset="\d+"

rule a:
    ...

rule b:
    ...

See the Python documentation on regular expressions for detailed information on regular expression syntax.

Aggregation

Input files can be Python lists, allowing to easily aggregate over parameters or samples:

rule aggregate:
    input:
        ["{dataset}/a.txt".format(dataset=dataset) for dataset in DATASETS]
    output:
        "aggregated.txt"
    shell:
        ...

While the above expression can be very powerful as arbitrary Python code can be used, Snakemake offers various helper functions to simplify aggregations (see Helpers for defining rules).

Input functions

Instead of specifying strings or lists of strings as input files, snakemake can also make use of functions that return single or lists of input files:

def myfunc(wildcards):
    return [... a list of input files depending on given wildcards ...]

rule:
    input:
        myfunc
    output:
        "someoutput.{somewildcard}.txt"
    shell:
        "..."

The function has to accept a single argument that will be the wildcards object generated from the application of the rule to create some requested output files. Note that you can also use lambda expressions instead of full function definitions. By this, rules can have entirely different input files (both in form and number) depending on the inferred wildcards. E.g. you can assign input files that appear in entirely different parts of your filesystem based on some wildcard value and a dictionary that maps the wildcard value to file paths.

Note

Input functions can themselves return input functions again (this also holds for functions given to params and resources.) Such nested evaluation is allowed for a depth up to 10. Afterwards, an exception will be thrown.

In addition to a single wildcards argument, input functions can optionally take a groupid (with exactly that name) as second argument, see Group-local jobs for details.

Finally, when implementing the input function, it is best practice to make sure that it can properly handle all possible wildcard values your rule can have. In particular, input files should not be combined with very general rules that can be applied to create almost any file: Snakemake will try to apply the rule, and will report the exceptions of your input function as errors.

For a practical example, see the Tutorial: General use (Step 3: Input functions).

Input Functions and `unpack()`

In some cases, you might want to have your input functions return named input files. This can be done by having them return dict() objects with the names as the dict keys and the file names as the dict values and using the unpack() keyword.

def myfunc(wildcards):
    return {'foo': '{wildcards.token}.txt'.format(wildcards=wildcards)}

rule:
    input:
        unpack(myfunc)
    output:
        "someoutput.{token}.txt"
    shell:
        "..."

Note that unpack() is only necessary for input functions returning dict. While it also works for list, remember that lists (and nested lists) of strings are automatically flattened.

Also note that if you do not pass in a function into the input list but you directly call a function then you shouldn’t use unpack(). Here, you can simply use Python’s double-star (**) operator for unpacking the parameters.

Note that as Snakefiles are translated into Python for execution, the same rules as for using the star and double-star unpacking Python operators apply. These restrictions do not apply when using unpack().

def myfunc1():
    return ['foo.txt']

def myfunc2():
    return {'foo': 'nowildcards.txt'}

rule:
    input:
        *myfunc1(),
        **myfunc2(),
    output:
        "..."
    shell:
        "..."

Helpers for defining rules

Snakemake provides a number of helpers that can be used to define rules and drastically simplify over using input functions or plain python expressions. Below, we will first start with describing two basic helper functions for specifying aggregations and multiple output files. Afterwards, we will further show a set of semantic helper functions should increase readability and simplify code (see Semantic helpers).

The expand function

Instead of specifying input files via a Python list comprehension, Snakemake offers a helper function expand().

rule aggregate:
    input:
        expand("{dataset}/a.txt", dataset=DATASETS)
    output:
        "aggregated.txt"
    shell:
        ...

Note that dataset is NOT a wildcard here because it is resolved by Snakemake due to the expand statement. The expand function also allows us to combine different variables, e.g.

rule aggregate:
    input:
        expand("{dataset}/a.{ext}", dataset=DATASETS, ext=FORMATS)
    output:
        "aggregated.txt"
    shell:
        ...

If FORMATS=["txt", "csv"] contains a list of desired output formats then expand will automatically combine any dataset with any of these extensions. Furthermore, the first argument can also be a list of strings. In that case, the transformation is applied to all elements of the list. E.g.

expand(["{dataset}/a.{ext}", "{dataset}/b.{ext}"], dataset=DATASETS, ext=FORMATS)

leads to

["ds1/a.txt", "ds1/b.txt", "ds2/a.txt", "ds2/b.txt", "ds1/a.csv", "ds1/b.csv", "ds2/a.csv", "ds2/b.csv"]

Per default, expand uses the python itertools function product that yields all combinations of the provided wildcard values. However by inserting a second positional argument this can be replaced by any combinatoric function, e.g. zip:

expand(["{dataset}/a.{ext}", "{dataset}/b.{ext}"], zip, dataset=DATASETS, ext=FORMATS)

leads to

["ds1/a.txt", "ds1/b.txt", "ds2/a.csv", "ds2/b.csv"]

You can also mask a wildcard expression in expand such that it will be kept, e.g.

expand("{{dataset}}/a.{ext}", ext=FORMATS)

will create strings with all values for ext but starting with the wildcard "{dataset}".

Finally, argument values passed to expand can also be functions or lists of functions if the return value of expand or expand itself is used within input, or params. Depending on the context, that function has to accept the same arguments as functions for input (see Input functions) or functions for params (see Non-file parameters for rules). If that is the case, expand returns a function again, the evaluation of which is deferred to the point in time when the wildcards of the respective job are known.

The multiext function

multiext provides a simplified variant of expand that allows us to define a set of output or input files that just differ by their extension:

rule plot:
    input:
        ...
    output:
        multiext("some/plot", ".pdf", ".svg", ".png")
    shell:
        ...

The effect is the same as if you would write expand("some/plot{ext}", ext=[".pdf", ".svg", ".png"]), however, using a simpler syntax. Moreover, defining output with multiext is the only way to use between workflow caching for rules with multiple output files. It’s also possible to get named input/output files in the following way:

rule plot:
    input:
        ...
    output:
        multiext("some/plot", out1=".pdf", out2=".svg")
        "some_other_output"
        named_output="another_output"
    shell:
        """
        somecommand > {output.out1}
        othercommand > {output.out2}
        anothercommand > {output[2]}
        finalcommand > {output.named_output}
        """

Do note that all the multiext extensions should be named, or all of them should be unnamed (not both). Additionally, if additional input/output statements are given, multiext should be treated as positional arguments (before other named input/output files).

Path variables

Certain components in input and output file paths tend to reoccur across many rules. Via so-called pathvars, Snakemake allows to define such components globally, make them configurable via the config file, and change them per module or even per rule. Apart from saving boilerplate code, pathvars can be used to make modules intended for reuse in multiple contexts more flexible. Pathvars can be used as generic placeholders for their actual values inside of input, output, log, and benchmark paths, using angle brackets, e.g. <results>. They behave similarly to Python string interpolation but only allow predefined placeholders, with precedence and configuration handled by Snakemake.

Pathvar usage

An example rule using pathvars is the following:

rule somerule:
    input:
        "<results>/something/{sample}.txt"
    output:
        "<results>/processed/{sample}.txt"
    shell:
        "somecommand {input} {output}"

Pathvars are resolved when the rule is parsed (before wildcard resolution and DAG construction). The values of pathvars can thereby even contain wildcards themselves.

Pathvar defaults

By default, Snakemake offers the pathvars results, stats, reports, temp, resources, logs, benchmarks. Each of them is set to its respective name (i.e. the output file "<results>/processed/{sample}.txt" will be interpreted as "results/processed/{sample}.txt").

Pathvar definition

Beyond the defaults, it is possible to define additional pathvars or customize the default definitions. This can happen in multiple ways, with the following precedence (from highest to lowest):

For individual rules, via the pathvars keyword.
For module config via the pathvars key in a config dict explicitly passed to the module (this applies recursively to nested modules).
For modules, via the pathvars keyword to the module directive (this applies recursively to nested modules).
Globally, via the pathvars key in the config file or passed to the --config command line arguments.
Globally, via the pathvars keyword at the top level of the Snakefile.

Thereby, if two definitions share the same precedence, the last one wins.

Apart from module pathvars, the most common way is to define them globally via the pathvars keyword:

pathvars:
    per="{sample}"

Above, we define a pathvar per and set it to the value {sample}, thus defining a wildcard. Such a pattern can be helpful if you write a workflow that shall be reused as a module in various different ways within other workflows (e.g. thereby processing different items, samples, or something else).

In order to overwrite pathvars for individual rules, they can be specified via the pathvars keyword inside a rule:

rule somerule:
    input:
        "<results>/something/<per>.txt"
    output:
        "<results>/processed/<per>.txt"
    pathvars:
        results="custom-folder"
    shell:
        "somecommand {input} {output}"

This way, the pathvar <results> in the input and output path would be replaced with custom-folder just for this rule. Per rule pathvar definition can also happen in combination with rule inheritance. This allows to quickly write and reuse rules with generic input or output files.

Finally, it is possible overwrite pathvars via the workflow configuration (configfile or --config). For this purpose, it is possible to define a key pathvars in the config, with a mapping between pathvars and their values below, e.g.

pathvars:
    results: example-folder

Note that defining pathvars in the config should be considered a rare, discouraged, and advanced use case, since the user must know the workflow’s internal pathvar expectations. Workflow authors can explicitly forbid the modification of particular pathvars via config file schemas and validation.

Semantic helpers

The collect function

The collect function is an alias for the expand function with exactly the same behavior. It can be used to express more explicitly that a rule collects a set of files from upstream jobs.

The lookup function

The lookup function can be used to look up a value in a python mapping (e.g. a dict) or a pandas dataframe or series. It is especially useful for looking up information based on wildcard values. The lookup function has the signature

lookup(
    dpath: Optional[str | Callable] = None,
    query: Optional[str | Callable] = None,
    cols: Optional[List[str]] = None,
    is_nrows: Optional[int],
    within=None,
    default=NODEFAULT
)

The required within parameter takes either a python mapping, a pandas dataframe, or a pandas series. For the former case, it expects the dpath argument, for the latter two cases, it expects the query argument to be given.

In case of a pandas dataframe, the query parameter is passed to DataFrame.query(). If the query results in multiple rows, the result is returned as a list of named tuples with the column names as attributes. If the query results in a single row, the result is returned as a single named tuple with the column names as attributes. If the query or dpath parameter is given a function, the function will be evaluated with wildcards passed as the first argument. If the dpath is not found or the query returns no matching rows, the default fallback value is returned if provided. Otherwise, a LookupError is raised (for dpath) or an empty list is returned (for query). Note: None is also a valid default value.

In both cases (dpath and query), the result can be used by the expand or collect function, e.g.

collect("results/{item.sample}.txt", item=lookup(query="someval > 2", within=samples))

Here, we take the file "results/{item.sample}.txt" with {item.sample} being replaced by the sample names that occur in all rows of the dataframe samples where the value of the someval column is greater than 2.

Since the result, in any case, also evaluates to True if it is not empty when interpreted as a boolean by Python, it can also be used as a condition for the branch function, e.g.

branch(lookup(query="sample == '{sample}' & someval > 2", within=samples), then="foo", otherwise="bar")

In case your dataframe has an index, you can also access the index within the query, e.g. for faster, constant time lookups:

lookup(query="index.loc[{sample}]", within=samples)

Further, it is possible to constrain the output to a list of columns, e.g.

lookup(query="sample == '{sample}'", within=samples, cols=["somecolumn"])

or to a single column, e.g.

lookup(query="sample == '{sample}'", within=samples, cols="somecolumn")

In the latter case, just a list of items in that column is returned (e.g. ["a", "b", "c"]).

The argument is_nrows allows to test for a given number of rows in the queried dataframe. If it is used, lookup just returns a boolean value indicating whether the number of rows in the queried dataframe matches the given number:

lookup(query="sample == '{sample}'", within=samples, is_nrows=5)

In case of a pandas series, the series is converted into a dataframe via Series.to_frame() and the same logic as for a dataframe is applied.

In case of a python mapping, the dpath parameter is passed to dpath.values() (see https://github.com/dpath-maintainers/dpath-python).

query, dpath, and cols may contain wildcards (e.g. {sample}). In that case, this function returns an input function which takes wildcards as its only argument and will be evaluated by Snakemake once the wildcard values are known if the lookup is used within an input file statement.

In addition to wildcard values, dpath, query and cols may refer via the same syntax to auxiliary namespace arguments given to the lookup function, e.g.

lookup(
    query="cell_type == '{sample.cell_type}'",
    within=samples,
    sample=lookup("sample == '{sample}'", within=samples)
)

This way, one can e.g. pass additional variables or chain lookups into more complex queries.

The branch function

The branch function allows to choose different input files based on a given conditional. It has the signature

branch(
    condition: Union[Callable, bool],
    then: Optional[Union[str, list[str], Callable]] = None,
    otherwise: Optional[Union[str, list[str], Callable]] = None,
    cases: Optional[Mapping] = None
)

The condition argument has to be either a function or an expression that can be evaluated as a bool (which is virtually everything in Python). If it is a function, it has to take wildcards as its only parameter. Similarly, then, otherwise and the values of the cases mapping (e.g. a python dict) can be such functions.

If any such function is given to any of those arguments, this function returns a derived input function that will be evaluated once the wildcards are known (e.g. when used in the context of an input definition) (see Input functions).

If then and optionally otherwise are specified, it does the following: If the condition is (or evaluates to) True, return the value of the then parameter. Otherwise, return the value of the otherwise parameter.

If cases is specified, it does the following: Retrieve the value of the cases mapping using the return value of the condition (if it is a function), or the condition value itself as a key.

An example of using branch in combination with lookup from a config dictionary can look as follows:

branch(
    lookup(dpath="tools/sometool", within=config),
    then="results/sometool/{dataset}.txt",
    otherwise="results/someresult/{dataset}.txt"
)

Here, the semantic is as follows: If the lookup returns True, the input is results/sometool/{dataset}.txt, otherwise it is results/someresult/{dataset}.txt.

Given that condition can be a function, if this is used in the context of a rule definition and the usage of the tool sometool depends on some wildcard values, one can also pass a function name instead of a boolean value to the branch function (using it as an input function).

def use_sometool(wildcards):
    # determine whether the tool shall be used based on the wildcard values.
    ...

rule a:
    input:
        branch(
            use_sometool,
            then="results/sometool/{dataset}.txt",
            otherwise="results/someresult/{dataset}.txt"
        )

Above, the semantic is as follows: If use_sometool returns True for the given wildcard values, the input is results/sometool/{dataset}.txt, otherwise it is results/someresult/{dataset}.txt.

An example for using the cases argument could look as follows:

branch(
    lookup(dpath="tool/to/use", within=config),
    cases={
        "sometool": "results/sometool/{dataset}.txt",
        "someothertool": "results/someothertool/{dataset}.txt"
    }
)

The evaluate function

The evaluate function allows to quickly evaluate a Python expression that contains wildcard values. It has the signature evaluate(expr: str). Within the expression one can specify wildcards via the usual syntax, e.g. {sample}. Upon evaluation, the wildcards are replaced by their values as strings and the expression is evaluated as Python code with access to any global variables defined in the workflow. Consider the following example:

rule a:
input:
    branch(evaluate("{sample} == '100'"), then="a/{sample}.txt", otherwise="b/{sample}.txt"),
output:
    "c/{sample}.txt",
shell:
    ...

The semantic is as follows: If the sample wildcard is 100, the input is a/100.txt, otherwise it is b/100.txt.

The exists function

The exists function allows to check whether a file exists, while properly considering remote storage settings provided to Snakemake. For example, if Snakemake has been configured to consider all input and output files to be located in an S3 bucket, exists will check whether the file exists in the S3 bucket. It has the signature exists(path), with path being the path to a file or directory, or an explicit storage object. The function returns True if the file exists, and False otherwise. It can for example be used to condition some behavior in the workflow on the existence of a file before the workflow is executed:

rule all:
    input:
        # only expect the output if test.txt is present before workflow execution
        "out.txt" if exists("test.txt") else [],

rule b:
    input:
        "test.txt"
    output:
        "out.txt"
    shell:
        "cp {input} {output}"

The parse_input function

The parse_input function allows to parse an input file and return a value. It has the signature parse_input(input_item, parser, kwargs), with input_item being the key of an input file, parser being a callable to extract the desired information, and kwargs extra arguments passed to the parser. The function will return the extracted value and this can, for example, be used as a parameter.

rule a:
    input:
        samples="samples.tsv",
    output:
        "samples.id",
    params:
        id=parse_input(input.samples, parser=extract_id)
    shell:
        "echo {params.id} > {output}"

The extract_checksum function

The extract_checksum function parses an input file and returns the checksum of the given file. It has the signature extract_checksum(infile, file), with infile being the input file, and file the filename to search for. The function will return the checksum of file present in infile.

rule a:
    input:
        checksum="samples.md5",
    output:
        tsv="{a}.tsv",
    params:
        checksum=parse_input(input.checksum, parser=extract_checksum, file=output.tsv)
    shell:
        "echo {params.checksum} > {output}"

The prepend_param function

The prepend_param function takes one or more input files and prepends a string to each. This allows easier use of tools that require adding a flag before each filename they are given. For example:

params:
    data=prepend_param("--input", input.data)
input:
    data=["a.txt", "b.txt", "c.txt"],
shell:
    "somecommand {params.data}"

will run the command somecommand --input a.txt --input b.txt --input c.txt.

If spaces are not required between the prefix and the filename, set the space keyword argument to False:

params:
    data=prepend_param("-i", input.data, space=False)
input:
    data=["a.txt", "b.txt", "c.txt"],
shell:
    "somecommand {params.data}"  # Runs somecommand -ia.txt -ib.txt -ic.txt

Rule item access helpers

Via functions (e.g. for Non-file parameters for rules or Resources) it is possible to access other items of the same rule in a deferred way, at the point in time when they are actually known. For this, functions like

def get_file_foo_from_input(wildcards, input):
    return input.foo

can be written. If such a function is passed to e.g. a params or resource statement, Snakemake knows that this resource shall be evaluated by passing the input files in addition to the wildcards (which are always required as first argument for any such function). To simplify such logic for certain situations, Snakemake provides globally available objects input, output, resources, and threads that can be used to replace the corresponding function definitions. For example, the global input.foo (not the one inside above function, which returns its value from the input argument of the function, which in turn is a concrete file path) returns a function that is equivalent to get_file_foo_from_input (the function above). Using these objects makes most sense inside of a rule definition. For example, it can be used to access a subpath of an input or output file or directory, see Sub-path access. For example, we could write

rule a:
    input:
        foo="results/something/foo.txt"
    output:
        "results/something-else/out.txt"
    params:
        directory=subpath(input.foo, parent=True)
    shell:
        "somecommand {params.directory} {output}"

Sub-path access

In some cases, it is useful to access a sub-path of an input or output file or directory. For this purpose, Snakemake provides the subpath function. It has the signature subpath(path_or_func, strip_suffix=None, with_suffix=None, basename=False, parent=False, ancestor=None). If a path is given as first argument (of type str or pathlib.Path), the function directly returns the sub-path of the given path. Thereby, the sub-path is determined depending on the other arguments.

If a str is given to strip_suffix, this suffix is stripped from the path (a ValueError is thrown if the path does not have the suffix).

subpath("results/test.txt", strip_suffix=".txt") # returns "results/test"

If a str is given to with_suffix, this suffix is added to the path.

subpath("results/test.txt", with_suffix=".gz") # returns "results/test.txt.gz"

The two arguments strip_suffix and with_suffix can be used in combination, e.g.

subpath("results/test.txt", strip_suffix=".txt", with_suffix=".csv") # returns "results/test.csv"

If basename is set to True, the basename of the path is returned (e.g. test.txt in case the path is results/test.txt).

subpath("results/test.txt", basename=True) # returns "test.txt"

If parent is set to True, the parent directory of the path is returned (e.g. results in case the path is results/test.txt).

subpath("results/test.txt", parent=True) # returns "results"

If ancestor is set to an integer greater than 0, the ancestor directory at the given level is returned (e.g. results in case the path is results/foo/test.txt and ancestor=2).

subpath("results/foo/test.txt", ancestor=2) # returns "results"

The arguments basename, parent, and ancestor are mutually exclusive.

The subpath function can be very handy in combination with Snakemake’s rule item access helpers, e.g.

rule a:
    input:
        "results/something/foo.txt"
    output:
        foo="results/something-else/out.txt"
    params:
        basename=subpath(output.foo, basename=True),
        outdir=subpath(output.foo, parent=True)
    shell:
        "somecommand {input} --name {params.basename} --outdir {params.outdir}"

The flatten function

When selecting input files, sometimes you might end up with an irregular list of lists. To flatten in, you can use:

flatten([1, "a", [2,"b"], ["c","d",["e", 3]]]) # returns ["1", "a", "2", "b", "c", "d", "e", "3"]

The choose_f(ile/older) functions

In some case you might need to choose a valid file/folder from a given list at execution time. For that you can use the choose_f family of functions. Their arguments are very similar:

(file/folder/tmp)_list: list of paths (files or folders) to choose from (List[Union[Path, AnnotatedString, str]])
read: whether the the input paths have to be readable (bool)
write: whether the the input paths have to be writeable (bool)
execute/open: whether the the input paths have to be executable/openable (bool)
creatable: whether the the input paths have to be creatable (bool)

Conditions are booleans and they all have to be met; if you want to ignore one condition set it to None. For example, choosing a file where the user has read/write access (but it does not matter if the user can execute or create it):

choose_file(["foo", "~/.bashrc"], read=True, write=True, execute=None, creatable=None)  # "~/.bashrc"

The same logic applies for choosing a folder where the user has read/write/open access (but it does not matter if the user can create it):

choose_folder(["foo/bar", "/proc", "~"], read=True, write=True, open=True, creatable=None)  # "~"

temp folders:

There is also a specific function for temporary folders, where the system temporary folder is returned if none is valid:

choose_tmp(["foo/bar", "/tmp"], read=True, write=True, open=True, creatable=True)  # "foo/bar"
choose_tmp(["/foo/bar", "/tmp", "/tmp/jobid"], read=True, write=True, open=True, creatable=None)  # "/tmp"
choose_tmp(["/foo/bar", "/tmp/$USER", "/tmp/"], read=True, write=True, open=True, creatable=True  # "/tmp/$USER"
choose_tmp(["/foo", "/bar"], read=True, write=True, open=True, creatable=True)  # "system_tmpdir"

as_py_module

When running a Python script under active development, which relies on relative imports you may want to have the script as an input file, but call it by its module name. The as_py_module function will translate a given script filename into a module name that may be used with python -m. For example,

rule:
    params:
        module=as_py_module(),
    input:
        script="some_package/some_subpackage/some_module.py",
    output:
        "..."
    shell:
        "python -m {params.module} --output_file {output}"

The helper by default looks at input.script. Other values may be used by using Snakemake’s rule item access helpers, e.g.

rule:
    params:
        module=as_py_module(input.my_script),
    input:
        my_script="some_package/some_subpackage/some_module.py",
    output:
        "..."
    shell:
        "python -m {params.module} --output_file {output}"

Target rules

By default, Snakemake always wants to execute the first rule in the snakefile. This gives rise to pseudo-rules at the beginning of the file that can be used to define build-targets similar to GNU Make:

rule all:
    input:
        expand("{dataset}/file.A.txt", dataset=DATASETS)

Here, for each dataset in a python list DATASETS defined before, the file {dataset}/file.A.txt is requested. In this example, Snakemake recognizes automatically that these can be created by multiple applications of the rule complex_conversion shown above.

It is possible to overwrite this behavior to use the first rule as a default target, by explicitly marking a rule as being the default target via the default_target directive:

rule xy:
    input:
        expand("{dataset}/file.A.txt", dataset=DATASETS)
    default_target: True

Regardless of where this rule appears in the Snakefile, it will be the default target. Usually, it is still recommended to keep the default target rule (and in fact all other rules that could act as optional targets) at the top of the file, such that it can be easily found. The default_target directive becomes particularly useful when combining several pre-existing workflows.

Shell settings

By default, Snakemake uses the bash shell. This can be overridden in two ways. First, by globally setting the shell executable (e.g. to zsh) via

shell.executable("/bin/zsh")

Note that this is usually not recommended, as it requires others who want to use the workflow to have that shell installed. Second, by setting the shell executable via the resources directive of a rule, e.g.

rule a:
    input: ...
    output: ...
    resources:
        shell_exec="zsh"
    shell:
        "echo 'hello world' > {output}"

This can be particularly important in case you use a container image for the rule which does not contain bash, e.g.

rule a:
    output:
        "test.out"
    resources:
        shell_exec="sh"
    # image does not have bash, hence this would fail if shell_exec is not set to sh
    container: "docker://busybox:1.33"
    shell:
        "echo 'hello world' > {output}"

Shell behavior

In case of bash shell, Snakemake always uses the so-called strict mode. For individual rules, you can deactivate aspects of the strict mode by unsetting them at the beginning of the shell command. Further, it is possible to set global prefixes and suffixes for all shell commands via

shell.prefix("some prefix command;")
shell.suffix("; some suffix command")

anywhere in your snakefile (preferably at the beginning for clarity). This can sometimes be useful for debugging, but is not recommended for production workflows and releases because it might hamper reproducibility and readability.

Threads

Further, a rule can be given a number of threads to use, i.e.

rule NAME:
    input: "path/to/inputfile", "path/to/other/inputfile"
    output: "path/to/outputfile", "path/to/another/outputfile"
    threads: 8
    shell: "somecommand --threads {threads} {input} {output}"

Note

On a cluster node, Snakemake uses as many cores as available on that node. Hence, the number of threads used by a rule never exceeds the number of physically available cores on the node. Note: This behavior is not affected by --local-cores, which only applies to jobs running on the main node.

Snakemake can alter the number of cores available based on command line options. Therefore it is useful to propagate it via the built in variable threads rather than hardcoding it into the shell command. In particular, it should be noted that the specified threads have to be seen as a maximum. When Snakemake is executed with fewer cores, the number of threads will be adjusted, i.e. threads = min(threads, cores) with cores being the number of cores specified at the command line (option --cores).

Hardcoding a particular maximum number of threads like above is useful when a certain tool has a natural maximum beyond which parallelization won’t help to further speed it up. This is often the case, and should be evaluated carefully for production workflows. Also, setting a threads: maximum is required to achieve parallelism in tools that (often implicitly and without the user knowing) rely on an environment variable for the maximum of cores to use. For example, this is the case for many linear algebra libraries and for OpenMP. Snakemake limits the respective environment variables to one core by default, to avoid unexpected and unlimited core-grabbing, but will override this with the threads: you specify in a rule (the parameters set to threads:, or defaulting to 1, are: OMP_NUM_THREADS, GOTO_NUM_THREADS, OPENBLAS_NUM_THREADS, MKL_NUM_THREADS, VECLIB_MAXIMUM_THREADS, NUMEXPR_NUM_THREADS).

If it is certain that no maximum for efficient parallelism exists for a tool, one can instead define threads as a function of the number of cores given to Snakemake:

rule NAME:
    input: "path/to/inputfile", "path/to/other/inputfile"
    output: "path/to/outputfile", "path/to/another/outputfile"
    threads: workflow.cores * 0.75
    shell: "somecommand --threads {threads} {input} {output}"

The number of given cores is globally available in the Snakefile as an attribute of the workflow object: workflow.cores. Any arithmetic operation can be performed to derive a number of threads from this. E.g., in the above example, we reserve 75% of the given cores for the rule. Snakemake will always round the calculated value down (while enforcing a minimum of 1 thread).

Starting from version 3.7, threads can also be a callable that returns an int value. The signature of the callable should be callable(wildcards[, input]) (input is an optional parameter). It is also possible to refer to a predefined variable (e.g, threads: threads_max) so that the number of cores for a set of rules can be changed with one change only by altering the value of the variable threads_max.

Both threads can be defined (or overwritten) upon invocation (without modifying the workflow code) via –set-threads see All Options and via workflow profiles, see Profiles. To quickly exemplify the latter, you could provide the following workflow profile in a file profiles/default/profile.yaml relative to the Snakefile or the current working directory:

set-threads:
    b: 16

to set the (maximum) number of threads rule b uses to 16.

Resources

In addition to threads, a rule can use arbitrary user-defined resources by specifying them with the resources-keyword:

rule a:
    input:     ...
    output:    ...
    resources:
        mem_mb=100
    shell:
        "..."

If workflow-wide limits for the resources are given via the command line, e.g.

$ snakemake --resources mem_mb=200

the scheduler will ensure that the given resources are not exceeded by running jobs. Resources are always meant to be specified as total per job, not by thread (i.e. above mem_mb=100 in rule a means that any job from rule a will require 100 megabytes of memory in total, and not per thread).

Importantly, there are some standard resources that should be considered before making up your own.

In general, resources are just names to the Snakemake scheduler, i.e., Snakemake does not check on the resource consumption of jobs in real time. Instead, resources are used to determine which jobs can be executed at the same time without exceeding the limits specified at the command line. Apart from making Snakemake aware of hybrid-computing architectures (e.g. with a limited number of additional devices like GPUs) this allows us to control scheduling in various ways, e.g. to limit IO-heavy jobs by assigning an artificial IO-resource to them and limiting it via the --resources flag. If no limits are given, the resources are ignored in local execution.

Resources can have any arbitrary name, and must be assigned int or str values. In case of None, the resource is considered to be unset (i.e. ignored) in the rule.

Standard Resources

There are several standard resources, for total memory, disk usage, runtime, and the temporary directory of a job: mem, disk, runtime, and tmpdir. All of these resources have specific meanings understood by snakemake and are treated in varying unique ways:

The tmpdir resource automatically leads to setting the $TMPDIR variable for shell commands, scripts, wrappers and notebooks. In cluster or cloud setups, its evaluation is delayed until the actual execution of the job. This way, it can dynamically react on the context of the node of execution.
The runtime resource indicates the amount of wall clock time a job needs to run. It can be given as string defining a time span or as integer defining minutes. In the former case, the time span can be defined as a string with a number followed by a unit (ms, s, m, h, d, w, y for seconds, minutes, hours, days, and years, respectively). The interpretation happens via the humanfriendly package. Cluster or cloud backends may use this to constrain the allowed execution time of the submitted job. See the section below for more information.
disk and mem define the amount of memory and disk space needed by the job. They are given as strings with a number followed by a unit (B, KB, MB, GB, TB, PB, KiB, MiB, GiB, TiB, PiB). The interpretation of the definition happens via the humanfriendly package. Alternatively, the two can be directly defined as integers via the resources mem_mb and disk_mb (to which disk and mem are also automatically translated internally). They are both locally scoped by default, a fact important for cluster and compute execution. See below for more info. They are usually passed to execution backends, e.g. to allow the selection of appropriate compute nodes for the job execution.
gpu, gpu_manufacturer, and gpu_model define the number of GPUs, the manufacturer of the GPUs, and the gpu model needed by the job. The gpu resource is an integer and the other two are strings. Please check the executor plugin docs in order to see whether and how these resources are supported and properly interpreted by the executor. For example, the kubernetes executor plugin accepts the terms nvidia or amd for the gpu_manufacturer resource.

Because of these special meanings, the above names should always be used instead of possible synonyms (e.g. tmp, time, temp, etc).

Default Resources

Since it could be cumbersome to define these standard resources for every rule, you can set default values via the command line flag --default-resources or in a profile. As with --set-resources, this can be done dynamically, using the variables specified for the callables in the section on Dynamic Resources. If those resource definitions are mandatory for a certain execution mode, Snakemake will fail with a hint if they are missing. Any resource definitions inside a rule override what has been defined with --default-resources. If --default-resources are specified without any further arguments, Snakemake uses 'mem_mb=min(max(2*input.size_mb, 1000), 8000)', 'disk_mb=max(2*input.size_mb, 1000) if input else 50000', and 'tmpdir=system_tmpdir'.

The tmpdir value points to whatever is the default of the operating system or specified by any of the environment variables $TMPDIR, $TEMP, or $TMP as outlined here.
The rationale for the default value of disk_mb is the following: if there are input files, we assume the rule will use at most twice their size during execution. If there are no input files, we cannot know what the rule will need, hence we assume a conservative default of 50GB.
The rationale for the default value of mem_mb is the following: we try to scale the required memory with the input file size (conservatively assuming that they are loaded entirely into memory). However, we stop at 8GB, in order to avoid artificially high requests. Tools that read very large files rather tend to stream them instead of fully loading them into memory.
If --default-resources is specified with some definitions, but any of the above defaults (e.g. mem_mb) is omitted, these are still used. In order to explicitly unset these defaults, assign them a value of None, e.g. --default-resources mem_mb=None.
Of course, any rule specifying concrete resources either via the rule definition or via --set-resources will override the defaults.

Dynamic Resources

It is often useful to determine resource specifications dynamically during workflow execution. A common example is determining the amount of memory that a job needs, based on the input file size of that particular rule instance. To enable this, resource specifications can also be callables (for example functions or lambda expressions) that return int, str or None values. The signature of the callable must be callable(wildcards [, input] [, threads] [, attempt]) (input, threads, and attempt are optional parameters). Such callables are evaluated immediately before the job is executed (or printed during a dry-run).

The above described example of using input size to determined memory requirements could for example be realized via a lambda expression (here also providing a minimum value of 300 MB memory):

rule:
    input:    ...
    output:   ...
    resources:
        mem_mb=lambda wc, input: max(2.5 * input.size_mb, 300)
    shell:
        "..."

In order to make this work with a dry-run, where the input files are not yet present, Snakemake automatically converts a FileNotFoundError that is raised by the callable into a placeholder called <TBD> that will be displayed during dry-run in such a case.

The parameter attempt allows us to adjust resources based on how often the job has been restarted (see All Options, option --retries). This is handy when executing a Snakemake workflow in a cluster environment, where jobs can e.g. fail because of too limited resources. When Snakemake is executed with --retries 3, it will try to restart a failed job 3 times before it gives up. Thereby, the parameter attempt will contain the current attempt number (starting from 1). This can be used to adjust the required memory as follows

def get_mem_mb(wildcards, attempt):
    return attempt * 100

rule:
    input:    ...
    output:   ...
    resources:
        mem_mb=get_mem_mb
    shell:
        "..."

Here, the first attempt will require 100 MB memory, the second attempt will require 200 MB memory and so on. When passing memory requirements to the cluster engine, you can by this automatically try out larger nodes if it turns out to be necessary.

Another application of callables as resources is when memory usage depends on the number of threads:

def get_mem_mb(wildcards, threads):
    return threads * 150

rule b:
    input:     ...
    output:    ...
    threads: 8
    resources:
        mem_mb=get_mem_mb
    shell:
        "..."

Here, the value that the function get_mem_mb returns, grows linearly with the number of threads. Of course, any other arithmetic could be performed in that function.

Both threads and resources can be defined (or overwritten) upon invocation (without modifying the workflow code) via –set-threads and –set-resources, see All Options. Or they can be defined via workflow Profiles, with the variables listed above in the signature for usable callables. You could, for example, provide the following workflow profile in a file profiles/default/profile.yaml relative to the Snakefile or the current working directory:

set-threads:
    b: 3
set-resources:
    b:
        mem_mb: 1000

to set the requirements for rule b to 3 threads and 1000 MB.

Another case of dynamic resources, is for the tmpdir. Depending on the architecture of our system, you might have different servers with different “ideal” temp folders. For example, some might have small but fast nvme disks, others might have a job-specific temp folder (e.g. /scratch/$SLURM_JOB_ID, while in others the only option might be the slow NFS. The function choose_tmp allows the user to specify a list of temp folders that, on each job submission, are evaluated. The first valid path is selected; if no path is valid, Snakemake’s internal system_tmpdir is used. For example, one can specify the function in a profile:

default_resources:
  tmpdir: choose_tmp(["/scratch/nvme", "/home/$USER/scratch", "/scratch/$SLURM_JOB_ID"])

and each job, depending on the hardware specification of each node, will select the first valid path.

Resources and Remote Execution

New to Snakemake 7.11. In cluster or cloud execution, resources may represent either a global constraint across all submissions (e.g. number of API calls per second), or a constraint local to each specific job sumbmission (e.g. the amount of memory available on a node). Snakemake distinguishes between these two types of constraints using resource scopes. By default, mem_mb, disk_mb, and threads are all considered "local" resources, meaning specific to individual submissions. So if a constraint of 16G of memory is given to snakemake (e.g. snakemake --resources mem_mb=16000), each group job will be allowed 16G of memory. All other resources are considered "global", meaning they are tracked across all jobs across all submissions. For example, if api_calls was limited to 5 and each job scheduled used 1 api call, only 5 jobs would be scheduled at a time, even if more job submissions were available.

These resource scopes may be modified both in the Snakefile and via the CLI parameter --set-resource-scopes. The CLI parameter takes priority. Modification in the Snakefile uses the following syntax:

resource_scopes:
    gpu="local",
    foo="local",
    disk_mb="global"

Here, we set both gpu and foo as local resources, and we changed disk_mb from its default to be a global resource. These options could be overridden at the command line using:

$ snakemake --set-resource-scopes gpu=global disk_mb=local

Resources and Group Jobs

New to Snakemake 7.11. When submitting group jobs to the cluster, Snakemake calculates how many resources to request by first determining which component jobs can be run in parallel, and which must be run in series. For most resources, such as mem_mb or threads, a sum will be taken across each parallel layer. The layer requiring the most resource (i.e. max()) will determine the final amount requested. The only exception is runtime. For it, max() will be used within each layer, then the total amount of time across all layers will be summed. If resource constraints are provided (via --resources or --cores) Snakemake will prevent group jobs from requesting more than the constraint. Jobs that could otherwise be run in parallel will be run in series to prevent the violation of resource constraints.

Preemptible Jobs

You can specify parameters preemptible-rules and preemption-default to request a Google Cloud preemptible virtual machine for use with the Google Life Sciences Executor. There are several ways to go about doing this. This first example will use preemptible instances for all rules, with 10 repeats (restarts of the instance if it stops unexpectedly).

snakemake --preemption-default 10

If your preference is to set a default but then overwrite some rules with a custom value, this is where you can use --preemtible-rules:

snakemake --preemption-default 10 --preemptible-rules map_reads=3 call_variants=0

The above statement says that we want to use preemtible instances for all steps, defaulting to 10 retries, but for the steps “map_reads” and “call_variants” we want to apply 3 and 0 retries, respectively. The final option is to not use preemptible instances by default, but only for a particular rule:

snakemake --preemptible-rules map_reads=10

Note that this is currently implemented for the Google Life Sciences API.

Messages

When executing snakemake, a short summary for each running rule is given to the console. This can be overridden by specifying a message for a rule:

rule NAME:
    input: "path/to/inputfile", "path/to/other/inputfile"
    output: "path/to/outputfile", "path/to/another/outputfile"
    threads: 8
    message: "Executing somecommand with {threads} threads on the following files {input}."
    shell: "somecommand --threads {threads} {input} {output}"

Note that access to wildcards is also possible via the variable wildcards (e.g, {wildcards.sample}), which is the same as with shell commands. It is important to have a namespace around wildcards in order to avoid clashes with other variable names.

Priorities

Snakemake allows for rules that specify numeric and/or callable priorities:

rule:
  input: ...
  output: ...
  priority: 50
  shell: ...

Per default, each rule has a priority of 0. Any rule that specifies a higher priority, will be preferred by the scheduler over all rules that are ready to execute at the same time without having at least the same priority.

Priority may also be specified with a callable. The callable receives wildcards as its first positional argument, and may optionally accept input, attempt, and rulename keyword arguments (similar to param functions). It has to return the priority as an integer or float (will be rounded):

rule sort:
    input: "{dataset}.txt"
    output: "{dataset}.sorted"
    priority: lambda wildcards, input: input.size_mb
    shell: "sort {input} > {output}"

This allows the scheduler to dynamically prioritise jobs based on, e.g., input file size so that larger jobs start first.

Furthermore, the --prioritize or -P command line flag allows to specify files (or rules) that shall be created with highest priority during the workflow execution. This means that the scheduler will assign the specified target and all its dependencies highest priority, such that the target is finished as soon as possible. The --dry-run (equivalently --dryrun) or -n option allows you to see the scheduling plan including the assigned priorities.

Log-Files

Each rule can specify a log file where information about the execution is written to:

rule abc:
    input: "input.txt"
    output: "output.txt"
    log: "logs/abc.log"
    shell: "somecommand --log {log} {input} {output}"

Log files can be used as input for other rules, just like any other output file. However, unlike output files, log files are not deleted upon error. This is obviously necessary in order to discover causes of errors which might become visible in the log file.

The variable log can be used inside a shell command to tell the used tool to which file to write the logging information. The log file has to use the same wildcards as output files, e.g.

log: "logs/abc.{dataset}.log"

Note

Using the log directive will not automatically redirect the rule’s output towards the log file - this you will still need to facilitate yourself! The log directive merely prevents Snakemake from deleting the log file upon rule failure.

For programs that do not have an explicit log parameter, you may always use 2> {log} to redirect stderr to a file (here, the log file) in Linux-based systems. Note that it is also possible to have multiple named log files, which could be used to capture stdout and stderr:

rule abc:
    input: "input.txt"
    output: "output.txt"
    log: stdout="logs/foo.stdout", stderr="logs/foo.stderr"
    shell: "somecommand {input} {output} > {log.stdout} 2> {log.stderr}"

Non-file parameters for rules

Sometimes you may want to define certain parameters separately from the rule body. Snakemake provides the params keyword for this purpose:

rule:
    input:
        ...
    params:
        threshold=0.4
    output:
        "somedir/{sample}.csv"
    shell:
        "somecommand --threshold {params.threshold} -o {output}"

The params section is an excellent place to name and assign parameters and variables for your subsequent command. Similar to input, params can take functions as well (see Input functions), e.g. you can write

rule:
    input:
        ...
    params:
        threshold=lambda wildcards: config["thresholds"][wildcards.sample]
    output:
        "somedir/{sample}.csv"
    shell:
        "somecommand --threshold {params.threshold} -o {output}"

Above example mimics a case where one would have to look up the value of some threshold in a config dictionary. Note that in contrast to the input directive, functions passed to the params directive can optionally take more arguments than only wildcards, namely input, output, threads, and resources. Their order does not matter, apart from the fact that wildcards has to be the first argument. This way, params can be used to dynamically adjust those values into whatever format is needed for your command or script.

The params directive is particularly powerful in combination with Snakemake’s semantic helper functions.

Plain python rules

Instead of a shell command, a rule can run some python code to generate the output. It is highly advisable to limit such code to a few lines. Otherwise, use Snakemake’s script support.

rule NAME:
    input:
        "path/to/inputfile",
        "path/to/other/inputfile",
    output:
        "path/to/outputfile",
        somename="path/to/another/outputfile",
    run:
        for f in input:
            ...
            with open(output[0], "w") as out:
                out.write(...)
        with open(output.somename, "w") as out:
            out.write(...)

As can be seen, instead of accessing input and output as a whole, we can also access by index (output[0]) or by keyword (output.somename). Note that, when adding keywords or names for input or output files, their order won’t be preserved when accessing them as a whole via e.g. {output} in a shell command.

Shell commands like above can also be invoked inside a python based rule, via the function shell that takes a string with the command and allows the same formatting like in the rule above, e.g.:

shell("somecommand {output.somename}")

Further, this combination of python and shell commands allows us to iterate over the output of the shell command, e.g.:

for line in shell("somecommand {output.somename}", iterable=True):
    ... # do something in python

External scripts

A rule can also point to an external script instead of a shell command or inline Python code, e.g.

Python

rule NAME:
    input:
        "path/to/inputfile",
        "path/to/other/inputfile"
    output:
        "path/to/outputfile",
        "path/to/another/outputfile"
    script:
        "scripts/script.py"

Note

It is possible to refer to wildcards and params in the script path, e.g. by specifying "scripts/{params.scriptname}.py" or "scripts/{wildcards.scriptname}.py".

The script path is always relative to the Snakefile containing the directive (in contrast to the input and output file paths, which are relative to the working directory). It is recommended to put all scripts into a subfolder scripts as above. Inside the script, you have access to an object snakemake that provides access to the same objects that are available in the run and shell directives (input, output, params, wildcards, log, threads, resources, config), e.g. you can use snakemake.input[0] to access the first input file of above rule. To enable code completion, linting and type checking your python code in IDEs, we recommend using the typing module’s TYPE_CHECKING variable and the typing stub provided in the snakemake.iocontainers module (see below for how).

An example external Python script could look like this:

def do_something(data_path, out_path, threads, myparam):
    # python code

do_something(snakemake.input[0], snakemake.output[0], snakemake.threads, snakemake.config["myparam"])

For type checking, it is possible to import the a correctly typed stub for the snakemake object:

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from snakemake.iocontainers import snakemake

def do_something(data_path, out_path, threads, myparam):
    # python code

do_something(snakemake.input[0], snakemake.output[0], snakemake.threads, snakemake.config["myparam"])

You can use the Python debugger from within the script if you invoke Snakemake with --debug.

Xonsh

Because Xonsh is a superset of Python, you can use a Xonsh script as you would a Python script, but with all the additional shell primitives that Xonsh provides.

For example, with this rule:

rule get_variants_in_genes:
    input:
        vcf="input.vcf",
        gene_locations="genes.bed",
    output:
        "output.tsv"
    conda:
        "envs/variant_calling.yaml"
    log:
        "logs/get_variants_in_genes.log"
    script:
        "scripts/get_variants_in_genes.xsh"

the Xonsh script might look like this:

$XONSH_TRACEBACK_LOGFILE = snakemake.log[0]

annotations = ", ".join(
    f'ANN["{field}"]' for field in ["Consequence", "SYMBOL", "Feature", "BIOTYPE"]
)

bcftools view -R @(snakemake.input.gene_locations) @(snakemake.input.vcf) \
| vembrane table --output @(snakemake.output[0]) @(f'CHROM, POS, ID, {annotations}')

Hy

Hy allows you to interact with Python using a Lisp-like syntax.

For example, with this rule:

rule get_sum_of_odd_numbers:
    input:
        "list_of_numbers.txt"
    output:
        results_file="sum_of_odd_numbers.txt"
    conda:
        "envs/hy.yaml"
    script:
        "scripts/sum_odd_numbers.hy"

the Hy script might look like this:

(require hyrule [-> ->>])

(defn is-odd? [n] (!= (% n 2) 0))

(setv result
      (->> (get snakemake.input 0)
           open
           .readlines
           (map int)
           (filter is-odd?)
           sum))

(with [f
       (-> (get snakemake.output "results_file")
           (open "w"))]
  (print result :file f))

R and R Markdown

Apart from Python scripts, this mechanism also allows you to integrate R and R Markdown scripts with Snakemake, e.g.

rule NAME:
    input:
        "path/to/inputfile",
        "path/to/other/inputfile"
    output:
        "path/to/outputfile",
        "path/to/another/outputfile"
    script:
        "scripts/script.R"

In the R script, an S4 object named snakemake analogous to the Python case above is available and allows access to input and output files and other parameters. Here the syntax follows that of S4 classes with attributes that are R lists, e.g. we can access the first input file with snakemake@input[[1]] (note that the first file does not have index 0 here, because R starts counting from 1). Named input and output files can be accessed in the same way, by just providing the name instead of an index, e.g. snakemake@input[["myfile"]].

An equivalent script (to the Python one above) written in R would look like this:

do_something <- function(data_path, out_path, threads, myparam) {
    # R code
}

do_something(snakemake@input[[1]], snakemake@output[[1]], snakemake@threads, snakemake@config[["myparam"]])

To debug R scripts, you can save the workspace with save.image(), and invoke R after Snakemake has terminated. Then you can use the usual R debugging facilities while having access to the snakemake variable. It is best practice to wrap the actual code into a separate function. This increases the portability if the code shall be invoked outside of Snakemake or from a different rule. A convenience method, snakemake@source(), acts as a wrapper for the normal R source() function, and can be used to source files relative to the original script directory.

An R Markdown file can be integrated in the same way as R and Python scripts, but only a single output (html) file can be used:

rule NAME:
    input:
        "path/to/inputfile",
        "path/to/other/inputfile"
    output:
        "path/to/report.html",
    script:
        "path/to/report.Rmd"

In the R Markdown file you can insert output from a R command, and access variables stored in the S4 object named snakemake

---
title: "Test Report"
author:
    - "Your Name"
date: "`r format(Sys.time(), '%d %B, %Y')`"
params:
   rmd: "report.Rmd"
output:
  html_document:
  highlight: tango
  number_sections: no
  theme: default
  toc: yes
  toc_depth: 3
  toc_float:
    collapsed: no
    smooth_scroll: yes
---

## R Markdown

This is an R Markdown document.

Test include from snakemake `r snakemake@input`.

## Source
<a download="report.Rmd" href="`r base64enc::dataURI(file = params$rmd, mime = 'text/rmd', encoding = 'base64')`">R Markdown source file (to produce this document)</a>

A link to the R Markdown document with the snakemake object can be inserted. Therefore a variable called rmd needs to be added to the params section in the header of the report.Rmd file. The generated R Markdown file with snakemake object will be saved in the file specified in this rmd variable. This file can be embedded into the HTML document using base64 encoding and a link can be inserted as shown in the example above. Also other input and output files can be embedded in this way to make a portable report. Note that the above method with a data URI only works for small files. An experimental technology to embed larger files is using Javascript Blob object.

Julia

rule NAME:
    input:
        "path/to/inputfile",
        "path/to/other/inputfile"
    output:
        "path/to/outputfile",
        "path/to/another/outputfile"
    script:
        "path/to/script.jl"

In the Julia script, a snakemake object is available, which can be accessed similar to the Python case, with the only difference that you have to index from 1 instead of 0.

Rust

rule NAME:
    input:
        "path/to/inputfile",
        "path/to/other/inputfile",
        named_input="path/to/named/inputfile",
    output:
        "path/to/outputfile",
        "path/to/another/outputfile"
    params:
        seed=4
    conda:
        "rust.yaml"
    log:
        stdout="path/to/stdout.log",
        stderr="path/to/stderr.log",
    script:
        "path/to/script.rs"

The ability to execute Rust scripts is facilitated by rust-script. As such, the script must be a valid rust-script script and rust-script (plus OpenSSL and a C compiler toolchain, provided by Conda packages openssl, c-compiler, pkg-config) must be available in the environment the rule is run in. The minimum required rust-script version is 0.35.0, so in the example above, the contents of rust.yaml might look like this:

channels:
  - conda-forge
  - bioconda
dependencies:
  - rust-script>=0.35.0
  - openssl
  - c-compiler
  - pkg-config

Some example scripts can be found in the tests directory.

In the Rust script, a snakemake instance is available, which is automatically generated from the python snakemake object using json_typegen. It usually looks like this:

pub struct Snakemake {
    input: Input,
    output: Output,
    params: Params,
    wildcards: Wildcards,
    threads: u64,
    log: Log,
    resources: Resources,
    config: Config,
    rulename: String,
    bench_iteration: Option<usize>,
    scriptdir: String,
}

Any named parameter is translated to a corresponding field_name: Type, such that params.seed from the example above can be accessed just like in python, i.e.:

let seed = snakemake.params.seed;
assert_eq!(seed, 4);

Positional arguments for input, output, log and wildcards can be accessed by index and iterated over:

let input = &snakemake.input;

// Input implements Index<usize>
let inputfile = input[0];
assert_eq!(inputfile, "path/to/inputfile");

// Input implements IntoIterator
//
// prints
// > 'path/to/inputfile'
// > 'path/to/other/inputfile'
for f in input {
    println!("> '{}'", &f);
}

It is also possible to redirect stdout and stderr:

println!("This will NOT be written to path/to/stdout.log");
// redirect stdout to "path/to/stdout.log"
let _stdout_redirect = snakemake.redirect_stdout(snakemake.log.stdout)?;
println!("This will be written to path/to/stdout.log");

// redirect stderr to "path/to/stderr.log"
let _stderr_redirect = snakemake.redirect_stderr(snakemake.log.stderr)?;
eprintln!("This will be written to path/to/stderr.log");
drop(_stderr_redirect);
eprintln!("This will NOT be written to path/to/stderr.log");

Redirection of stdout/stderr is only “active” as long as the returned Redirect instance is alive; in order to stop redirecting, drop the respective instance.

In order to work, rust-script support for snakemake has some dependencies enabled by default:

anyhow=1, for its Result type
gag=1, to enable stdout/stderr redirects
json_typegen=0.6, for generating rust structs from a json representation of the snakemake object
lazy_static=1.4, to make a snakemake instance easily accessible
serde=1.0, explicit dependency of json_typegen
serde_derive=1.0, explicit dependency of json_typegen
serde_json=1.0, explicit dependency of json_typegen

If your script uses any of these packages, you do not need to use them in your script. Trying to use them will cause a compilation error.

Bash

Bash scripts work much the same as the other script languages above, but with some important differences. Access to the rule’s directives is provided through the use of associative arrays - requiring Bash version 4.0 or greater. One “limitation” of associative arrays is they cannot be nested. As such, the following rule directives are found in a separate variable, named as snakemake_<directive>:

input
output
log
wildcards
resources
params
config

Access to the input directive is facilitated through the bash associative array named snakemake_input. The remaining directives can be found in the variable snakemake.

Note

As arrays cannot be nested in Bash, use of python’s dict in directives is not supported. So, adding a params key of data={"foo": "bar"} will not be reflected - ${snakemake_params[data]} actually only returns "foo".

Bash Example 1

rule align:
    input:
        "{sample}.fq",
        reference="ref.fa",
    output:
        "{sample}.sam"
    params:
        opts="-a -x map-ont",
    threads: 4
    log:
        "align/{sample}.log"
    conda:
        "envs/align.yaml"
    script:
        "scripts/align.sh"

align.sh

#!/usr/bin/env bash

echo "Aligning sample ${snakemake_wildcards[sample]} with minimap2" 2> "${snakemake_log[0]}"

minimap2 ${snakemake_params[opts]} -t ${snakemake[threads]} "${snakemake_input[reference]}" \
    "${snakemake_input[0]}" > "${snakemake_output[0]}" 2>> "${snakemake_log[0]}"

If you don’t add a shebang, the default #!/usr/bin/env bash will be inserted for you. A tutorial on how to use associative arrays can be found here.

You may also have noticed the mixed use of double-quotes when accessing some variables. It is generally good practice in Bash to double-quote variables for which you want to prevent word splitting; generally, you will want to double-quote any variable that could contain a file name. However, in some cases, word splitting is desired, such as ${snakemake_params[opts]} in the above example.

Bash Example 2

rule align:
    input:
        reads=["{sample}_R1.fq", "{sample}_R2.fq]"],
        reference="ref.fa",
    output:
        "{sample}.sam"
    params:
        opts="-M",
    threads: 4
    log:
        "align/{sample}.log"
    conda:
        "envs/align.yaml"
    script:
        "scripts/align.sh"

In this example, the input variable reads, which is a python list, actually gets stored as a space-separated string in Bash because, you guessed it, you can’t nest arrays in Bash! So in order to access the individual members, we turn the string into an array; allowing us to access individual elements of the list/array. See this stackoverflow question for other solutions.

align.sh

#!/usr/bin/env bash

exec 2> "${snakemake_log[0]}"  # send all stderr from this script to the log file

reads=(${snakemake_input[reads]})  # don't double-quote this - we want word splitting

r1="${reads[0]}"
r2="${reads[1]}"

bwa index "${snakemake_input[reference]}"
bwa mem ${snakemake_params[opts]} -t ${snakemake[threads]} \
    "${snakemake_input[reference]}" "$r1" "$r2" > "${snakemake_output[0]}"

If, in the above example, the fastq reads were not in a named variable, but were instead just a list, they would be available as "${snakemake_input[0]}" and "${snakemake_input[1]}".

For technical reasons, scripts are executed in .snakemake/scripts. The original script directory is available as scriptdir in the snakemake object.

Jupyter notebook integration

Instead of plain scripts (see above), one can integrate Jupyter Notebooks. This enables the interactive development of data analysis components (e.g. for plotting). Integration works as follows (note the use of notebook: instead of script:):

rule hello:
    output:
        "test.txt"
    log:
        # optional path to the processed notebook
        notebook="logs/notebooks/processed_notebook.ipynb"
    notebook:
        "notebooks/hello.py.ipynb"

Note

Consider Jupyter notebook integration as a way to get the best of both worlds. A modular, readable workflow definition with Snakemake, and the ability to quickly explore and plot data with Jupyter. The benefit will be maximal when integrating many small notebooks that each do a particular job, hence allowing to get away from large monolithic, and therefore unreadable notebooks.

It is recommended to prefix the .ipynb suffix with either .py or .r to indicate the notebook language. In the notebook, a snakemake object is available, which can be accessed in the same way as the with script integration. In other words, you have access to input files via snakemake.input (in the Python case) and snakemake@input (in the R case) etc.. Optionally it is possible to automatically store the processed notebook. This can be achieved by adding a named logfile notebook=... to the log directive.

Note

It is possible to refer to wildcards and params in the notebook path, e.g. by specifying "notebook/{params.name}.py" or "notebook/{wildcards.name}.py".

Normally, notebooks are executed headlessly (without a Jupyter interface being presented to you). This is achieved with Papermill if that is installed in your software environment, or nbconvert otherwise. The latter will be installed automatically along with Jupyter, but will not output an executed (logfile) notebook until the entire execution is complete, and won’t output a notebook if execution encounters an error.

In order to simplify the coding of notebooks given the automatically inserted snakemake object, Snakemake provides an interactive edit mode for notebook rules. Let us assume you have written above rule, but the notebook does not yet exist. By running

snakemake --cores 1 --edit-notebook test.txt

you instruct Snakemake to allow interactive editing of the notebook needed to create the file test.txt. Snakemake will run all dependencies of the notebook rule, such that all input files are present. Then, it will start a jupyter notebook server with an empty draft of the notebook, in which you can interactively program everything needed for this particular step. Once done, you should save the notebook from the jupyter web interface, go to the jupyter dashboard and hit the Quit button on the top right in order to shut down the jupyter server. Snakemake will detect that the server is closed and automatically store the drafted notebook into the path given in the rule (here hello.py.ipynb). If the notebook already exists, above procedure can be used to easily modify it. Note that Snakemake requires local execution for the notebook edit mode. On a cluster or the cloud, you can generate all dependencies of the notebook rule via

snakemake --cluster ... --jobs 100 --until test.txt

Then, the notebook rule can easily be executed locally. An demo of the entire interactive editing process can be found by clicking below:

Finally, it is advisable to combine the notebook directive with the conda directive (see Integrated Package Management) in order to define a software stack to use. At least, this software stack should contain jupyter and the language to use (e.g. Python or R). For the above case, this means

rule hello:
    output:
        "test.txt"
    conda:
        "envs/hello.yaml"
    notebook:
        "notebooks/hello.py.ipynb"

with

channels:
  - conda-forge
dependencies:
  - python =3.8
  - jupyter =1.0
  - jupyterlab_code_formatter =1.4

The last dependency is advisable in order to enable autoformatting of notebook cells when editing. When using other languages than Python in the notebook, one needs to additionally add the respective kernel, e.g. r-irkernel for R support.

When using an IDE with built-in Jupyter support, an alternative to --edit-notebook is --draft-notebook. Instead of firing up a notebook server, --draft-notebook just creates a skeleton notebook for editing within the IDE. In addition, it prints instructions for configuring the IDE’s notebook environment to use the interpreter from the Conda environment defined in the corresponding rule. For example, running

snakemake --cores 1 --draft-notebook test.txt --software-deployment-method conda

or the short form

snakemake -c 1 --draft-notebook test.txt --sdm conda

will generate skeleton code in notebooks/hello.py.ipynb and additionally print instructions on how to open and execute the notebook in VSCode.

Protected and Temporary Files

A particular output file may require a huge amount of computation time. Hence one might want to protect it against accidental deletion or overwriting. Snakemake allows this by marking such a file as protected:

rule NAME:
    input:
        "path/to/inputfile"
    output:
        protected("path/to/outputfile")
    shell:
        "somecommand {input} {output}"

A protected file will be write-protected after the rule that produces it is completed.

Further, an output file marked as temp is deleted after all rules that use it as an input are completed:

rule NAME:
    input:
        "path/to/inputfile"
    output:
        temp("path/to/outputfile")
    shell:
        "somecommand {input} {output}"

Auto-grouping via temp files upon remote execution

For performance reasons, it is sometimes useful to write intermediate files on a faster storage, e.g., attached locally on the cluster compute node rather than shared over the network (and thus neither visible to the main snakemake process that submits jobs to the cluster, nor to other nodes of the cluster). Snakemake (since version 9.0) allows files marked as temp to use the option group_jobs to indicate that rules creating and consuming them should be automatically grouped together so Snakemake will schedule them to run on the same physical node:

rule NAME1:
    input:
        "path/to/inputfile"
    output:
        temp("path/to/intermediatefile", group_jobs=True)
    shell:
        "somecommand {input} {output}"

rule NAME2:
    input:
        "path/to/intermediatefile"
    output:
        "path/to/outputfile"
    shell:
        "someothercommand {input} {output}"

Directories as outputs

Sometimes it can be convenient to have directories, rather than files, as outputs of a rule. As of version 5.2.0, directories as outputs have to be explicitly marked with directory. This is primarily for safety reasons; since all outputs are deleted before a job is executed, we don’t want to risk deleting important directories if the user makes some mistake. Marking the output as directory makes the intent clear, and the output can be safely removed. Another reason comes down to how modification time for directories work. The modification time on a directory changes when a file or a subdirectory is added, removed or renamed. This can easily happen in not-quite-intended ways, such as when Apple macOS or MS Windows add .DS_Store or thumbs.db files to store parameters for how the directory contents should be displayed. When the directory flag is used a hidden file called .snakemake_timestamp is created in the output directory, and the modification time of that file is used when determining whether the rule output is up to date or if it needs to be rerun. Always consider if you can’t formulate your workflow using normal files before resorting to using directory().

rule NAME:
    input:
        "path/to/inputfile"
    output:
        directory("path/to/outputdir")
    shell:
        "somecommand {input} {output}"

Ignoring timestamps

For determining whether output files have to be re-created, Snakemake checks whether the file modification date (i.e. the timestamp) of any input file of the same job is newer than the timestamp of the output file. Please note, however, that for small input files (of by default up to 1 MB, controlled by --max-checksum-file-size), Snakemake instead records and compares file checksums and only reruns the rule if the input file checksum has changed, even if the timestamp of the input file is newer than the output file(s). This overall behavior can be overridden by marking an input file as ancient. The timestamp of such files is ignored and always assumed to be older than any of the output files:

rule NAME:
    input:
        ancient("path/to/inputfile")
    output:
        "path/to/outputfile"
    shell:
        "somecommand {input} {output}"

Here, this means that the file path/to/outputfile will not be triggered for re-creation after it has been generated once, even when the input file is modified in the future. Note that any flag that forces re-creation of files still also applies to files marked as ancient.

Ensuring output file properties like non-emptyness or checksum compliance

It is possible to annotate certain additional criteria for output files to be ensured after they have been generated successfully. For example, this can be used to check for output files to be non-empty, or to compare them against a given sha256 checksum. If this functionality is used, Snakemake will check such annotated files before considering a job to be successful. Non-emptyness can be checked as follows:

rule NAME:
    output:
        ensure("test.txt", non_empty=True)
    shell:
        "somecommand {output}"

Above, the output file test.txt is marked as non-empty. If the command somecommand happens to generate an empty output, the job will fail with an error listing the unexpected empty file.

A sha256 (or md5 or sha1) checksum can be compared as follows (using corresponding keyword arguments sha256=, md5=, or sha1=).:

my_checksum = "9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08"

rule NAME:
    output:
        ensure("test.txt", sha256=my_checksum)
    shell:
        "somecommand {output}"

In addition to providing the checksum as plain string, it is possible to provide a pointer to a function (similar to input functions). The function has to accept a single argument that will be the wildcards object generated from the application of the rule to create some requested output files:

def get_checksum(wildcards):
    # e.g., look up the checksum with the value of the wildcard sample
    # in some dictionary
    return my_checksums[wildcards.sample]

rule NAME:
    output:
        ensure("test/{sample}.txt", sha256=get_checksum)
    shell:
        "somecommand {output}"

Note that you can also use lambda expressions instead of full function definitions.

Often, it is a good idea to combine ensure annotations with retry definitions, e.g. for retrying upon invalid checksums or empty files.

Shadow rules

Shadow rules result in each execution of the rule to be run in isolated temporary directories. This “shadow” directory contains symlinks to files and directories in the current workdir. This is useful for running programs that generate lots of unused files which you don’t want to manually cleanup in your snakemake workflow. It can also be useful if you want to keep your workdir clean while the program executes, or simplify your workflow by not having to worry about unique filenames for all outputs of all rules.

By setting shadow: "shallow", the top level files and directories are symlinked, so that any relative paths in a subdirectory will be real paths in the filesystem. The setting shadow: "full" fully shadows the entire subdirectory structure of the current workdir. The setting shadow: "minimal" only symlinks the inputs to the rule, and shadow: "copy-minimal" copies the inputs instead of just creating symlinks. Once the rule successfully executes, the output file will be moved if necessary to the real path as indicated by output.

Typically, you will not need to modify your rule for compatibility with shadow, unless you reference parent directories relative to your workdir in a rule.

rule NAME:
    input: "path/to/inputfile"
    output: "path/to/outputfile"
    shadow: "shallow"
    shell: "somecommand --other_outputs other.txt {input} {output}"

Shadow directories are stored one per rule execution in .snakemake/shadow/, and are cleared on successful execution. Consider running with the --cleanup-shadow argument every now and then to remove any remaining shadow directories from aborted jobs. The base shadow directory can be changed with the --shadow-prefix command line argument.

Defining retries for fallible rules

Sometimes, rules may be expected to fail occasionally. For example, this can happen when a rule downloads some online resources. For such cases, it is possible to defined a number of automatic retries for each job from that particular rule via the retries directive:

rule a:
    output:
        "test.txt"
    retries: 3
    shell:
        "curl https://some.unreliable.server/test.txt > {output}"

Often, it is a good idea to combine retry functionality with ensure annotations, e.g. for retrying upon invalid checksums or empty files.

Note that it is also possible to define retries globally (via the --retries command line option, see All Options). The local definition of the rule thereby overwrites the global definition.

Importantly the retries directive is meant to be used for defining platform independent behavior (like adding robustness to above download command). For dealing with unreliable cluster or cloud systems, you should use the --retries command line option.

Flag files

Sometimes it is necessary to enforce some rule execution order without real file dependencies. This can be achieved by “touching” empty files that denote that a certain task was completed. Snakemake supports this via the touch flag:

rule all:
    input: "mytask.done"

rule mytask:
    output: touch("mytask.done")
    shell: "mycommand ..."

With the touch flag, Snakemake touches (i.e. creates or updates) the file mytask.done after mycommand has finished successfully.

Job Properties

Note

If there are more than 100 input and/or output files for a job, None will be used instead of listing all values. This is to prevent the jobscript from becoming larger than Slurm jobscript size limits.

When executing a workflow on a cluster using the --cluster parameter (see below), Snakemake creates a job script for each job to execute. This script is then invoked using the provided cluster submission command (e.g. qsub). Sometimes you want to provide a custom wrapper for the cluster submission command that decides about additional parameters. As this might be based on properties of the job, Snakemake stores the job properties (e.g. rule name, threads, input files, params etc.) as JSON inside the job script. For convenience, there exists a parser function snakemake.utils.read_job_properties that can be used to access the properties. The following shows an example job submission wrapper:

#!/usr/bin/env python3
import os
import sys

from snakemake.utils import read_job_properties

jobscript = sys.argv[1]
job_properties = read_job_properties(jobscript)

# do something useful with the threads
threads = job_properties[threads]

# access property defined in the cluster configuration file (Snakemake >=3.6.0)
job_properties["cluster"]["time"]

os.system("qsub -t {threads} {script}".format(threads=threads, script=jobscript))

Code Tracking

Snakemake tracks the code that was used to create your files. In combination with --summary or --list-code-changes this can be used to see what files may need a re-run because the implementation changed. Re-run can be automated by invoking Snakemake as follows:

$ snakemake -R `snakemake --list-code-changes`

Onstart, onsuccess and onerror handlers

Sometimes, it is necessary to specify code that shall be executed when the workflow execution is finished (e.g. cleanup, or notification of the user). With Snakemake 3.2.1, this is possible via the onsuccess and onerror keywords:

onsuccess:
    print("Workflow finished, no error")

onerror:
    print("An error occurred")
    shell("mail -s "an error occurred" youremail@provider.com < {log}")

The onsuccess handler is executed if the workflow finished without error. Otherwise, the onerror handler is executed. In both handlers, you have access to the variable log, which contains the path to a logfile with the complete Snakemake output. Snakemake 3.6.0 adds an onstart handler, that will be executed before the workflow starts. Note that dry-runs do not trigger any of the handlers.

When you are using Modules, only the onstart, onsuccess and onerror handlers of the top-level Snakefile are executed. Handlers defined inside module Snakefiles are not triggered automatically. To access the handlers from a specific module’s Snakefile, you can use module_name.onstart, module_name.onsuccess and module_name.onerror.

module test1:
    snakefile:
        "module1/Snakefile"

use rule * from test1 as module1_*

onstart:
    test1.onstart()
onsuccess:
    test1.onsuccess()
onerror:
    test1.onerror()

Rule dependencies

From version 2.4.8 on, rules can also refer to the output of other rules in the Snakefile, e.g.:

rule a:
    input:  "path/to/input"
    output: "path/to/output"
    shell:  ...

rule b:
    input:  rules.a.output
    output: "path/to/output/of/b"
    shell:  ...

Importantly, be aware that referring to rule a here requires that rule a was defined above rule b in the file, since the object has to be known already. This feature also allows us to resolve dependencies that are ambiguous when using filenames.

Note that when the rule you refer to defines multiple output files but you want to require only a subset of those as input for another rule, you should name the output files and refer to them specifically:

rule a:
    input:  "path/to/input"
    output: a = "path/to/output", b = "path/to/output2"
    shell:  ...

rule b:
    input:  rules.a.output.a
    output: "path/to/output/of/b"
    shell:  ...

Handling Ambiguous Rules

When two rules can produce the same output file, snakemake cannot decide which one to use without additional guidance. Hence an AmbiguousRuleException is thrown. Note: ruleorder is not intended to bring rules in the correct execution order (this is solely guided by the names of input and output files you use), it only helps snakemake to decide which rule to use when multiple ones can create the same output file! To deal with such ambiguity, provide a ruleorder for the conflicting rules, e.g.

ruleorder: rule1 > rule2 > rule3

Here, rule1 is preferred over rule2 and rule3, and rule2 is preferred over rule3. Only if rule1 and rule2 cannot be applied (e.g. due to missing input files), rule3 is used to produce the desired output file.

Alternatively, rule dependencies (see above) can also resolve ambiguities.

Another (quick and dirty) possibility is to tell snakemake to allow ambiguity via a command line option

$ snakemake --allow-ambiguity

such that similar to GNU Make always the first matching rule is used. Here, a warning that summarizes the decision of snakemake is provided at the terminal.

Local Rules

When working in a cluster environment, not all rules need to become a job that has to be submitted (e.g. downloading some file, or a target-rule like all, see Target rules). The keyword localrules allows to mark a rule as local, so that it is not submitted to the cluster and instead executed on the host node:

localrules: all, foo

rule all:
    input: ...

rule foo:
    ...

rule bar:
    ...

Here, only jobs from the rule bar will be submitted to the cluster, whereas all and foo will be run locally. Note that you can use the localrules directive multiple times. The result will be the union of all declarations.

Alternatively, you can also use the rule directive localrule:

rule all:
    input: ...
    localrule: True

rule foo:
    ...
    localrule: True

rule bar:
    ...

Benchmark Rules

Since version 3.1, Snakemake provides support for benchmarking the run times of rules. This can be used to create complex performance analysis pipelines. With the benchmark keyword, a rule can be declared to store a benchmark of its code into the specified location. E.g. the rule

rule benchmark_command:
    input:
        "path/to/input.{sample}.txt"
    output:
        "path/to/output.{sample}.txt"
    benchmark:
        "benchmarks/somecommand/{sample}.tsv"
    shell:
        "somecommand {input} {output}"

benchmarks the

s: Wall clock time (in seconds),
h:m:s: Wall clock time (in hour:minutes:seconds),
max_rss: Max RSS memory usage (in megabytes),
max_vms: Max VMS memory usage (in megabytes),
max_uss: Max USS memory usage (in megabytes),
max_pss: Max PSS memory usage (in megabytes),
io_in: I/O read (in bytes),
io_out: I/O written (in bytes),
mean_load: CPU load = CPU time (cpu_usage) divided by wall clock time (s),
cpu_time: CPU time user+system (seconds),

Since version 8.11.0, it is possible to have extra benchmark metrics with the command --benchmark-extended:

jobid: Internal job ID,
rule_name: Rule name,
wildcards: Job wildcards,
params: Job parameters,
threads: Number of threads requested for this job,
cpu_usage: Total CPU load,
resources: Resources requested for this job,
input_size_mb: Size of input files (MiB),

of the command somecommand for the given output and input files.

For this, the shell or run body of the rule is executed on that data, and all run times are stored into the given benchmark tsv file (which will contain a tab-separated table of run times and memory usage in MiB). Per default, Snakemake executes the job once, generating one run time. However, the benchmark file can be annotated with the desired number of repeats, e.g.,

rule benchmark_command:
    input:
        "path/to/input.{sample}.txt"
    output:
        "path/to/output.{sample}.txt"
    benchmark:
        repeat("benchmarks/somecommand/{sample}.tsv", 3)
    shell:
        "somecommand {input} {output}"

will instruct Snakemake to run each job of this rule three times and store all measurements in the benchmark file. The resulting tsv file can be used as input for other rules, just like any other output file.

Since version 8.11.0, it is also possible to have the benchmark metrics in different formats (depending on the extension); currently only the .jsonl extension (JSONL format; i.e. one JSON record per line) is supported and all other extensions will be treated as TSV.

Note

Note that benchmarking is only possible in a reliable fashion for subprocesses (thus for tasks run through the shell, script, and wrapper directive). In the run block, the variable bench_record is available that you can pass to shell() as bench_record=bench_record. When using shell(..., bench_record=bench_record), the maximum of all measurements of all shell() calls will be used but the running time of the rule execution including any Python code.

Defining scatter-gather processes

Via Snakemake’s powerful and arbitrary Python based aggregation abilities (via the expand function and arbitrary Python code, see here), scatter-gather workflows are well supported. Nevertheless, it can sometimes be handy to use Snakemake’s specific scatter-gather support, which allows to avoid boilerplate and offers additional configuration options. Scatter-gather processes can be defined via a global scattergather directive:

scattergather:
    split=8

Each process thereby defines a name (here e.g. split) and a default number of scatter items. Then, scattering and gathering can be implemented by using globally available scatter and gather objects:

rule all:
    input:
        "gathered/all.txt"


rule split:
    output:
        scatter.split("split/{scatteritem}.txt")
    shell:
        "touch {output}"


rule intermediate:
    input:
        "split/{scatteritem}.txt"
    output:
        "split/{scatteritem}.post.txt"
    shell:
        "cp {input} {output}"


rule gather:
    input:
        gather.split("split/{scatteritem}.post.txt")
    output:
        "gathered/all.txt"
    shell:
        "cat {input} > {output}"

Thereby, scatter.split("split/{scatteritem}.txt") yields a list of paths "split/1-of-n.txt", "split/2-of-n.txt", …, depending on the number n of scatter items defined. Analogously, gather.split("split/{scatteritem}.post.txt"), yields a list of paths "split/0.post.txt", "split/1.post.txt", …, which request the application of the rule intermediate to each scatter item.

The default number of scatter items can be overwritten via the command line interface. For example

snakemake --set-scatter split=2

would set the number of scatter items for the split process defined above to 2 instead of 8. This allows to adapt parallelization according to the needs of the underlying computing platform and the analysis at hand.

For more complex workflows it’s possible to define multiple processes, for example:

scattergather:
    split_a=8,
    split_b=3,

The calls to scatter and gather would need to reference the appropriate process name, e.g. scatter.split_a and gather.split_a to use the split_a settings.

Defining groups for execution

From Snakemake 5.0 on, it is possible to assign rules to groups. Such groups will be executed together in cluster or cloud mode, as a so-called group job, i.e., all jobs of a particular group will be submitted at once, to the same computing node. When executing locally, group definitions are ignored.

Groups can be defined via the group keyword. This way, queueing and execution time can be saved, in particular if one or several short-running rules are involved.

samples = [1,2,3,4,5]


rule all:
    input:
        "test.out"


rule a:
    output:
        "a/{sample}.out"
    group: "mygroup"
    shell:
        "touch {output}"


rule b:
    input:
        "a/{sample}.out"
    output:
        "b/{sample}.out"
    group: "mygroup"
    shell:
        "touch {output}"


rule c:
    input:
        expand("b/{sample}.out", sample=samples)
    output:
        "test.out"
    shell:
        "touch {output}"

Here, jobs from rule a and b end up in one group mygroup, whereas jobs from rule c are executed separately. Note that Snakemake always determines a connected subgraph with the same group id to be a group job. Here, this means that, e.g., the jobs creating a/1.out and b/1.out will be in one group, and the jobs creating a/2.out and b/2.out will be in a separate group. However, if we would add group: "mygroup" to rule c, all jobs would end up in a single group, including the one spawned from rule c, because c connects all the other jobs.

Alternatively, groups can be defined via the command line interface. This enables to almost arbitrarily partition the DAG, e.g. in order to save network traffic, see here.

For execution on the cloud using Google Life Science API and preemptible instances, we expect all rules in the group to be homogeneously set as preemptible instances (e.g., with command-line option --preemptible-rules), such that a preemptible VM is requested for the execution of the group job.

Group-local jobs

From Snakemake 7.0 on, it is further possible to ensure that jobs from a certain rule are executed separately within each job group. For this purpose we use input functions, which, in addition to the wildcards argument can expect a groupid argument. In such a case, Snakemake passes the ID of the corresponding group job to the input function. Consider the following example

rule all:
    input:
        expand("bar{i}.txt", i=range(3))


rule grouplocal:
    output:
        "foo.{groupid}.txt"
    group:
        "foo"
    shell:
        "echo test > {output}"


def get_input(wildcards, groupid):
    return f"foo.{groupid}.txt"


rule consumer:
    input:
        get_input
    output:
        "bar{i}.txt"
    group:
        "foo"
    shell:
        "cp {input} {output}"

Here, the value of groupid that is passed by Snakemake to the input function is a UUID that uniquely identifies the group job in which each instance of the rule consumer is contained. In the input function get_input we use this ID to request the desired input file from the rule grouplocal. Since the value of the corresponding wildcard groupid is now always a group specific unique ID, it is ensured that the rule grouplocal will run for every group job spawned from the group foo (remember that group jobs by default only span one connected component, and that this can be configured via the command line, see Job Grouping). Of course, above example would also work if the groups are not specified via the rule definition but entirely via the command line.

Piped output

From Snakemake 5.0 on, it is possible to mark output files as pipes, via the pipe flag, e.g.:

rule all:
    input:
        expand("test.{i}.out", i=range(2))


rule a:
    output:
        pipe("test.{i}.txt")
    shell:
        "for i in {{0..2}}; do echo {wildcards.i} >> {output}; done"


rule b:
    input:
        "test.{i}.txt"
    output:
        "test.{i}.out"
    shell:
        "grep {wildcards.i} < {input} > {output}"

If an output file is marked to be a pipe, then Snakemake will first create a named pipe with the given name and then execute the creating job simultaneously with the consuming job, inside a group job (see above). This works in all execution modes, local, cluster, and cloud. Naturally, a pipe output may only have a single consumer. It is possible to combine explicit group definition as above with pipe outputs. Thereby, pipe jobs can live within, or (automatically) extend existing groups. However, the two jobs connected by a pipe may not exist in conflicting groups.

As with other groups, Snakemake will automatically calculate the required resources for the group job (see resources.

Service rules/jobs

From Snakemake 7.0 on, it is possible to define so-called service rules. Jobs spawned from such rules provide at least one special output file that is marked as service, which means that it is considered to provide a resource that shall be kept available until all consuming jobs are finished. This can for example be the socket of a database, a shared memory device, a ramdisk, and so on. It can even just be a dummy file, and access to the service might happen via a different channel (e.g. a local http port). Service jobs are expected to not exit after creating that resource, but instead wait until Snakemake terminates them (e.g. via SIGTERM on Unixoid systems).

Consider the following example:

rule the_service:
    output:
        service("foo.socket")
    shell:
        # here we simulate some kind of server process that provides data via a socket
        "ln -s /dev/random {output}; sleep 10000"


rule consumer1:
    input:
        "foo.socket"
    output:
        "test.txt"
    shell:
        "head -n1 {input} > {output}"


rule consumer2:
    input:
        "foo.socket"
    output:
        "test2.txt"
    shell:
        "head -n1 {input} > {output}"

Snakemake will schedule the service with all consumers to the same physical node (in the future we might provide further controls and other modes of operation). Once all consumer jobs are finished, the service job will be terminated automatically by Snakemake, and the service output will be removed.

Group-local service jobs

Since Snakemake supports arbitrary partitioning of the DAG into so-called job groups, one should consider what this implies for service jobs when running a workflow in a cluster of cloud context: since each group job spans at least one connected component (see job groups and the Snakemake paper <https://doi.org/10.12688/f1000research.29032.2>), this means that the service job will automatically connect all consumers into one big group. This can be undesired, because depending on the number of consumers that group job can become too big for efficient execution on the underlying architecture. In case of local execution, this is not a problem because here DAG partitioning has no effect.

However, to make a workflow portable across different backends, this behavior should always be considered. In order to circumvent it, it is possible to model service jobs as group-local, i.e. ensuring that each group job gets its own instance of the service rule. This works by combining the service job pattern from above with the group-local pattern as follows:

rule the_service:
    output:
        service("foo.{groupid}.socket")
    shell:
        # here we simulate some kind of server process that provides data via a socket
        "ln -s /dev/random {output}; sleep 10000"


def get_socket(wildcards, groupid):
    return f"foo.{groupid}.socket"


rule consumer1:
    input:
        get_socket
    output:
        "test.txt"
    shell:
        "head -n1 {input} > {output}"


rule consumer2:
    input:
        get_socket
    output:
        "test2.txt"
    shell:
        "head -n1 {input} > {output}"

Parameter space exploration

The basic Snakemake functionality already provides everything to handle parameter spaces in any way (sub-spacing for certain rules and even depending on wildcard values, the ability to read or generate spaces on the fly or from files via pandas, etc.). However, it usually would require some boilerplate code for translating a parameter space into wildcard patterns, and translate it back into concrete parameters for scripts and commands. From Snakemake 5.31 on (inspired by JUDI), this is solved via the Paramspace helper, which can be used as follows:

from snakemake.utils import Paramspace
import pandas as pd

# declare a dataframe to be a paramspace
paramspace = Paramspace(pd.read_csv("params.tsv", sep="\t"))


rule all:
    input:
        # Aggregate over entire parameter space (or a subset thereof if needed)
        # of course, something like this can happen anywhere in the workflow (not
        # only at the end).
        expand("results/plots/{params}.pdf", params=paramspace.instance_patterns)


rule simulate:
    output:
        # format a wildcard pattern like "alpha~{alpha}/beta~{beta}/gamma~{gamma}"
        # into a file path, with alpha, beta, gamma being the columns of the data frame
        f"results/simulations/{paramspace.wildcard_pattern}.tsv"
    params:
        # automatically translate the wildcard values into an instance of the param space
        # in the form of a dict (here: {"alpha": ..., "beta": ..., "gamma": ...})
        simulation=paramspace.instance
    script:
        "scripts/simulate.py"


rule plot:
    input:
        f"results/simulations/{paramspace.wildcard_pattern}.tsv"
    output:
        f"results/plots/{paramspace.wildcard_pattern}.pdf"
    shell:
        "touch {output}"

In above example, please note the Python f-string formatting (the f before the initial quotes) applied to the input and output file strings that contain paramspace.wildcard_pattern. This means that the file that is registered as input or output file by Snakemake does not contain a wildcard {paramspace.wildcard_pattern}, but instead this item is replaced by a pattern of multiple wildcards derived from the columns of the parameter space dataframe. This is done by the Python f-string formatting before the string is registered in the rule. Given that params.tsv contains:

alpha       beta    gamma
1.0 0.1     0.99
2.0 0.0     3.9

This workflow will run as follows:

[Fri Nov 27 20:57:27 2020]
rule simulate:
    output: results/simulations/alpha~2.0/beta~0.0/gamma~3.9.tsv
    jobid: 4
    wildcards: alpha=2.0, beta=0.0, gamma=3.9

[Fri Nov 27 20:57:27 2020]
rule simulate:
    output: results/simulations/alpha~1.0/beta~0.1/gamma~0.99.tsv
    jobid: 2
    wildcards: alpha=1.0, beta=0.1, gamma=0.99

[Fri Nov 27 20:57:27 2020]
rule plot:
    input: results/simulations/alpha~2.0/beta~0.0/gamma~3.9.tsv
    output: results/plots/alpha~2.0/beta~0.0/gamma~3.9.pdf
    jobid: 3
    wildcards: alpha=2.0, beta=0.0, gamma=3.9


[Fri Nov 27 20:57:27 2020]
rule plot:
    input: results/simulations/alpha~1.0/beta~0.1/gamma~0.99.tsv
    output: results/plots/alpha~1.0/beta~0.1/gamma~0.99.pdf
    jobid: 1
    wildcards: alpha=1.0, beta=0.1, gamma=0.99


[Fri Nov 27 20:57:27 2020]
localrule all:
    input: results/plots/alpha~1.0/beta~0.1/gamma~0.99.pdf, results/plots/alpha~2.0/beta~0.0/gamma~3.9.pdf
    jobid: 0

Naturally, it is possible to create sub-spaces from Paramspace objects, simply by applying all the usual methods and attributes that Pandas data frames provide (e.g. .loc[...], .filter() etc.). Further, the form of the created wildcard_pattern can be controlled via additional arguments of the Paramspace constructor. In particular, using the argument single_wildcard the default behavior of encoding each column as a wildcard can be replaced with a single given wildcard name. This can be handy in case a rule shall serve multiple param spaces with different sets of columns.

Data-dependent conditional execution

From Snakemake 5.4 on, conditional reevaluation of the DAG of jobs based on the content outputs is possible. The key idea is that rules can be declared as checkpoints, e.g.,

checkpoint somestep:
    input:
        "samples/{sample}.txt"
    output:
        "somestep/{sample}.txt"
    shell:
        "somecommand {input} > {output}"

Snakemake allows to re-evaluate the DAG after the successful execution of every job spawned from a checkpoint. For this, every checkpoint is registered by its name in a globally available checkpoints object. The checkpoints object can be accessed by input functions. Assuming that the checkpoint is named somestep as above, the output files for a particular job can be retrieved with

checkpoints.somestep.get(sample="a").output

Note

Note that output files of checkpoints that are accessed via this mechanism will not be marked as temporary. Even you try to mark them as temporary, Snakemake will ignore the label and keep the output files of the checkpoint. Reruns will not be triggered if the output file do not exist.

Thereby, the get method throws snakemake.exceptions.IncompleteCheckpointException if the checkpoint has not yet been executed for these particular wildcard value(s). Inside an input function, the exception will be automatically handled by Snakemake, and leads to a re-evaluation after the checkpoint has been successfully passed.

To illustrate the possibilities of this mechanism, consider the following complete example:

# a target rule to define the desired final output
rule all:
    input:
        "aggregated/sample1.txt",
        "aggregated/sample2.txt"


# generate per-sample input files; filenames are sample IDs,
# while file content ("a" or "b") controls downstream branching
rule generate_sample_input:
    output:
        "samples/{sample}.txt"
    run:
        import random

        with open(output[0], "w") as f:
            f.write(random.choice(["a", "b"]))


# the checkpoint that shall trigger re-evaluation of the DAG
checkpoint somestep:
    input:
        "samples/{sample}.txt"
    output:
        "somestep/{sample}.txt"
    shell:
        # propagate generated value into checkpoint output
        "cp {input} {output}"


# intermediate rule
rule intermediate:
    input:
        "somestep/{sample}.txt"
    output:
        "post/{sample}.txt"
    shell:
        "touch {output}"


# alternative intermediate rule
rule alt_intermediate:
    input:
        "somestep/{sample}.txt"
    output:
        "alt/{sample}.txt"
    shell:
        "touch {output}"


# input function for the rule aggregate
def aggregate_input(wildcards):
    # decision based on content of output file
    # Important: use the method open() of the returned file!
    # This way, Snakemake is able to automatically download the file if it is generated in
    # a cloud environment without a shared filesystem.
    with checkpoints.somestep.get(sample=wildcards.sample).output[0].open() as f:
        if f.read().strip() == "a":
            return "post/{sample}.txt"
        else:
            return "alt/{sample}.txt"


rule aggregate:
    input:
        aggregate_input
    output:
        "aggregated/{sample}.txt"
    shell:
        "touch {output}"

As can be seen, the rule aggregate uses an input function.

Note

You don’t need to use the checkpoint mechanism to determine parameter or resource values of downstream rules that would be based on the output of previous rules. In fact, it won’t even work because the checkpoint mechanism is only considered for input functions. Instead, you can simply use normal parameter or resource functions that just assume that those output files are there. Snakemake will evaluate them immediately before the job is scheduled, when the required files from upstream rules are already present.

Inside the function, we first retrieve the output files of the checkpoint somestep with the wildcards, passing through the value of the wildcard sample. Upon execution, if the checkpoint is not yet complete, Snakemake will record somestep as a direct dependency of the rule aggregate. Once somestep has finished for a given sample, the input function will automatically be re-evaluated and the method get will no longer return an exception. Instead, the output file will be opened, and depending on its contents either "post/{sample}.txt" or "alt/{sample}.txt" will be returned by the input function. This way, the DAG becomes conditional on some produced data.

It is also possible to use checkpoints for cases where the output files are unknown before execution. Consider the following example where an arbitrary number of files is generated by a rule before being aggregated:

# a target rule to define the desired final output
rule all:
    input:
        "aggregated.txt"


# the checkpoint that shall trigger re-evaluation of the DAG
# an number of file is created in a defined directory
checkpoint somestep:
    output:
        directory("my_directory/")
    shell:'''
    mkdir -p my_directory/
    cd my_directory
    for i in 1 2 3; do touch $i.txt; done
    '''



# input function for rule aggregate, return paths to all files produced by the checkpoint 'somestep'
def aggregate_input(wildcards):
    checkpoint_output = checkpoints.somestep.get(**wildcards).output[0]
    return expand("my_directory/{i}.txt",
                i=glob_wildcards(os.path.join(checkpoint_output, "{i}.txt")).i)


rule aggregate:
    input:
        aggregate_input
    output:
        "aggregated.txt"
    shell:
        "cat {input} > {output}"

Because the number of output files is unknown beforehand, the checkpoint only defines an output directory. This time, instead of explicitly writing

checkpoints.somestep.get(sample=wildcards.sample).output[0]

we use the shorthand

checkpoints.somestep.get(**wildcards).output[0]

which automatically unpacks the wildcards as keyword arguments (this is standard python argument unpacking). If the checkpoint has not yet been executed, accessing checkpoints.somestep.get(**wildcards) ensures that Snakemake records the checkpoint as a direct dependency of the rule aggregate. Upon completion of the checkpoint, the input function is re-evaluated, and the code beyond its first line is executed. Here, we retrieve the values of the wildcard i based on all files named {i}.txt in the output directory of the checkpoint. Because the wildcard i is evaluated only after completion of the checkpoint, it is necessary to use directory to declare its output, instead of using the full wildcard patterns as output.

A more practical example building on the previous one is a clustering process with an unknown number of clusters for different samples, where each cluster shall be saved into a separate file. In this example the clusters are being processed by an intermediate rule before being aggregated:

# a target rule to define the desired final output
rule all:
    input:
        "aggregated/a.txt",
        "aggregated/b.txt"


# the checkpoint that shall trigger re-evaluation of the DAG
checkpoint clustering:
    input:
        "samples/{sample}.txt"
    output:
        clusters=directory("clustering/{sample}")
    shell:
        "mkdir clustering/{wildcards.sample}; "
        "for i in 1 2 3; do echo $i > clustering/{wildcards.sample}/$i.txt; done"


# an intermediate rule
rule intermediate:
    input:
        "clustering/{sample}/{i}.txt"
    output:
        "post/{sample}/{i}.txt"
    shell:
        "cp {input} {output}"


def aggregate_input(wildcards):
    checkpoint_output = checkpoints.clustering.get(**wildcards).output[0]
    return expand("post/{sample}/{i}.txt",
           sample=wildcards.sample,
           i=glob_wildcards(os.path.join(checkpoint_output, "{i}.txt")).i)


# an aggregation over all produced clusters
rule aggregate:
    input:
        aggregate_input
    output:
        "aggregated/{sample}.txt"
    shell:
        "cat {input} > {output}"

Here a new directory will be created for each sample by the checkpoint. After completion of the checkpoint, the aggregate_input function is re-evaluated as previously. The values of the wildcard i is this time used to expand the pattern "post/{sample}/{i}.txt", such that the rule intermediate is executed for each of the determined clusters.

Rule inheritance

With Snakemake 6.0 and later, it is possible to inherit from previously defined rules, or in other words, reuse an existing rule in a modified way. This works via the use rule statement that also allows to declare the usage of rules from external modules (see Modules). Consider the following example:

rule a:
    output:
        "test.out"
    shell:
        "echo test > {output}"


use rule a as b with:
    output:
        "test2.out"

As can be seen, we first declare a rule a, and then we reuse the rule a as rule b, while changing only the output file and keeping everything else the same. In reality, one will often change more. Analogously to the use rule from external modules, any properties of the rule (input, output, log, params, benchmark, threads, resources, pathvars, etc.) can be modified, except the actual execution step (shell, notebook, script, cwl, or run). All unmodified properties are inherited from the parent rule.

Pathvars become particularly powerful in combination with such rule inheritance, as they allow to introduce generic items in the parent rule that can be specified out in the child rule:

rule transform_something:
    input:
        "<results>/<instep>/<per>.txt"
    output:
        "<results>/<outstep>/<per>.txt"
    shell:
        "somecommand {input} > {output}"

use rule transform_something as something1_to_something2 with:
    pathvars:
        instep="something1",
        outstep="something2",
        per="{sample}"

use rule transform_something_else as something5_to_something6 with:
    pathvars:
        instep="something5",
        outstep="something6",
        per="{sample}.{replicate}"

In other words, here we define a potentially complex rule only once, and explicitly use it in two different parts of the workflow, even with different kinds of wildcards, all by just configuring the pathvars.

Important

A rule cannot be redefined without renaming it using the as clause. Otherwise, you will have two versions of the same rule, which might be unintended (a common symptom of such unintended repeated uses would be ambiguous rule exceptions thrown by Snakemake). However, it is allowed to create multiple modified versions of the same rule, as long as each has a unique name. The only exception is when a rule was previously imported via a general use rule * from statement, such rules may be further modified once under the same final name for convenience (see Modules).

Note

Modification of params allows the replacement of single keyword arguments. Keyword params arguments of the original rule that are not defined after with are inherited. Positional params arguments of the original rule are overwritten, if positional params arguments are given after with. All other properties (input, output, …) are entirely overwritten with the values specified after with.

Accessing auxiliary source files

Snakemake workflows can refer to various other source files via paths relative to the current Snakefile. This happens for example with the script directive or the conda directive. Sometimes, it is necessary to access further source files that are in a directory relative to the current Snakefile. Since workflows can be imported from remote locations (e.g. when using modules), it is important to not do this manually, so that Snakemake has the chance to cache these files locally before they are accessed. This can be achieved by accessing their path via the workflow.source_path, which (a) computes the correct path relative to the current Snakefile such that the file can be accessed from any working directory, and (b) downloads remote files to a local cache:

rule a:
    input:
        json=workflow.source_path("../resources/test.json")
    output:
        "test.out"
    shell:
        "somecommand {input.json} > {output}"

Note

Note that if such source paths are specified as input files, they are automatically considered to be non-storage files. This means that Snakemake will not try to map them to an eventually specified default storage provider (see Storage support). Further, note that workflow.source_path should not be used from params: but only from input:. The reason is that it returns a cached path that may change between Snakemake runs, thereby triggering spurious reruns if referred via params: (since Snakemake would think that the parameter has changed.

Template rendering integration

Sometimes, data analyses entail the dynamic rendering of internal configuration files that are required for certain steps. From Snakemake 7 on, such template rendering is directly integrated such that it can happen with minimal code and maximum performance. Consider the following example:

rule render_jinja2_template:
    input:
        "some-jinja2-template.txt"
    output:
        "results/{sample}.rendered-version.txt"
    params:
        foo=0.1
    template_engine:
        "jinja2"

Here, Snakemake will automatically use the specified template engine Jinja2 to render the template given as input file into the given output file. The template_engine instruction has to be specified at the end of the rule. Template rendering rules may only have a single output file. If the rule needs more than one input file, there has to be one input file called template, pointing to the main template to be used for the rendering:

rule render_jinja2_template:
    input:
        template="some-jinja2-template.txt",
        other_file="some-other-input-file-used-by-the-template.txt"
    output:
        "results/{sample}.rendered-version.txt"
    params:
        foo=0.1
    template_engine:
        "jinja2"

The template itself has access to input, params, wildcards, and config, which are the same objects you can use for example in the shell or run directive, and the same objects as can be accessed from script or notebook directives (but in the latter two cases they are stored behind the snakemake object which serves as a dedicated namespace to avoid name clashes).

An example Jinja2 template could look like this:

This is some text and now we access {{ params.foo }}.

Apart from Jinja2, Snakemake supports YTE (YAML template engine), which is particularly designed to support templating of the ubiquitous YAML file format:

rule render_yte_template:
    input:
        "some-yte-template.yaml"
    output:
        "results/{sample}.rendered-version.yaml"
    params:
        foo=0.1
    template_engine:
        "yte"

Analogously to the jinja2 case YTE has access to params, wildcards, and config:

?if params.foo < 0.5:
  x:
    - 1
    - 2
    - 3
?else:
  y:
    - a
    - b
    - ?config["threshold"]

By default, template rendering rules are executed locally, without submission to cluster or cloud processes (since templating is usually not resource intensive). However, if a storage plugin is used, a template rule can theoretically leak paths to local copies of the storage files into the rendered template. This can happen if the template inserts the path of an input file into the rendered output. Snakemake tries to detect such cases by checking the template output. To avoid such leaks (only required if your template does something like that with an input file path), you can assign the same group to your template rule and the consuming rule, and in addition mark the template output as temp(), i.e.:

rule render_yte_template:
    input:
        "some-yte-template.yaml"
    output:
        temp("results/{sample}.rendered-version.yaml")
    params:
        foo=0.1
    group: "some-group"
    template_engine:
        "yte"

rule consume_template:
    input:
        "results/{sample}.rendered-version.yaml"
    output:
        "results/some-output.txt"
    group: "some-group"
    shell:
        "sometool {input} {output}"

Setting default flags

Snakemake allows the annotation of input and output files via so-called flags (see e.g. Protected and Temporary Files). Sometimes, it can be useful to define that a certain flag shall be applied to all input or output files of a workflow. This can be achieved via the global inputflags and outputflags directives. Consider the following example:

outputflags:
    temp

rule a:
    output:
        "test.out"
    shell:
        "echo test > {output}"

Would automatically mark the output file of rule a as temporary. The most convenient use case of this mechanism occurs in combination with access pattern annotation. In this case, the default access pattern can be set globally for all output files of a workflow. Only a few cases that differ have then to deal with explicit access pattern annotation (see Access pattern annotation for an example). Whenever a rule defines a flag for a file, this flag will override the default flag of the same kind or any contradicting default flags (e.g. temp will override protected).

Such default input and output flag specifications are always valid for all rules that follow them in the workflow definition. Importantly, they are also “namespaced” per module, meaning that inputflags and outputflags directives in a module only apply to the rules defined in that module.

MPI support

Some highly parallel programs or scripts implement the message passing interface (MPI)), which enables a program to span work across multiple compute nodes on a compute cluster (where a node is an individual machine, which will usually have multiple CPUs nowadays). Let us assume, we have such a program that can parallelize using the MPI and its name is calc-pi-mpi. To run such a program, the user will usually launch it via the mpirun command from Open MPI. But because it can make sense to use another MPI launch command in some circumstances, we recommend the following pattern for a snakemake rule using an MPI-parallelized tool:

rule calc_pi:
  output:
      "pi.calc",
  log:
      "logs/calc_pi.log",
  resources:
      tasks=10,
      mpi="mpirun",
  shell:
      "{resources.mpi} -n {resources.tasks} calc-pi-mpi 10 > {output} 2> {log}"

Here, you provide the MPI wrapper command used to launch the program under resources: mpi=. This enables users to override this command if their execution environment requires this, via providing the mpi resource in Profiles or via the command line option --set-resources. While mpirun should work in most compute environments, including cluster systems like Slurm, LSF or PBS, the exact MPI wrapper command to launch programs may differ on your system. To find out if and which command your execution environment provides, you will have to consult local documentation, check out if any known mpi wrapper commands are available or ask your system’s administrators. A good reference point for getting mpirun to work on your execution environment is the documentation of the mpirun prerequisites.

While a number of cluster scheduling systems are able to figure the tasks resource out for you, other execution environments will require setting -n manually. This includes running a snakemake workflow with an MPI program on a single host or in a non-scheduled environment via ssh, but will also include snakemake remote execution plugins for cluster systems that don’t integrate handling this for you. It is thus good practice to provide this explicitly. To understand how a remote execution plugin for a particular cluster scheduling system supports MPI job execution, please consult the documentation for the respective plugin.

In addition to overriding the MPI wrapper command, you can also provide extra parameters to the MPI wrapper command with the above construct, should your execution environment require it, for example:

$ snakemake --set-resources calc_pi:mpi="mpirun -arch x86" ...

or:

$ snakemake --set-resources calc_pi:mpi="srun --hint nomultithread" ...

Continuously updated input

From Snakemake 8.2 on, it is possible to define rules that continuously accept new input files during workflow execution. This is useful for scenarios like streaming data analysis. The feature works by defining a synchronized Python queue for obtaining input files via the helper function from_queue:

rule myrule:
    input:
        from_queue(all_results, finish_sentinel=...)
    ...

Rules with input marked as from_queue may not define any wildcards. When new items arrive in the queue:

The input files list for the rule is updated
The DAG of jobs is updated, potentially generating new dependencies for the rule
Any dependent rules that need to process the new input files are automatically created and executed

It is required to define a finish sentinel, which is a special value that signals the end of the queue. Once the finish sentinel is encountered, Snakemake will allow all remaining dependent jobs to finish and complete execution of the workflow.

Consider the following complete toy example:

import threading, queue, time

# the finish sentinel
finish_sentinel = object()
# a synchronized queue for the input files
all_results = queue.Queue()

# a thread that fills the queue with input files to be considered
def update_results():
    try:
        for i in range(10):
            all_results.put(f"test{i}.txt")
            time.sleep(1)
        all_results.put(finish_sentinel)
        all_results.join()
    except (KeyboardInterrupt, SystemExit):
        return

update_thread = threading.Thread(target=update_results)
update_thread.start()


# target rule which will be continuously updated until the queue is finished
rule all:
    input:
        from_queue(all_results, finish_sentinel=finish_sentinel)


# job that generates the requested input files
rule generate:
    output:
        "test{i}.txt"
    shell:
        "echo {wildcards.i} > {output}"

Updating existing output files

By default, Snakemake deletes already existing output files before a job is executed. This is usually very convenient, because many tools will fail if their output files already exist. However, from Snakemake 8.7 on, it is possible to declare an output file/directory to be updated by a job instead of rewritten from scratch. Consider the following example:

rule update:
    input:
        "in.txt"
    output:
        update("test.txt")
    shell:
        "echo test >> {output}"

Here, the statement test is appended to the output file test.txt. Hence, we declare it as being updated via the update flag. This way, Snakemake will not delete the file or directory before the job is executed. Furthermore, Snakemake will restore the previous version of the file/directory if the job fails.

If such a file/directory has to be considered as input before the update for another rule it can be marked as before_update. This ensures that Snakemake does not search for a producing job but instead considers the file as is on disk or in the storage:

rule do_something:
    input:
        before_update("test.txt")
    output:
        "in.txt"
    shell:
        "cp {input} {output}"

rule update:
    input:
        "in.txt"
    output:
        update("test.txt")
    shell:
        "echo test >> {output}"

As can be seen, this way it is even possible to break a cyclic dependency. An important helper for setting up the logic of before_update is the exists function, which allows to e.g. condition the consideration of the file that shall be used before the update by its actual existence before the update.

Procedural rule definition

The name is optional and can be left out, creating an anonymous rule. It can also be overridden by setting a rule’s name attribute.

for tool in ["bcftools", "freebayes"]:
    rule:
        name:
            f"call_variants_{tool}"
        input:
            f"path/to/{tool}/inputfile"
        output:
            f"path/to/{tool}/outputfile"
        shell:
            f"{tool} {{input}} > {{output}}"

Snakefiles and Rules

Wildcards

Aggregation

Input functions

Input Functions and unpack()

Helpers for defining rules

The expand function

The multiext function

Path variables

Pathvar usage

Pathvar defaults

Pathvar definition

Semantic helpers

The collect function

The lookup function

The branch function

The evaluate function

The exists function

The parse_input function

The extract_checksum function

The prepend_param function

Rule item access helpers

Sub-path access

The flatten function

The choose_f(ile/older) functions

as_py_module

Target rules

Shell settings

Shell behavior

Threads

Resources

Standard Resources

Default Resources

Dynamic Resources

Resources and Remote Execution

Resources and Group Jobs

Preemptible Jobs

Messages

Priorities

Log-Files

Non-file parameters for rules

Plain python rules

External scripts

Python

Xonsh

Hy

R and R Markdown

Julia

Rust

Bash

Bash Example 1

Bash Example 2

Jupyter notebook integration

Protected and Temporary Files

Auto-grouping via temp files upon remote execution

Directories as outputs

Ignoring timestamps

Ensuring output file properties like non-emptyness or checksum compliance

Shadow rules

Defining retries for fallible rules

Flag files

Job Properties

Code Tracking

Onstart, onsuccess and onerror handlers

Rule dependencies

Handling Ambiguous Rules

Local Rules

Benchmark Rules

Defining scatter-gather processes

Defining groups for execution

Group-local jobs

Piped output

Service rules/jobs

Group-local service jobs

Parameter space exploration

Data-dependent conditional execution

Rule inheritance

Accessing auxiliary source files

Template rendering integration

Setting default flags

Input Functions and `unpack()`