Welcome to Snakemake’s documentation!

https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg https://img.shields.io/pypi/pyversions/snakemake.svg https://img.shields.io/pypi/v/snakemake.svg https://quay.io/repository/snakemake/snakemake/status https://app.wercker.com/status/5b4faec0485e3b6ed5497f3e8e551b34/s/master https://img.shields.io/badge/stack-overflow-orange.svg

Snakemake is an MIT-licensed workflow management system that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment, together with a clean and modern specification language in python style. Snakemake workflows are essentially Python scripts extended by declarative code to define rules. Rules describe how to create output files from input files.

Quick Example

rule targets:
    input:
        "plots/dataset1.pdf",
        "plots/dataset2.pdf"

rule plot:
    input:
        "raw/{dataset}.csv"
    output:
        "plots/{dataset}.pdf"
    shell:
        "somecommand {input} {output}"
  • Similar to GNU Make, you specify targets in terms of a pseudo-rule at the top.
  • For each target and intermediate file, you create rules that define how they are created from input files.
  • Snakemake determines the rule dependencies by matching file names.
  • Input and output files can contain multiple named wildcards.
  • Rules can either use shell commands, plain Python code or external Python or R scripts to create output files from input files.
  • Snakemake workflows can be executed on workstations and clusters without modification. The job scheduling can be constrained by arbitrary resources like e.g. available CPU cores, memory or GPUs.
  • Snakemake can use Amazon S3, Google Storage, Dropbox, FTP and SFTP to access input or output files and further access input files via HTTP and HTTPS.

Getting started

To get started, consider the Snakemake Tutorial, the introductory slides, and the FAQ.

Support

Publications using Snakemake

In the following you find an incomplete list of publications making use of Snakemake for their analyses. Please consider to add your own.

Installation

Snakemake is available on PyPi as well as through Bioconda and also from source code. You can use one of the following ways for installing Snakemake.

Installation via Conda

On Linux and MacOSX, this is the recommended way to install Snakemake, because it also enables Snakemake to handle software dependencies of your workflow.

First, you have to install the Miniconda Python3 distribution. See here for installation instructions. Make sure to ...

  • Install the Python 3 version of Miniconda.
  • Answer yes to the question whether conda shall be put into your PATH.

Then, you can install Snakemake with

$ conda install -c bioconda snakemake

from the Bioconda channel.

Global Installation

With a working Python 3 setup, installation of Snakemake can be performed by issuing

$ easy_install3 snakemake

or

$ pip3 install snakemake

in your terminal.

Installing in Virtualenv

To create an installation in a virtual environment, use the following commands:

$ virtualenv -p python3 .venv
$ source .venv/bin/activate
$ pip install snakemake

Installing from Source

We recommend installing Snakemake into a virtualenv instead of globally. Use the following commands to create a virtualenv and install Snakemake. Note that this will install the development version and as you are installing from the source code, we trust that you know what you are doing and how to checkout individual versions/tags.

$ git clone https://bitbucket.org/snakemake/snakemake.git
$ cd snakemake
$ virtualenv -p python3 .venv
$ source .venv/bin/activate
$ python setup.py install

You can also use python setup.py develop to create a “development installation” in which no files are copied but a link is created and changes in the source code are immediately visible in your snakemake commands.

Examples

Most of the examples below assume that Snakemake is executed in a project-specific root directory. The paths in the Snakefiles below are relative to this directory. We follow the convention to use different subdirectories for different intermediate results, e.g., mapped/ for mapped sequence reads in .bam files, etc.

Building a C Program

GNU Make is primarily used to build C/C++ code. Snakemake can do the same, while providing a superior readability due to less obscure variables inside the rules.

The following example Makefile was adapted from http://www.cs.colby.edu/maxwell/courses/tutorials/maketutor/.

IDIR=../include
ODIR=obj
LDIR=../lib

LIBS=-lm

CC=gcc
CFLAGS=-I$(IDIR)

_HEADERS = hello.h
HEADERS = $(patsubst %,$(IDIR)/%,$(_HEADERS))

_OBJS = hello.o hellofunc.o
OBJS = $(patsubst %,$(ODIR)/%,$(_OBJS))

# build the executable from the object files
hello: $(OBJS)
        $(CC) -o $@ $^ $(CFLAGS)

# compile a single .c file to an .o file
$(ODIR)/%.o: %.c $(HEADERS)
        $(CC) -c -o $@ $< $(CFLAGS)


# clean up temporary files
.PHONY: clean
clean:
        rm -f $(ODIR)/*.o *~ core $(IDIR)/*~

A Snakefile can be easily written as

from os.path import join

IDIR = '../include'
ODIR = 'obj'
LDIR = '../lib'

LIBS = '-lm'

CC = 'gcc'
CFLAGS = '-I' + IDIR


_HEADERS = ['hello.h']
HEADERS = [join(IDIR, hfile) for hfile in _HEADERS]

_OBJS = ['hello.o', 'hellofunc.o']
OBJS = [join(ODIR, ofile) for ofile in _OBJS]


rule hello:
    """build the executable from the object files"""
    output:
        'hello'
    input:
        OBJS
    shell:
        "{CC} -o {output} {input} {CFLAGS} {LIBS}"

rule c_to_o:
    """compile a single .c file to an .o file"""
    output:
        temp('{ODIR}/{name}.o')
    input:
        src='{name}.c',
        headers=HEADERS
    shell:
        "{CC} -c -o {output} {input.src} {CFLAGS}"

rule clean:
    """clean up temporary files"""
    shell:
        "rm -f   *~  core  {IDIR}/*~"

As can be seen, the shell calls become more readable, e.g. "{CC} -c -o {output} {input} {CFLAGS}" instead of $(CC) -c -o $@ $< $(CFLAGS). Further, Snakemake automatically deletes .o-files when they are not needed anymore since they are marked as temp.

C Workflow DAG

Building a Paper with LaTeX

Building a scientific paper can be automated by Snakemake as well. Apart from compiling LaTeX code and invoking BibTeX, we provide a special rule to zip the needed files for online submission.

We first provide a Snakefile tex.rules that contains rules that can be shared for any latex build task:

ruleorder:  tex2pdf_with_bib > tex2pdf_without_bib

rule tex2pdf_with_bib:
    input:
        '{name}.tex',
        '{name}.bib'
    output:
        '{name}.pdf'
    shell:
        """
        pdflatex {wildcards.name}
        bibtex {wildcards.name}
        pdflatex {wildcards.name}
        pdflatex {wildcards.name}
        """

rule tex2pdf_without_bib:
    input:
        '{name}.tex'
    output:
        '{name}.pdf'
    shell:
        """
        pdflatex {wildcards.name}
        pdflatex {wildcards.name}
        """

rule texclean:
    shell:
        "rm -f  *.log *.aux *.bbl *.blg *.synctex.gz"

Note how we distinguish between a .tex file with and without a corresponding .bib with the same name. Assuming that both paper.tex and paper.bib exist, an ambiguity arises: Both rules are, in principle, applicable. This would lead to an AmbiguousRuleException, but since we have specified an explicit rule order in the file, it is clear that in this case the rule tex2pdf_with_bib is to be preferred. If the paper.bib file does not exist, that rule is not even applicable, and the only option is to execute rule tex2pdf_without_bib.

Assuming that the above file is saved as tex.rules, the actual documents are then built from a specific Snakefile that includes these common rules:

DOCUMENTS = ['document', 'response-to-editor']
TEXS = [doc+".tex" for doc in DOCUMENTS]
PDFS = [doc+".pdf" for doc in DOCUMENTS]
FIGURES = ['fig1.pdf']

include:
    'tex.smrules'

rule all:
    input:
        PDFS

rule zipit:
    output:
        'upload.zip'
    input:
        TEXS, FIGURES, PDFS
    shell:
        'zip -T {output} {input}'

rule pdfclean:
    shell:
        "rm -f  {PDFS}"

Hence the user can perform 4 different tasks. Build all PDFs:

$ snakemake

Create a zip-file for online submissions:

$ snakemake zipit

Clean up all PDFs:

$ snakemake pdfclean

Clean up latex temporary files:

$ snakemake texclean

The following DAG of jobs would be executed upon a full run:

LaTeX Workflow DAG

Snakemake Tutorial

This tutorial introduces the text-based workflow system Snakemake. Snakemake follows the GNU Make paradigm: workflows are defined in terms of rules that define how to create output files from input files. Dependencies between the rules are determined automatically, creating a DAG (directed acyclic graph) of jobs that can be automatically parallelized.

Snakemake sets itself apart from existing text-based workflow systems in the following way. Hooking into the Python interpreter, Snakemake offers a definition language that is an extension of Python with syntax to define rules and workflow specific properties. This allows to combine the flexibility of a plain scripting language with a pythonic workflow definition. The Python language is known to be concise yet readable and can appear almost like pseudo-code. The syntactic extensions provided by Snakemake maintain this property for the definition of the workflow. Further, Snakemakes scheduling algorithm can be constrained by priorities, provided cores and customizable resources and it provides a generic support for distributed computing (e.g., cluster or batch systems). Hence, a Snakemake workflow scales without modification from single core workstations and multi-core servers to cluster or batch systems.

The examples presented in this tutorial come from Bioinformatics. However, Snakemake is a general-purpose workflow management system for any discipline. We ensured that no bioinformatics knowledge is needed to understand the tutorial.

Also have a look at the corresponding slides.

Setup

Requirements

To go through this tutorial, you need the following software installed:

The easiest way to setup these prerequisites is to use the Miniconda Python 3 distribution. The tutorial assumes that you are using either Linux or MacOS X. Both Snakemake and Miniconda work also under Windows, but the Windows shell is too different to be able to provide generic examples.

Setup a Linux VM with Vagrant under Windows

If you already use Linux or MacOS X, go on with Step 1. If you use Windows, you can setup a Linux virtual machine (VM) with Vagrant. First, install Vagrant following the installation instructions in the Vagrant Documentation. Then, create a reasonable new directory you want to share with your Linux VM, e.g., create a folder vagrant-linux somewhere. Open a command line prompt, and change into that directory. Here, you create a 64-bit Ubuntu Linux environment with

> vagrant init hashicorp/precise64
> vagrant up

If you decide to use a 32-bit image, you will need to download the 32-bit version of Miniconda in the next step. The contents of the vagrant-linux folder will be shared with the virtual machine that is set up by vagrant. You can log into the virtual machine via

> vagrant ssh

If this command tells you to install an SSH client, you can follow the instructions in this Blogpost. Now, you can follow the steps of our tutorial from within your Linux VM.

Step 1: Installing Miniconda 3

First, please open a terminal or make sure you are logged into your Vagrant Linux VM. Assuming that you have a 64-bit system, on Linux, download and install Miniconda 3 with

$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh

On MacOS X, download and install with

$ curl https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -o Miniconda3-latest-MacOSX-x86_64.sh
$ bash Miniconda3-latest-MacOSX-x86_64.sh

For a 32-bit system, URLs and file names are analogous but without the _64. When you are asked the question

Do you wish the installer to prepend the Miniconda3 install location to PATH ...? [yes|no]

answer with yes. Along with a minimal Python 3 environment, Miniconda contains the package manager Conda. After opening a new terminal, you can use the new conda command to install software packages and create isolated environments to, e.g., use different versions of the same package. We will later use Conda to create an isolated environment with all required software for this tutorial.

Step 2: Preparing a working directory

First, create a new directory snakemake-tutorial at a reasonable place and change into that directory in your terminal. If you use a Vagrant Linux VM from Windows as described above, create that directory under /vagrant/, so that the contents are shared with your host system (you can then edit all files from within Windows with an editor that supports Unix line breaks). Then, change to the newly created directory. In this directory, we will later create an example workflow that illustrates the Snakemake syntax and execution environment. First, we download some example data on which the workflow shall be executed:

$ wget https://bitbucket.org/snakemake/snakemake-tutorial/get/v3.11.0.tar.bz2
$ tar -xf v3.11.0.tar.bz2 --strip 1

This will create a folder data and a file environment.yaml in the working directory.

Step 3: Creating an environment with the required software

The environment.yaml file can be used to install all required software into an isolated Conda environment with the name snakemake-tutorial via

$ conda env create --name snakemake-tutorial --file environment.yaml
Step 4: Activating the environment

To activate the snakemake-tutorial environment, execute

$ source activate snakemake-tutorial

Now you can use the installed tools. Execute

$ snakemake --help

to test this and get information about the command-line interface of Snakemake. To exit the environment, you can execute

$ source deactivate

but don’t do that now, since we finally want to start working with Snakemake :-).

Basics: An example workflow

Please make sure that you have activated the environment we created before, and that you have an open terminal in the working directory you have created.

A Snakemake workflow is defined by specifying rules in a Snakefile. Rules decompose the workflow into small steps (e.g., the application of a single tool) by specifying how to create sets of output files from sets of input files. Snakemake automatically determines the dependencies between the rules by matching file names.

The Snakemake language extends the Python language, adding syntactic structures for rule definition and additional controls. All added syntactic structures begin with a keyword followed by a code block that is either in the same line or indented and consisting of multiple lines. The resulting syntax resembles that of original Python constructs.

In the following, we will introduce the Snakemake syntax by creating an example workflow. The workflow comes from the domain of genome analysis. It maps sequencing reads to a reference genome and call variants on the mapped reads. The tutorial does not require you to know what this is about. Nevertheless, we provide some background in the following.

Background

The genome of a living organism encodes its hereditary information. It serves as a blueprint for proteins, which form living cells, carry information and drive chemical reactions. Differences between populations, species, cancer cells and healthy tissue, as well as syndromes or diseases can be reflected and sometimes caused by changes in the genome. This makes the genome an major target of biological and medical research. Today, it is often analyzed with DNA sequencing, producing gigabytes of data from a single biological sample (e.g. a biopsy of some tissue). For technical reasons, DNA sequencing cuts the DNA of a sample into millions of small pieces, called reads. In order to recover the genome of the sample, one has to map these reads against a known reference genome (e.g., the human one obtained during the famous human genome genome project). This task is called read mapping. Often, it is of interest where an individual genome is different from the species-wide consensus represented with the reference genome. Such differences are called variants. They are responsible for harmless individual differences (like eye color), but can also cause diseases like cancer. By investigating the differences between the all mapped reads and the reference sequence at one position, variants can be detected. This is a statistical challenge, because they have to be distinguished from artifacts generated by the sequencing process.

Step 1: Mapping reads

Our first Snakemake rule maps reads of a given sample to a given reference genome (see Background). For this, we will use the tool bwa, specifically the subcommand bwa mem. In the working directory, create a new file called Snakefile with an editor of your choice. We propose to use the Atom editor, since it provides out-of-the-box syntax highlighting for Snakemake. In the Snakefile, define the following rule:

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/A.fastq"
    output:
        "mapped_reads/A.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

A Snakemake rule has a name (here bwa_map) and a number of directives, here input, output and shell. The input and output directives are followed by lists of files that are expected to be used or created by the rule. In the simplest case, these are just explicit Python strings. The shell directive is followed by a Python string containing the shell command to execute. In the shell command string, we can refer to elements of the rule via braces notation (similar to the Python format function). Here, we refer to the output file by specifying {output} and to the input files by specifying {input}. Since the rule has multiple input files, Snakemake will concatenate them separated by a whitespace. In other words, Snakemake will replace {input} with data/genome.fa data/samples/A.fastq before executing the command. The shell command invokes bwa mem with reference genome and reads, and pipes the output into samtools which creates a compressed BAM file containing the alignments. The output of samtools is piped into the output file defined by the rule.

When a workflow is executed, Snakemake tries to generate given target files. Target files can be specified via the command line. By executing

$ snakemake -np mapped_reads/A.bam

in the working directory containing the Snakefile, we tell Snakemake to generate the target file mapped_reads/A.bam. Since we used the -n (or --dryrun) flag, Snakemake will only show the execution plan instead of actually perform the steps. The -p flag instructs Snakemake to also print the resulting shell command for illustration. To generate the target files, Snakemake applies the rules given in the Snakefile in a top-down way. The application of a rule to generate a set of output files is called job. For each input file of a job, Snakemake again (i.e. recursively) determines rules that can be applied to generate it. This yields a directed acyclic graph (DAG) of jobs where the edges represent dependencies. So far, we only have a single rule, and the DAG of jobs consists of a single node. Nevertheless, we can execute our workflow with

$ snakemake mapped_reads/A.bam

Note that, after completion of above command, Snakemake will not try to create mapped_reads/A.bam again, because it is already present in the file system. Snakemake only re-runs jobs if one of the input files is newer than one of the output files or one of the input files will be updated by another job.

Step 2: Generalizing the read mapping rule

Obviously, the rule will only work for a single sample with reads in the file data/samples/A.fastq. However, Snakemake allows to generalize rules by using named wildcards. Simply replace the A in the second input file and in the output file with the wildcard {sample}, leading to

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

When Snakemake determines that this rule can be applied to generate a target file by replacing the wildcard {sample} in the output file with an appropriate value, it will propagate that value to all occurrences of {sample} in the input files and thereby determine the necessary input for the resulting job. Note that you can have multiple wildcards in your file paths, however, to avoid conflicts with other jobs of the same rule, all output files of a rule have to contain exactly the same wildcards.

When executing

$ snakemake -np mapped_reads/B.bam

Snakemake will determine that the rule bwa_map can be applied to generate the target file by replacing the wildcard {sample} with the value B. In the output of the dry-run, you will see how the wildcard value is propagated to the input files and all filenames in the shell command. You can also specify multiple targets, e.g.:

$ snakemake -np mapped_reads/A.bam mapped_reads/B.bam

Some Bash magic can make this particularly handy. For example, you can alternatively compose our multiple targets in a single pass via

$ snakemake -np mapped_reads/{A,B}.bam

Note that this is not a special Snakemake syntax. Bash is just expanding the given path into two, one for each element of the set {A,B}.

In both cases, you will see that Snakemake only proposes to create the output file mapped_reads/B.bam. This is because you already executed the workflow before (see the previous step) and no input file is newer than the output file mapped_reads/A.bam. You can update the file modification date of the input file data/samples/A.fastq via

$ touch data/samples/A.fastq

and see how Snakemake wants to re-run the job to create the file mapped_reads/A.bam by executing

$ snakemake -np mapped_reads/A.bam mapped_reads/B.bam
Step 3: Sorting read alignments

For later steps, we need the read alignments in the BAM files to be sorted. This can be achieved with the samtools command. We add the following rule beneath the bwa_map rule:

rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"

This rule will take the input file from the mapped_reads directory and store a sorted version in the sorted_reads directory. Note that Snakemake automatically creates missing directories before jobs are executed. For sorting, samtools requires a prefix specified with the flag -T. Here, we need the value of the wildcard sample. Snakemake allows to access wildcards in the shell command via the wildcards object that has an attribute with the value for each wildcard.

When issuing

$ snakemake -np sorted_reads/B.bam

you will see how Snakemake wants to run first the rule bwa_map and then the rule samtools_sort to create the desired target file: as mentioned before, the dependencies are resolved automatically by matching file names.

Step 4: Indexing read alignments and visualizing the DAG of jobs

Next, we need to use samtools again to index the sorted read alignments for random access. This can be done with the following rule:

rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    shell:
        "samtools index {input}"

Having three steps already, it is a good time to take a closer look at the resulting DAG of jobs. By executing

$ snakemake --dag sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg

we create a visualization of the DAG using the dot command provided by Graphviz. For the given target files, Snakemake specifies the DAG in the dot language and pipes it into the dot command, which renders the definition into SVG format. The rendered DAG is piped into the file dag.svg and will look similar to this:

_images/dag_index.png

The DAG contains a node for each job and edges representing the dependencies. Jobs that don’t need to be run because their output is up-to-date are dashed. For rules with wildcards, the value of the wildcard for the particular job is displayed in the job node.

Exercise
  • Run parts of the workflow using different targets. Recreate the DAG and see how different rules become dashed because their output is present and up-to-date.
Step 5: Calling genomic variants

The next step in our workflow will aggregate the mapped reads from all samples and jointly call genomic variants on them (see Background). For the variant calling, we will combine the two utilities samtools and bcftools. Snakemake provides a helper function for collecting input files that helps us to describe the aggregation in this step. With

expand("sorted_reads/{sample}.bam", sample=SAMPLES)

we obtain a list of files where the given pattern "sorted_reads/{sample}.bam" was formatted with the values in a given list of samples SAMPLES, i.e.

["sorted_reads/A.bam", "sorted_reads/B.bam"]

The function is particularly useful when the pattern contains multiple wildcards. For example,

expand("sorted_reads/{sample}.{replicate}.bam", sample=SAMPLES, replicate=[0, 1])

would create the product of all elements of SAMPLES and the list [0, 1], yielding

["sorted_reads/A.0.bam", "sorted_reads/A.1.bam", "sorted_reads/B.0.bam", "sorted_reads/B.1.bam"]

Here, we use only the simple case of expand. We first let Snakemake know which samples we want to consider. Remember that Snakemake works top-down, it does not automatically infer this from, e.g., the fastq files in the data folder. Also remember that Snakefiles are in principle Python code enhanced by some declarative statements to define workflows. Hence, we can define the list of samples ad-hoc in plain Python at the top of the Snakefile:

SAMPLES = ["A", "B"]

Later, we will learn about more sophisticated ways like config files. Now, we can add the following rule to our Snakefile:

rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
    output:
        "calls/all.vcf"
    shell:
        "samtools mpileup -g -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"

With multiple input or output files, it is sometimes handy to refer them separately in the shell command. This can be done by specifying names for input or output files (here, e.g., fa=...). The files can then be referred in the shell command via, e.g., {input.fa}. For long shell commands like this one, it is advisable to split the string over multiple indented lines. Python will automatically merge it into one. Further, you will notice that the input or output file lists can contain arbitrary Python statements, as long as it returns a string, or a list of strings. Here, we invoke our expand function to aggregate over the aligned reads of all samples.

Exercise
  • obtain the updated DAG of jobs for the target file calls/all.vcf, it should look like this:
_images/dag_call.png
Step 6: Writing a report

Although Snakemake workflows are already self-documenting to a certain degree, it is often useful to summarize the obtained results and performed steps in a comprehensive report. With Snakemake, such reports can be composed easily with the built-in report function. It is best practice to create reports in a separate rule that takes all desired results as input files and provides a single HTML file as output.

rule report:
    input:
        "calls/all.vcf"
    output:
        "report.html"
    run:
        from snakemake.utils import report
        with open(input[0]) as vcf:
            n_calls = sum(1 for l in vcf if not l.startswith("#"))

        report("""
        An example variant calling workflow
        ===================================

        Reads were mapped to the Yeast
        reference genome and variants were called jointly with
        SAMtools/BCFtools.

        This resulted in {n_calls} variants (see Table T1_).
        """, output[0], T1=input[0])

First, we notice that this rule does not entail a shell command. Instead, we use the run directive, which is followed by plain Python code. Similar to the shell case, we have access to input and output files, which we can handle as plain Python objects.

We go through the run block line by line. First, we import the report function from snakemake.utils. Second, we open the VCF file by accessing it via its index in the input files (i.e. input[0]), and count the number of non-header lines (which is equivalent to the number of variant calls). Of course, this is only a silly example of what to do with variant calls. Third, we create the report using the report function. The function takes a string that contains RestructuredText markup. In addition, we can use the familiar braces notation to access any Python variables (here the samples and n_calls variables we have defined before). The second argument of the report function is the path were the report will be stored (the function creates a single HTML file). Then, report expects any number of keyword arguments referring to files that shall be embedded into the report. Technically, this means that the file will be stored as a Base64 encoded data URI within the HTML file, making reports entirely self-contained. Importantly, you can refer to the files from within the report via the given keywords followed by an underscore (here T1_). Hence, reports can be used to semantically connect and explain the obtained results.

When having many result files, it is sometimes handy to define the names already in the list of input files and unpack these into keyword arguments as follows:

report("""...""", output[0], **input)

Further, you can add meta data in the form of any string that will be displayed in the footer of the report, e.g.

report("""...""", output[0], metadata="Author: Johannes Köster (koester@jimmy.harvard.edu)", **input)
Step 7: Adding a target rule

So far, we always executed the workflow by specifying a target file at the command line. Apart from filenames, Snakemake also accepts rule names as targets if the referred rule does not have wildcards. Hence, it is possible to write target rules collecting particular subsets of the desired results or all results. Moreover, if no target is given at the command line, Snakemake will define the first rule of the Snakefile as the target. Hence, it is best practice to have a rule all at the top of the workflow which has all typically desired target files as input files.

Here, this means that we add a rule

rule all:
    input:
        "report.html"

to the top of our workflow. When executing Snakemake with

$ snakemake -n

the execution plan for creating the file report.html which contains and summarizes all our results will be shown. Note that, apart from Snakemake considering the first rule of the workflow as default target, the appearance of rules in the Snakefile is arbitrary and does not influence the DAG of jobs.

Exercise
  • Create the DAG of jobs for the complete workflow.
  • Execute the complete workflow and have a look at the resulting report.html in your browser.
  • Snakemake provides handy flags for forcing re-execution of parts of the workflow. Have a look at the command line help with snakemake --help and search for the flag --forcerun. Then, use this flag to re-execute the rule samtools_sort and see what happens.
  • With --reason it is possible to display the execution reason for each job. Try this flag together with a dry-run and the --forcerun flag to understand the decisions of Snakemake.
Summary

In total, the resulting workflow looks like this:

SAMPLES = ["A", "B"]


rule all:
    input:
        "report.html"


rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"


rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"


rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    shell:
        "samtools index {input}"


rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
    output:
        "calls/all.vcf"
    shell:
        "samtools mpileup -g -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"


rule report:
    input:
        "calls/all.vcf"
    output:
        "report.html"
    run:
        from snakemake.utils import report
        with open(input[0]) as vcf:
            n_calls = sum(1 for l in vcf if not l.startswith("#"))

        report("""
        An example variant calling workflow
        ===================================

        Reads were mapped to the Yeast
        reference genome and variants were called jointly with
        SAMtools/BCFtools.

        This resulted in {n_calls} variants (see Table T1_).
        """, output[0], T1=input[0])

Advanced: Decorating the example workflow

Now that the basic concepts of Snakemake have been illustrated, we can introduce advanced topics.

Step 1: Specifying the number of used threads

For some tools, it is advisable to use more than one thread in order to speed up the computation. Snakemake can be made aware of the threads a rule needs with the threads directive. In our example workflow, it makes sense to use multiple threads for the rule bwa_map:

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    threads: 8
    shell:
        "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"

The number of threads can be propagated to the shell command with the familiar braces notation (i.e. {threads}). If no threads directive is given, a rule is assumed to need 1 thread.

When a workflow is executed, the number of threads the jobs need is considered by the Snakemake scheduler. In particular, the scheduler ensures that the sum of the threads of all running jobs does not exceed a given number of available CPU cores. This number can be given with the --cores command line argument (per default, Snakemake uses only 1 CPU core). For example

$ snakemake --cores 10

would execute the workflow with 10 cores. Since the rule bwa_map needs 8 threads, only one job of the rule can run at a time, and the Snakemake scheduler will try to saturate the remaining cores with other jobs like, e.g., samtools_sort. The threads directive in a rule is interpreted as a maximum: when less cores than threads are provided, the number of threads a rule uses will be reduced to the number of given cores.

Exercise
  • With the flag --forceall you can enforce a complete re-execution of the workflow. Combine this flag with different values for --cores and examine how the scheduler selects jobs to run in parallel.
Step 2: Config files

So far, we specified the samples to consider in a Python list within the Snakefile. However, often you want your workflow to be customizable, so that it can be easily adapted to new data. For this purpose, Snakemake provides a config file mechanism. Config files can be written in JSON or YAML, and loaded with the configfile directive. In our example workflow, we add the line

configfile: "config.yaml"

to the top of the Snakefile. Snakemake will load the config file and store its contents into a globally available dictionary named config. In our case, it makes sense to specify the samples in config.yaml as

samples:
    A: data/samples/A.fastq
    B: data/samples/B.fastq

Now, we can remove the statement defining SAMPLES from the Snakefile and change the rule bcftools_call to

rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
    output:
        "calls/all.vcf"
    shell:
        "samtools mpileup -g -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"
Step 3: Input functions

Since we have stored the path to the FASTQ files in the config file, we can also generalize the rule bwa_map to use these paths. This case is different to the rule bcftools_call we modified above. To understand this, it is important to know that Snakemake workflows are executed in three phases.

  • In the initialization phase, the workflow is parsed and all rules are instantiated.
  • In the DAG phase, the DAG of jobs is built by filling wildcards and matching input files to output files.
  • In the scheduling phase, the DAG of jobs is executed.

The expand functions in the list of input files of the rule bcftools_call are executed during the initialization phase. In this phase, we don’t know about jobs, wildcard values and rule dependencies. Hence, we cannot determine the FASTQ paths for rule bwa_map from the config file in this phase, because we don’t even know which jobs will be generated from that rule. Instead, we need to defer the determination of input files to the DAG phase. This can be achieved by specifying an input function instead of a string as inside of the input directive. For the rule bwa_map this works as follows:

rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        "mapped_reads/{sample}.bam"
    threads: 8
    shell:
        "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"

Here, we use an anonymous function, also called lambda expression. Any normal function would work as well. Input functions take as single argument a wildcards object, that allows to access the wildcards values via attributes (here wildcards.sample). They have to return a string or a list of strings, that are interpreted as paths to input files (here, we return the path that is stored for the sample in the config file). Input functions are evaluated once the wildcard values of a job are determined.

Exercise
  • In the data/samples folder, there is an additional sample C.fastq. Add that sample to the config file and see how Snakemake wants to recompute the part of the workflow belonging to the new sample, when invoking with snakemake -n --reason --forcerun bcftools_call.
Step 4: Rule parameters

Sometimes, shell commands are not only composed of input and output files and some static flags. In particular, it can happen that additional parameters need to be set depending on the wildcard values of the job. For this, Snakemake allows to define arbitrary parameters for rules with the params directive. In our workflow, it is reasonable to annotate aligned reads with so-called read groups, that contain metadata like the sample name. We modify the rule bwa_map accordingly:

rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        "mapped_reads/{sample}.bam"
    params:
        rg="@RG\tID:{sample}\tSM:{sample}"
    threads: 8
    shell:
        "bwa mem -R '{params.rg}' -t {threads} {input} | samtools view -Sb - > {output}"

Similar to input and output files, params can be accessed from the shell command or the Python based run block (see Step 6: Writing a report).

Exercise
  • Variant calling can consider a lot of parameters. A particularly important one is the prior mutation rate (1e-3 per default). It is set via the flag -P of the bcftools call command. Consider making this flag configurable via adding a new key to the config file and using the params directive in the rule bcftools_call to propagate it to the shell command.
Step 5: Logging

When executing a large workflow, it is usually desirable to store the output of each job persistently in files instead of just printing it to the terminal. For this purpose, Snakemake allows to specify log files for rules. Log files are defined via the log directive and handled similarly to output files, but they are not subject of rule matching and are not cleaned up when a job fails. We modify our rule bwa_map as follows:

rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        "mapped_reads/{sample}.bam"
    params:
        rg="@RG\tID:{sample}\tSM:{sample}"
    log:
        "logs/bwa_mem/{sample}.log"
    threads: 8
    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"

The shell command is modified to collect STDERR output of both bwa and samtools and pipe it into the file referred by {log}. Log files must contain exactly the same wildcards as the output files to avoid clashes.

Exercise
  • Add a log directive to the bcftools_call rule as well.
  • Time to re-run the whole workflow (remember the command line flags to force re-execution). See how log files are created for variant calling and read mapping.
  • The ability to track the provenance of each generated result is an important step towards reproducible analyses. Apart from the report functionality discussed before, Snakemake can summarize various provenance information for all output files of the workflow. The flag --summary prints a table associating each output file with the rule used to generate it, the creation date and optionally the version of the tool used for creation is provided. Further, the table informs about updated input files and changes to the source code of the rule after creation of the output file. Invoke Snakemake with --summary to examine the information for our example.
Step 6: Temporary and protected files

In our workflow, we create two BAM files for each sample, namely the output of the rules bwa_map and samtools_sort. When not dealing with examples, the underlying data is usually huge. Hence, the resulting BAM files need a lot of disk space and their creation takes some time. Snakemake allows to mark output files as temporary, such that they are deleted once every consuming job has been executed, in order to save disk space. We use this mechanism for the output file of the rule bwa_map:

rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        temp("mapped_reads/{sample}.bam")
    params:
        rg="@RG\tID:{sample}\tSM:{sample}"
    log:
        "logs/bwa_mem/{sample}.log"
    threads: 8
    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"

This results in the deletion of the BAM file once the corresponding samtools_sort job has been executed. Since the creation of BAM files via read mapping and sorting is computationally expensive, it is reasonable to protect the final BAM file from accidental deletion or modification. We modify the rule samtools_sort by marking it’s output file as protected:

rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        protected("sorted_reads/{sample}.bam")
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"

After execution of the job, Snakemake will write-protect the output file in the filesystem, so that it can’t be overwritten or deleted accidentally.

Exercise
  • Re-execute the whole workflow and observe how Snakemake handles the temporary and protected files.
  • Run Snakemake with the target mapped_reads/A.bam. Although the file is marked as temporary, you will see that Snakemake does not delete it because it is specified as a target file.
  • Try to re-execute the whole workflow again with the dry-run option. You will see that it fails (as intended) because Snakemake cannot overwrite the protected output files.
Summary

The final version of our workflow looks like this:

configfile: "config.yaml"


rule all:
    input:
        "report.html"


rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        temp("mapped_reads/{sample}.bam")
    params:
        rg="@RG\tID:{sample}\tSM:{sample}"
    log:
        "logs/bwa_mem/{sample}.log"
    threads: 8
    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"


rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        protected("sorted_reads/{sample}.bam")
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"


rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    shell:
        "samtools index {input}"


rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
    output:
        "calls/all.vcf"
    shell:
        "samtools mpileup -g -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"


rule report:
    input:
        "calls/all.vcf"
    output:
        "report.html"
    run:
        from snakemake.utils import report
        with open(input[0]) as vcf:
            n_calls = sum(1 for l in vcf if not l.startswith("#"))

        report("""
        An example variant calling workflow
        ===================================

        Reads were mapped to the Yeast
        reference genome and variants were called jointly with
        SAMtools/BCFtools.

        This resulted in {n_calls} variants (see Table T1_).
        """, output[0], T1=input[0])

Additional features

In the following, we introduce some features that are beyond the scope of above example workflow. For details and even more features, see Writing Workflows, Frequently Asked Questions and the command line help (snakemake --help).

Benchmarking

With the benchmark directive, Snakemake can be instructed to measure the wall clock time of a job. We activate benchmarking for the rule bwa_map:

rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        temp("mapped_reads/{sample}.bam")
    params:
        rg="@RG\tID:{sample}\tSM:{sample}"
    log:
        "logs/bwa_mem/{sample}.log"
    benchmark:
        "benchmarks/{sample}.bwa.benchmark.txt"
    threads: 8
    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"

The benchmark directive takes a string that points to the file where benchmarking results shall be stored. Similar to output files, the path can contain wildcards (it must be the same wildcards as in the output files). When a job derived from the rule is executed, Snakemake will measure the wall clock time and memory usage (in MiB) and store it in the file in tab-delimited format. With the command line flag --benchmark-repeats, Snakemake can be instructed to perform repetitive measurements by executing benchmark jobs multiple times. The repeated measurements occur as subsequent lines in the tab-delimited benchmark file.

We can include the benchmark results into our report:

rule report:
    input:
        T1="calls/all.vcf",
        T2=expand("benchmarks/{sample}.bwa.benchmark.txt", sample=config["samples"])
    output:
        "report.html"
    run:
        from snakemake.utils import report
        with open(input.T1) as vcf:
            n_calls = sum(1 for l in vcf if not l.startswith("#"))

        report("""
        An example variant calling workflow
        ===================================

        Reads were mapped to the Yeast
        reference genome and variants were called jointly with
        SAMtools/BCFtools.

        This resulted in {n_calls} variants (see Table T1_).
        Benchmark results for BWA can be found in the tables T2_.
        """, output[0], **input)

We use the expand function to collect the benchmark files for all samples. Here, we directly provide names for the input files. In particular, we can also name the whole list of benchmark files returned by the expand function as T2. When invoking the report function, we just unpack input into keyword arguments (resulting in T1 and T2). In the text, we refer with T2_ to the list of benchmark files.

Exercise
  • Re-execute the workflow and benchmark bwa_map with 3 repeats. Open the report and see how the list of benchmark files is presented in the HTML report.
Modularization

In order to re-use building blocks or simply to structure large workflows, it is sometimes reasonable to split a workflow into modules. For this, Snakemake provides the include directive to include another Snakefile into the current one, e.g.:

include: "path/to/other.snakefile"

Alternatively, Snakemake allows to define sub-workflows. A sub-workflow refers to a working directory with a complete Snakemake workflow. Output files of that sub-workflow can be used in the current Snakefile. When executing, Snakemake ensures that the output files of the sub-workflow are up-to-date before executing the current workflow. This mechanism is particularly useful when you want to extend a previous analysis without modifying it. For details about sub-workflows, see the documentation.

Exercise
  • Put the read mapping related rules into a separate Snakefile and use the include directive to make them available in our example workflow again.
Using custom scripts

Using the run directive as above is only reasonable for short Python scripts. As soon as your script becomes larger, it is reasonable to separate it from the workflow definition. For this purpose, Snakemake offers the script directive. Using this, the report rule from above could instead look like this:

rule report:
    input:
        T1="calls/all.vcf",
        T2=expand("benchmarks/{sample}.bwa.benchmark.txt", sample=config["samples"])
    output:
        "report.html"
    script:
        "scripts/report.py"

The actual Python code to generate the report is now hidden in the script scripts/report.py. Script paths are always relative to the referring Snakefile. In the script, all properties of the rule like input, output, wildcards, params, threads etc. are available as attributes of a global snakemake object:

from snakemake.utils import report

with open(snakemake.input.T1) as vcf:
    n_calls = sum(1 for l in vcf if not l.startswith("#"))

report("""
An example variant calling workflow
===================================

Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.

This resulted in {n_calls} variants (see Table T1_).
Benchmark results for BWA can be found in the tables T2_.
""", snakemake.output[0], **snakemake.input)

Although there are other strategies to invoke separate scripts from your workflow (e.g., invoking them via shell commands), the benefit of this is obvious: the script logic is separated from the workflow logic (and can be even shared between workflows), but boilerplate code like the parsing of command line arguments in unnecessary.

Apart from Python scripts, it is also possible to use R scripts. In R scripts, an S4 object named snakemake analog to the Python case above is available and allows access to input and output files and other parameters. Here the syntax follows that of S4 classes with attributes that are R lists, e.g. we can access the first input file with snakemake@input[[1]] (note that the first file does not have index 0 here, because R starts counting from 1). Named input and output files can be accessed in the same way, by just providing the name instead of an index, e.g. snakemake@input[["myfile"]].

For details and examples, see the External scripts section in the Documentation.

Automatic deployment of software dependencies

In order to get a fully reproducible data analysis, it is not sufficient to be able to execute each step and document all used parameters. The used software tools and libraries have to be documented as well. In this tutorial, you have already seen how Conda can be used to specify an isolated software environment for a whole workflow. With Snakemake, you can go one step further and specify Conda environments per rule. This way, you can even make use of conflicting software versions (e.g. combine Python 2 with Python 3).

In our example, instead of using an external environment we can specify environments per rule, e.g.:

rule samtools_index:
  input:
      "sorted_reads/{sample}.bam"
  output:
      "sorted_reads/{sample}.bam.bai"
  conda:
      "envs/samtools.yaml"
  shell:
      "samtools index {input}"

with envs/samtools.yaml defined as

channels:
  - bioconda
dependencies:
  - samtools =1.3

When Snakemake is executed with

snakemake --use-conda

it will automatically create required environments and activate them before a job is executed. It is best practice to specify at least the major and minor version of any packages in the environment definition. Specifying environments per rule in this way has two advantages. First, the workflow definition also documents all used software versions. Second, a workflow can be re-executed (without admin rights) on a vanilla system, without installing any prerequisites apart from Snakemake and Miniconda.

Tool wrappers

In order to simplify the utilization of popular tools, Snakemake provides a repository of so-called wrappers (the Snakemake wrapper repository). A wrapper is a short script that wraps (typically) a command line application and makes it directly addressable from within Snakemake. For this, Snakemake provides the wrapper directive that can be used instead of shell, script, or run. For example, the rule bwa_map could alternatively look like this:

rule bwa_mem:
  input:
      ref="data/genome.fa",
      sample=lambda wildcards: config["samples"][wildcards.sample]
  output:
      temp("mapped_reads/{sample}.bam")
  log:
      "logs/bwa_mem/{sample}.log"
  params:
      "-R '@RG\tID:{sample}\tSM:{sample}'"
  threads: 8
  wrapper:
      "0.15.3/bio/bwa/mem"

The wrapper directive expects a (partial) URL that points to a wrapper in the repository. These can be looked up in the corresponding database. The first part of the URL is a Git version tag. Upon invocation, Snakemake will automatically download the requested version of the wrapper. Furthermore, in combination with --use-conda (see Automatic deployment of software dependencies), the required software will be automatically deployed before execution.

Cluster execution

By default, Snakemake executes jobs on the local machine it is invoked on. Alternatively, it can execute jobs in distributed environments, e.g., compute clusters or batch systems. If the nodes share a common file system, Snakemake supports three alternative execution modes.

In cluster environments, compute jobs are usually submitted as shell scripts via commands like qsub. Snakemake provides a generic mode to execute on such clusters. By invoking Snakemake with

$ snakemake --cluster qsub --jobs 100

each job will be compiled into a shell script that is submitted with the given command (here qsub). The --jobs flag limits the number of concurrently submitted jobs to 100. This basic mode assumes that the submission command returns immediately after submitting the job. Some clusters allow to run the submission command in synchronous mode, such that it waits until the job has been executed. In such cases, we can invoke e.g.

$ snakemake --cluster-sync "qsub -sync yes" --jobs 100

The specified submission command can also be decorated with additional parameters taken from the submitted job. For example, the number of used threads can be accessed in braces similarly to the formatting of shell commands, e.g.

$ snakemake --cluster "qsub -pe threaded {threads}" --jobs 100

Alternatively, Snakemake can use the Distributed Resource Management Application API (DRMAA). This API provides a common interface to control various resource management systems. The DRMAA support can be activated by invoking Snakemake as follows:

$ snakemake --drmaa --jobs 100

If available, DRMAA is preferable over the generic cluster modes because it provides better control and error handling. To support additional cluster specific parametrization, a Snakefile can be complemented by a Cluster Configuration file.

Constraining wildcards

Snakemake uses regular expressions to match output files to input files and determine dependencies between the jobs. Sometimes it is useful to constrain the values a wildcard can have. This can be achieved by adding a regular expression that describes the set of allowed wildcard values. For example, the wildcard sample in the output file "sorted_reads/{sample}.bam" can be constrained to only allow alphanumeric sample names as "sorted_reads/{sample,[A-Za-z0-9]+}.bam". Constrains may be defined per rule or globally using the wildcard_constraints keyword, as demonstrated in Wildcards. This mechanism helps to solve two kinds of ambiguity.

  • It can help to avoid ambiguous rules, i.e. two or more rules that can be applied to generate the same output file. Other ways of handling ambiguous rules are described in the Section Handling Ambiguous Rules.
  • It can help to guide the regular expression based matching so that wildcards are assigned to the right parts of a file name. Consider the output file {sample}.{group}.txt and assume that the target file is A.1.normal.txt. It is not clear whether dataset="A.1" and group="normal" or dataset="A" and group="1.normal" is the right assignment. Here, constraining the dataset wildcard by {sample,[A-Z]+}.{group} solves the problem.

When dealing with ambiguous rules, it is best practice to first try to solve the ambiguity by using a proper file structure, for example, by separating the output files of different steps in different directories.

The Snakemake Executable

This part of the documentation describes the snakemake executable. Snakemake is primarily a command-line tool, so the snakemake executable is the primary way to execute, debug, and visualize workflows.

Useful Command Line Arguments

If called without parameters, i.e.

$ snakemake

Snakemake tries to execute the workflow specified in a file called Snakefile in the same directory (instead, the Snakefile can be given via the parameter -s).

By issuing

$ snakemake -n

a dry-run can be performed. This is useful to test if the workflow is defined properly and to estimate the amount of needed computation. Further, the reason for each rule execution can be printed via

$ snakemake -n -r

Importantly, Snakemake can automatically determine which parts of the workflow can be run in parallel. By specifying the number of available cores, i.e.

$ snakemake -j 4

one can tell Snakemake to use up to 4 cores and solve a binary knapsack problem to optimize the scheduling of jobs. If the number is omitted (i.e., only -j is given), the number of used cores is determined as the number of available CPU cores in the machine.

Cluster Execution

Snakemake can make use of cluster engines that support shell scripts and have access to a common filesystem, (e.g. the Sun Grid Engine). In this case, Snakemake simply needs to be given a submit command that accepts a shell script as first positional argument:

$ snakemake --cluster qsub -j 32

Here, -j denotes the number of jobs submitted being submitted to the cluster at the same time (here 32). The cluster command can be decorated with job specific information, e.g.

$ snakemake --cluster "qsub {threads}"

Thereby, all keywords of a rule are allowed (e.g. params, input, output, threads, priority, ...). For example, you could encode the expected running time into params:

rule:
    input:  ...
    output: ...
    params: runtime="4h"
    shell: ...

and forward it to the cluster scheduler:

$ snakemake --cluster "qsub --runtime {params.runtime}"

If your cluster system supports DRMAA, Snakemake can make use of that to increase the control over jobs. E.g. jobs can be cancelled upon pressing Ctrl+C, which is not possible with the generic --cluster support. With DRMAA, no qsub command needs to be provided, but system specific arguments can still be given as a string, e.g.

$ snakemake --drmaa " -q username" -j 32

Note that the string has to contain a leading whitespace. Else, the arguments will be interpreted as part of the normal Snakemake arguments, and execution will fail.

Job Properties

When executing a workflow on a cluster using the --cluster parameter (see below), Snakemake creates a job script for each job to execute. This script is then invoked using the provided cluster submission command (e.g. qsub). Sometimes you want to provide a custom wrapper for the cluster submission command that decides about additional parameters. As this might be based on properties of the job, Snakemake stores the job properties (e.g. rule name, threads, input files, params etc.) as JSON inside the job script. For convenience, there exists a parser function snakemake.utils.read_job_properties that can be used to access the properties. The following shows an example job submission wrapper:

#!python

#!/usr/bin/env python3
import os
import sys

from snakemake.utils import read_job_properties

jobscript = sys.argv[1]
job_properties = read_job_properties(jobscript)

# do something useful with the threads
threads = job_properties[threads]

# access property defined in the cluster configuration file (Snakemake >=3.6.0)
job_properties["cluster"]["time"]

os.system("qsub -t {threads} {script}".format(threads=threads, script=jobscript))

Visualization

To visualize the workflow, one can use the option --dag. This creates a representation of the DAG in the graphviz dot language which has to be postprocessed by the graphviz tool dot. E.g. to visualize the DAG that would be executed, you can issue:

$ snakemake --dag | dot | display

For saving this to a file, you can specify the desired format:

$ snakemake --dag | dot -Tpdf > dag.pdf

To visualize the whole DAG regardless of the eventual presence of files, the forceall option can be used:

$ snakemake --forceall --dag | dot -Tpdf > dag.pdf

Of course the visual appearance can be modified by providing further command line arguments to dot.

All Options

All command line options can be printed by calling snakemake -h.

Bash Completion

Snakemake supports bash completion for filenames, rulenames and arguments. To enable it globally, just append

`snakemake --bash-completion`

including the accents to your .bashrc. This only works if the snakemake command is in your path.

Writing Workflows

In Snakemake, workflows are specified as Snakefiles. Inspired by GNU Make, a Snakefile contains rules that denote how to create output files from input files. Dependencies between rules are handled implicitly, by matching filenames of input files against output files. Thereby wildcards can be used to write general rules.

Grammar

The Snakefile syntax obeys the following grammar, given in extended Backus-Naur form (EBNF)

snakemake  = statement | rule | include | workdir
rule       = "rule" (identifier | "") ":" ruleparams
include    = "include:" stringliteral
workdir    = "workdir:" stringliteral
ni         = NEWLINE INDENT
ruleparams = [ni input] [ni output] [ni params] [ni message] [ni threads] [ni (run | shell)] NEWLINE snakemake
input      = "input" ":" parameter_list
output     = "output" ":" parameter_list
params     = "params" ":" parameter_list
log        = "log" ":" parameter_list
benchmark  = "benchmark" ":" statement
message    = "message" ":" stringliteral
threads    = "threads" ":" integer
resources  = "resources" ":" parameter_list
version    = "version" ":" statement
run        = "run" ":" ni statement
shell      = "shell" ":" stringliteral

while all not defined non-terminals map to their Python equivalents.

Depend on a Minimum Snakemake Version

From Snakemake 3.2 on, if your workflow depends on a minimum Snakemake version, you can easily ensure that at least this version is installed via

from snakemake.utils import min_version

min_version("3.2")

given that your minimum required version of Snakemake is 3.2. The statement will raise a WorkflowError (and therefore abort the workflow execution) if the version is not met.

Rules

Most importantly, a rule can consist of a name (the name is optional and can be left out, creating an anonymous rule), input files, output files, and a shell command to generate the output from the input, i.e.

rule NAME:
    input: "path/to/inputfile", "path/to/other/inputfile"
    output: "path/to/outputfile", "path/to/another/outputfile"
    shell: "somecommand {input} {output}"

Inside the shell command, all local and global variables, especially input and output files can be accessed via their names in the python format minilanguage. Here input and output (and in general any list or tuple) automatically evaluate to a space-separated list of files (i.e. path/to/inputfile path/to/other/inputfile). From Snakemake 3.8.0 on, adding the special formatting instruction :q (e.g. "somecommand {input:q} {output:q}")) will let Snakemake quote each of the list or tuple elements that contains whitespace. Instead of a shell command, a rule can run some python code to generate the output:

rule NAME:
    input: "path/to/inputfile", "path/to/other/inputfile"
    output: "path/to/outputfile", somename = "path/to/another/outputfile"
    run:
        for f in input:
            ...
            with open(output[0], "w") as out:
                out.write(...)
        with open(output.somename, "w") as out:
            out.write(...)

As can be seen, instead of accessing input and output as a whole, we can also access by index (output[0]) or by keyword (output.somename). Note that, when adding keywords or names for input or output files, their order won’t be preserved when accessing them as a whole via e.g. {output} in a shell command.

Shell commands like above can also be invoked inside a python based rule, via the function shell that takes a string with the command and allows the same formatting like in the rule above, e.g.:

shell("somecommand {output.somename}")

Further, this combination of python and shell commands, allows to iterate over the output of the shell command, e.g.:

for line in shell("somecommand {output.somename}", iterable=True):
    ... # do something in python

Note that shell commands in Snakemake use the bash shell in strict mode by default.

Wildcards

Usually, it is useful to generalize a rule to be applicable to a number of e.g. datasets. For this purpose, wildcards can be used. Automatically resolved multiple named wildcards are a key feature and strength of Snakemake in comparison to other systems. Consider the following example.

rule complex_conversion:
    input:
        "{dataset}/inputfile"
    output:
        "{dataset}/file.{group}.txt"
    shell:
        "somecommand --group {wildcards.group} < {input} > {output}"

Here, we define two wildcards, dataset and group. By this, the rule can produce all files that follow the regular expression pattern .+/file\..+\.txt, i.e. the wildcards are replaced by the regular expression .+. If the rule’s output matches a requested file, the substrings matched by the wildcards are propagated to the input files and to the variable wildcards, that is here also used in the shell command. The wildcards object can be accessed in the same way as input and output, which is described above.

For example, if another rule in the workflow requires the file the file 101/file.A.txt, Snakemake recognizes that this rule is able to produce it by setting dataset=101 and group=A. Thus, it requests file 101/inputfile as input and executes the command somecommand --group A  < 101/inputfile  > 101/file.A.txt. Of course, the input file might have to be generated by another rule with different wildcards.

Importantly, the wildcard names in input and output must be named identically. Most typically, the same wildcard is present in both input and output, but it is of course also possible to have wildcards only in the output but not the input section.

Multiple wildcards in one filename can cause ambiguity. Consider the pattern {dataset}.{group}.txt and assume that a file 101.B.normal.txt is available. It is not clear whether dataset=101.B and group=normal or dataset=101 and group=B.normal in this case.

Hence wildcards can be constrained to given regular expressions. Here we could restrict the wildcard dataset to consist of digits only using \d+ as the corresponding regular expression. With Snakemake 3.8.0, there are three ways to constrain wildcards. First, a wildcard can be constrained within the file pattern, by appending a regular expression separated by a comma:

output: "{dataset,\d+}.{group}.txt"

Second, a wildcard can be constrained within the rule via the keyword wildcard_constraints:

rule complex_conversion:
    input:
        "{dataset}/inputfile"
    output:
        "{dataset}/file.{group}.txt"
    wildcard_constraints:
        dataset="\d+"
    shell:
        "somecommand --group {wildcards.group}  < {input}  > {output}"

Finally, you can also define global wildcard constraints that apply for all rules:

wildcard_constraints:
    dataset="\d+"

rule a:
    ...

rule b:
    ...

See the Python documentation on regular expressions for detailed information on regular expression syntax.

Targets

By default snakemake executes the first rule in the snakefile. This gives rise to pseudo-rules at the beginning of the file that can be used to define build-targets similar to GNU Make:

rule all:
  input: ["{dataset}/file.A.txt".format(dataset=dataset) for dataset in DATASETS]

Here, for each dataset in a python list DATASETS defined before, the file {dataset}/file.A.txt is requested. In this example, Snakemake recognizes automatically that these can be created by multiple applications of the rule complex_conversion shown above.

Above expression can be simplified to the following:

rule all:
  input: expand("{dataset}/file.A.txt", dataset=DATASETS)

This may be used for “aggregation” rules for which files from multiple or all datasets are needed to produce a specific output (say, allSamplesSummary.pdf). Note that dataset is NOT a wildcard here because it is resolved by Snakemake due to the expand statement (see below also for more information).

The expand function thereby allows also to combine different variables, e.g.

rule all:
  input: expand("{dataset}/file.A.{ext}", dataset=DATASETS, ext=PLOTFORMATS)

If now PLOTFORMATS=["pdf", "png"] contains a list of desired output formats then expand will automatically combine any dataset with any of these extensions.

Further, the first argument can also be a list of strings. In that case, the transformation is applied to all elements of the list. E.g.

expand(["{dataset}/plot1.{ext}", "{dataset}/plot2.{ext}"], dataset=DATASETS, ext=PLOTFORMATS)

leads to

["ds1/plot1.pdf", "ds1/plot2.pdf", "ds2/plot1.pdf", "ds2/plot2.pdf", "ds1/plot1.png", "ds1/plot2.png", "ds2/plot1.png", "ds2/plot2.png"]

Per default, expand uses the python itertools function product that yields all combinations of the provided wildcard values. However by inserting a second positional argument this can be replaced by any combinatoric function, e.g. zip:

expand("{dataset}/plot1.{ext} {dataset}/plot2.{ext}".split(), zip, dataset=DATASETS, ext=PLOTFORMATS)

leads to

["ds1/plot1.pdf", "ds1/plot2.pdf", "ds2/plot1.png", "ds2/plot2.png"]

You can also mask a wildcard expression in expand such that it will be kept, e.g.

expand("{{dataset}}/plot1.{ext}", ext=PLOTFORMATS)

will create strings with all values for ext but starting with "{dataset}".

Threads

Further, a rule can be given a number of threads to use, i.e.

rule NAME:
    input: "path/to/inputfile", "path/to/other/inputfile"
    output: "path/to/outputfile", "path/to/another/outputfile"
    threads: 8
    shell: "somecommand --threads {threads} {input} {output}"

Snakemake can alter the number of cores available based on command line options. Therefore it is useful to propagate it via the built in variable threads rather than hardcoding it into the shell command. In particular, it should be noted that the specified threads have to be seen as a maximum. When Snakemake is executed with fewer cores, the number of threads will be adjusted, i.e. threads = min(threads, cores) with cores being the number of cores specified at the command line (option --cores). On a cluster node, Snakemake uses as many cores as available on that node. Hence, the number of threads used by a rule never exceeds the number of physically available cores on the node. Note: This behavior is not affected by --local-cores, which only applies to jobs running on the master node.

Starting from version 3.7, threads can also be a callable that returns an int value. The signature of the callable should be callable(wildcards, [input]) (input is an optional parameter). It is also possible to refer to a predefined variable (e.g, threads: threads_max) so that the number of cores for a set of rules can be changed with one change only by altering the value of the variable threads_max.

Resources

In addition to threads, a rule can use arbitrary user-defined resources by specifying them with the resources-keyword:

rule:
    input:     ...
    output:    ...
    resources: gpu=1
    shell: "..."

If limits for the resources are given via the command line, e.g.

$ snakemake --resources gpu=2

the scheduler will ensure that the given resources are not exceeded by running jobs. If no limits are given, the resources are ignored. Apart from making Snakemake aware of hybrid-computing architectures (e.g. with a limited number of additional devices like GPUs) this allows to control scheduling in various ways, e.g. to limit IO-heavy jobs by assigning an artificial IO-resource to them and limiting it via the --resources flag. Resources must be int values.

Resources can also be callables that return int values. The signature of the callable should be callable(wildcards, [input]) (input is an optional parameter).

Messages

When executing snakemake, a short summary for each running rule is given to the console. This can be overridden by specifying a message for a rule:

rule NAME:
    input: "path/to/inputfile", "path/to/other/inputfile"
    output: "path/to/outputfile", "path/to/another/outputfile"
    threads: 8
    message: "Executing somecommand with {threads} threads on the following files {input}."
    shell: "somecommand --threads {threads} {input} {output}"

Note that access to wildcards is also possible via the variable wildcards (e.g, {wildcards.sample}), which is the same as with shell commands. It is important to have a namespace around wildcards in order to avoid clashes with other variable names.

Priorities

Snakemake allows rules to specify numeric priorities:

rule:
  input: ...
  output: ...
  priority: 50
  shell: ...

Per default, each rule has a priority of 0. Any rule that specifies a higher priority, will be preferred by the scheduler over all rules that are ready to execute at the same time without having at least the same priority.

Furthermore, the --prioritize or -P command line flag allows to specify files (or rules) that shall be created with highest priority during the workflow execution. This means that the scheduler will assign the specified target and all its dependencies highest priority, such that the target is finished as soon as possible. The --dryrun or -n option allows you to see the scheduling plan including the assigned priorities.

Log-Files

Each rule can specify a log file where information about the execution is written to:

rule abc:
    input: "input.txt"
    output: "output.txt"
    log: "logs/abc.log"
    shell: "somecommand --log {log} {input} {output}"

The variable log can be used inside a shell command to tell the used tool to which file to write the logging information. Of course the log file can use the same wildcards as input and output files, e.g.

log: "logs/abc.{dataset}.log"

For programs that do not have an explicit log parameter, you may always use 2> {log} to redirect standard output to a file (here, the log file) in Linux-based systems. Note that it is also supported to have multiple (named) log files being specified:

rule abc:
    input: "input.txt"
    output: "output.txt"
    log: log1="logs/abc.log", log2="logs/xyz.log"
    shell: "somecommand --log {log.log1} METRICS_FILE={log.log2} {input} {output}"

Non-file parameters for rules

Sometimes you may want to define certain parameters separately from the rule body. Snakemake provides the params keyword for this purpose:

rule:
    input:
        ...
    params:
        prefix="somedir/{sample}"
    output:
        "somedir/{sample}.csv"
    shell:
        "somecommand -o {params.prefix}"

The params keyword allows you to specify additional parameters depending on the wildcards values. This allows you to circumvent the need to use run: and python code for non-standard commands like in the above case. Here, the command somecommand expects the prefix of the output file instead of the actual one. The params keyword helps here since you cannot simply add the prefix as an output file (as the file won’t be created, Snakemake would throw an error after execution of the rule).

Furthermore, for enhanced readability and clarity, the params section is also an excellent place to name and assign parameters and variables for your subsequent command.

Similar to input, params can take functions as well (see Functions as Input Files), e.g. you can write

rule:
    input:
        ...
    params:
        prefix=lambda wildcards, output: output[0][:-4]
    output:
        "somedir/{sample}.csv"
    shell:
        "somecommand -o {params.prefix}"

to get the same effect as above. Note that in contrast to the input directive, the params directive can optionally take more arguments than only wildcards, namely input, output, threads, and resources. From the Python perspective, they can be seen as optional keyword arguments without a default value. Their order does not matter, apart from the fact that wildcards has to be the first argument. In the example above, this allows you to derive the prefix name from the output file.

External scripts

A rule can also point to an external script instead of a shell command or inline Python code, e.g.

rule NAME:
    input:
        "path/to/inputfile",
        "path/to/other/inputfile"
    output:
        "path/to/outputfile",
        "path/to/another/outputfile"
    script:
        "path/to/script.py"

The script path is always relative to the Snakefile (in contrast to the input and output file paths, which are relative to the working directory). Inside the script, you have access to an object snakemake that provides access to the same objects that are available in the run and shell directives (input, output, params, wildcards, log, threads, resources, config), e.g. you can use snakemake.input[0] to access the first input file of above rule.

Apart from Python scripts, this mechanism also allows you to integrate R and R Markdown scripts with Snakemake, e.g.

rule NAME:
    input:
        "path/to/inputfile",
        "path/to/other/inputfile"
    output:
        "path/to/outputfile",
        "path/to/another/outputfile"
    script:
        "path/to/script.R"

In the R script, an S4 object named snakemake analog to the Python case above is available and allows access to input and output files and other parameters. Here the syntax follows that of S4 classes with attributes that are R lists, e.g. we can access the first input file with snakemake@input[[1]] (note that the first file does not have index 0 here, because R starts counting from 1). Named input and output files can be accessed in the same way, by just providing the name instead of an index, e.g. snakemake@input[["myfile"]].

An example external Python script would could look like this:

def do_something(data_path, out_path, threads, myparam):
    # python code

do_something(snakemake.input[0], snakemake.output[0], snakemake.threads, snakemake.config["myparam"])

You can use the Python debugger from within the script if you invoke Snakemake with --debug. An equivalent script written in R would look like this:

do_something <- function(data_path, out_path, threads, myparam) {
    # R code
}

do_something(snakemake@input[[1]], snakemake@output[[1]], snakemake@threads, snakemake@config[["myparam"]])

To debug R scripts, you can save the workspace with save.image(), and invoke R after Snakemake has terminated. Then you can use the usual R debugging facilities while having access to the snakemake variable. It is best practice to wrap the actual code into a separate function. This increases the portability if the code shall be invoked outside of Snakemake or from a different rule.

An R Markdown file can be integrated in the same way as R and Python scripts, but only a single output (html) file can be used:

rule NAME:
    input:
        "path/to/inputfile",
        "path/to/other/inputfile"
    output:
        "path/to/report.html",
    script:
        "path/to/report.Rmd"

In the R Markdown file you can insert output from a R command, and access variables stored in the S4 object named snakemake

---
title: "Test Report"
author:
    - "Your Name"
date: "`r format(Sys.time(), '%d %B, %Y')`"
params:
   rmd: "report.Rmd"
output:
  html_document:
  highlight: tango
  number_sections: no
  theme: default
  toc: yes
  toc_depth: 3
  toc_float:
    collapsed: no
    smooth_scroll: yes
---

## R Markdown

This is an R Markdown document.

Test include from snakemake `r snakemake@input`.

## Source
<a download="report.Rmd" href="`r base64enc::dataURI(file = params$rmd, mime = 'text/rmd', encoding = 'base64')`">R Markdown source file (to produce this document)</a>

A link to the R Markdown document with the snakemake object can be inserted. Therefore a variable called rmd needs to be added to the params section in the header of the report.Rmd file. The generated R Markdown file with snakemake object will be saved in the file specified in this rmd variable. This file can be embedded into the HTML document using base64 encoding and a link can be inserted as shown in the example above. Also other input and output files can be embedded in this way to make a portable report. Note that the above method with a data URI only works for small files. An experimental technology to embed larger files is using Javascript Blob object_.

Protected and Temporary Files

A particular output file may require a huge amount of computation time. Hence one might want to protect it against accidental deletion or overwriting. Snakemake allows this by marking such a file as protected:

rule NAME:
    input:
        "path/to/inputfile"
    output:
        protected("path/to/outputfile")
    shell:
        "somecommand {input} {output}"

A protected file will be write-protected after the rule that produces it is completed.

Further, an output file marked as temp is deleted after all rules that use it as an input are completed:

rule NAME:
    input:
        "path/to/inputfile"
    output:
        temp("path/to/outputfile")
    shell:
        "somecommand {input} {output}"

Ignoring timestamps

For determining whether output files have to be re-created, Snakemake checks whether the file modification date (i.e. the timestamp) of any input file of the same job is newer than the timestamp of the output file. This behavior can be overridden by marking an input file as ancient. The timestamp of such files is ignored and always assumed to be older than any of the output files:

rule NAME:
    input:
        ancient("path/to/inputfile")
    output:
        "path/to/outputfile"
    shell:
        "somecommand {input} {output}"

Here, this means that the file path/to/outputfile will not be triggered for re-creation after it has been generated once, even when the input file is modified in the future. Note that any flag that forces re-creation of files still also applies to files marked as ancient.

Shadow rules

Shadow rules result in each execution of the rule to be run in isolated temporary directories. This “shadow” directory contains symlinks to files and directories in the current workdir. This is useful for running programs that generate lots of unused files which you don’t want to manually cleanup in your snakemake workflow. It can also be useful if you want to keep your workdir clean while the program executes, or simplify your workflow by not having to worry about unique filenames for all outputs of all rules.

By setting shadow: "shallow", the top level files and directories are symlinked, so that any relative paths in a subdirectory will be real paths in the filesystem. The setting shadow: "full" fully shadows the entire subdirectory structure of the current workdir. Once the rule successfully executes, the output file will be moved if necessary to the real path as indicated by output.

Shadow directories are stored one per rule execution in .snakemake/shadow/, and are cleared on subsequent snakemake invocations unless the --keep-shadow command line argument is used.

Typically, you will not need to modify your rule for compatibility with shadow, unless you reference parent directories relative to your workdir in a rule.

rule NAME:
    input: "path/to/inputfile"
    output: "path/to/outputfile"
    shadow: "shallow"
    shell: "somecommand --other_outputs other.txt {input} {output}"

Flag files

Sometimes it is necessary to enforce some rule execution order without real file dependencies. This can be achieved by “touching” empty files that denote that a certain task was completed. Snakemake supports this via the touch flag:

rule all:
    input: "mytask.done"

rule mytask:
    output: touch("mytask.done")
    shell: "mycommand ..."

With the touch flag, Snakemake touches (i.e. creates or updates) the file mytask.done after mycommand has finished successfully.

Job Properties

When executing a workflow on a cluster using the --cluster parameter (see below), Snakemake creates a job script for each job to execute. This script is then invoked using the provided cluster submission command (e.g. qsub). Sometimes you want to provide a custom wrapper for the cluster submission command that decides about additional parameters. As this might be based on properties of the job, Snakemake stores the job properties (e.g. rule name, threads, input files, params etc.) as JSON inside the job script. For convenience, there exists a parser function snakemake.utils.read_job_properties that can be used to access the properties. The following shows an example job submission wrapper:

#!/usr/bin/env python3
import os
import sys

from snakemake.utils import read_job_properties

jobscript = sys.argv[1]
job_properties = read_job_properties(jobscript)

# do something useful with the threads
threads = job_properties[threads]

# access property defined in the cluster configuration file (Snakemake >=3.6.0)
job_properties["cluster"]["time"]

os.system("qsub -t {threads} {script}".format(threads=threads, script=jobscript))

Dynamic Files

Snakemake provides experimental support for dynamic files. Dynamic files can be used whenever one has a rule, for which the number of output files is unknown before the rule was executed. This is useful for example with cetain clustering algorithms:

rule cluster:
    input: "afile.csv"
    output: dynamic("{clusterid}.cluster.csv")
    run: ...

Now the results of the rule can be used in Snakemake although it does not know how many files will be present before executing the rule cluster, e.g. by:

rule all:
    input: dynamic("{clusterid}.cluster.plot.pdf")

rule plot:
    input: "{clusterid}.cluster.csv"
    output: "{clusterid}.cluster.plot.pdf"
    run: ...

Here, Snakemake determines the input files for the rule all after the rule cluster was executed, and then dynamically inserts jobs of the rule plot into the DAG to create the desired plots.

Functions as Input Files

Instead of specifying strings or lists of strings as input files, snakemake can also make use of functions that return single or lists of input files:

def myfunc(wildcards):
    return [... a list of input files depending on given wildcards ...]

rule:
    input: myfunc
    output: "someoutput.{somewildcard}.txt"
    shell: "..."

The function has to accept a single argument that will be the wildcards object generated from the application of the rule to create some requested output files. Note that you can also use lambda expressions instead of full function definitions. By this, rules can have entirely different input files (both in form and number) depending on the inferred wildcards. E.g. you can assign input files that appear in entirely different parts of your filesystem based on some wildcard value and a dictionary that maps the wildcard value to file paths.

Note that the function will be executed when the rule is evaluated and before the workflow actually starts to execute. Further note that using a function as input overrides the default mechanism of replacing wildcards with their values inferred from the output files. You have to take care of that yourself with the given wildcards object.

Finally, when implementing the input function, it is best practice to make sure that it can properly handle all possible wildcard values your rule can have. In particular, input files should not be combined with very general rules that can be applied to create almost any file: Snakemake will try to apply the rule, and will report the exceptions of your input function as errors.

For a practical example, see the Snakemake Tutorial (Step 3: Input functions).

Input Functions and unpack()

In some cases, you might want to have your input functions return named input files. This can be done by having them return dict() objects with the names as the dict keys and the file names as the dict values and using the unpack() keyword.

def myfunc(wildcards):
    return { 'foo': '{wildcards.token}.txt'.format(wildcards=wildcards)

rule:
    input: unpack(myfunc)
    output: "someoutput.{token}.txt"
    shell: "..."

Note that unpack() only necessary for input functions returning dict. While it also works for list, remember that lists (and nested lists) of strings are automatically flattened.

Also note that if you do not pass in a function into the input list but you directly call a function then you don’t use unpack() either. Here, you can simply use Python’s double-star (**) operator for unpacking the parameters.

Note that as Snakefiles are translated into Python for execution, the same rules as for using the star and double-star unpacking Python operators apply. These restrictions do not apply when using unpack().

def myfunc1():
    return ['foo.txt']

def myfunc2():
    return {'foo': 'nowildcards.txt'}

rule:
    input:
        *myfunc1(),
        **myfunc2(),
    output: "..."
    shell: "..."

Version Tracking

Rules can specify a version that is tracked by Snakemake together with the output files. When the version changes snakemake informs you when using the flag --summary or --list-version-changes. The version can be specified by the version directive, which takes a string:

rule:
    input:   ...
    output:  ...
    version: "1.0"
    shell:   ...

The version can of course also be filled with the output of a shell command, e.g.:

SOMECOMMAND_VERSION = subprocess.check_output("somecommand --version", shell=True)

rule:
    version: SOMECOMMAND_VERSION

Alternatively, you might want to use file modification times in case of local scripts:

SOMECOMMAND_VERSION = str(os.path.getmtime("path/to/somescript"))

rule:
    version: SOMECOMMAND_VERSION

A re-run can be automated by invoking Snakemake as follows:

$ snakemake -R `snakemake --list-version-changes`

With the availability of the conda directive (see Integrated Package Management) the version directive has become obsolete in favor of defining isolated software environments that can be automatically deployed via the conda package manager.

Code Tracking

Snakemake tracks the code that was used to create your files. In combination with --summary or --list-code-changes this can be used to see what files may need a re-run because the implementation changed. Re-run can be automated by invoking Snakemake as follows:

$ snakemake -R `snakemake --list-code-changes`

Onstart, onsuccess and onerror handlers

Sometimes, it is necessary to specify code that shall be executed when the workflow execution is finished (e.g. cleanup, or notification of the user). With Snakemake 3.2.1, this is possible via the onsuccess and onerror keywords:

onsuccess:
    print("Workflow finished, no error")

onerror:
    print("An error occurred")
    shell("mail -s "an error occurred" youremail@provider.com < {log}")

The onsuccess handler is executed if the workflow finished without error. Else, the onerror handler is executed. In both handlers, you have access to the variable log, which contains the path to a logfile with the complete Snakemake output. Snakemake 3.6.0 adds an ``onstart`` handler, that will be executed before the workflow starts. Note that dry-runs do not trigger any of the handlers.

Rule dependencies

From verion 2.4.8 on, rules can also refer to the output of other rules in the Snakefile, e.g.:

rule a:
    input:  "path/to/input"
    output: "path/to/output"
    shell:  ...

rule b:
    input:  rules.a.output
    output: "path/to/output/of/b"
    shell:  ...

Importantly, be aware that referring to rule a here requires that rule a was defined above rule b in the file, since the object has to be known already. This feature also allows to resolve dependencies that are ambiguous when using filenames.

Note that when the rule you refer to defines multiple output files but you want to require only a subset of those as input for another rule, you should name the output files and refer to them specifically:

rule a:
    input:  "path/to/input"
    output: a = "path/to/output", b = "path/to/output2"
    shell:  ...

rule b:
    input:  rules.a.output.a
    output: "path/to/output/of/b"
    shell:  ...

Handling Ambiguous Rules

When two rules can produce the same output file, snakemake cannot decide per default which one to use. Hence an AmbiguousRuleException is thrown. Note: ruleorder is not intended to bring rules in the correct execution order (this is solely guided by the names of input and output files you use), it only helps snakemake to decide which rule to use when multiple ones can create the same output file! The proposed strategy to deal with such ambiguity is to provide a ruleorder for the conflicting rules, e.g.

ruleorder: rule1 > rule2 > rule3

Here, rule1 is preferred over rule2 and rule3, and rule2 is preferred over rule3. Only if rule1 and rule2 cannot be applied (e.g. due to missing input files), rule3 is used to produce the desired output file.

Alternatively, rule dependencies (see above) can also resolve ambiguities.

Another (quick and dirty) possiblity is to tell snakemake to allow ambiguity via a command line option

$ snakemake --allow-ambiguity

such that similar to GNU Make always the first matching rule is used. Here, a warning that summarizes the decision of snakemake is provided at the terminal.

Local Rules

When working in a cluster environment, not all rules need to become a job that has to be submitted (e.g. downloading some file, or a target-rule like all, see Targets). The keyword localrules allows to mark a rule as local, so that it is not submitted to the cluster and instead executed on the host node:

localrules: all, foo

rule all:
    input: ...

rule foo:
    ...

rule bar:
    ...

Here, only jobs from the rule bar will be submitted to the cluster, whereas all and foo will be run locally. Note that you can use the localrules directive multiple times. The result will be the union of all declarations.

Benchmark Rules

Since version 3.1, Snakemake provides support for benchmarking the run times of rules. This can be used to create complex performance analysis pipelines. With the benchmark keyword, a rule can be declared to store a benchmark of its code into the specified location. E.g. the rule

rule benchmark_command:
    input:
        "path/to/input.{sample}.txt"
    output:
        "path/to/output.{sample}.txt"
    benchmark:
        "benchmarks/somecommand/{sample}.txt"
    shell:
        "somecommand {input} {output}"

benchmarks the CPU and wall clock time of the command somecommand for the given output and input files. For this, the shell or run body of the rule is executed on that data, and all run times are stored into the given benchmark txt file (which will contain a tab-separated table of run times and memory usage in MiB). Per default, Snakemake executes the job once, generating one run time. With snakemake --benchmark-repeats, this number can be changed to e.g. generate timings for two or three runs. The resulting txt file can be used as input for other rules, just like any other output file.

Note

Note that benchmarking is only possible in a reliable fashion for subprocesses (thus for tasks run through the shell, script, and wrapper directive). In the run block, the variable bench_record is available that you can pass to shell() as bench_record=bench_record. When using shell(..., bench_record=bench_record), the maximum of all measurements of all shell() calls will be used but the running time of the rule execution including any Python code.

Configuration

Snakemake allows you to use configuration files for making your workflows more flexible and also for abstracting away direct dependencies to a fixed HPC cluster scheduler.

Standard Configuration

Snakemake directly supports the configuration of your workflow. A configuration is provided as a JSON or YAML file and can be loaded with:

configfile: "path/to/config.json"

The config file can be used to define a dictionary of configuration parameters and their values. In the workflow, the configuration is accessible via the global variable config, e.g.

rule all:
    input:
        expand("{sample}.{yourparam}.output.pdf", sample=config["samples"], param=config["yourparam"])

If the configfile statement is not used, the config variable provides an empty array. In addition to the configfile statement, config values can be overwritten via the command line or the The Snakemake API, e.g.:

$ snakemake --config yourparam=1.5

Further, you can manually alter the config dictionary using any Python code outside of your rules. Changes made from within a rule won’t be seen from other rules. Finally, you can use the –configfile command line argument to overwrite values from the configfile statement. Note that any values parsed into the config dictionary with any of above mechanisms are merged, i.e., all keys defined via a configfile statement, or the –configfile and –config command line arguments will end up in the final config dictionary, but if two methods define the same key, command line overwrites the configfile statement.

For adding config placeholders into a shell command, Python string formatting syntax requires you to leave out the quotes around the key name, like so:

shell:
    "mycommand {config[foo]} ..."

Cluster Configuration

Snakemake supports a separate configuration file for execution on a cluster. A cluster config file allows you to specify cluster submission parameters outside the Snakefile. The cluster config is a JSON- or YAML-formatted file that contains objects that match names of rules in the Snakefile. The parameters in the cluster config are then accessed by the cluster.* wildcard when you are submitting jobs. For example, say that you have the following Snakefile:

rule all:
    input: "input1.txt", "input2.txt"

rule compute1:
    output: "input1.txt"
    shell: "touch input1.txt"

rule compute2:
    output: "input2.txt"
    shell: "touch input2.txt"

This Snakefile can then be configured by a corresponding cluster config, say “cluster.json”:

{
    "__default__" :
    {
        "account" : "my account",
        "time" : "00:15:00",
        "n" : 1,
        "partition" : "core"
    },
    "compute1" :
    {
        "time" : "00:20:00"
    }
}

Any string in the cluster configuration can be formatted in the same way as shell commands, e.g. {rule}.{wildcards.sample} is formatted to a.xy if the rulename is a and the wildcard value is xy. Here __default__ is a special object that specifies default parameters, these will be inherited by the other configuration objects. The compute1 object here changes the time parameter, but keeps the other parameters from __default__. The rule compute2 does not have any configuration, and will therefore use the default configuration. You can then run the Snakefile with the following command on a SLURM system.

$ snakemake -j 999 --cluster-config cluster.json --cluster "sbatch -A {cluster.account} -p {cluster.partition} -n {cluster.n}  -t {cluster.time}"

For cluster systems using LSF/BSUB, a cluster config may look like this:

{
    "__default__" :
    {
        "queue"     : "medium_priority",
        "nCPUs"     : "16",
        "memory"    : 20000,
        "resources" : "\"select[mem>20000] rusage[mem=20000] span[hosts=1]\"",
        "name"      : "JOBNAME.{rule}.{wildcards}",
        "output"    : "logs/cluster/{rule}.{wildcards}.out",
        "error"     : "logs/cluster/{rule}.{wildcards}.err"
    },


    "trimming_PE" :
    {
        "memory"    : 30000,
        "resources" : "\"select[mem>30000] rusage[mem=30000] span[hosts=1]\"",
    }
}

The advantage of this setup is that it is already pretty general by exploiting the wildcard possibilities that Snakemake provides via {rule} and {wildcards}. So job names, output and error files all have reasonable and trackable default names, only the directies (logs/cluster) and job names (JOBNAME) have to adjusted accordingly. If a rule named bamCoverage is executed with the wildcard basename = sample1, for example, the output and error files will be bamCoverage.basename=sample1.out and bamCoverage.basename=sample1.err, respectively.

Configure Working Directory

All paths in the snakefile are interpreted relative to the directory snakemake is executed in. This behaviour can be overridden by specifying a workdir in the snakefile:

workdir: "path/to/workdir"

Usually, it is preferred to only set the working directory via the command line, because above directive limits the portability of Snakemake workflows.

Modularization

Modularization in Snakemake comes at different levels.

  1. The most fine-grained level are wrappers. They are available and can be published at the Snakemake Wrapper Repository. These wrappers can then be composed and customized according to your needs, by copying skeleton rules into your workflow. In combination with conda integration, wrappers also automatically deploy the needed software dependencies into isolated environments.
  2. For larger, reusable parts that shall be integrated into a common workflow, it is recommended to write small Snakefiles and include them into a master Snakefile via the include statement. In such a setup, all rules share a common config file.
  3. The third level of separation are subworkflows. Importantly, these are rather meant as links between otherwise separate data analyses.

Wrappers

With Snakemake 3.5.5, the wrapper directive is introduced (experimental). This directive allows to have re-usable wrapper scripts around e.g. command line tools. In contrast to modularization strategies like include or subworkflows, the wrapper directive allows to re-wire the DAG of jobs. For example

rule samtools_sort:
    input:
        "mapped/{sample}.bam"
    output:
        "mapped/{sample}.sorted.bam"
    params:
        "-m 4G"
    threads: 8
    wrapper:
        "0.0.8/bio/samtools_sort"

Refers to the wrapper "0.0.8/bio/samtools_sort" to create the output from the input. Snakemake will automatically download the wrapper from the Snakemake Wrapper Repository. Thereby, 0.0.8 can be replaced with the git version tag you want to use, or a commit id (see here). This ensures reproducibility since changes in the wrapper implementation won’t be propagated automatically to your workflow. Alternatively, e.g., for development, the wrapper directive can also point to full URLs, including URLs to local files with absolute paths file:// or relative paths file:. Examples for each wrapper can be found in the READMEs located in the wrapper subdirectories at the Snakemake Wrapper Repository.

The Snakemake Wrapper Repository is meant as a collaborative project and pull requests are very welcome.

Includes

Another Snakefile with all its rules can be included into the current:

include: "path/to/other/snakefile"

The default target rule (often called the all-rule), won’t be affected by the include. I.e. it will always be the first rule in your Snakefile, no matter how many includes you have above your first rule. Includes are relative to the directory of the Snakefile in which they occur. For example, if above Snakefile resides in the directory my/dir, then Snakemake will search for the include at my/dir/path/to/other/snakefile, regardless of the working directory.

Sub-Workflows

In addition to including rules of another workflow, Snakemake allows to depend on the output of other workflows as sub-workflows. A sub-workflow is executed independently before the current workflow is executed. Thereby, Snakemake ensures that all files the current workflow depends on are created or updated if necessary. This allows to create links between otherwise separate data analyses.

subworkflow otherworkflow:
    workdir: "../path/to/otherworkflow"
    snakefile: "../path/to/otherworkflow/Snakefile"

rule a:
    input:  otherworkflow("test.txt")
    output: ...
    shell:  ...

Here, the subworkflow is named “otherworkflow” and it is located in the working directory ../path/to/otherworkflow. The snakefile is in the same directory and called Snakefile. If snakefile is not defined for the subworkflow, it is assumed be located in the workdir location and called Snakefile, hence, above we could have left the snakefile keyword out as well. If workdir is not specified, it is assumed to be the same as the current one. Files that are output from the subworkflow that we depend on are marked with the otherworkflow function (see the input of rule a). This function automatically determines the absolute path to the file (here ../path/to/otherworkflow/test.txt).

When executing, snakemake first tries to create (or update, if necessary) test.txt (and all other possibly mentioned dependencies) by executing the subworkflow. Then the current workflow is executed. This can also happen recursively, since the subworkflow may have its own subworkflows as well.

Remote files

In versions snakemake>=3.5.

The Snakefile supports a wrapper function, remote(), indicating a file is on a remote storage provider (this is similar to temp() or protected()). In order to use all types of remote files, the Python packages boto, moto, filechunkio, pysftp, dropbox, requests, ftputil, XRootD, and biopython must be installed.

During rule execution, a remote file (or object) specified is downloaded to the local cwd, within a sub-directory bearing the same name as the remote provider. This sub-directory naming lets you have multiple remote origins with reduced likelihood of name collisions, and allows Snakemake to easily translate remote objects to local file paths. You can think of each local remote sub-directory as a local mirror of the remote system. The remote() wrapper is mutually-exclusive with the temp() and protected() wrappers.

Snakemake includes the following remote providers, supported by the corresponding classes:

  • Amazon Simple Storage Service (AWS S3): snakemake.remote.S3
  • Google Cloud Storage (GS): snakemake.remote.GS
  • File transfer over SSH (SFTP): snakemake.remote.SFTP
  • Read-only web (HTTP[S]): snakemake.remote.HTTP
  • File transfer protocol (FTP): snakemake.remote.FTP
  • Dropbox: snakemake.remote.dropbox
  • XRootD: snakemake.remote.XRootD
  • GenBank / NCBI Entrez: snakemake.remote.NCBI

Amazon Simple Storage Service (S3)

This section describes usage of the S3 RemoteProvider, and also provides an intro to remote files and their usage.

It is important to note that you must have credentials (access_key_id and secret_access_key) which permit read/write access. If a file only serves as input to a Snakemake rule, read access is sufficient. You may specify credentials as environment variables or in the file =/.aws/credentials, prefixed with AWS_*, as with a standard boto config. Credentials may also be explicitly listed in the Snakefile, as shown below:

For the Amazon S3 and Google Cloud Storage providers, the sub-directory used must be the bucket name.

Using remote files is easy (AWS S3 shown):

from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider(access_key_id="MYACCESSKEY", secret_access_key="MYSECRET")

rule all:
    input:
        S3.remote("bucket-name/file.txt")

Expand still works as expected, just wrap the expansion:

from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider()

rule all:
    input:
        S3.remote(expand("bucket-name/{letter}-2.txt", letter=["A", "B", "C"]))

It is possible to use S3-compatible storage by specifying a different endpoint address as the host kwarg in the provider, as the kwargs used in instantiating the provider are passed in to boto:

from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider(access_key_id="MYACCESSKEY", secret_access_key="MYSECRET", host="mystorage.example.com")

rule all:
    input:
        S3.remote("bucket-name/file.txt")

Only remote files needed to satisfy the DAG build are downloaded for the workflow. By default, remote files are downloaded prior to rule execution and are removed locally as soon as no rules depend on them. Remote files can be explicitly kept by setting the keep_local=True keyword argument:

from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider(access_key_id="MYACCESSKEY", secret_access_key="MYSECRET")

rule all:
    input: S3.remote('bucket-name/prefix{split_id}.txt', keep_local=True)

If you wish to have a rule to simply download a file to a local copy, you can do so by declaring the same file path locally as is used by the remote file:

from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider(access_key_id="MYACCESSKEY", secret_access_key="MYSECRET")

rule all:
    input:
        S3.remote("bucket-name/out.txt")
    output:
        "bucket-name/out.txt"
    run:
        shell("cp {output[0]} ./")

In some cases the rule can use the data directly on the remote provider, in these cases stay_on_remote=True can be set to avoid downloading/uploading data unnecessarily. Additionally, if the backend supports it, any potentially corrupt output files will be removed from the remote. The default for stay_on_remote and keep_local can be configured by setting these properties on the remote provider object:

from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider(access_key_id="MYACCESSKEY", secret_access_key="MYSECRET", keep_local=True, stay_on_remote=True)

The remote provider also supports a new glob_wildcards() (see How do I run my rule on all files of a certain directory?) which acts the same as the local version of glob_wildcards(), but for remote files:

from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider(access_key_id="MYACCESSKEY", secret_access_key="MYSECRET")
S3.glob_wildcards("bucket-name/{file_prefix}.txt")

# (result looks just like as if the local glob_wildcards() function were used on a locally with a folder called "bucket-name")

Google Cloud Storage (GS)

Using Google Cloud Storage (GS) is a simple import change, though since GS support it is based on boto, GS must be accessed via Google’s “interoperable” credentials. Usage of the GS provider is the same as the S3 provider. You may specify credentials as environment variables in the file =/.aws/credentials, prefixed with AWS_*, as with a standard boto config, or explicitly in the Snakefile.

from snakemake.remote.GS import RemoteProvider as GSRemoteProvider
GS = GSRemoteProvider(access_key_id="MYACCESSKEY", secret_access_key="MYSECRET")

rule all:
    input:
        GS.remote("bucket-name/file.txt")

File transfer over SSH (SFTP)

Snakemake can use files on remove servers accessible via SFTP (i.e. most *nix servers). It uses pysftp for the underlying support of SFTP, so the same connection options exist. Assuming you have SSH keys already set up for the server you are using in the Snakefile, usage is simple:

from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider()

rule all:
    input:
        SFTP.remote("example.com/path/to/file.bam")

The remote file addresses used must be specified with the host (domain or IP address) and the absolute path to the file on the remote server. A port may be specified if the SSH daemon on the server is listening on a port other than 22, in either the RemoteProvider or in each instance of remote():

from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider(port=4040)

rule all:
    input:
        SFTP.remote("example.com/path/to/file.bam")
from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider()

rule all:
    input:
        SFTP.remote("example.com:4040/path/to/file.bam")

The standard keyword arguments used by pysftp may be provided to the RemoteProvider to specify credentials (either password or private key):

from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider(username="myusername", private_key="/Users/myusername/.ssh/particular_id_rsa")

rule all:
    input:
        SFTP.remote("example.com/path/to/file.bam")
from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider(username="myusername", password="mypassword")

rule all:
    input:
        SFTP.remote("example.com/path/to/file.bam")

If you share credentials between servers but connect to one on a different port, the alternate port may be specified in the remote() wrapper:

from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider(username="myusername", password="mypassword")

rule all:
    input:
        SFTP.remote("some-example-server-1.com/path/to/file.bam"),
        SFTP.remote("some-example-server-2.com:2222/path/to/file.bam")

There is a glob_wildcards() function:

from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider()
SFTP.glob_wildcards("example.com/path/to/{sample}.bam")

Read-only web (HTTP[s])

Snakemake can access web resources via a read-only HTTP(S) provider. This provider can be helpful for including public web data in a workflow.

Web addresses must be specified without protocol, so if your URI looks like this:

http://server3.example.com/path/to/myfile.tar.gz

The URI used in the Snakefile must look like this:

server3.example.com/path/to/myfile.tar.gz

It is straightforward to use the HTTP provider to download a file to the cwd:

import os
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider

HTTP = HTTPRemoteProvider()

rule all:
    input:
        HTTP.remote("www.example.com/path/to/document.pdf", keep_local=True)
    run:
        outputName = os.path.basename(input[0])
        shell("mv {input} {outputName}")

To connect on a different port, specify the port as part of the URI string:

from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()

rule all:
    input:
        HTTP.remote("www.example.com:8080/path/to/document.pdf", keep_local=True)

By default, the HTTP provider always uses HTTPS (TLS). If you need to connect to a resource with regular HTTP (no TLS), you must explicitly include insecure as a kwarg to remote():

from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()

rule all:
    input:
        HTTP.remote("www.example.com/path/to/document.pdf", insecure=True, keep_local=True)

If the URI used includes characters not permitted in a local file path, you may include them as part of the additional_request_string in the kwargs for remote(). This may also be useful for including additional parameters you don not want to be part of the local filename (since the URI string becomes the local file name).

from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()

rule all:
    input:
        HTTP.remote("example.com/query.php", additional_request_string="?range=2;3")

If the file requires authentication, you can specify a username and password for HTTP Basic Auth with the Remote Provider, or with each instance of remote(). For different types of authentication, you can pass in a Python `requests.auth object (see here) the auth kwarg.

from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider(username="myusername", password="mypassword")

rule all:
    input:
        HTTP.remote("example.com/interactive.php", keep_local=True)
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()

rule all:
    input:
        HTTP.remote("example.com/interactive.php", username="myusername", password="mypassword", keep_local=True)
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()

rule all:
    input:
        HTTP.remote("example.com/interactive.php", auth=requests.auth.HTTPDigestAuth("myusername", "mypassword"), keep_local=True)

Since remote servers do not present directory contents uniformly, glob_wildcards() is __not__ supported by the HTTP provider.

File Transfer Protocol (FTP)

Snakemake can work with files stored on regular FTP. Currently supported are authenticated FTP and anonymous FTP, excluding FTP via TLS.

Usage is similar to the SFTP provider, however the paths specified are relative to the FTP home directory (since this is typically a chroot):

from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider

FTP = FTPRemoteProvider(username="myusername", password="mypassword")

rule all:
    input:
        FTP.remote("example.com/rel/path/to/file.tar.gz")

The port may be specified in either the provider, or in each instance of remote():

from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider

FTP = FTPRemoteProvider(username="myusername", password="mypassword", port=2121)

rule all:
    input:
        FTP.remote("example.com/rel/path/to/file.tar.gz")
from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider

FTP = FTPRemoteProvider(username="myusername", password="mypassword")

rule all:
    input:
        FTP.remote("example.com:2121/rel/path/to/file.tar.gz")

Anonymous download of FTP resources is possible:

from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider
FTP = FTPRemoteProvider()

rule all:
    input:
        # only keeping the file so we can move it out to the cwd
        FTP.remote("example.com/rel/path/to/file.tar.gz", keep_local=True)
    run:
        shell("mv {input} ./")

glob_wildcards():

from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider
FTP = FTPRemoteProvider(username="myusername", password="mypassword")

print(FTP.glob_wildcards("example.com/somedir/{file}.txt"))

Setting immediate_close=True allows the use of a large number of remote FTP input files in a job where the endpoint server limits the number of concurrent connections. When immediate_close=True, Snakemake will terminate FTP connections after each remote file action (exists(), size(), download(), mtime(), etc.). This is in contrast to the default behavior which caches FTP details and leaves the connection open across actions to improve performance (closing the connection upon job termination). :

from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider
FTP = FTPRemoteProvider()

rule all:
    input:
        # only keep the file so we can move it out to the cwd
        # This server limits the number of concurrent connections so we need to have Snakemake close each after each FTP action.
        FTP.remote(expand("ftp.example.com/rel/path/to/{file}", file=large_list), keep_local=True, immediate_close=True)
    run:
        shell("mv {input} ./")

glob_wildcards():

from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider
FTP = FTPRemoteProvider(username="myusername", password="mypassword")

print(FTP.glob_wildcards("example.com/somedir/{file}.txt"))

Dropbox

The Dropbox remote provider allows you to upload and download from your Dropbox account without having the client installed on your machine. In order to use the provider you first need to register an “app” on the Dropbox developer website, with access to the Full Dropbox. After registering, generate an OAuth2 access token. You will need the token to use the Snakemake Dropbox remote provider.

Using the Dropbox provider is straightforward:

from snakemake.remote.dropbox import RemoteProvider as DropboxRemoteProvider
DBox = DropboxRemoteProvider(oauth2_access_token="mytoken")

rule all:
    input:
        DBox.remote("path/to/input.txt")

glob_wildcards() is supported:

from snakemake.remote.dropbox import RemoteProvider as DropboxRemoteProvider
DBox = DropboxRemoteProvider(oauth2_access_token="mytoken")

DBox.glob_wildcards("path/to/{title}.txt")

Note that Dropbox paths are case-insensitive.

XRootD

Snakemake can be used with XRootD backed storage provided the python bindings are installed. This is typically most useful when combined with the stay_on_remote flag to minimise local storage requirements. This flag can be overridden on a file by file basis as described in the S3 remote. Additionally glob_wildcards() is supported:

from snakemake.remote.XRootD import RemoteProvider as XRootDRemoteProvider

XRootD = XRootDRemoteProvider(stay_on_remote=True)
file_numbers = XRootD.glob_wildcards("root://eospublic.cern.ch//eos/opendata/lhcb/MasterclassDatasets/D0lifetime/2014/mclasseventv2_D0_{n}.root")

rule all:
    input:
        XRootD.remote(expand("local_data/mclasseventv2_D0_{n}.root", n=file_numbers))

rule make_data:
    input:
        XRootD.remote("root://eospublic.cern.ch//eos/opendata/lhcb/MasterclassDatasets/D0lifetime/2014/mclasseventv2_D0_{n}.root")
    output:
        'local_data/mclasseventv2_D0_{n}.root'
    shell:
        'xrdcp {input[0]} {output[0]}'

GenBank / NCBI Entrez

Snakemake can directly source input files from GenBank and other NCBI Entrez databases if the Biopython library is installed.

from snakemake.remote.NCBI import RemoteProvider as NCBIRemoteProvider
NCBI = NCBIRemoteProvider(email="someone@example.com") # email required by NCBI to prevent abuse

rule all:
    input:
        "size.txt"

rule download_and_count:
    input:
        NCBI.remote("KY785484.1.fasta", db="nuccore")
    output:
        "size.txt"
    run:
        shell("wc -c {input} > {output}")

The output format and source database of a record retrieved from GenBank is inferred from the file extension specified. For example, NCBI.RemoteProvider().remote("KY785484.1.fasta", db="nuccore") will download a FASTA file while NCBI.RemoteProvider().remote("KY785484.1.gb", db="nuccore") will download a GenBank-format file. If the options are ambiguous, Snakemake will raise an exception and inform the user of possible format choices. To see available formats, consult the in a variety of Entrez EFetch documentation. To view the valid file extensions for these formats, access NCBI.RemoteProvider()._gb.valid_extensions, or instantiate an NCBI.NCBIHelper() and access NCBI.NCBIHelper().valid_extensions (this is a property).

When used in conjunction with NCBI.RemoteProvider().search(), Snakemake and NCBI.RemoteProvider().remote() can be used to find accessions by query and download them:

from snakemake.remote.NCBI import RemoteProvider as NCBIRemoteProvider
NCBI = NCBIRemoteProvider(email="someone@example.com") # email required by NCBI to prevent abuse

# get accessions for the first 3 results in a search for full-length Zika virus genomes
# the query parameter accepts standard GenBank search syntax
query = '"Zika virus"[Organism] AND (("9000"[SLEN] : "20000"[SLEN]) AND ("2017/03/20"[PDAT] : "2017/03/24"[PDAT])) '
accessions = NCBI.search(query, retmax=3)

# give the accessions a file extension to help the RemoteProvider determine the
# proper output type.
input_files = expand("{acc}.fasta", acc=accessions)

rule all:
    input:
        "sizes.txt"

rule download_and_count:
    input:
        # Since *.fasta files could come from several different databases, specify the database here.
        # if the input files are ambiguous, the provider will alert the user with possible options
        # standard options like "seq_start" are supported
        NCBI.remote(input_files, db="nuccore", seq_start=5000)

    output:
        "sizes.txt"
    run:
        shell("wc -c {input} > sizes.txt")

Normally, all accessions for a query are returned from NCBI.RemoteProvider.search(). To truncate the results, specify retmax=<desired_number>. Standard Entrez fetch query options are supported as kwargs, and may be passed in to NCBI.RemoteProvider.remote() and NCBI.RemoteProvider.search().

Remote cross-provider transfers

It is possible to use Snakemake to transfer files between remote providers (using the local machine as an intermediary), as long as the sub-directory (bucket) names differ:

from snakemake.remote.GS import RemoteProvider as GSRemoteProvider
from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider

GS = GSRemoteProvider(access_key_id="MYACCESSKEYID", secret_access_key="MYSECRETACCESSKEY")
S3 = S3RemoteProvider(access_key_id="MYACCESSKEYID", secret_access_key="MYSECRETACCESSKEY")

fileList, = S3.glob_wildcards("source-bucket/{file}.bam")
rule all:
    input:
        GS.remote( expand("destination-bucket/{file}.bam", file=fileList) )
rule transfer_S3_to_GS:
    input:
        S3.remote( expand("source-bucket/{file}.bam", file=fileList) )
    output:
        GS.remote( expand("destination-bucket/{file}.bam", file=fileList) )
    run:
        shell("cp -R source-bucket/ destination-bucket/")

Utils

The module snakemake.utils provides a collection of helper functions for common tasks in Snakemake workflows. Details can be found in Additional utils.

Reports

The report function provides an easy mechanism to write reports containing your results. A report is written in reStructuredText and compiled to HTML. The function allows you to embed your generated tables and plots into the HTML file. By referencing the files from your text, you can easily provide a semantical connection between them. For using this function, you need to have the docutils package installed.

from snakemake.utils import report

SOMECONSTANT = 42

rule report:
    input:  F1="someplot.pdf",
            T1="sometable.txt"
    output: html="report.html"
    run:
        report("""
        =======================
        The title of the report
        =======================

        Write your report here, explaining your results. Don't fear to use math
        it will be rendered correctly in any browser using MathJAX,
        e.g. inline :math:`\sum_{{j \in E}} t_j \leq I`,
        or even properly separated:

        .. math::

            |cq_{{0ctrl}}^i - cq_{{nt}}^i| > 0.5

        Include your files using their keyword name and an underscore: F1_, T1_.

        Access your global and local variables like within shell commands, e.g. {SOMECONSTANT}.
        """, output.html, metadata="Johannes Köster (johannes.koester@uni-due.de)", **input)

The optional metadata argument allows to provide arbitrary additional information to the report, e.g. the author name. The unpacked input files (**input) in the report function generates a list of keyword args, that can be referenced inside the document with the mentioned underscore notation. The files will be embedded into the HTML file using data URLs, thus making the report fully portable and not dependent on your local filesystem structure.

Scripting with R

The R function allows you to use R code in your rules. It relies on rpy2:

from snakemake.utils import R

SOMECONSTANT = 42

rule:
    input:  ...
    output: ...
    run:
        R("""
        # write your R code here
        # Access any global or local variables from the Snakefile with the braces notation
        sqrt({SOMECONSTANT});
        # be sure to mask braces used in R control flow by doubling them:
        if(TRUE) {{
            # do something
        }}
        """)

If you compiled your Python installation from source, make sure that Python was build with sqlite support, which is needed for rpy2.

Workflow Distribution and Deployment

It is recommended to store each workflow in a dedicated git repository of the following structure:

├── .gitignore
├── README.md
├── LICENSE.md
├── config.yaml
├── environment.yaml
├── scripts
│   ├── __init__.py
│   ├── script1.py
│   └── script2.R
└── Snakefile

Then, a workflow can be deployed to a new system via the following steps

# clone workflow into working directory
git clone https://bitbucket.org/user/myworkflow.git path/to/workdir
cd path/to/workdir

# edit config and workflow as needed
vim config.yaml

# install dependencies into isolated environment
conda env create -n myworkflow --file environment.yaml

# activate environment
source activate myworkflow

# execute workflow
snakemake -n

Importantly, git branching and pull requests can be used to modify and possibly re-integrate workflows.

Integrated Package Management

With Snakemake 3.9.0 it is possible to define isolated software environments per rule. Upon execution of a workflow, the Conda package manager is used to obtain and deploy the defined software packages in the specified versions. Packages will be installed into your working directory, without requiring any admin/root priviledges. Given that conda is available on your system (see Miniconda), to use the Conda integration, add the --use-conda flag to your workflow execution command, e.g. snakemake --cores 8 --use-conda. When --use-conda is activated, Snakemake will automatically create software environments for any used wrapper (see Wrappers). Further, you can manually define environments via the conda directive, e.g.:

rule NAME:
    input:
        "table.txt"
    output:
        "plots/myplot.pdf"
    conda:
        "envs/ggplot.yaml"
    script:
        "scripts/plot-stuff.R"

with the following environment definition:

channels:
 - r
dependencies:
 - r=3.3.1
 - r-ggplot2=2.1.0

Snakemake will store the environment persistently in .snakemake/conda/$hash with $hash being the MD5 hash of the environment definition file content. This way, updates to the environment definition are automatically detected. Note that you need to clean up environments manually for now. However, in many cases they are lightweight and consist of symlinks to your central conda installation.

Sustainable and reproducible archiving

With Snakemake 3.10.0 it is possible to archive a workflow into a tarball (.tar, .tar.gz, .tar.bz2, .tar.xz), via

snakemake --archive my-workflow.tar.gz

If above layout is followed, this will archive any code and config files that is under git version control. Further, all input files will be included into the archive. Finally, the software packages of each defined conda environment are included. This results in a self-contained workflow archive that can be re-executed on a vanilla machine that only has Conda and Snakemake installed via

tar -xf my-workflow.tar.gz
snakemake -n

Note that the archive is platform specific. For example, if created on Linux, it will run on any Linux newer than the minimum version that has been supported by the used Conda packages at the time of archiving (e.g. CentOS 6).

A useful pattern when publishing data analyses is to create such an archive, upload it to Zenodo and thereby obtain a DOI. Then, the DOI can be cited in manuscripts, and readers are able to download and reproduce the data analysis at any time in the future.

The Snakemake API

snakemake.snakemake(snakefile, listrules=False, list_target_rules=False, cores=1, nodes=1, local_cores=1, resources={}, config={}, configfile=None, config_args=None, workdir=None, targets=None, dryrun=False, touch=False, forcetargets=False, forceall=False, forcerun=[], until=[], omit_from=[], prioritytargets=[], stats=None, printreason=False, printshellcmds=False, debug_dag=False, printdag=False, printrulegraph=False, printd3dag=False, nocolor=False, quiet=False, keepgoing=False, cluster=None, cluster_config=None, cluster_sync=None, drmaa=None, drmaa_log_dir=None, jobname='snakejob.{rulename}.{jobid}.sh', immediate_submit=False, standalone=False, ignore_ambiguity=False, snakemakepath=None, lock=True, unlock=False, cleanup_metadata=None, force_incomplete=False, ignore_incomplete=False, list_version_changes=False, list_code_changes=False, list_input_changes=False, list_params_changes=False, list_resources=False, summary=False, archive=None, detailed_summary=False, latency_wait=3, benchmark_repeats=1, wait_for_files=None, print_compilation=False, debug=False, notemp=False, keep_remote_local=False, nodeps=False, keep_target_files=False, keep_shadow=False, allowed_rules=None, jobscript=None, timestamp=False, greediness=None, no_hooks=False, overwrite_shellcmd=None, updated_files=None, log_handler=None, keep_logger=False, max_jobs_per_second=None, restart_times=0, verbose=False, force_use_threads=False, use_conda=False, conda_prefix=None, mode=0, wrapper_prefix=None, default_remote_provider=None, default_remote_prefix='')[source]

Run snakemake on a given snakefile.

This function provides access to the whole snakemake functionality. It is not thread-safe.

Parameters:
  • snakefile (str) – the path to the snakefile
  • listrules (bool) – list rules (default False)
  • list_target_rules (bool) – list target rules (default False)
  • cores (int) – the number of provided cores (ignored when using cluster support) (default 1)
  • nodes (int) – the number of provided cluster nodes (ignored without cluster support) (default 1)
  • local_cores (int) – the number of provided local cores if in cluster mode (ignored without cluster support) (default 1)
  • resources (dict) – provided resources, a dictionary assigning integers to resource names, e.g. {gpu=1, io=5} (default {})
  • config (dict) – override values for workflow config
  • workdir (str) – path to working directory (default None)
  • targets (list) – list of targets, e.g. rule or file names (default None)
  • dryrun (bool) – only dry-run the workflow (default False)
  • touch (bool) – only touch all output files if present (default False)
  • forcetargets (bool) – force given targets to be re-created (default False)
  • forceall (bool) – force all output files to be re-created (default False)
  • forcerun (list) – list of files and rules that shall be re-created/re-executed (default [])
  • prioritytargets (list) – list of targets that shall be run with maximum priority (default [])
  • stats (str) – path to file that shall contain stats about the workflow execution (default None)
  • printreason (bool) – print the reason for the execution of each job (default false)
  • printshellcmds (bool) – print the shell command of each job (default False)
  • printdag (bool) – print the dag in the graphviz dot language (default False)
  • printrulegraph (bool) – print the graph of rules in the graphviz dot language (default False)
  • printd3dag (bool) – print a D3.js compatible JSON representation of the DAG (default False)
  • nocolor (bool) – do not print colored output (default False)
  • quiet (bool) – do not print any default job information (default False)
  • keepgoing (bool) – keep goind upon errors (default False)
  • cluster (str) – submission command of a cluster or batch system to use, e.g. qsub (default None)
  • cluster_config (str,list) – configuration file for cluster options, or list thereof (default None)
  • cluster_sync (str) – blocking cluster submission command (like SGE ‘qsub -sync y’) (default None)
  • drmaa (str) – if not None use DRMAA for cluster support, str specifies native args passed to the cluster when submitting a job
  • drmaa_log_dir (str) – the path to stdout and stderr output of DRMAA jobs (default None)
  • jobname (str) – naming scheme for cluster job scripts (default “snakejob.{rulename}.{jobid}.sh”)
  • immediate_submit (bool) – immediately submit all cluster jobs, regardless of dependencies (default False)
  • standalone (bool) – kill all processes very rudely in case of failure (do not use this if you use this API) (default False) (deprecated)
  • ignore_ambiguity (bool) – ignore ambiguous rules and always take the first possible one (default False)
  • snakemakepath (str) – Deprecated parameter whose value is ignored. Do not use.
  • lock (bool) – lock the working directory when executing the workflow (default True)
  • unlock (bool) – just unlock the working directory (default False)
  • cleanup_metadata (bool) – just cleanup metadata of output files (default False)
  • force_incomplete (bool) – force the re-creation of incomplete files (default False)
  • ignore_incomplete (bool) – ignore incomplete files (default False)
  • list_version_changes (bool) – list output files with changed rule version (default False)
  • list_code_changes (bool) – list output files with changed rule code (default False)
  • list_input_changes (bool) – list output files with changed input files (default False)
  • list_params_changes (bool) – list output files with changed params (default False)
  • summary (bool) – list summary of all output files and their status (default False)
  • archive (str) – archive workflow into the given tarball
  • latency_wait (int) – how many seconds to wait for an output file to appear after the execution of a job, e.g. to handle filesystem latency (default 3)
  • benchmark_repeats (int) – number of repeated runs of a job if declared for benchmarking (default 1)
  • wait_for_files (list) – wait for given files to be present before executing the workflow
  • list_resources (bool) – list resources used in the workflow (default False)
  • summary – list summary of all output files and their status (default False). If no option is specified a basic summary will be ouput. If ‘detailed’ is added as an option e.g –summary detailed, extra info about the input and shell commands will be included
  • detailed_summary (bool) – list summary of all input and output files and their status (default False)
  • print_compilation (bool) – print the compilation of the snakefile (default False)
  • debug (bool) – allow to use the debugger within rules
  • notemp (bool) – ignore temp file flags, e.g. do not delete output files marked as temp after use (default False)
  • keep_remote_local (bool) – keep local copies of remote files (default False)
  • nodeps (bool) – ignore dependencies (default False)
  • keep_target_files (bool) – Do not adjust the paths of given target files relative to the working directory.
  • keep_shadow (bool) – Do not delete the shadow directory on snakemake startup.
  • allowed_rules (set) – Restrict allowed rules to the given set. If None or empty, all rules are used.
  • jobscript (str) – path to a custom shell script template for cluster jobs (default None)
  • timestamp (bool) – print time stamps in front of any output (default False)
  • greediness (float) – set the greediness of scheduling. This value between 0 and 1 determines how careful jobs are selected for execution. The default value (0.5 if prioritytargets are used, 1.0 else) provides the best speed and still acceptable scheduling quality.
  • overwrite_shellcmd (str) – a shell command that shall be executed instead of those given in the workflow. This is for debugging purposes only.
  • updated_files (list) – a list that will be filled with the files that are updated or created during the workflow execution
  • verbose (bool) – show additional debug output (default False)
  • max_jobs_per_second (int) – maximal number of cluster/drmaa jobs per second, None to impose no limit (default None)
  • restart_times (int) – number of times to restart failing jobs (default 1)
  • force_use_threads – whether to force use of threads over processes. helpful if shared memory is full or unavailable (default False)
  • use_conda (bool) – create conda environments for each job (defined with conda directive of rules)
  • conda_prefix (str) – the directories in which conda environments will be created (default None)
  • mode (snakemake.common.Mode) – Execution mode
  • wrapper_prefix (str) – Prefix for wrapper script URLs (default None)
  • default_remote_provider (str) – Default remote provider to use instead of local files (S3, GS)
  • default_remote_prefix (str) – Prefix for default remote provider (e.g. name of the bucket).
  • log_handler (function) –

    redirect snakemake output to this custom log handler, a function that takes a log message dictionary (see below) as its only argument (default None). The log message dictionary for the log handler has to following entries:

    level:the log level (“info”, “error”, “debug”, “progress”, “job_info”)
    level=”info”, “error” or “debug”:
     
    msg:the log message
    level=”progress”:
     
    done:number of already executed jobs
    total:number of total jobs
    level=”job_info”:
     
    input:list of input files of a job
    output:list of output files of a job
    log:path to log file of a job
    local:whether a job is executed locally (i.e. ignoring cluster)
    msg:the job message
    reason:the job reason
    priority:the job priority
    threads:the threads of the job
Returns:

True if workflow execution was successful.

Return type:

bool

Additional utils

class snakemake.utils.AlwaysQuotedFormatter(quote_func=<function quote>, *args, **kwargs)[source]

Subclass of QuotedFormatter that always quotes.

Usage is identical to QuotedFormatter, except that it always acts like “q” was appended to the format spec.

class snakemake.utils.QuotedFormatter(quote_func=<function quote>, *args, **kwargs)[source]

Subclass of string.Formatter that supports quoting.

Using this formatter, any field can be quoted after formatting by appending “q” to its format string. By default, shell quoting is performed using “shlex.quote”, but you can pass a different quote_func to the constructor. The quote_func simply has to take a string argument and return a new string representing the quoted form of the input string.

Note that if an element after formatting is the empty string, it will not be quoted.

snakemake.utils.R(code)[source]

Execute R code

This function executes the R code given as a string. The function requires rpy2 to be installed.

Parameters:code (str) – R code to be executed
class snakemake.utils.SequenceFormatter(separator=' ', element_formatter=<string.Formatter object>, *args, **kwargs)[source]

string.Formatter subclass with special behavior for sequences.

This class delegates formatting of individual elements to another formatter object. Non-list objects are formatted by calling the delegate formatter’s “format_field” method. List-like objects (list, tuple, set, frozenset) are formatted by formatting each element of the list according to the specified format spec using the delegate formatter and then joining the resulting strings with a separator (space by default).

format_element(elem, format_spec)[source]

Format a single element

For sequences, this is called once for each element in a sequence. For anything else, it is called on the entire object. It is intended to be overridden in subclases.

snakemake.utils.available_cpu_count()[source]

Return the number of available virtual or physical CPUs on this system. The number of available CPUs can be smaller than the total number of CPUs when the cpuset(7) mechanism is in use, as is the case on some cluster systems.

Adapted from http://stackoverflow.com/a/1006301/715090

snakemake.utils.format(_pattern, *args, stepout=1, _quote_all=False, **kwargs)[source]

Format a pattern in Snakemake style.

This means that keywords embedded in braces are replaced by any variable values that are available in the current namespace.

snakemake.utils.linecount(filename)[source]

Return the number of lines of given file.

Parameters:filename (str) – the path to the file
snakemake.utils.listfiles(pattern, restriction=None, omit_value=None)[source]

Yield a tuple of existing filepaths for the given pattern.

Wildcard values are yielded as the second tuple item.

Parameters:
  • pattern (str) – a filepattern. Wildcards are specified in snakemake syntax, e.g. “{id}.txt”
  • restriction (dict) – restrict to wildcard values given in this dictionary
  • omit_value (str) – wildcard value to omit
Yields:

tuple – The next file matching the pattern, and the corresponding wildcards object

snakemake.utils.makedirs(dirnames)[source]

Recursively create the given directory or directories without reporting errors if they are present.

snakemake.utils.min_version(version)[source]

Require minimum snakemake version, raise workflow error if not met.

snakemake.utils.read_job_properties(jobscript, prefix='# properties', pattern=re.compile('# properties = (.*)'))[source]

Read the job properties defined in a snakemake jobscript.

This function is a helper for writing custom wrappers for the snakemake –cluster functionality. Applying this function to a jobscript will return a dict containing information about the job.

snakemake.utils.report(text, path, stylesheet='/home/docs/checkouts/readthedocs.org/user_builds/snakemake/checkouts/v3.13.3/snakemake/report.css', defaultenc='utf8', template=None, metadata=None, **files)[source]

Create an HTML report using python docutils.

Attention: This function needs Python docutils to be installed for the python installation you use with Snakemake.

All keywords not listed below are intepreted as paths to files that shall be embedded into the document. They keywords will be available as link targets in the text. E.g. append a file as keyword arg via F1=input[0] and put a download link in the text like this:

report('''
==============
Report for ...
==============

Some text. A link to an embedded file: F1_.

Further text.
''', outputpath, F1=input[0])

Instead of specifying each file as a keyword arg, you can also expand
the input of your rule if it is completely named, e.g.:

report('''
Some text...
''', outputpath, **input)
Parameters:
  • text (str) – The “restructured text” as it is expected by python docutils.
  • path (str) – The path to the desired output file
  • stylesheet (str) – An optional path to a css file that defines the style of the document. This defaults to <your snakemake install>/report.css. Use the default to get a hint how to create your own.
  • defaultenc (str) – The encoding that is reported to the browser for embedded text files, defaults to utf8.
  • template (str) – An optional path to a docutils HTML template.
  • metadata (str) – E.g. an optional author name or email address.
snakemake.utils.set_protected_output(*rules)[source]

Set the output of rules to protected

snakemake.utils.set_temporary_output(*rules)[source]

Set the output of rules to temporary

snakemake.utils.simplify_path(path)[source]

Return a simplified version of the given path.

snakemake.utils.update_config(config, overwrite_config)[source]

Recursively update dictionary config with overwrite_config.

See http://stackoverflow.com/questions/3232943/update-value-of-a-nested-dictionary-of-varying-depth for details.

Parameters:
  • config (dict) – dictionary to update
  • overwrite_config (dict) – dictionary whose items will overwrite those in config

Citing and Citations

This section gives instructions on how to cite Snakemake and lists citing articles.

Citing Snakemake

When using Snakemake for a publication, please cite the following article in you paper:

Project Pages

If you publish a Snakemake workflow, consider to add this badge to your project page:

https://img.shields.io/badge/snakemake-≥3.5.2-brightgreen.svg?style=flat-square

The markdown syntax is

[![Snakemake](https://img.shields.io/badge/snakemake-≥3.5.2-brightgreen.svg?style=flat-square)](https://snakemake.bitbucket.io)

Replace the 3.5.2 with the minimum required Snakemake version. You can also change the style.

Frequently Asked Questions

Contents

What is the key idea of Snakemake workflows?

The key idea is very similar to GNU Make. The workflow is determined automatically from top (the files you want) to bottom (the files you have), by applying very general rules with wildcards you give to Snakemake:

Snakemake idea

When you start using Snakemake, please make sure to walk through the official tutorial. It is crucial to understand how to properly use the system.

My shell command fails with with errors about an “unbound variable”, what’s wrong?

This happens often when calling virtual environments from within Snakemake. Snakemake is using bash strict mode, to ensure e.g. proper error behavior of shell scripts. Unfortunately, virtualenv and some other tools violate bash strict mode. he quick fix for virtualenv is to temporarily deactivate the check for unbound variables

set +u; source /path/to/venv/bin/activate; set -u

For more details on bash strict mode, see the here.

How do I run my rule on all files of a certain directory?

In Snakemake, similar to GNU Make, the workflow is determined from the top, i.e. from the target files. Imagine you have a directory with files 1.fastq, 2.fastq, 3.fastq, ..., and you want to produce files 1.bam, 2.bam, 3.bam, ... you should specify these as target files, using the ids 1,2,3,.... You could end up with at least two rules like this (or any number of intermediate steps):

IDS = "1 2 3 ...".split() # the list of desired ids

# a pseudo-rule that collects the target files
rule all:
    input:  expand("otherdir/{id}.bam", id=IDS)

# a general rule using wildcards that does the work
rule:
    input:  "thedir/{id}.fastq"
    output: "otherdir/{id}.bam"
    shell:  "..."

Snakemake will then go down the line and determine which files it needs from your initial directory.

In order to infer the IDs from present files, Snakemake provides the glob_wildcards function, e.g.

IDS, = glob_wildcards("thedir/{id}.fastq")

The function matches the given pattern against the files present in the filesystem and thereby infers the values for all wildcards in the pattern. A named tuple that contains a list of values for each wildcard is returned. Here, this named tuple has only one item, that is the list of values for the wildcard {id}.

Snakemake complains about a cyclic dependency or a PeriodicWildcardError. What can I do?

One limitation of Snakemake is that graphs of jobs have to be acyclic (similar to GNU Make). This means, that no path in the graph may be a cycle. Although you might have considered this when designing your workflow, Snakemake sometimes runs into situations where a cyclic dependency cannot be avoided without further information, although the solution seems obvious for the developer. Consider the following example:

rule all:
    input:
        "a"

rule unzip:
    input:
        "{sample}.tar.gz"
    output:
        "{sample}"
    shell:
        "tar -xf {input}"

If this workflow is executed with

snakemake -n

two things may happen.

  1. If the file a.tar.gz is present in the filesystem, Snakemake will propose the following (expected and correct) plan:

    rule a:
            input: a.tar.gz
        output: a
        wildcards: sample=a
    localrule all:
            input: a
    Job counts:
            count   jobs
            1       a
            1       all
            2
    
  2. If the file a.tar.gz is not present and cannot be created by any other rule than rule a, Snakemake will try to run rule a again, with {sample}=a.tar.gz. This would infinitely go on recursively. Snakemake detects this case and produces a PeriodicWildcardError.

In summary, PeriodicWildcardErrors hint to a problem where a rule or a set of rules can be applied to create its own input. If you are lucky, Snakemake can be smart and avoid the error by stopping the recursion if a file exists in the filesystem. Importantly, however, bugs upstream of that rule can manifest as PeriodicWildcardError, although in reality just a file is missing or named differently. In such cases, it is best to restrict the wildcard of the output file(s), or follow the general rule of putting output files of different rules into unique subfolders of your working directory. This way, you can discover the true source of your error.

Is it possible to pass variable values to the workflow via the command line?

Yes, this is possible. Have a look at Configuration. Previously it was necessary to use environment variables like so: E.g. write

$ SAMPLES="1 2 3 4 5" snakemake

and have in the Snakefile some Python code that reads this environment variable, i.e.

SAMPLES = os.environ.get("SAMPLES", "10 20").split()

I get a NameError with my shell command. Are braces unsupported?

You can use the entire Python format minilanguage in shell commands. Braces in shell commands that are not intended to insert variable values thus have to be escaped by doubling them:

...
shell: "awk '{{print $1}}' {input}"

Here the double braces are escapes, i.e. there will remain single braces in the final command. In contrast, {input} is replaced with an input filename.

How do I incorporate files that do not follow a consistent naming scheme?

The best solution is to have a dictionary that translates a sample id to the inconsistently named files and use a function (see Functions as Input Files) to provide an input file like this:

FILENAME = dict(...)  # map sample ids to the irregular filenames here

rule:
    # use a function as input to delegate to the correct filename
    input: lambda wildcards: FILENAME[wildcards.sample]
    output: "somefolder/{sample}.csv"
    shell: ...

How do I force Snakemake to rerun all jobs from the rule I just edited?

This can be done by invoking Snakemake with the --forcerules or -R flag, followed by the rules that should be re-executed:

$ snakemake -R somerule

This will cause Snakemake to re-run all jobs of that rule and everything downstream (i.e. directly or indirectly depending on the rules output).

How do I enable syntax highlighting in Vim for Snakefiles?

A vim syntax highlighting definition for Snakemake is available here. You can copy that file to $HOME/.vim/syntax directory and add

au BufNewFile,BufRead Snakefile set syntax=snakemake
au BufNewFile,BufRead *.rules set syntax=snakemake
au BufNewFile,BufRead *.snakefile set syntax=snakemake
au BufNewFile,BufRead *.snake set syntax=snakemake

to your $HOME/.vimrc file. Highlighting can be forced in a vim session with :set syntax=snakemake.

I want to import some helper functions from another python file. Is that possible?

Yes, from version 2.4.8 on, Snakemake allows to import python modules (and also simple python files) from the same directory where the Snakefile resides.

How can I run Snakemake on a cluster where its main process is not allowed to run on the head node?

This can be achived by submitting the main Snakemake invocation as a job to the cluster. If it is not allowed to submit a job from a non-head cluster node, you can provide a submit command that goes back to the head node before submitting:

qsub -N PIPE -cwd -j yes python snakemake --cluster "ssh user@headnode_address 'qsub -N pipe_task -j yes -cwd -S /bin/sh ' " -j

This hint was provided by Inti Pedroso.

I would like to receive a mail upon snakemake exit. How can this be achieved?

On unix, you can make use of the commonly pre-installed mail command:

snakemake 2> snakemake.log
mail -s "snakemake finished" youremail@provider.com < snakemake.log

In case your administrator does not provide you with a proper configuration of the sendmail framework, you can configure mail to work e.g. via Gmail (see here).

I want to pass variables between rules. Is that possible?

Because of the cluster support and the ability to resume a workflow where you stopped last time, Snakemake in general should be used in a way that information is stored in the output files of your jobs. Sometimes it might though be handy to have a kind of persistent storage for simple values between jobs and rules. Using plain python objects like a global dict for this will not work as each job is run in a separate process by snakemake. What helps here is the PersistentDict from the pytools package. Here is an example of a Snakemake workflow using this facility:

from pytools.persistent_dict import PersistentDict

storage = PersistentDict("mystorage")

rule a:
    input: "test.in"
    output: "test.out"
    run:
        myvar = storage.fetch("myvar")
        # do stuff

rule b:
    output: temp("test.in")
    run:
        storage.store("myvar", 3.14)

Here, the output rule b has to be temp in order to ensure that myvar is stored in each run of the workflow as rule a relies on it. In other words, the PersistentDict is persistent between the job processes, but not between different runs of this workflow. If you need to conserve information between different runs, use output files for them.

Why do my global variables behave strangely when I run my job on a cluster?

This is closely related to the question above. Any Python code you put outside of a rule definition is normally run once before Snakemake starts to process rules, but on a cluster it is re-run again for each submitted job, because Snakemake implements jobs by re-running itself.

Consider the following...

from mydatabase import get_connection

dbh = get_connection()
latest_parameters = dbh.get_params().latest()

rule a:
    input: "{foo}.in"
    output: "{foo}.out"
    shell: "do_op -params {latest_parameters}  {input} {output}"

When run a single machine, you will see a single connection to your database and get a single value for latest_parameters for the duration of the run. On a cluster you will see a connection attempt from the cluster node for each job submitted, regardless of whether it happens to involve rule a or not, and the parameters will be recalculated for each job.

I want to configure the behavior of my shell for all rules. How can that be achieved with Snakemake?

You can set a prefix that will prepended to all shell commands by adding e.g.

shell.prefix("set -o pipefail; ")

to the top of your Snakefile. Make sure that the prefix ends with a semicolon, such that it will not interfere with the subsequent commands. To simulate a bash login shell, you can do the following:

shell.executable("/bin/bash")
shell.prefix("source ~/.bashrc; ")

Some command line arguments like –config cannot be followed by rule or file targets. Is that intended behavior?

This is a limitation of the argparse module, which cannot distinguish between the perhaps next arg of --config and a target. As a solution, you can put the –config at the end of your invocation, or prepend the target with a single --, i.e.

$ snakemake --config foo=bar -- mytarget
$ snakemake mytarget --config foo=bar

How do I make my rule fail if an output file is empty?

Snakemake expects shell commands to behave properly, meaning that failures should cause an exit status other than zero. If a command does not exit with a status other than zero, Snakemake assumes everything worked fine, even if output files are empty. This is because empty output files are also a reasonable tool to indicate progress where no real output was produced. However, sometimes you will have to deal with tools that do not properly report their failure with an exit status. Here, the recommended way is to use bash to check for non-empty output files, e.g.:

rule:
    input:  ...
    output: "my/output/file.txt"
    shell:  "somecommand {input} {output} && [[ -s {output} ]]"

How does Snakemake lock the working directory?

Per default, Snakemake will lock a working directory by output and input files. Two Snakemake instances that want to create the same output file are not possible. Two instances creating disjoint sets of output files are possible. With the command line option --nolock, you can disable this mechanism on your own risk. With --unlock, you can be remove a stale lock. Stale locks can appear if your machine is powered off with a running Snakemake instance.

Snakemake does not trigger re-runs if I add additional input files. What can I do?

Snakemake has a kind of “lazy” policy about added input files if their modification date is older than that of the output files. One reason is that information what to do cannot be inferred just from the input and output files. You need additional information about the last run to be stored. Since behaviour would be inconsistent between cases where that information is available and where it is not, this functionality has been encoded as an extra switch. To trigger updates for jobs with changed input files, you can use the command line argument –list-input-changes in the following way:

$ snakemake -n -R `snakemake --list-input-changes`

Here, snakemake --list-input-changes returns the list of output files with changed input files, which is fed into -R to trigger a re-run.

How do I trigger re-runs for rules with updated code or parameters?

Similar to the solution above, you can use

$ snakemake -n -R `snakemake --list-params-changes`

and

$ snakemake -n -R `snakemake --list-code-changes`

Again, the list commands in backticks return the list of output files with changes, which are fed into -R to trigger a re-run.

How do I remove all files created by snakemake, i.e. like make clean

To remove all files created by snakemake as output files to start from scratch, you can use

rm $(snakemake --summary | tail -n+2 | cut -f1)

Why can’t I use the conda directive with a run block?

The run block of a rule (see Rules) has access to anything defined in the Snakefile, outside of the rule. Hence, it has to share the conda environment with the main Snakemake process. To avoid confusion we therefore disallow the conda directive together with the run block. It is recommended to use the script directive instead (see External scripts).

My workflow is very large, how to I stop Snakemake from printing all this rule/job information in a dry-run?

Indeed, the information for each individual job can slow down a dryrun if there are tens of thousands of jobs. If you are just interested in the final summary, you can use the --quiet flag to suppress this.

$ snakemake -n --quiet

Git is messing up the modification times of my input files, what can I do?

When you checkout a git repository, the modification times of updated files are set to the time of the checkout. If you rely on these files as input and output files in your workflow, this can cause trouble. For example, Snakemake could think that a certain (git-tracked) output has to be re-executed, just because its input has been checked out a bit later. In such cases, it is advisable to set the file modification dates to the last commit date after an update has been pulled. See here for a solution to achieve this.

Contributing

Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.

You can contribute in many ways:

Types of Contributions

Report Bugs

Report bugs at https://bitbucket.org/snakemake/snakemake/issues

If you are reporting a bug, please include:

  • Your operating system name and version.
  • Any details about your local setup that might be helpful in troubleshooting.
  • Detailed steps to reproduce the bug.
Fix Bugs

Look through the Bitbucket issues for bugs. If you want to start working on a bug then please write short message on the issue tracker to prevent duplicate work.

Implement Features

Look through the Bitbucket issues for features. If you want to start working on an issue then please write short message on the issue tracker to prevent duplicate work.

Write Documentation

Snakemake could always use more documentation, whether as part of the official vcfpy docs, in docstrings, or even on the web in blog posts, articles, and such.

Snakemake uses Sphinx for the user manual (that you are currently reading). See project_info-doc_guidelines on how the documentation reStructuredText is used.

Submit Feedback

The best way to send feedback is to file an issue at https://bitbucket.org/snakemake/snakemake/issues

If you are proposing a feature:

  • Explain in detail how it would work.
  • Keep the scope as narrow as possible, to make it easier to implement.
  • Remember that this is a volunteer-driven project, and that contributions are welcome :)

Pull Request Guidelines

To update the documentation, fix bugs or add new features you need to create a Pull Request . A PR is a change you make to your local copy of the code for us to review and potentially integrate into the code base.

To create a Pull Request you need to do these steps:

  1. Create a Bitbucket account.
  2. Fork the repository (see the left sidebar on the main Bitbucket Snakemake page).
  3. Clone your fork (go to your copy of the repository at https://bitbucket.org/<your_username>/snakemake and click clone. This gives you the command you need to paste into your shell).
  4. Go to the snakemake folder with cd snakemake.
  5. Create a new branch with git checkout -b <descriptive_branch_name>.
  6. Make your changes to the code or documentation.
  7. Run git add . to add all the changed files to the commit (to see what files will be added you can run git add . --dry-run).
  8. To commit the added files use git commit. (This will open a command line editor to write a commit message. These should have a descriptive 80 line header, followed by an empty line, and then a description of what you did and why. To use your command line text editor of choice use (for example) export GIT_EDITOR=vim before running git commit).
  9. Now you can push your changes to your Bitbucket copy of Snakemake by running git push origin <descriptive_branch_name>.
  10. If you now go to the webpage for your Bitbucket copy of Snakemake you should see a link in the sidebar called “Create Pull Request”.
  11. Now you need to choose your PR from the menu and click the “Create pull request” button. Be sure to change the pull request target branch to <descriptive_branch_name>!

If you want to create more pull requests, first run git checkout master and then start at step 5. with a new branch name.

Feel free to ask questions about this if you want to contribute to Snakemake :)

Testing Guidelines

To ensure that you do not introduce bugs into Snakemake, you should test your code thouroughly.

To have integration tests run automatically when commiting code changes to bitbucket, you need to sign up on wercker.com and register a user.

The easiest way to run your development version of Snakemake is perhaps to go to the folder containing your local copy of Snakemake and call

conda env create -f environment.yml -n snakemake-testing
source activate snakemake-testing
pip install -e .

This will make your development version of Snakemake the one called when running snakemake. You do not need to run this command after each time you make code changes.

From the base snakemake folder you call python setup.py nosetests to run all the tests. (If it complains that you do not have nose installed, which is the testing framework we use, you can simply install it by running pip install nose.)

If you introduce a new feature you should add a new test to the tests directory. See the folder for examples.

Documentation Guidelines

For the documentation, please adhere to the following guidelines:

  • Put each sentence on its own line, this makes tracking changes through Git SCM easier.
  • Provide hyperlink targets, at least for the first two section levels. For this, use the format <document_part>-<section_name>, e.g., project_info-doc_guidelines.
  • Use the section structure from below.
.. document_part-heading_1:

=========
Heading 1
=========


.. document_part-heading_2:

---------
Heading 2
---------


.. document_part-heading_3:

Heading 3
=========


.. document_part-heading_4:

Heading 4
---------


.. document_part-heading_5:

Heading 5
~~~~~~~~~


.. document_part-heading_6:

Heading 6
:::::::::

Documentation Setup

For building the documentation, you have to install the Sphinx. If you have already installed Conda, all you need to do is to create a Snakemake development environment via

$ git clone git@bitbucket.org:snakemake/snakemake.git
$ cd snakemake
$ conda env create -f environment.yml -n snakemake

Then, the docs can be built with

$ source activate snakemake
$ cd docs
$ make html
$ make clean && make html  # force rebuild

Alternatively, you can use virtualenv. The following assumes you have a working Python 3 setup.

$ git clone git@bitbucket.org:snakemake/snakemake.git
$ cd snakemake/docs
$ virtualenv -p python3 .venv
$ source .venv/bin/activate
$ pip install --upgrade -r requirements.txt

Afterwards, the docs can be built with

$ source .venv/bin/activate
$ make html  # rebuild for changed files only
$ make clean && make html  # force rebuild

Credits

Development Lead

  • Johannes Köster

Development Team

  • Christopher Tomkins-Tinch
  • David Koppstein
  • Tim Booth
  • Manuel Holtgrewe
  • Christian Arnold
  • Wibowo Arindrarto

Contributors

In alphabetical order

  • Andreas Wilm
  • Anthony Underwood
  • Ryan Dale
  • David Alexander
  • Elias Kuthe
  • Elmar Pruesse
  • Hyeshik Chang
  • Jay Hesselberth
  • Jesper Foldager
  • John Huddleston
  • Joona Lehtomäki
  • Karel Brinda
  • Karl Gutwin
  • Kemal Eren
  • Kostis Anagnostopoulos
  • Kyle A. Beauchamp
  • Kyle Meyer
  • Lance Parsons
  • Manuel Holtgrewe
  • Marcel Martin
  • Matthew Shirley
  • Mattias Franberg
  • Matt Shirley
  • Paul Moore
  • percyfal
  • Per Unneberg
  • Ryan C. Thompson
  • Ryan Dale
  • Sean Davis
  • Simon Ye
  • Tobias Marschall
  • Willem Ligtenberg

Change Log

# Change Log

## [3.13.3] - 2017-06-23
### Changed
- Fix a followup bug in Namedlist where a single item was not returned as string.

## [3.13.2] - 2017-06-20
### Changed
- The --wrapper-prefix flag now also affects where the corresponding environment definition is fetched from.
- Fix bug where empty output file list was recognized as containing duplicates (issue #574).


## [3.13.1] - 2017-06-20
### Changed
- Fix --conda-prefix to be passed to all jobs.
- Fix cleanup issue with scripts that fail to download.

## [3.13.0] - 2017-06-12
### Added
- An NCBI remote provider. By this, you can seamlessly integrate any NCBI resouce (reference genome, gene/protein sequences, ...) as input file.
### Changed
- Snakemake now detects if automatically generated conda environments have to be recreated because the workflow has been moved to a new path.
- Remote functionality has been made more robust, in particular to avoid race conditions.
- `--config` parameter evaluation has been fixed for non-string types.
- The Snakemake docker container is now based on the official debian image.

## [3.12.0] - 2017-05-09
### Added
- Support for RMarkdown (.Rmd) in script directives.
- New option --debug-dag that prints all decisions while building the DAG of jobs. This helps to debug problems like cycles or unexpected MissingInputExceptions.
- New option --conda-prefix to specify the place where conda environments are stored.

### Changed
- Benchmark files now also include the maximal RSS and VMS size of the Snakemake process and all sub processes.
- Speedup conda environment creation.
- Allow specification, of DRMAA log dir.
- Pass cluster config to subworkflow.


## [3.11.2] - 2017-03-15
### Changed
- Fixed fix handling of local URIs with the wrapper directive.


## [3.11.1] - 2017-03-14
### Changed
- --touch ignores missing files
- Fixed handling of local URIs with the wrapper directive.


## [3.11.0] - 2017-03-08
### Added
- Param functions can now also refer to threads.
### Changed
- Improved tutorial and docs.
- Made conda integration more robust.
- None is converted to NULL in R scripts.


## [3.10.2] - 2017-02-28
### Changed
- Improved config file handling and merging.
- Output files can be referred in params functions (i.e. lambda wildcards, output: ...)
- Improved conda-environment creation.
- Jobs are cached, leading to reduced memory footprint.
- Fixed subworkflow handling in input functions.

## [3.10.0] - 2017-01-18
### Added
- Workflows can now be archived to a tarball with `snakemake --archive my-workflow.tar.gz`. The archive contains all input files, source code versioned with git and all software packages that are defined via conda environments. Hence, the archive allows to fully reproduce a workflow on a different machine. Such an archive can be uploaded to Zenodo, such that your workflow is secured in a self-contained, executable way for the future.
### Changed
- Improved logging.
- Reduced memory footprint.
- Added a flag to automatically unpack the output of input functions.
- Improved handling of HTTP redirects with remote files.
- Improved exception handling with DRMAA.
- Scripts referred by the script directive can now use locally defined external python modules.


## [3.9.1] - 2016-12-23
### Added
- Jobs can be restarted upon failure (--restart-times).
### Changed
- The docs have been restructured and improved. Now available under snakemake.readthedocs.org.
- Changes in scripts show up with --list-code-changes.
- Duplicate output files now cause an error.
- Various bug fixes.


## [3.9.0] - 2016-11-15
### Added
- Ability to define isolated conda software environments (YAML) per rule. Environment will be deployed by Snakemake upon workflow execution.
- Command line argument --wrapper-prefix in order to overwrite the default URL for looking up wrapper scripts.
### Changed
- --summary now displays the log files correspoding to each output file.
- Fixed hangups when using run directive and a large number of jobs
- Fixed pickling errors with anonymous rules and run directive.
- Various small bug fixes

## [3.8.2] - 2016-09-23
### Changed
- Add missing import in rules.py.
- Use threading only in cluster jobs.

## [3.8.1] - 2016-09-14
### Changed
- Snakemake now warns when using relative paths starting with "./".
- The option -R now also accepts an empty list of arguments.
- Bug fix when handling benchmark directive.
- Jobscripts exit with code 1 in case of failure. This should improve the error messages of cluster system.
- Fixed a bug in SFTP remote provider.


## [3.8.0] - 2016-08-26
### Added
- Wildcards can now be constrained by rule and globally via the new `wildcard_constraints` directive (see the [docs](https://bitbucket.org/snakemake/snakemake/wiki/Documentation#markdown-header-wildcards)).
- Subworkflows now allow to overwrite their config file via the configfile directive in the calling Snakefile.
- A method `log_fmt_shell` in the snakemake proxy object that is available in scripts and wrappers allows to obtain a formatted string to redirect logging output from STDOUT or STDERR.
- Functions given to resources can now optionally contain an additional argument `input` that refers to the input files.
- Functions given to params can now optionally contain additional arguments `input` (see above) and `resources`. The latter refers to the resources.
- It is now possible to let items in shell commands be automatically quoted (see the [docs](https://bitbucket.org/snakemake/snakemake/wiki/Documentation#markdown-header-rules)). This is usefull when dealing with filenames that contain whitespaces.
### Changed
- Snakemake now deletes output files before job exection. Further, it touches output files after job execution. This solves various problems with slow NFS filesystems.
- A bug was fixed that caused dynamic output rules to be executed multiple times when forcing their execution with -R.
- A bug causing double uploads with remote files was fixed. Various additional bug fixes related to remote files.
- Various minor bug fixes.

## [3.7.1] - 2016-05-16
### Changed
- Fixed a missing import of the multiprocessing module.

## [3.7.0] - 2016-05-05
### Added
- The entries in `resources` and the `threads` job attribute can now be callables that must return `int` values.
- Multiple `--cluster-config` arguments can be given to the Snakemake command line. Later one override earlier ones.
- In the API, multiple `cluster_config` paths can be given as a list, alternatively to the previous behaviour of expecting one string for this parameter.
- When submitting cluster jobs (either through `--cluster` or `--drmaa`), you can now use `--max-jobs-per-second` to limit the number of jobs being submitted (also available through Snakemake API). Some cluster installations have problems with too many jobs per second.
- Wildcard values are now printed upon job execution in addition to input and output files.
### Changed
- Fixed a bug with HTTP remote providers.

## [3.6.1] - 2016-04-08
### Changed
- Work around missing RecursionError in Python < 3.5
- Improved conversion of numpy and pandas data structures to R scripts.
- Fixed locking of working directory.

## [3.6.0] - 2016-03-10
### Added
- onstart handler, that allows to add code that shall be only executed before the actual workflow execution (not on dryrun).
- Parameters defined in the cluster config file are now accessible in the job properties under the key "cluster".
- The wrapper directive can be considered stable.
### Changed
- Allow to use rule/job parameters with braces notation in cluster config.
- Show a proper error message in case of recursion errors.
- Remove non-empty temp dirs.
- Don't set the process group of Snakemake in order to allow kill signals from parent processes to be propagated.
- Fixed various corner case bugs.
- The params directive no longer converts a list ``l`` implicitly to ``" ".join(l)``.

## [3.5.5] - 2016-01-23
### Added
- New experimental wrapper directive, which allows to refer to re-usable [wrapper scripts](https://bitbucket.org/snakemake/snakemake/wiki/Documentation#markdown-header-wrappers). Wrappers are provided in the [Snakemake Wrapper Repository](https://bitbucket.org/snakemake/snakemake-wrappers).
- David Koppstein implemented two new command line options to constrain the execution of the DAG of job to sub-DAGs (--until and --omit-from).
### Changed
- Fixed various bugs, e.g. with shadow jobs and --latency-wait.

## [3.5.4] - 2015-12-04
### Changed
- The params directive now fully supports non-string parameters. Several bugs in the remote support were fixed.

## [3.5.3] - 2015-11-24
### Changed
- The missing remote module was added to the package.

## [3.5.2] - 2015-11-24
### Added
- Support for easy integration of external R and Python scripts via the new [script directive](https://bitbucket.org/snakemake/snakemake/wiki/Documentation#markdown-header-external-scripts).
- Chris Tomkins-Tinch has implemented support for remote files: Snakemake can now handle input and output files from Amazon S3, Google Storage, FTP, SFTP, HTTP and Dropbox.
- Simon Ye has implemented support for sandboxing jobs with [shadow rules](https://bitbucket.org/snakemake/snakemake/wiki/Documentation#markdown-header-shadow-rules).
### Changed
- Manuel Holtgrewe has fixed dynamic output files in combination with mutliple wildcards.
- It is now possible to add suffixes to all shell commands with shell.suffix("mysuffix").
- Job execution has been refactored to spawn processes only when necessary, resolving several problems in combination with huge workflows consisting of thousands of jobs and reducing the memory footprint.
- In order to reflect the new collaborative development model, Snakemake has moved from my personal bitbucket account to http://snakemake.bitbucket.org.

## [3.4.2] - 2015-09-12
### Changed
- Willem Ligtenberg has reduced the memory usage of Snakemake.
- Per Unneberg has improved config file handling to provide a more intuitive overwrite behavior.
- Simon Ye has improved the test suite of Snakemake and helped with setting up continuous integration via Codeship.
- The cluster implementation has been rewritten to use only a single thread to wait for jobs. This avoids failures with large numbers of jobs.
- Benchmarks are now writing tab-delimited text files instead of JSON.
- Snakemake now always requires to set the number of jobs with -j when in cluster mode. Set this to a high value if your cluster does not have restrictions.
- The Snakemake Conda package has been moved to the bioconda channel.
- The handling of Symlinks was improved, which made a switch to Python 3.3 as the minimum required Python version necessary.

## [3.4.1] - 2015-08-05
### Changed
- This release fixes a bug that caused named input or output files to always be returned as lists instead of single files.

## [3.4] - 2015-07-18
### Added
- This release adds support for executing jobs on clusters in synchronous mode (e.g. qsub -sync). Thanks to David Alexander for implementing this.
- There is now vim syntax highlighting support (thanks to Jay Hesselberth).
- Snakemake is now available as Conda package.
### Changed
- Lots of bugs have been fixed. Thanks go to e.g. David Koppstein, Marcel Martin, John Huddleston and Tao Wen for helping with useful reports and debugging.

See [here](https://bitbucket.org/snakemake/snakemake/wiki/News-Archive) for older changes.

License

Snakemake is licensed under the MIT License:

Copyright (c) 2016 Johannes Köster <johannes.koester@tu-dortmund.de>

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.