Welcome to Snakemake’s documentation!¶
Snakemake is an MIT-licensed workflow management system that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment, together with a clean and modern specification language in python style. Snakemake workflows are essentially Python scripts extended by declarative code to define rules. Rules describe how to create output files from input files.
Quick Example¶
rule targets:
input:
"plots/dataset1.pdf",
"plots/dataset2.pdf"
rule plot:
input:
"raw/{dataset}.csv"
output:
"plots/{dataset}.pdf"
shell:
"somecommand {input} {output}"
- Similar to GNU Make, you specify targets in terms of a pseudo-rule at the top.
- For each target and intermediate file, you create rules that define how they are created from input files.
- Snakemake determines the rule dependencies by matching file names.
- Input and output files can contain multiple named wildcards.
- Rules can either use shell commands, plain Python code or external Python or R scripts to create output files from input files.
- Snakemake workflows can be executed on workstations and clusters without modification. The job scheduling can be constrained by arbitrary resources like e.g. available CPU cores, memory or GPUs.
- Snakemake can use Amazon S3, Google Storage, Dropbox, FTP and SFTP to access input or output files and further access input files via HTTP and HTTPS.
Getting started¶
To get started, consider the Snakemake Tutorial, the introductory slides, and the FAQ.
Support¶
- In case of questions, please post on stack overflow.
- To discuss with other Snakemake users, you can use the mailing list.
- For bugs and feature requests, please use the issue tracker.
- For contributions, visit Snakemake on bitbucket and read the guidelines.
Publications using Snakemake¶
In the following you find an incomplete list of publications making use of Snakemake for their analyses. Please consider to add your own.
- Etournay et al. 2016. TissueMiner: a multiscale analysis toolkit to quantify how cellular processes create tissue dynamics. eLife Sciences.
- Townsend et al. 2016. The Public Repository of Xenografts Enables Discovery and Randomized Phase II-like Trials in Mice. Cancer Cell.
- Burrows et al. 2016. Genetic Variation, Not Cell Type of Origin, Underlies the Majority of Identifiable Regulatory Differences in iPSCs. PLOS Genetics.
- Ziller et al. 2015. Coverage recommendations for methylation analysis by whole-genome bisulfite sequencing. Nature Methods.
- Li et al. 2015. Quality control, modeling, and visualization of CRISPR screens with MAGeCK-VISPR. Genome Biology.
- Schmied et al. 2015. An automated workflow for parallel processing of large multiview SPIM recordings. Bioinformatics.
- Chung et al. 2015. Whole-Genome Sequencing and Integrative Genomic Analysis Approach on Two 22q11.2 Deletion Syndrome Family Trios for Genotype to Phenotype Correlations. Human Mutation.
- Kim et al. 2015. TUT7 controls the fate of precursor microRNAs by using three different uridylation mechanisms. The EMBO Journal.
- Park et al. 2015. Ebola Virus Epidemiology, Transmission, and Evolution during Seven Months in Sierra Leone. Cell.
- Břinda et al. 2015. RNF: a general framework to evaluate NGS read mappers. Bioinformatics.
- Břinda et al. 2015. Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics.
- Spjuth et al. 2015. Experiences with workflows for automating data-intensive bioinformatics. Biology Direct.
- Schramm et al. 2015. Mutational dynamics between primary and relapse neuroblastomas. Nature Genetics.
- Bray et al. 2015. Near-optimal RNA-Seq quantification. Arxiv preprint.
- Berulava et al. 2015. N6-Adenosine Methylation in MiRNAs. PLOS ONE.
- The Genome of the Netherlands Consortium 2014. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nature Genetics.
- Patterson et al. 2014. WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads. Journal of Computational Biology.
- Fernández et al. 2014. H3K4me1 marks DNA regions hypomethylated during aging in human stem and differentiated cells. Genome Research.
- Köster et al. 2014. Massively parallel read mapping on GPUs with the q-group index and PEANUT. PeerJ.
- Chang et al. 2014. TAIL-seq: Genome-wide Determination of Poly(A) Tail Length and 3′ End Modifications. Molecular Cell.
- Althoff et al. 2013. MiR-137 functions as a tumor suppressor in neuroblastoma by downregulating KDM1A. International Journal of Cancer.
- Marschall et al. 2013. MATE-CLEVER: Mendelian-Inheritance-Aware Discovery and Genotyping of Midsize and Long Indels. Bioinformatics.
- Rahmann et al. 2013. Identifying transcriptional miRNA biomarkers by integrating high-throughput sequencing and real-time PCR data. Methods.
- Martin et al. 2013. Exome sequencing identifies recurrent somatic mutations in EIF1AX and SF3B1 in uveal melanoma with disomy 3. Nature Genetics.
- Czeschik et al. 2013. Clinical and mutation data in 12 patients with the clinical diagnosis of Nager syndrome. Human Genetics.
- Marschall et al. 2012. CLEVER: Clique-Enumerating Variant Finder. Bioinformatics.
Installation¶
Snakemake is available on PyPi as well as through Bioconda and also from source code. You can use one of the following ways for installing Snakemake.
Installation via Conda¶
On Linux and MacOSX, this is the recommended way to install Snakemake, because it also enables Snakemake to handle software dependencies of your workflow.
First, you have to install the Miniconda Python3 distribution. See here for installation instructions. Make sure to ...
- Install the Python 3 version of Miniconda.
- Answer yes to the question whether conda shall be put into your PATH.
Then, you can install Snakemake with
$ conda install -c bioconda snakemake
from the Bioconda channel.
Global Installation¶
With a working Python 3 setup, installation of Snakemake can be performed by issuing
$ easy_install3 snakemake
or
$ pip3 install snakemake
in your terminal.
Installing in Virtualenv¶
To create an installation in a virtual environment, use the following commands:
$ virtualenv -p python3 .venv
$ source .venv/bin/activate
$ pip install snakemake
Installing from Source¶
We recommend installing Snakemake into a virtualenv instead of globally. Use the following commands to create a virtualenv and install Snakemake. Note that this will install the development version and as you are installing from the source code, we trust that you know what you are doing and how to checkout individual versions/tags.
$ git clone https://bitbucket.org/snakemake/snakemake.git
$ cd snakemake
$ virtualenv -p python3 .venv
$ source .venv/bin/activate
$ python setup.py install
You can also use python setup.py develop
to create a “development installation” in which no files are copied but a link is created and changes in the source code are immediately visible in your snakemake
commands.
Examples¶
Most of the examples below assume that Snakemake is executed in a project-specific root directory.
The paths in the Snakefiles below are relative to this directory.
We follow the convention to use different subdirectories for different intermediate results, e.g., mapped/
for mapped sequence reads in .bam
files, etc.
Cufflinks¶
Cufflinks is a tool to assemble transcripts, calculate abundance and conduct a differential expression analysis on RNA-Seq data. This example shows how to create a typical Cufflinks workflow with Snakemake. It assumes that mapped RNA-Seq data for four samples 101-104 is given as bam files.
- For each sample, transcripts are assembled with
cufflinks
(ruleassembly
). - Assemblies are merged into one gtf with
cuffmerge
(rulemerge_assemblies
). - A comparison to the hg19 gtf-track is conducted (rule
compare_assemblies
). - Finally, differential expressions are calculated on the found transcripts (rule
diffexp
).
# path to track and reference
TRACK = 'hg19.gtf'
REF = 'hg19.fa'
# sample names and classes
CLASS1 = '101 102'.split()
CLASS2 = '103 104'.split()
SAMPLES = CLASS1 + CLASS2
# path to bam files
CLASS1_BAM = expand('mapped/{sample}.bam', sample=CLASS1)
CLASS2_BAM = expand('mapped/{sample}.bam', sample=CLASS2)
rule all:
input:
'diffexp/isoform_exp.diff',
'assembly/comparison'
rule assembly:
input:
'mapped/{sample}.bam'
output:
'assembly/{sample}/transcripts.gtf',
dir='assembly/{sample}'
threads: 4
shell:
'cufflinks --num-threads {threads} -o {output.dir} '
'--frag-bias-correct {REF} {input}'
rule compose_merge:
input:
expand('assembly/{sample}/transcripts.gtf', sample=SAMPLES)
output:
txt='assembly/assemblies.txt'
run:
with open(output.txt, 'w') as out:
print(*input, sep="\n", file=out)
rule merge_assemblies:
input:
'assembly/assemblies.txt'
output:
'assembly/merged/merged.gtf', dir='assembly/merged'
shell:
'cuffmerge -o {output.dir} -s {REF} {input}'
rule compare_assemblies:
input:
'assembly/merged/merged.gtf'
output:
'assembly/comparison/all.stats',
dir='assembly/comparison'
shell:
'cuffcompare -o {output.dir}all -s {REF} -r {TRACK} {input}'
rule diffexp:
input:
class1=CLASS1_BAM,
class2=CLASS2_BAM,
gtf='assembly/merged/merged.gtf'
output:
'diffexp/gene_exp.diff', 'diffexp/isoform_exp.diff'
params:
class1=",".join(CLASS1_BAM),
class2=",".join(CLASS2_BAM)
threads: 8
shell:
'cuffdiff --num-threads {threads} {gtf} {params.class1} {params.class2}'
The execution plan of Snakemake for this workflow can be visualized with the following DAG.

Building a C Program¶
GNU Make is primarily used to build C/C++ code. Snakemake can do the same, while providing a superior readability due to less obscure variables inside the rules.
The following example Makefile was adapted from http://www.cs.colby.edu/maxwell/courses/tutorials/maketutor/.
IDIR=../include
ODIR=obj
LDIR=../lib
LIBS=-lm
CC=gcc
CFLAGS=-I$(IDIR)
_HEADERS = hello.h
HEADERS = $(patsubst %,$(IDIR)/%,$(_HEADERS))
_OBJS = hello.o hellofunc.o
OBJS = $(patsubst %,$(ODIR)/%,$(_OBJS))
# build the executable from the object files
hello: $(OBJS)
$(CC) -o $@ $^ $(CFLAGS)
# compile a single .c file to an .o file
$(ODIR)/%.o: %.c $(HEADERS)
$(CC) -c -o $@ $< $(CFLAGS)
# clean up temporary files
.PHONY: clean
clean:
rm -f $(ODIR)/*.o *~ core $(IDIR)/*~
A Snakefile can be easily written as
from os.path import join
IDIR = '../include'
ODIR = 'obj'
LDIR = '../lib'
LIBS = '-lm'
CC = 'gcc'
CFLAGS = '-I' + IDIR
_HEADERS = ['hello.h']
HEADERS = [join(IDIR, hfile) for hfile in _HEADERS]
_OBJS = ['hello.o', 'hellofunc.o']
OBJS = [join(ODIR, ofile) for ofile in _OBJS]
rule hello:
"""build the executable from the object files"""
output:
'hello'
input:
OBJS
shell:
"{CC} -o {output} {input} {CFLAGS} {LIBS}"
rule c_to_o:
"""compile a single .c file to an .o file"""
output:
temp('{ODIR}/{name}.o')
input:
'{name}.c', HEADERS
shell:
"{CC} -c -o {output} {input} {CFLAGS}"
rule clean:
"""clean up temporary files"""
shell:
"rm -f *~ core {IDIR}/*~"
As can be seen, the shell calls become more readable, e.g. "{CC} -c -o {output} {input} {CFLAGS}"
instead of $(CC) -c -o $@ $< $(CFLAGS)
. Further, Snakemake automatically deletes .o
-files when they are not needed anymore since they are marked as temp
.

Building a Paper with LaTeX¶
Building a scientific paper can be automated by Snakemake as well. Apart from compiling LaTeX code and invoking BibTeX, we provide a special rule to zip the needed files for online submission.
We first provide a Snakefile tex.rules
that contains rules that can be shared for any latex build task:
ruleorder: tex2pdf_with_bib > tex2pdf_without_bib
rule tex2pdf_with_bib:
input:
'{name}.tex',
'{name}.bib'
output:
'{name}.pdf'
shell:
"""
pdflatex {wildcards.name}
bibtex {wildcards.name}
pdflatex {wildcards.name}
pdflatex {wildcards.name}
"""
rule tex2pdf_without_bib:
input:
'{name}.tex'
output:
'{name}.pdf'
shell:
"""
pdflatex {wildcards.name}
pdflatex {wildcards.name}
"""
rule texclean:
shell:
"rm -f *.log *.aux *.bbl *.blg *.synctex.gz"
Note how we distinguish between a .tex
file with and without a corresponding .bib
with the same name.
Assuming that both paper.tex
and paper.bib
exist, an ambiguity arises: Both rules are, in principle, applicable.
This would lead to an AmbiguousRuleException
, but since we have specified an explicit rule order in the file, it is clear that in this case the rule tex2pdf_with_bib
is to be preferred.
If the paper.bib
file does not exist, that rule is not even applicable, and the only option is to execute rule tex2pdf_without_bib
.
Assuming that the above file is saved as tex.rules
, the actual documents are then built from a specific Snakefile that includes these common rules:
DOCUMENTS = ['document', 'response-to-editor']
TEXS = [doc+".tex" for doc in DOCUMENTS]
PDFS = [doc+".pdf" for doc in DOCUMENTS]
FIGURES = ['fig1.pdf']
include:
'tex.smrules'
rule all:
input:
PDFS
rule zipit:
output:
'upload.zip'
input:
TEXS, FIGURES, PDFS
shell:
'zip -T {output} {input}'
rule pdfclean:
shell:
"rm -f {PDFS}"
Hence the user can perform 4 different tasks. Build all PDFs:
$ snakemake
Create a zip-file for online submissions:
$ snakemake zipit
Clean up all PDFs:
$ snakemake pdfclean
Clean up latex temporary files:
$ snakemake texclean
The following DAG of jobs would be executed upon a full run:

Snakemake Tutorial¶
This tutorial introduces the text-based workflow system Snakemake. Snakemake follows the GNU Make paradigm: workflows are defined in terms of rules that define how to create output files from input files. Dependencies between the rules are determined automatically, creating a DAG (directed acyclic graph) of jobs that can be automatically parallelized.
Snakemake sets itself apart from existing text-based workflow systems in the following way. Hooking into the Python interpreter, Snakemake offers a definition language that is an extension of Python with syntax to define rules and workflow specific properties. This allows to combine the flexibility of a plain scripting language with a pythonic workflow definition. The Python language is known to be concise yet readable and can appear almost like pseudo-code. The syntactic extensions provided by Snakemake maintain this property for the definition of the workflow. Further, Snakemakes scheduling algorithm can be constrained by priorities, provided cores and customizable resources and it provides a generic support for distributed computing (e.g., cluster or batch systems). Hence, a Snakemake workflow scales without modification from single core workstations and multi-core servers to cluster or batch systems.
The examples presented in this tutorial come from Bioinformatics. However, Snakemake is a general-purpose workflow management system for any discipline. We ensured that no bioinformatics knowledge is needed to understand the tutorial.
Also have a look at the corresponding slides.
Setup¶
Requirements¶
To go through this tutorial, you need the following software installed:
- Python ≥3.3
- Snakemake 3.11.0
- BWA 0.7.12
- SAMtools 1.3.1
- BCFtools 1.3.1
- Graphviz 2.38.0
- PyYAML 3.11
- Docutils 0.12
The easiest way to setup these prerequisites is to use the Miniconda Python 3 distribution. The tutorial assumes that you are using either Linux or MacOS X. Both Snakemake and Miniconda work also under Windows, but the Windows shell is too different to be able to provide generic examples.
Setup a Linux VM with Vagrant under Windows¶
If you already use Linux or MacOS X, go on with Step 1.
If you use Windows, you can setup a Linux virtual machine (VM) with Vagrant.
First, install Vagrant following the installation instructions in the Vagrant Documentation.
Then, create a reasonable new directory you want to share with your Linux VM, e.g., create a folder vagrant-linux
somewhere.
Open a command line prompt, and change into that directory.
Here, you create a 64-bit Ubuntu Linux environment with
> vagrant init hashicorp/precise64
> vagrant up
If you decide to use a 32-bit image, you will need to download the 32-bit version of Miniconda in the next step.
The contents of the vagrant-linux
folder will be shared with the virtual machine that is set up by vagrant.
You can log into the virtual machine via
> vagrant ssh
If this command tells you to install an SSH client, you can follow the instructions in this Blogpost. Now, you can follow the steps of our tutorial from within your Linux VM.
Step 1: Installing Miniconda 3¶
First, please open a terminal or make sure you are logged into your Vagrant Linux VM. Assuming that you have a 64-bit system, on Linux, download and install Miniconda 3 with
$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
On MacOS X, download and install with
$ curl https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -o Miniconda3-latest-MacOSX-x86_64.sh
$ bash Miniconda3-latest-MacOSX-x86_64.sh
For a 32-bit system, URLs and file names are analogous but without the _64
.
When you are asked the question
Do you wish the installer to prepend the Miniconda3 install location to PATH ...? [yes|no]
answer with yes.
Along with a minimal Python 3 environment, Miniconda contains the package manager Conda.
After opening a new terminal, you can use the new conda
command to install software packages and create isolated environments to, e.g., use different versions of the same package.
We will later use Conda to create an isolated environment with all required software for this tutorial.
Step 2: Preparing a working directory¶
First, create a new directory snakemake-tutorial
at a reasonable place and change into that directory in your terminal.
If you use a Vagrant Linux VM from Windows as described above, create that directory under /vagrant/
, so that the contents are shared with your host system (you can then edit all files from within Windows with an editor that supports Unix line breaks).
Then, change to the newly created directory.
In this directory, we will later create an example workflow that illustrates the Snakemake syntax and execution environment.
First, we download some example data on which the workflow shall be executed:
$ wget https://bitbucket.org/snakemake/snakemake-tutorial/get/v3.11.0.tar.bz2
$ tar -xf v3.11.0.tar.bz2 --strip 1
This will create a folder data
and a file environment.yaml
in the working directory.
Step 3: Creating an environment with the required software¶
The environment.yaml
file can be used to install all required software into an isolated Conda environment with the name snakemake-tutorial
via
$ conda env create --name snakemake-tutorial --file environment.yaml
Step 4: Activating the environment¶
To activate the snakemake-tutorial
environment, execute
$ source activate snakemake-tutorial
Now you can use the installed tools. Execute
$ snakemake --help
to test this and get information about the command-line interface of Snakemake. To exit the environment, you can execute
$ source deactivate
but don’t do that now, since we finally want to start working with Snakemake :-).
Basics: An example workflow¶
Please make sure that you have activated the environment we created before, and that you have an open terminal in the working directory you have created.
A Snakemake workflow is defined by specifying rules in a Snakefile. Rules decompose the workflow into small steps (e.g., the application of a single tool) by specifying how to create sets of output files from sets of input files. Snakemake automatically determines the dependencies between the rules by matching file names.
The Snakemake language extends the Python language, adding syntactic structures for rule definition and additional controls. All added syntactic structures begin with a keyword followed by a code block that is either in the same line or indented and consisting of multiple lines. The resulting syntax resembles that of original Python constructs.
In the following, we will introduce the Snakemake syntax by creating an example workflow. The workflow comes from the domain of genome analysis. It maps sequencing reads to a reference genome and call variants on the mapped reads. The tutorial does not require you to know what this is about. Nevertheless, we provide some background in the following.
Background¶
The genome of a living organism encodes its hereditary information. It serves as a blueprint for proteins, which form living cells, carry information and drive chemical reactions. Differences between populations, species, cancer cells and healthy tissue, as well as syndromes or diseases can be reflected and sometimes caused by changes in the genome. This makes the genome an major target of biological and medical research. Today, it is often analyzed with DNA sequencing, producing gigabytes of data from a single biological sample (e.g. a biopsy of some tissue). For technical reasons, DNA sequencing cuts the DNA of a sample into millions of small pieces, called reads. In order to recover the genome of the sample, one has to map these reads against a known reference genome (e.g., the human one obtained during the famous human genome genome project). This task is called read mapping. Often, it is of interest where an individual genome is different from the species-wide consensus represented with the reference genome. Such differences are called variants. They are responsible for harmless individual differences (like eye color), but can also cause diseases like cancer. By investigating the differences between the all mapped reads and the reference sequence at one position, variants can be detected. This is a statistical challenge, because they have to be distinguished from artifacts generated by the sequencing process.
Step 1: Mapping reads¶
Our first Snakemake rule maps reads of a given sample to a given reference genome (see Background).
For this, we will use the tool bwa, specifically the subcommand bwa mem
.
In the working directory, create a new file called Snakefile
with an editor of your choice.
We propose to use the Atom editor, since it provides out-of-the-box syntax highlighting for Snakemake.
In the Snakefile, define the following rule:
rule bwa_map:
input:
"data/genome.fa",
"data/samples/A.fastq"
output:
"mapped_reads/A.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
A Snakemake rule has a name (here bwa_map
) and a number of directives, here input
, output
and shell
.
The input
and output
directives are followed by lists of files that are expected to be used or created by the rule.
In the simplest case, these are just explicit Python strings.
The shell
directive is followed by a Python string containing the shell command to execute.
In the shell command string, we can refer to elements of the rule via braces notation (similar to the Python format function).
Here, we refer to the output file by specifying {output}
and to the input files by specifying {input}
.
Since the rule has multiple input files, Snakemake will concatenate them separated by a whitespace.
In other words, Snakemake will replace {input}
with data/genome.fa data/samples/A.fastq
before executing the command.
The shell command invokes bwa mem
with reference genome and reads, and pipes the output into samtools
which creates a compressed BAM file containing the alignments.
The output of samtools
is piped into the output file defined by the rule.
When a workflow is executed, Snakemake tries to generate given target files. Target files can be specified via the command line. By executing
$ snakemake -np mapped_reads/A.bam
in the working directory containing the Snakefile, we tell Snakemake to generate the target file mapped_reads/A.bam
.
Since we used the -n
(or --dryrun
) flag, Snakemake will only show the execution plan instead of actually perform the steps.
The -p
flag instructs Snakemake to also print the resulting shell command for illustration.
To generate the target files, Snakemake applies the rules given in the Snakefile in a top-down way.
The application of a rule to generate a set of output files is called job.
For each input file of a job, Snakemake again (i.e. recursively) determines rules that can be applied to generate it.
This yields a directed acyclic graph (DAG) of jobs where the edges represent dependencies.
So far, we only have a single rule, and the DAG of jobs consists of a single node.
Nevertheless, we can execute our workflow with
$ snakemake mapped_reads/A.bam
Note that, after completion of above command, Snakemake will not try to create mapped_reads/A.bam
again, because it is already present in the file system.
Snakemake only re-runs jobs if one of the input files is newer than one of the output files or one of the input files will be updated by another job.
Step 2: Generalizing the read mapping rule¶
Obviously, the rule will only work for a single sample with reads in the file data/samples/A.fastq
.
However, Snakemake allows to generalize rules by using named wildcards.
Simply replace the A
in the second input file and in the output file with the wildcard {sample}
, leading to
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
When Snakemake determines that this rule can be applied to generate a target file by replacing the wildcard {sample}
in the output file with an appropriate value, it will propagate that value to all occurrences of {sample}
in the input files and thereby determine the necessary input for the resulting job.
Note that you can have multiple wildcards in your file paths, however, to avoid conflicts with other jobs of the same rule, all output files of a rule have to contain exactly the same wildcards.
When executing
$ snakemake -np mapped_reads/B.bam
Snakemake will determine that the rule bwa_map
can be applied to generate the target file by replacing the wildcard {sample}
with the value B
.
In the output of the dry-run, you will see how the wildcard value is propagated to the input files and all filenames in the shell command.
You can also specify multiple targets, e.g.:
$ snakemake -np mapped_reads/A.bam mapped_reads/B.bam
Some Bash magic can make this particularly handy. For example, you can alternatively compose our multiple targets in a single pass via
$ snakemake -np mapped_reads/{A,B}.bam
Note that this is not a special Snakemake syntax. Bash is just expanding the given path into two, one for each element of the set {A,B}
.
In both cases, you will see that Snakemake only proposes to create the output file mapped_reads/B.bam
.
This is because you already executed the workflow before (see the previous step) and no input file is newer than the output file mapped_reads/A.bam
.
You can update the file modification date of the input file
data/samples/A.fastq
via
$ touch data/samples/A.fastq
and see how Snakemake wants to re-run the job to create the file mapped_reads/A.bam
by executing
$ snakemake -np mapped_reads/A.bam mapped_reads/B.bam
Step 3: Sorting read alignments¶
For later steps, we need the read alignments in the BAM files to be sorted.
This can be achieved with the samtools command.
We add the following rule beneath the bwa_map
rule:
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
This rule will take the input file from the mapped_reads
directory and store a sorted version in the sorted_reads
directory.
Note that Snakemake automatically creates missing directories before jobs are executed.
For sorting, samtools
requires a prefix specified with the flag -T
.
Here, we need the value of the wildcard sample
.
Snakemake allows to access wildcards in the shell command via the wildcards
object that has an attribute with the value for each wildcard.
When issuing
$ snakemake -np sorted_reads/B.bam
you will see how Snakemake wants to run first the rule bwa_map
and then the rule samtools_sort
to create the desired target file:
as mentioned before, the dependencies are resolved automatically by matching file names.
Step 4: Indexing read alignments and visualizing the DAG of jobs¶
Next, we need to use samtools again to index the sorted read alignments for random access. This can be done with the following rule:
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
Having three steps already, it is a good time to take a closer look at the resulting DAG of jobs. By executing
$ snakemake --dag sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg
we create a visualization of the DAG using the dot
command provided by Graphviz.
For the given target files, Snakemake specifies the DAG in the dot language and pipes it into the dot
command, which renders the definition into SVG format.
The rendered DAG is piped into the file dag.svg
and will look similar to this:

The DAG contains a node for each job and edges representing the dependencies. Jobs that don’t need to be run because their output is up-to-date are dashed. For rules with wildcards, the value of the wildcard for the particular job is displayed in the job node.
Exercise¶
- Run parts of the workflow using different targets. Recreate the DAG and see how different rules become dashed because their output is present and up-to-date.
Step 5: Calling genomic variants¶
The next step in our workflow will aggregate the mapped reads from all samples and jointly call genomic variants on them (see Background). For the variant calling, we will combine the two utilities samtools and bcftools. Snakemake provides a helper function for collecting input files that helps us to describe the aggregation in this step. With
expand("sorted_reads/{sample}.bam", sample=SAMPLES)
we obtain a list of files where the given pattern "sorted_reads/{sample}.bam"
was formatted with the values in a given list of samples SAMPLES
, i.e.
["sorted_reads/A.bam", "sorted_reads/B.bam"]
The function is particularly useful when the pattern contains multiple wildcards. For example,
expand("sorted_reads/{sample}.{replicate}.bam", sample=SAMPLES, replicate=[0, 1])
would create the product of all elements of SAMPLES
and the list [0, 1]
, yielding
["sorted_reads/A.0.bam", "sorted_reads/A.1.bam", "sorted_reads/B.0.bam", "sorted_reads/B.1.bam"]
Here, we use only the simple case of expand
.
We first let Snakemake know which samples we want to consider.
Remember that Snakemake works top-down, it does not automatically infer this from, e.g., the fastq files in the data folder.
Also remember that Snakefiles are in principle Python code enhanced by some declarative statements to define workflows.
Hence, we can define the list of samples ad-hoc in plain Python at the top of the Snakefile:
SAMPLES = ["A", "B"]
Later, we will learn about more sophisticated ways like config files. Now, we can add the following rule to our Snakefile:
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
With multiple input or output files, it is sometimes handy to refer them separately in the shell command.
This can be done by specifying names for input or output files (here, e.g., fa=...
).
The files can then be referred in the shell command via, e.g., {input.fa}
.
For long shell commands like this one, it is advisable to split the string over multiple indented lines.
Python will automatically merge it into one.
Further, you will notice that the input or output file lists can contain arbitrary Python statements, as long as it returns a string, or a list of strings.
Here, we invoke our expand
function to aggregate over the aligned reads of all samples.
Exercise¶
- obtain the updated DAG of jobs for the target file
calls/all.vcf
, it should look like this:

Step 6: Writing a report¶
Although Snakemake workflows are already self-documenting to a certain degree, it is often useful to summarize the obtained results and performed steps in a comprehensive report.
With Snakemake, such reports can be composed easily with the built-in report
function.
It is best practice to create reports in a separate rule that takes all desired results as input files and provides a single HTML file as output.
rule report:
input:
"calls/all.vcf"
output:
"report.html"
run:
from snakemake.utils import report
with open(input[0]) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
""", output[0], T1=input[0])
First, we notice that this rule does not entail a shell command.
Instead, we use the run
directive, which is followed by plain Python code.
Similar to the shell case, we have access to input
and output
files, which we can handle as plain Python objects.
We go through the run
block line by line.
First, we import the report
function from snakemake.utils
.
Second, we open the VCF file by accessing it via its index in the input files (i.e. input[0]
), and count the number of non-header lines (which is equivalent to the number of variant calls).
Of course, this is only a silly example of what to do with variant calls.
Third, we create the report using the report
function.
The function takes a string that contains RestructuredText markup.
In addition, we can use the familiar braces notation to access any Python variables (here the samples
and n_calls
variables we have defined before).
The second argument of the report
function is the path were the report will be stored (the function creates a single HTML file).
Then, report expects any number of keyword arguments referring to files that shall be embedded into the report.
Technically, this means that the file will be stored as a Base64 encoded data URI within the HTML file, making reports entirely self-contained.
Importantly, you can refer to the files from within the report via the given keywords followed by an underscore (here T1_
).
Hence, reports can be used to semantically connect and explain the obtained results.
When having many result files, it is sometimes handy to define the names already in the list of input files and unpack these into keyword arguments as follows:
report("""...""", output[0], **input)
Further, you can add meta data in the form of any string that will be displayed in the footer of the report, e.g.
report("""...""", output[0], metadata="Author: Johannes Köster (koester@jimmy.harvard.edu)", **input)
Step 7: Adding a target rule¶
So far, we always executed the workflow by specifying a target file at the command line.
Apart from filenames, Snakemake also accepts rule names as targets if the referred rule does not have wildcards.
Hence, it is possible to write target rules collecting particular subsets of the desired results or all results.
Moreover, if no target is given at the command line, Snakemake will define the first rule of the Snakefile as the target.
Hence, it is best practice to have a rule all
at the top of the workflow which has all typically desired target files as input files.
Here, this means that we add a rule
rule all:
input:
"report.html"
to the top of our workflow. When executing Snakemake with
$ snakemake -n
the execution plan for creating the file report.html
which contains and summarizes all our results will be shown.
Note that, apart from Snakemake considering the first rule of the workflow as default target, the appearance of rules in the Snakefile is arbitrary and does not influence the DAG of jobs.
Exercise¶
- Create the DAG of jobs for the complete workflow.
- Execute the complete workflow and have a look at the resulting
report.html
in your browser. - Snakemake provides handy flags for forcing re-execution of parts of the workflow. Have a look at the command line help with
snakemake --help
and search for the flag--forcerun
. Then, use this flag to re-execute the rulesamtools_sort
and see what happens. - With
--reason
it is possible to display the execution reason for each job. Try this flag together with a dry-run and the--forcerun
flag to understand the decisions of Snakemake.
Summary¶
In total, the resulting workflow looks like this:
SAMPLES = ["A", "B"]
rule all:
input:
"report.html"
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
rule report:
input:
"calls/all.vcf"
output:
"report.html"
run:
from snakemake.utils import report
with open(input[0]) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
""", output[0], T1=input[0])
Advanced: Decorating the example workflow¶
Now that the basic concepts of Snakemake have been illustrated, we can introduce advanced topics.
Step 1: Specifying the number of used threads¶
For some tools, it is advisable to use more than one thread in order to speed up the computation.
Snakemake can be made aware of the threads a rule needs with the threads
directive.
In our example workflow, it makes sense to use multiple threads for the rule bwa_map
:
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
threads: 8
shell:
"bwa mem -t {threads} {input} | samtools view -Sb - > {output}"
The number of threads can be propagated to the shell command with the familiar braces notation (i.e. {threads}
).
If no threads
directive is given, a rule is assumed to need 1 thread.
When a workflow is executed, the number of threads the jobs need is considered by the Snakemake scheduler.
In particular, the scheduler ensures that the sum of the threads of all running jobs does not exceed a given number of available CPU cores.
This number can be given with the --cores
command line argument (per default, Snakemake uses only 1 CPU core).
For example
$ snakemake --cores 10
would execute the workflow with 10 cores.
Since the rule bwa_map
needs 8 threads, only one job of the rule can run at a time, and the Snakemake scheduler will try to saturate the remaining cores with other jobs like, e.g., samtools_sort
.
The threads directive in a rule is interpreted as a maximum: when less cores than threads are provided, the number of threads a rule uses will be reduced to the number of given cores.
Exercise¶
- With the flag
--forceall
you can enforce a complete re-execution of the workflow. Combine this flag with different values for--cores
and examine how the scheduler selects jobs to run in parallel.
Step 2: Config files¶
So far, we specified the samples to consider in a Python list within the Snakefile.
However, often you want your workflow to be customizable, so that it can be easily adapted to new data.
For this purpose, Snakemake provides a config file mechanism.
Config files can be written in JSON or YAML, and loaded with the configfile
directive.
In our example workflow, we add the line
configfile: "config.yaml"
to the top of the Snakefile.
Snakemake will load the config file and store its contents into a globally available dictionary named config
.
In our case, it makes sense to specify the samples in config.yaml
as
samples:
A: data/samples/A.fastq
B: data/samples/B.fastq
Now, we can remove the statement defining SAMPLES
from the Snakefile and change the rule bcftools_call
to
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
Step 3: Input functions¶
Since we have stored the path to the FASTQ files in the config file, we can also generalize the rule bwa_map
to use these paths.
This case is different to the rule bcftools_call
we modified above.
To understand this, it is important to know that Snakemake workflows are executed in three phases.
- In the initialization phase, the workflow is parsed and all rules are instantiated.
- In the DAG phase, the DAG of jobs is built by filling wildcards and matching input files to output files.
- In the scheduling phase, the DAG of jobs is executed.
The expand functions in the list of input files of the rule bcftools_call
are executed during the initialization phase.
In this phase, we don’t know about jobs, wildcard values and rule dependencies.
Hence, we cannot determine the FASTQ paths for rule bwa_map
from the config file in this phase, because we don’t even know which jobs will be generated from that rule.
Instead, we need to defer the determination of input files to the DAG phase.
This can be achieved by specifying an input function instead of a string as inside of the input directive.
For the rule bwa_map
this works as follows:
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
"mapped_reads/{sample}.bam"
threads: 8
shell:
"bwa mem -t {threads} {input} | samtools view -Sb - > {output}"
Here, we use an anonymous function, also called lambda expression.
Any normal function would work as well.
Input functions take as single argument a wildcards
object, that allows to access the wildcards values via attributes (here wildcards.sample
).
They have to return a string or a list of strings, that are interpreted as paths to input files (here, we return the path that is stored for the sample in the config file).
Input functions are evaluated once the wildcard values of a job are determined.
Exercise¶
- In the
data/samples
folder, there is an additional sampleC.fastq
. Add that sample to the config file and see how Snakemake wants to recompute the part of the workflow belonging to the new sample, when invoking withsnakemake -n --reason --forcerun bcftools_call
.
Step 4: Rule parameters¶
Sometimes, shell commands are not only composed of input and output files and some static flags.
In particular, it can happen that additional parameters need to be set depending on the wildcard values of the job.
For this, Snakemake allows to define arbitrary parameters for rules with the params
directive.
In our workflow, it is reasonable to annotate aligned reads with so-called read groups, that contain metadata like the sample name.
We modify the rule bwa_map
accordingly:
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
"mapped_reads/{sample}.bam"
params:
rg="@RG\tID:{sample}\tSM:{sample}"
threads: 8
shell:
"bwa mem -R '{params.rg}' -t {threads} {input} | samtools view -Sb - > {output}"
Similar to input and output files, params
can be accessed from the shell command or the Python based run
block (see Step 6: Writing a report).
Exercise¶
- Variant calling can consider a lot of parameters. A particularly important one is the prior mutation rate (1e-3 per default). It is set via the flag
-P
of thebcftools call
command. Consider making this flag configurable via adding a new key to the config file and using theparams
directive in the rulebcftools_call
to propagate it to the shell command.
Step 5: Logging¶
When executing a large workflow, it is usually desirable to store the output of each job persistently in files instead of just printing it to the terminal.
For this purpose, Snakemake allows to specify log files for rules.
Log files are defined via the log
directive and handled similarly to output files, but they are not subject of rule matching and are not cleaned up when a job fails.
We modify our rule bwa_map
as follows:
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
"mapped_reads/{sample}.bam"
params:
rg="@RG\tID:{sample}\tSM:{sample}"
log:
"logs/bwa_mem/{sample}.log"
threads: 8
shell:
"(bwa mem -R '{params.rg}' -t {threads} {input} | "
"samtools view -Sb - > {output}) 2> {log}"
The shell command is modified to collect STDERR output of both bwa
and samtools
and pipe it into the file referred by {log}
.
Log files must contain exactly the same wildcards as the output files to avoid clashes.
Exercise¶
- Add a log directive to the
bcftools_call
rule as well. - Time to re-run the whole workflow (remember the command line flags to force re-execution). See how log files are created for variant calling and read mapping.
- The ability to track the provenance of each generated result is an important step towards reproducible analyses. Apart from the
report
functionality discussed before, Snakemake can summarize various provenance information for all output files of the workflow. The flag--summary
prints a table associating each output file with the rule used to generate it, the creation date and optionally the version of the tool used for creation is provided. Further, the table informs about updated input files and changes to the source code of the rule after creation of the output file. Invoke Snakemake with--summary
to examine the information for our example.
Step 6: Temporary and protected files¶
In our workflow, we create two BAM files for each sample, namely
the output of the rules bwa_map
and samtools_sort
.
When not dealing with examples, the underlying data is usually huge.
Hence, the resulting BAM files need a lot of disk space and their creation takes some time.
Snakemake allows to mark output files as temporary, such that they are deleted once every consuming job has been executed, in order to save disk space.
We use this mechanism for the output file of the rule bwa_map
:
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
temp("mapped_reads/{sample}.bam")
params:
rg="@RG\tID:{sample}\tSM:{sample}"
log:
"logs/bwa_mem/{sample}.log"
threads: 8
shell:
"(bwa mem -R '{params.rg}' -t {threads} {input} | "
"samtools view -Sb - > {output}) 2> {log}"
This results in the deletion of the BAM file once the corresponding samtools_sort
job has been executed.
Since the creation of BAM files via read mapping and sorting is computationally expensive, it is reasonable to protect the final BAM file from accidental deletion or modification.
We modify the rule samtools_sort
by marking it’s output file as protected
:
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
protected("sorted_reads/{sample}.bam")
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
After execution of the job, Snakemake will write-protect the output file in the filesystem, so that it can’t be overwritten or deleted accidentally.
Exercise¶
- Re-execute the whole workflow and observe how Snakemake handles the temporary and protected files.
- Run Snakemake with the target
mapped_reads/A.bam
. Although the file is marked as temporary, you will see that Snakemake does not delete it because it is specified as a target file. - Try to re-execute the whole workflow again with the dry-run option. You will see that it fails (as intended) because Snakemake cannot overwrite the protected output files.
Summary¶
The final version of our workflow looks like this:
configfile: "config.yaml"
rule all:
input:
"report.html"
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
temp("mapped_reads/{sample}.bam")
params:
rg="@RG\tID:{sample}\tSM:{sample}"
log:
"logs/bwa_mem/{sample}.log"
threads: 8
shell:
"(bwa mem -R '{params.rg}' -t {threads} {input} | "
"samtools view -Sb - > {output}) 2> {log}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
protected("sorted_reads/{sample}.bam")
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
rule report:
input:
"calls/all.vcf"
output:
"report.html"
run:
from snakemake.utils import report
with open(input[0]) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
""", output[0], T1=input[0])
Additional features¶
In the following, we introduce some features that are beyond the scope of above example workflow.
For details and even more features, see Writing Workflows, Frequently Asked Questions and the command line help (snakemake --help
).
Benchmarking¶
With the benchmark
directive, Snakemake can be instructed to measure the wall clock time of a job.
We activate benchmarking for the rule bwa_map
:
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
temp("mapped_reads/{sample}.bam")
params:
rg="@RG\tID:{sample}\tSM:{sample}"
log:
"logs/bwa_mem/{sample}.log"
benchmark:
"benchmarks/{sample}.bwa.benchmark.txt"
threads: 8
shell:
"(bwa mem -R '{params.rg}' -t {threads} {input} | "
"samtools view -Sb - > {output}) 2> {log}"
The benchmark
directive takes a string that points to the file where benchmarking results shall be stored.
Similar to output files, the path can contain wildcards (it must be the same wildcards as in the output files).
When a job derived from the rule is executed, Snakemake will measure the wall clock time and store it in the file in tab-delimited format.
With the command line flag --benchmark-repeats
, Snakemake can be instructed to perform repetitive measurements by executing benchmark jobs multiple times.
The repeated measurements occur as subsequent lines in the tab-delimited benchmark file.
We can include the benchmark results into our report:
rule report:
input:
T1="calls/all.vcf",
T2=expand("benchmarks/{sample}.bwa.benchmark.txt", sample=config["samples"])
output:
"report.html"
run:
from snakemake.utils import report
with open(input.T1) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
Benchmark results for BWA can be found in the tables T2_.
""", output[0], **input)
We use the expand
function to collect the benchmark files for all samples.
Here, we directly provide names for the input files.
In particular, we can also name the whole list of benchmark files returned by the expand
function as T2
.
When invoking the report
function, we just unpack input
into keyword arguments (resulting in T1
and T2
).
In the text, we refer with T2_
to the list of benchmark files.
Exercise¶
- Re-execute the workflow and benchmark
bwa_map
with 3 repeats. Open the report and see how the list of benchmark files is presented in the HTML report.
Modularization¶
In order to re-use building blocks or simply to structure large workflows, it is sometimes reasonable to split a workflow into modules.
For this, Snakemake provides the include
directive to include another Snakefile into the current one, e.g.:
include: "path/to/other.snakefile"
Alternatively, Snakemake allows to define sub-workflows. A sub-workflow refers to a working directory with a complete Snakemake workflow. Output files of that sub-workflow can be used in the current Snakefile. When executing, Snakemake ensures that the output files of the sub-workflow are up-to-date before executing the current workflow. This mechanism is particularly useful when you want to extend a previous analysis without modifying it. For details about sub-workflows, see the documentation.
Exercise¶
- Put the read mapping related rules into a separate Snakefile and use the
include
directive to make them available in our example workflow again.
Using custom scripts¶
Using the run
directive as above is only reasonable for short Python scripts.
As soon as your script becomes larger, it is reasonable to separate it from the
workflow definition.
For this purpose, Snakemake offers the script
directive.
Using this, the report
rule from above could instead look like this:
rule report:
input:
T1="calls/all.vcf",
T2=expand("benchmarks/{sample}.bwa.benchmark.txt", sample=config["samples"])
output:
"report.html"
script:
"scripts/report.py"
The actual Python code to generate the report is now hidden in the script scripts/report.py
.
Script paths are always relative to the referring Snakefile.
In the script, all properties of the rule like input
, output
, wildcards
,
params
, threads
etc. are available as attributes of a global snakemake
object:
from snakemake.utils import report
with open(snakemake.input.T1) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
Benchmark results for BWA can be found in the tables T2_.
""", snakemake.output[0], **snakemake.input)
Although there are other strategies to invoke separate scripts from your workflow (e.g., invoking them via shell commands), the benefit of this is obvious: the script logic is separated from the workflow logic (and can be even shared between workflows), but boilerplate code like the parsing of command line arguments in unnecessary.
Apart from Python scripts, it is also possible to use R scripts. In R scripts,
an S4 object named snakemake
analog to the Python case above is available and
allows access to input and output files and other parameters. Here the syntax
follows that of S4 classes with attributes that are R lists, e.g. we can access
the first input file with snakemake@input[[1]]
(note that the first file does
not have index 0 here, because R starts counting from 1). Named input and output
files can be accessed in the same way, by just providing the name instead of an
index, e.g. snakemake@input[["myfile"]]
.
For details and examples, see the External scripts section in the Documentation.
Automatic deployment of software dependencies¶
In order to get a fully reproducible data analysis, it is not sufficient to be able to execute each step and document all used parameters. The used software tools and libraries have to be documented as well. In this tutorial, you have already seen how Conda can be used to specify an isolated software environment for a whole workflow. With Snakemake, you can go one step further and specify Conda environments per rule. This way, you can even make use of conflicting software versions (e.g. combine Python 2 with Python 3).
In our example, instead of using an external environment we can specify environments per rule, e.g.:
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
conda:
"envs/samtools.yaml"
shell:
"samtools index {input}"
with envs/samtools.yaml
defined as
channels:
- bioconda
dependencies:
- samtools =1.3
When Snakemake is executed with
snakemake --use-conda
it will automatically create required environments and activate them before a job is executed. It is best practice to specify at least the major and minor version of any packages in the environment definition. Specifying environments per rule in this way has two advantages. First, the workflow definition also documents all used software versions. Second, a workflow can be re-executed (without admin rights) on a vanilla system, without installing any prerequisites apart from Snakemake and Miniconda.
Tool wrappers¶
In order to simplify the utilization of popular tools, Snakemake provides a
repository of so-called wrappers
(the Snakemake wrapper repository).
A wrapper is a short script that wraps (typically)
a command line application and makes it directly addressable from within Snakemake.
For this, Snakemake provides the wrapper
directive that can be used instead of
shell
, script
, or run
.
For example, the rule bwa_map
could alternatively look like this:
rule bwa_mem:
input:
ref="data/genome.fa",
sample=lambda wildcards: config["samples"][wildcards.sample]
output:
temp("mapped_reads/{sample}.bam")
log:
"logs/bwa_mem/{sample}.log"
params:
"-R '@RG\tID:{sample}\tSM:{sample}'"
threads: 8
wrapper:
"0.15.3/bio/bwa/mem"
The wrapper directive expects a (partial) URL that points to a wrapper in the repository.
These can be looked up in the corresponding database.
The first part of the URL is a Git version tag. Upon invocation, Snakemake
will automatically download the requested version of the wrapper.
Furthermore, in combination with --use-conda
(see Automatic deployment of software dependencies),
the required software will be automatically deployed before execution.
Cluster execution¶
By default, Snakemake executes jobs on the local machine it is invoked on. Alternatively, it can execute jobs in distributed environments, e.g., compute clusters or batch systems. If the nodes share a common file system, Snakemake supports three alternative execution modes.
In cluster environments, compute jobs are usually submitted as shell scripts via commands like qsub
.
Snakemake provides a generic mode to execute on such clusters.
By invoking Snakemake with
$ snakemake --cluster qsub --jobs 100
each job will be compiled into a shell script that is submitted with the given command (here qsub
).
The --jobs
flag limits the number of concurrently submitted jobs to 100.
This basic mode assumes that the submission command returns immediately after submitting the job.
Some clusters allow to run the submission command in synchronous mode, such that it waits until the job has been executed.
In such cases, we can invoke e.g.
$ snakemake --cluster-sync "qsub -sync yes" --jobs 100
The specified submission command can also be decorated with additional parameters taken from the submitted job. For example, the number of used threads can be accessed in braces similarly to the formatting of shell commands, e.g.
$ snakemake --cluster "qsub -pe threaded {threads}" --jobs 100
Alternatively, Snakemake can use the Distributed Resource Management Application API (DRMAA). This API provides a common interface to control various resource management systems. The DRMAA support can be activated by invoking Snakemake as follows:
$ snakemake --drmaa --jobs 100
If available, DRMAA is preferable over the generic cluster modes because it provides better control and error handling. To support additional cluster specific parametrization, a Snakefile can be complemented by a Cluster Configuration file.
Constraining wildcards¶
Snakemake uses regular expressions to match output files to input files and determine dependencies between the jobs.
Sometimes it is useful to constrain the values a wildcard can have.
This can be achieved by adding a regular expression that describes the set of allowed wildcard values.
For example, the wildcard sample
in the output file "sorted_reads/{sample}.bam"
can be constrained to only allow alphanumeric sample names as "sorted_reads/{sample,[A-Za-z0-9]+}.bam"
.
Constrains may be defined per rule or globally using the wildcard_constraints
keyword, as demonstrated in Wildcards.
This mechanism helps to solve two kinds of ambiguity.
- It can help to avoid ambiguous rules, i.e. two or more rules that can be applied to generate the same output file. Other ways of handling ambiguous rules are described in the Section Handling Ambiguous Rules.
- It can help to guide the regular expression based matching so that wildcards are assigned to the right parts of a file name. Consider the output file
{sample}.{group}.txt
and assume that the target file isA.1.normal.txt
. It is not clear whetherdataset="A.1"
andgroup="normal"
ordataset="A"
andgroup="1.normal"
is the right assignment. Here, constraining the dataset wildcard by{sample,[A-Z]+}.{group}
solves the problem.
When dealing with ambiguous rules, it is best practice to first try to solve the ambiguity by using a proper file structure, for example, by separating the output files of different steps in different directories.
The Snakemake Executable¶
This part of the documentation describes the snakemake
executable. Snakemake
is primarily a command-line tool, so the snakemake
executable is the primary way
to execute, debug, and visualize workflows.
Useful Command Line Arguments¶
If called without parameters, i.e.
$ snakemake
Snakemake tries to execute the workflow specified in a file called Snakefile
in the same directory (instead, the Snakefile can be given via the parameter -s
).
By issuing
$ snakemake -n
a dry-run can be performed. This is useful to test if the workflow is defined properly and to estimate the amount of needed computation. Further, the reason for each rule execution can be printed via
$ snakemake -n -r
Importantly, Snakemake can automatically determine which parts of the workflow can be run in parallel. By specifying the number of available cores, i.e.
$ snakemake -j 4
one can tell Snakemake to use up to 4 cores and solve a binary knapsack problem to optimize the scheduling of jobs.
If the number is omitted (i.e., only -j
is given), the number of used cores is determined as the number of available CPU cores in the machine.
Cluster Execution¶
Snakemake can make use of cluster engines that support shell scripts and have access to a common filesystem, (e.g. the Sun Grid Engine). In this case, Snakemake simply needs to be given a submit command that accepts a shell script as first positional argument:
$ snakemake --cluster qsub -j 32
Here, -j
denotes the number of jobs submitted being submitted to the cluster at the same time (here 32).
The cluster command can be decorated with job specific information, e.g.
$ snakemake --cluster "qsub {threads}"
Thereby, all keywords of a rule are allowed (e.g. params, input, output, threads, priority, ...). For example, you could encode the expected running time into params:
rule:
input: ...
output: ...
params: runtime="4h"
shell: ...
and forward it to the cluster scheduler:
$ snakemake --cluster "qsub --runtime {params.runtime}"
If your cluster system supports DRMAA, Snakemake can make use of that to increase the control over jobs.
E.g. jobs can be cancelled upon pressing Ctrl+C
, which is not possible with the generic --cluster
support.
With DRMAA, no qsub
command needs to be provided, but system specific arguments can still be given as a string, e.g.
$ snakemake --drmaa " -q username" -j 32
Note that the string has to contain a leading whitespace. Else, the arguments will be interpreted as part of the normal Snakemake arguments, and execution will fail.
Job Properties¶
When executing a workflow on a cluster using the --cluster
parameter (see below), Snakemake creates a job script for each job to execute. This script is then invoked using the provided cluster submission command (e.g. qsub
). Sometimes you want to provide a custom wrapper for the cluster submission command that decides about additional parameters. As this might be based on properties of the job, Snakemake stores the job properties (e.g. rule name, threads, input files, params etc.) as JSON inside the job script. For convenience, there exists a parser function snakemake.utils.read_job_properties that can be used to access the properties. The following shows an example job submission wrapper:
#!python
#!/usr/bin/env python3
import os
import sys
from snakemake.utils import read_job_properties
jobscript = sys.argv[1]
job_properties = read_job_properties(jobscript)
# do something useful with the threads
threads = job_properties[threads]
# access property defined in the cluster configuration file (Snakemake >=3.6.0)
job_properties["cluster"]["time"]
os.system("qsub -t {threads} {script}".format(threads=threads, script=jobscript))
Visualization¶
To visualize the workflow, one can use the option --dag
.
This creates a representation of the DAG in the graphviz dot language which has to be postprocessed by the graphviz tool dot
.
E.g. to visualize the DAG that would be executed, you can issue:
$ snakemake --dag | dot | display
For saving this to a file, you can specify the desired format:
$ snakemake --dag | dot -Tpdf > dag.pdf
To visualize the whole DAG regardless of the eventual presence of files, the forceall
option can be used:
$ snakemake --forceall --dag | dot -Tpdf > dag.pdf
Of course the visual appearance can be modified by providing further command line arguments to dot
.
All Options¶
All command line options can be printed by calling snakemake -h
.
Bash Completion¶
Snakemake supports bash completion for filenames, rulenames and arguments. To enable it globally, just append
`snakemake --bash-completion`
including the accents to your .bashrc
.
This only works if the snakemake
command is in your path.
Writing Workflows¶
In Snakemake, workflows are specified as Snakefiles. Inspired by GNU Make, a Snakefile contains rules that denote how to create output files from input files. Dependencies between rules are handled implicitly, by matching filenames of input files against output files. Thereby wildcards can be used to write general rules.
Grammar¶
The Snakefile syntax obeys the following grammar, given in extended Backus-Naur form (EBNF)
snakemake = statement | rule | include | workdir
rule = "rule" (identifier | "") ":" ruleparams
include = "include:" stringliteral
workdir = "workdir:" stringliteral
ni = NEWLINE INDENT
ruleparams = [ni input] [ni output] [ni params] [ni message] [ni threads] [ni (run | shell)] NEWLINE snakemake
input = "input" ":" parameter_list
output = "output" ":" parameter_list
params = "params" ":" parameter_list
log = "log" ":" parameter_list
benchmark = "benchmark" ":" statement
message = "message" ":" stringliteral
threads = "threads" ":" integer
resources = "resources" ":" parameter_list
version = "version" ":" statement
run = "run" ":" ni statement
shell = "shell" ":" stringliteral
while all not defined non-terminals map to their Python equivalents.
Depend on a Minimum Snakemake Version¶
From Snakemake 3.2 on, if your workflow depends on a minimum Snakemake version, you can easily ensure that at least this version is installed via
from snakemake.utils import min_version
min_version("3.2")
given that your minimum required version of Snakemake is 3.2. The statement will raise a WorkflowError (and therefore abort the workflow execution) if the version is not met.
Rules¶
Most importantly, a rule can consist of a name (the name is optional and can be left out, creating an anonymous rule), input files, output files, and a shell command to generate the output from the input, i.e.
rule NAME:
input: "path/to/inputfile", "path/to/other/inputfile"
output: "path/to/outputfile", "path/to/another/outputfile"
shell: "somecommand {input} {output}"
Inside the shell command, all local and global variables, especially input and output files can be accessed via their names in the python format minilanguage. Here input and output (and in general any list or tuple) automatically evaluate to a space-separated list of files (i.e. path/to/inputfile path/to/other/inputfile
).
From Snakemake 3.8.0 on, adding the special formatting instruction :q
(e.g. "somecommand {input:q} {output:q}")
) will let Snakemake quote each of the list or tuple elements that contains whitespace.
Instead of a shell command, a rule can run some python code to generate the output:
rule NAME:
input: "path/to/inputfile", "path/to/other/inputfile"
output: "path/to/outputfile", somename = "path/to/another/outputfile"
run:
for f in input:
...
with open(output[0], "w") as out:
out.write(...)
with open(output.somename, "w") as out:
out.write(...)
As can be seen, instead of accessing input and output as a whole, we can also access by index (output[0]
) or by keyword (output.somename
).
Note that, when adding keywords or names for input or output files, their order won’t be preserved when accessing them as a whole via e.g. {output}
in a shell command.
Shell commands like above can also be invoked inside a python based rule, via the function shell
that takes a string with the command and allows the same formatting like in the rule above, e.g.:
shell("somecommand {output.somename}")
Further, this combination of python and shell commands, allows to iterate over the output of the shell command, e.g.:
for line in shell("somecommand {output.somename}", iterable=True):
... # do something in python
Note that shell commands in Snakemake use the bash shell in strict mode by default.
Wildcards¶
Usually, it is useful to generalize a rule to be applicable to a number of e.g. datasets. For this purpose, wildcards can be used. Automatically resolved multiple named wildcards are a key feature and strength of Snakemake in comparison to other systems. Consider the following example.
rule complex_conversion:
input:
"{dataset}/inputfile"
output:
"{dataset}/file.{group}.txt"
shell:
"somecommand --group {wildcards.group} < {input} > {output}"
Here, we define two wildcards, dataset
and group
. By this, the rule can produce all files that follow the regular expression pattern .+/file\..+\.txt
, i.e. the wildcards are replaced by the regular expression .+
. If the rule’s output matches a requested file, the substrings matched by the wildcards are propagated to the input files and to the variable wildcards, that is here also used in the shell command. The wildcards object can be accessed in the same way as input and output, which is described above.
For example, if another rule in the workflow requires the file the file 101/file.A.txt
, Snakemake recognizes that this rule is able to produce it by setting dataset=101
and group=A
.
Thus, it requests file 101/inputfile
as input and executes the command somecommand --group A < 101/inputfile > 101/file.A.txt
.
Of course, the input file might have to be generated by another rule with different wildcards.
Importantly, the wildcard names in input and output must be named identically. Most typically, the same wildcard is present in both input and output, but it is of course also possible to have wildcards only in the output but not the input section.
Multiple wildcards in one filename can cause ambiguity.
Consider the pattern {dataset}.{group}.txt
and assume that a file 101.B.normal.txt
is available.
It is not clear whether dataset=101.B
and group=normal
or dataset=101
and group=B.normal
in this case.
Hence wildcards can be constrained to given regular expressions.
Here we could restrict the wildcard dataset
to consist of digits only using \d+
as the corresponding regular expression.
With Snakemake 3.8.0, there are three ways to constrain wildcards.
First, a wildcard can be constrained within the file pattern, by appending a regular expression separated by a comma:
output: "{dataset,\d+}.{group}.txt"
Second, a wildcard can be constrained within the rule via the keyword wildcard_constraints
:
rule complex_conversion:
input:
"{dataset}/inputfile"
output:
"{dataset}/file.{group}.txt"
wildcard_constraints:
dataset="\d+"
shell:
"somecommand --group {wildcards.group} < {input} > {output}"
Finally, you can also define global wildcard constraints that apply for all rules:
wildcard_constraints:
dataset="\d+"
rule a:
...
rule b:
...
See the Python documentation on regular expressions for detailed information on regular expression syntax.
Targets¶
By default snakemake executes the first rule in the snakefile. This gives rise to pseudo-rules at the beginning of the file that can be used to define build-targets similar to GNU Make:
rule all:
input: ["{dataset}/file.A.txt".format(dataset=dataset) for dataset in DATASETS]
Here, for each dataset in a python list DATASETS
defined before, the file {dataset}/file.A.txt
is requested. In this example, Snakemake recognizes automatically that these can be created by multiple applications of the rule complex_conversion
shown above.
Above expression can be simplified to the following:
rule all:
input: expand("{dataset}/file.A.txt", dataset=DATASETS)
This may be used for “aggregation” rules for which files from multiple or all datasets are needed to produce a specific output (say, allSamplesSummary.pdf).
Note that dataset is NOT a wildcard here because it is resolved by Snakemake due to the expand
statement (see below also for more information).
The expand
function thereby allows also to combine different variables, e.g.
rule all:
input: expand("{dataset}/file.A.{ext}", dataset=DATASETS, ext=PLOTFORMATS)
If now PLOTFORMATS=["pdf", "png"]
contains a list of desired output formats then expand will automatically combine any dataset with any of these extensions.
Further, the first argument can also be a list of strings. In that case, the transformation is applied to all elements of the list. E.g.
expand(["{dataset}/plot1.{ext}", "{dataset}/plot2.{ext}"], dataset=DATASETS, ext=PLOTFORMATS)
leads to
["ds1/plot1.pdf", "ds1/plot2.pdf", "ds2/plot1.pdf", "ds2/plot2.pdf", "ds1/plot1.png", "ds1/plot2.png", "ds2/plot1.png", "ds2/plot2.png"]
Per default, expand
uses the python itertools function product
that yields all combinations of the provided wildcard values. However by inserting a second positional argument this can be replaced by any combinatoric function, e.g. zip
:
expand("{dataset}/plot1.{ext} {dataset}/plot2.{ext}".split(), zip, dataset=DATASETS, ext=PLOTFORMATS)
leads to
["ds1/plot1.pdf", "ds1/plot2.pdf", "ds2/plot1.png", "ds2/plot2.png"]
You can also mask a wildcard expression in expand such that it will be kept, e.g.
expand("{{dataset}}/plot1.{ext}", ext=PLOTFORMATS)
will create strings with all values for ext but starting with "{dataset}"
.
Threads¶
Further, a rule can be given a number of threads to use, i.e.
rule NAME:
input: "path/to/inputfile", "path/to/other/inputfile"
output: "path/to/outputfile", "path/to/another/outputfile"
threads: 8
shell: "somecommand --threads {threads} {input} {output}"
Snakemake can alter the number of cores available based on command line options. Therefore it is useful to propagate it via the built in variable threads
rather than hardcoding it into the shell command.
In particular, it should be noted that the specified threads have to be seen as a maximum. When Snakemake is executed with fewer cores, the number of threads will be adjusted, i.e. threads = min(threads, cores)
with cores
being the number of cores specified at the command line (option --cores
). On a cluster node, Snakemake uses as many cores as available on that node. Hence, the number of threads used by a rule never exceeds the number of physically available cores on the node. Note: This behavior is not affected by --local-cores
, which only applies to jobs running on the master node.
Starting from version 3.7, threads can also be a callable that returns an int
value. It is also possible to refer to a predefined variable (e.g, threads: threads_max
) so that the number of cores for a set of rules can be changed with one change only by altering the value of the variable threads_max
.
Resources¶
In addition to threads, a rule can use arbitrary user-defined resources by specifying them with the resources-keyword:
rule:
input: ...
output: ...
resources: gpu=1
shell: "..."
If limits for the resources are given via the command line, e.g.
$ snakemake --resources gpu=2
the scheduler will ensure that the given resources are not exceeded by running jobs.
If no limits are given, the resources are ignored.
Apart from making Snakemake aware of hybrid-computing architectures (e.g. with a limited number of additional devices like GPUs) this allows to control scheduling in various ways, e.g. to limit IO-heavy jobs by assigning an artificial IO-resource to them and limiting it via the --resources
flag.
Resources must be int
values.
Resources can also be callables that return int
values.
The signature of the callable should be callable(wildcards, [input])
(input is an optional parameter).
Messages¶
When executing snakemake, a short summary for each running rule is given to the console. This can be overridden by specifying a message for a rule:
rule NAME:
input: "path/to/inputfile", "path/to/other/inputfile"
output: "path/to/outputfile", "path/to/another/outputfile"
threads: 8
message: "Executing somecommand with {threads} threads on the following files {input}."
shell: "somecommand --threads {threads} {input} {output}"
Note that access to wildcards is also possible via the variable wildcards
(e.g, {wildcards.sample}
), which is the same as with shell commands. It is important to have a namespace around wildcards in order to avoid clashes with other variable names.
Priorities¶
Snakemake allows rules to specify numeric priorities:
rule:
input: ...
output: ...
priority: 50
shell: ...
Per default, each rule has a priority of 0. Any rule that specifies a higher priority, will be preferred by the scheduler over all rules that are ready to execute at the same time without having at least the same priority.
Furthermore, the --prioritize
or -P
command line flag allows to specify files (or rules) that shall be created with highest priority during the workflow execution. This means that the scheduler will assign the specified target and all its dependencies highest priority, such that the target is finished as soon as possible.
The --dryrun
or -n
option allows you to see the scheduling plan including the assigned priorities.
Log-Files¶
Each rule can specify a log file where information about the execution is written to:
rule abc:
input: "input.txt"
output: "output.txt"
log: "logs/abc.log"
shell: "somecommand --log {log} {input} {output}"
The variable log
can be used inside a shell command to tell the used tool to which file to write the logging information. Of course the log file can use the same wildcards as input and output files, e.g.
log: "logs/abc.{dataset}.log"
For programs that do not have an explicit log
parameter, you may always use 2> {log}
to redirect standard output to a file (here, the log
file) in Linux-based systems.
Note that it is also supported to have multiple (named) log files being specified:
rule abc:
input: "input.txt"
output: "output.txt"
log: log1="logs/abc.log", log2="logs/xyz.log"
shell: "somecommand --log {log.log1} METRICS_FILE={log.log2} {input} {output}"
Non-file parameters for rules¶
Sometimes you may want to define certain parameters separately from the rule body. Snakemake provides the params
keyword for this purpose:
rule:
input:
...
params:
prefix="somedir/{sample}"
output:
"somedir/{sample}.csv"
shell:
"somecommand -o {params.prefix}"
The params
keyword allows you to specify additional parameters depending on the wildcards values. This allows you to circumvent the need to use run:
and python code for non-standard commands like in the above case.
Here, the command somecommand
expects the prefix of the output file instead of the actual one. The params
keyword helps here since you cannot simply add the prefix as an output file (as the file won’t be created, Snakemake would throw an error after execution of the rule).
Furthermore, for enhanced readability and clarity, the params
section is also an excellent place to name and assign parameters and variables for your subsequent command.
Similar to input
, params
can take functions as well (see Functions as Input Files), e.g. you can write
rule:
input:
...
params:
prefix=lambda wildcards, output: output[0][:-4]
output:
"somedir/{sample}.csv"
shell:
"somecommand -o {params.prefix}"
to get the same effect as above. Note that in contrast to the input
directive, the
params
directive can optionally take more arguments than only wildcards
, namely input
, output
, threads
, and resources
.
Here, this allows you to derive the prefix name from the output file.
External scripts¶
A rule can also point to an external script instead of a shell command or inline Python code, e.g.
rule NAME:
input:
"path/to/inputfile",
"path/to/other/inputfile"
output:
"path/to/outputfile",
"path/to/another/outputfile"
script:
"path/to/script.py"
The script path is always relative to the Snakefile (in contrast to the input and output file paths, which are relative to the working directory).
Inside the script, you have access to an object snakemake
that provides access to the same objects that are available in the run
and shell
directives (input, output, params, wildcards, log, threads, resources, config), e.g. you can use snakemake.input[0]
to access the first input file of above rule.
Apart from Python scripts, this mechanism also allows you to integrate R scripts with Snakemake, e.g.
rule NAME:
input:
"path/to/inputfile",
"path/to/other/inputfile"
output:
"path/to/outputfile",
"path/to/another/outputfile"
script:
"path/to/script.R"
In the R script, an S4 object named snakemake
analog to the Python case above is available and allows access to input and output files and other parameters. Here the syntax follows that of S4 classes with attributes that are R lists, e.g. we can access the first input file with snakemake@input[[1]]
(note that the first file does not have index 0
here, because R starts counting from 1
). Named input and output files can be accessed in the same way, by just providing the name instead of an index, e.g. snakemake@input[["myfile"]]
.
An example external Python script would could look like this:
def do_something(data_path, out_path, threads, myparam):
# python code
do_something(snakemake.input[0], snakemake.output[0], snakemake.threads, snakemake.config["myparam"])
You can use the Python debugger from within the script if you invoke Snakemake with --debug
.
An equivalent script written in R would look like this:
do_something <- function(data_path, out_path, threads, myparam) {
# R code
}
do_something(snakemake@input[[1]], snakemake@output[[1]], snakemake@threads, snakemake@config[["myparam"]])
To debug R scripts, you can save the workspace with save.image()
, and invoke R after Snakemake has terminated. Then you can use the usual R debugging facilities while having access to the snakemake
variable.
It is best practice to wrap the actual code into a separate function. This increases the portability if the code shall be invoked outside of Snakemake or from a different rule.
Protected and Temporary Files¶
A particular output file may require a huge amount of computation time. Hence one might want to protect it against accidental deletion or overwriting. Snakemake allows this by marking such a file as protected
:
rule NAME:
input: "path/to/inputfile", "path/to/other/inputfile"
output: protected("path/to/outputfile"), "path/to/another/outputfile"
shell: "somecommand --threads {threads} {input} {output}"
A protected file will be write-protected after the rule that produces it is completed.
Further, an output file marked as temp
is deleted after all rules that use it as an input are completed:
rule NAME:
input: "path/to/inputfile", "path/to/other/inputfile"
output: temp("path/to/outputfile"), "path/to/another/outputfile"
shell: "somecommand --threads {threads} {input} {output}"
Shadow rules¶
Shadow rules result in each execution of the rule to be run in isolated temporary directories. This “shadow” directory contains symlinks to files and directories in the current workdir. This is useful for running programs that generate lots of unused files which you don’t want to manually cleanup in your snakemake workflow. It can also be useful if you want to keep your workdir clean while the program executes, or simplify your workflow by not having to worry about unique filenames for all outputs of all rules.
By setting shadow: "shallow"
, the top level files and directories are symlinked, so that any relative paths in a subdirectory will be real paths in the filesystem. The setting shadow: "full"
fully shadows the entire subdirectory structure of the current workdir. Once the rule successfully executes, the output file will be moved if necessary to the real path as indicated by output
.
Shadow directories are stored one per rule execution in .snakemake/shadow/
, and are cleared on subsequent snakemake invocations unless the --keep-shadow
command line argument is used.
Typically, you will not need to modify your rule for compatibility with shadow
, unless you reference parent directories relative to your workdir in a rule.
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
shadow: "shallow"
shell: "somecommand --other_outputs other.txt {input} {output}"
Flag files¶
Sometimes it is necessary to enforce some rule execution order without real file dependencies. This can be achieved by “touching” empty files that denote that a certain task was completed. Snakemake supports this via the touch flag:
rule all:
input: "mytask.done"
rule mytask:
output: touch("mytask.done")
shell: "mycommand ..."
With the touch
flag, Snakemake touches (i.e. creates or updates) the file mytask.done
after mycommand
has finished successfully.
Job Properties¶
When executing a workflow on a cluster using the --cluster
parameter (see below), Snakemake creates a job script for each job to execute.
This script is then invoked using the provided cluster submission command (e.g. qsub
).
Sometimes you want to provide a custom wrapper for the cluster submission command that decides about additional parameters.
As this might be based on properties of the job, Snakemake stores the job properties (e.g. rule name, threads, input files, params etc.) as JSON inside the job script.
For convenience, there exists a parser function snakemake.utils.read_job_properties
that can be used to access the properties.
The following shows an example job submission wrapper:
#!/usr/bin/env python3
import os
import sys
from snakemake.utils import read_job_properties
jobscript = sys.argv[1]
job_properties = read_job_properties(jobscript)
# do something useful with the threads
threads = job_properties[threads]
# access property defined in the cluster configuration file (Snakemake >=3.6.0)
job_properties["cluster"]["time"]
os.system("qsub -t {threads} {script}".format(threads=threads, script=jobscript))
Dynamic Files¶
Snakemake provides experimental support for dynamic files. Dynamic files can be used whenever one has a rule, for which the number of output files is unknown before the rule was executed. This is useful for example with cetain clustering algorithms:
rule cluster:
input: "afile.csv"
output: dynamic("{clusterid}.cluster.csv")
run: ...
Now the results of the rule can be used in Snakemake although it does not know how many files will be present before executing the rule cluster, e.g. by:
rule all:
input: dynamic("{clusterid}.cluster.plot.pdf")
rule plot:
input: "{clusterid}.cluster.csv"
output: "{clusterid}.cluster.plot.pdf"
run: ...
Here, Snakemake determines the input files for the rule all after the rule cluster was executed, and then dynamically inserts jobs of the rule plot into the DAG to create the desired plots.
Functions as Input Files¶
Instead of specifying strings or lists of strings as input files, snakemake can also make use of functions that return single or lists of input files:
def myfunc(wildcards):
return [... a list of input files depending on given wildcards ...]
rule:
input: myfunc
output: "someoutput.{somewildcard}.txt"
shell: "..."
The function has to accept a single argument that will be the wildcards object generated from the application of the rule to create some requested output files. Note that you can also use lambda expressions instead of full function definitions. By this, rules can have entirely different input files (both in form and number) depending on the inferred wildcards. E.g. you can assign input files that appear in entirely different parts of your filesystem based on some wildcard value and a dictionary that maps the wildcard value to file paths.
Note that the function will be executed when the rule is evaluated and before the workflow actually starts to execute. Further note that using a function as input overrides the default mechanism of replacing wildcards with their values inferred from the output files. You have to take care of that yourself with the given wildcards object.
Finally, when implementing the input function, it is best practice to make sure that it can properly handle all possible wildcard values your rule can have. In particular, input files should not be combined with very general rules that can be applied to create almost any file: Snakemake will try to apply the rule, and will report the exceptions of your input function as errors.
Input Functions and unpack()
¶
In some cases, you might want to have your input functions return named input files.
This can be done by having them return dict()
objects with the names as the dict keys and the file names as the dict values and using the unpack()
keyword.
def myfunc(wildcards):
return { 'foo': '{wildcards.token}.txt'.format(wildcards=wildcards)
rule:
input: unpack(myfunc)
output: "someoutput.{token}.txt"
shell: "..."
Note that unpack()
only necessary for input functions returning dict
.
While it also works for list
, remember that lists (and nested lists) of strings are automatically flattened.
Also note that if you do not pass in a function into the input list but you directly call a function then you don’t use unpack()
either.
Here, you can simply use Python’s double-star (**
) operator for unpacking the parameters.
Note that as Snakefiles are translated into Python for execution, the same rules as for using the star and double-star unpacking Python operators apply.
These restrictions do not apply when using unpack()
.
def myfunc1():
return ['foo.txt']
def myfunc2():
return {'foo': 'nowildcards.txt'}
rule:
input:
*myfunc1(),
**myfunc2(),
output: "..."
shell: "..."
Version Tracking¶
Rules can specify a version that is tracked by Snakemake together with the output files. When the version changes snakemake informs you when using the flag --summary
or --list-version-changes
.
The version can be specified by the version directive, which takes a string:
rule:
input: ...
output: ...
version: "1.0"
shell: ...
The version can of course also be filled with the output of a shell command, e.g.:
SOMECOMMAND_VERSION = subprocess.check_output("somecommand --version", shell=True)
rule:
version: SOMECOMMAND_VERSION
Alternatively, you might want to use file modification times in case of local scripts:
SOMECOMMAND_VERSION = str(os.path.getmtime("path/to/somescript"))
rule:
version: SOMECOMMAND_VERSION
A re-run can be automated by invoking Snakemake as follows:
$ snakemake -R `snakemake --list-version-changes`
With the availability of the conda
directive (see Integrated Package Management)
the version
directive has become obsolete in favor of defining isolated
software environments that can be automatically deployed via the conda package
manager.
Code Tracking¶
Snakemake tracks the code that was used to create your files.
In combination with --summary
or --list-code-changes
this can be used to see what files may need a re-run because the implementation changed.
Re-run can be automated by invoking Snakemake as follows:
$ snakemake -R `snakemake --list-code-changes`
Onstart, onsuccess and onerror handlers¶
Sometimes, it is necessary to specify code that shall be executed when the workflow execution is finished (e.g. cleanup, or notification of the user).
With Snakemake 3.2.1, this is possible via the onsuccess
and onerror
keywords:
onsuccess:
print("Workflow finished, no error")
onerror:
print("An error occurred")
shell("mail -s "an error occurred" youremail@provider.com < {log}")
The onsuccess
handler is executed if the workflow finished without error. Else, the onerror
handler is executed.
In both handlers, you have access to the variable log
, which contains the path to a logfile with the complete Snakemake output.
Snakemake 3.6.0 adds an ``onstart``
handler, that will be executed before the workflow starts.
Note that dry-runs do not trigger any of the handlers.
Rule dependencies¶
From verion 2.4.8 on, rules can also refer to the output of other rules in the Snakefile, e.g.:
rule a:
input: "path/to/input"
output: "path/to/output"
shell: ...
rule b:
input: rules.a.output
output: "path/to/output/of/b"
shell: ...
Importantly, be aware that referring to rule a here requires that rule a was defined above rule b in the file, since the object has to be known already. This feature also allows to resolve dependencies that are ambiguous when using filenames.
Note that when the rule you refer to defines multiple output files but you want to require only a subset of those as input for another rule, you should name the output files and refer to them specifically:
rule a:
input: "path/to/input"
output: a = "path/to/output", b = "path/to/output2"
shell: ...
rule b:
input: rules.a.output.a
output: "path/to/output/of/b"
shell: ...
Handling Ambiguous Rules¶
When two rules can produce the same output file, snakemake cannot decide per default which one to use. Hence an AmbiguousRuleException
is thrown.
Note: ruleorder is not intended to bring rules in the correct execution order (this is solely guided by the names of input and output files you use), it only helps snakemake to decide which rule to use when multiple ones can create the same output file!
The proposed strategy to deal with such ambiguity is to provide a ruleorder
for the conflicting rules, e.g.
ruleorder: rule1 > rule2 > rule3
Here, rule1
is preferred over rule2
and rule3
, and rule2
is preferred over rule3
.
Only if rule1 and rule2 cannot be applied (e.g. due to missing input files), rule3 is used to produce the desired output file.
Alternatively, rule dependencies (see above) can also resolve ambiguities.
Another (quick and dirty) possiblity is to tell snakemake to allow ambiguity via a command line option
$ snakemake --allow-ambiguity
such that similar to GNU Make always the first matching rule is used. Here, a warning that summarizes the decision of snakemake is provided at the terminal.
Local Rules¶
When working in a cluster environment, not all rules need to become a job that has to be submitted (e.g. downloading some file, or a target-rule like all, see Targets). The keyword localrules allows to mark a rule as local, so that it is not submitted to the cluster and instead executed on the host node:
localrules: all, foo
rule all:
input: ...
rule foo:
...
rule bar:
...
Here, only jobs from the rule bar
will be submitted to the cluster, whereas all and foo will be run locally.
Note that you can use the localrules directive multiple times. The result will be the union of all declarations.
Benchmark Rules¶
Since version 3.1, Snakemake provides support for benchmarking the run times of rules. This can be used to create complex performance analysis pipelines. With the benchmark keyword, a rule can be declared to store a benchmark of its code into the specified location. E.g. the rule
rule benchmark_command:
input:
"path/to/input.{sample}.txt"
output:
"path/to/output.{sample}.txt"
benchmark:
"benchmarks/somecommand/{sample}.txt"
shell:
"somecommand {input} {output}"
benchmarks the CPU and wall clock time of the command somecommand
for the given output and input files.
For this, the shell or run body of the rule is executed on that data, and all run times are stored into the given benchmark txt file (which will contain a tab-separated table of run times). Per default, Snakemake executes the job once, generating one run time.
With snakemake --benchmark-repeats
, this number can be changed to e.g. generate timings for two or three runs.
The resulting txt file can be used as input for other rules, just like any other output file.
Configuration¶
Snakemake allows you to use configuration files for making your workflows more flexible and also for abstracting away direct dependencies to a fixed HPC cluster scheduler.
Standard Configuration¶
Snakemake directly supports the configuration of your workflow. A configuration is provided as a JSON or YAML file and can be loaded with:
configfile: "path/to/config.json"
The config file can be used to define a dictionary of configuration parameters and their values. In the workflow, the configuration is accessible via the global variable config, e.g.
rule all:
input:
expand("{sample}.{yourparam}.output.pdf", sample=config["samples"], param=config["yourparam"])
If the configfile statement is not used, the config variable provides an empty array. In addition to the configfile statement, config values can be overwritten via the command line or the The Snakemake API, e.g.:
$ snakemake --config yourparam=1.5
Further, you can manually alter the config dictionary using any Python code outside of your rules. Changes made from within a rule won’t be seen from other rules. Finally, you can use the –configfile command line argument to overwrite values from the configfile statement. Note that any values parsed into the config dictionary with any of above mechanisms are merged, i.e., all keys defined via a configfile statement, or the –configfile and –config command line arguments will end up in the final config dictionary, but if two methods define the same key, command line overwrites the configfile statement.
For adding config placeholders into a shell command, Python string formatting syntax requires you to leave out the quotes around the key name, like so:
shell:
"mycommand {config[foo]} ..."
Cluster Configuration¶
Snakemake supports a separate configuration file for execution on a cluster.
A cluster config file allows you to specify cluster submission parameters outside the Snakefile.
The cluster config is a JSON- or YAML-formatted file that contains objects that match names of rules in the Snakefile.
The parameters in the cluster config are then accessed by the cluster.*
wildcard when you are submitting jobs.
For example, say that you have the following Snakefile:
rule all:
input: "input1.txt", "input2.txt"
rule compute1:
output: "input1.txt"
shell: "touch input1.txt"
rule compute2:
output: "input2.txt"
shell: "touch input2.txt"
This Snakefile can then be configured by a corresponding cluster config, say “cluster.json”:
{
"__default__" :
{
"account" : "my account",
"time" : "00:15:00",
"n" : 1,
"partition" : "core"
},
"compute1" :
{
"time" : "00:20:00"
}
}
Any string in the cluster configuration can be formatted in the same way as shell commands, e.g. {rule}.{wildcards.sample}
is formatted to a.xy
if the rulename is a
and the wildcard value is xy
.
Here __default__
is a special object that specifies default parameters, these will be inherited by the other configuration objects. The compute1
object here changes the time
parameter, but keeps the other parameters from __default__
. The rule compute2
does not have any configuration, and will therefore use the default configuration. You can then run the Snakefile with the following command on a SLURM system.
$ snakemake -j 999 --cluster-config cluster.json --cluster "sbatch -A {cluster.account} -p {cluster.partition} -n {cluster.n} -t {cluster.time}"
For cluster systems using LSF/BSUB, a cluster config may look like this:
{
"__default__" :
{
"queue" : "medium_priority",
"nCPUs" : "16",
"memory" : 20000,
"resources" : "\"select[mem>20000] rusage[mem=20000] span[hosts=1]\"",
"name" : "JOBNAME.{rule}.{wildcards}",
"output" : "logs/cluster/{rule}.{wildcards}.out",
"error" : "logs/cluster/{rule}.{wildcards}.err"
},
"trimming_PE" :
{
"memory" : 30000,
"resources" : "\"select[mem>30000] rusage[mem=30000] span[hosts=1]\"",
}
}
The advantage of this setup is that it is already pretty general by exploiting the wildcard possibilities that Snakemake provides via {rule}
and {wildcards}
. So job names, output and error files all have reasonable and trackable default names, only the directies (logs/cluster
) and job names (JOBNAME
) have to adjusted accordingly.
If a rule named bamCoverage
is executed with the wildcard basename = sample1
, for example, the output and error files will be bamCoverage.basename=sample1.out
and bamCoverage.basename=sample1.err
, respectively.
Configure Working Directory¶
All paths in the snakefile are interpreted relative to the directory snakemake is executed in. This behaviour can be overridden by specifying a workdir in the snakefile:
workdir: "path/to/workdir"
Usually, it is preferred to only set the working directory via the command line, because above directive limits the portability of Snakemake workflows.
Modularization¶
Modularization in Snakemake comes at different levels.
- The most fine-grained level are wrappers. They are available and can be published at the Snakemake Wrapper Repository. These wrappers can then be composed and customized according to your needs, by copying skeleton rules into your workflow. In combination with conda integration, wrappers also automatically deploy the needed software dependencies into isolated environments.
- For larger, reusable parts that shall be integrated into a common workflow, it is recommended to write small Snakefiles and include them into a master Snakefile via the include statement. In such a setup, all rules share a common config file.
- The third level of separation are subworkflows. Importantly, these are rather meant as links between otherwise separate data analyses.
Wrappers¶
With Snakemake 3.5.5, the wrapper directive is introduced (experimental).
This directive allows to have re-usable wrapper scripts around e.g. command line tools. In contrast to modularization strategies like include
or subworkflows, the wrapper directive allows to re-wire the DAG of jobs.
For example
rule samtools_sort:
input:
"mapped/{sample}.bam"
output:
"mapped/{sample}.sorted.bam"
params:
"-m 4G"
threads: 8
wrapper:
"0.0.8/bio/samtools_sort"
Refers to the wrapper "0.0.8/bio/samtools_sort"
to create the output from the input.
Snakemake will automatically download the wrapper from the Snakemake Wrapper Repository.
Thereby, 0.0.8 can be replaced with the git version tag you want to use, or a commit id (see here).
This ensures reproducibility since changes in the wrapper implementation won’t be propagated automatically to your workflow.
Alternatively, e.g., for development, the wrapper directive can also point to full URLs, including URLs to local files with absolute paths file://
or relative paths file:
.
Examples for each wrapper can be found in the READMEs located in the wrapper subdirectories at the Snakemake Wrapper Repository.
The Snakemake Wrapper Repository is meant as a collaborative project and pull requests are very welcome.
Includes¶
Another Snakefile with all its rules can be included into the current:
include: "path/to/other/snakefile"
The default target rule (often called the all
-rule), won’t be affected by the include.
I.e. it will always be the first rule in your Snakefile, no matter how many includes you have above your first rule.
Includes are relative to the directory of the Snakefile in which they occur.
For example, if above Snakefile resides in the directory my/dir
, then Snakemake will search for the include at my/dir/path/to/other/snakefile
, regardless of the working directory.
Sub-Workflows¶
In addition to including rules of another workflow, Snakemake allows to depend on the output of other workflows as sub-workflows. A sub-workflow is executed independently before the current workflow is executed. Thereby, Snakemake ensures that all files the current workflow depends on are created or updated if necessary. This allows to create links between otherwise separate data analyses.
subworkflow otherworkflow:
workdir: "../path/to/otherworkflow"
snakefile: "../path/to/otherworkflow/Snakefile"
rule a:
input: otherworkflow("test.txt")
output: ...
shell: ...
Here, the subworkflow is named “otherworkflow” and it is located in the working directory ../path/to/otherworkflow
.
The snakefile is in the same directory and called Snakefile
.
If snakefile
is not defined for the subworkflow, it is assumed be located in the workdir location and called Snakefile
, hence, above we could have left the snakefile
keyword out as well.
If workdir
is not specified, it is assumed to be the same as the current one.
Files that are output from the subworkflow that we depend on are marked with the otherworkflow
function (see the input of rule a).
This function automatically determines the absolute path to the file (here ../path/to/otherworkflow/test.txt
).
When executing, snakemake first tries to create (or update, if necessary) test.txt
(and all other possibly mentioned dependencies) by executing the subworkflow.
Then the current workflow is executed.
This can also happen recursively, since the subworkflow may have its own subworkflows as well.
Remote files¶
In versions snakemake>=3.5
.
The Snakefile
supports a wrapper function, remote()
, indicating a file is on a remote storage provider (this is similar to temp()
or protected()
). In order to use all types of remote files, the Python packages boto
, moto
, filechunkio
, pysftp
, dropbox
, requests
, and ftputil
must be installed.
During rule execution, a remote file (or object) specified is downloaded to the local cwd
, within a sub-directory bearing the same name as the remote provider. This sub-directory naming lets you have multiple remote origins with reduced likelihood of name collisions, and allows Snakemake to easily translate remote objects to local file paths. You can think of each local remote sub-directory as a local mirror of the remote system. The remote()
wrapper is mutually-exclusive with the temp()
and protected()
wrappers.
Snakemake includes the following remote providers, supported by the corresponding classes:
- Amazon Simple Storage Service (AWS S3):
snakemake.remote.S3
- Google Cloud Storage (GS):
snakemake.remote.GS
- File transfer over SSH (SFTP):
snakemake.remote.SFTP
- Read-only web (HTTP[S]):
snakemake.remote.HTTP
- File transfer protocol (FTP):
snakemake.remote.FTP
- Dropbox:
snakemake.remote.dropbox
Amazon Simple Storage Service (S3)¶
This section describes usage of the S3 RemoteProvider, and also provides an intro to remote files and their usage.
It is important to note that you must have credentials (access_key_id
and secret_access_key
) which permit read/write access. If a file only serves as input to a Snakemake rule, read access is sufficient. You may specify credentials as environment variables or in the file =/.aws/credentials
, prefixed with AWS_*
, as with a standard boto config. Credentials may also be explicitly listed in the Snakefile
, as shown below:
For the Amazon S3 and Google Cloud Storage providers, the sub-directory used must be the bucket name.
Using remote files is easy (AWS S3 shown):
from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider(access_key_id="MYACCESSKEY", secret_access_key="MYSECRET")
rule all:
input:
S3.remote("bucket-name/file.txt")
Expand still works as expected, just wrap the expansion:
from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider()
rule all:
input:
S3.remote(expand("bucket-name/{letter}-2.txt", letter=["A", "B", "C"]))
It is possible to use S3-compatible storage by specifying a different endpoint address as the host kwarg in the provider, as the kwargs used in instantiating the provider are passed in to boto:
from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider(access_key_id="MYACCESSKEY", secret_access_key="MYSECRET", host="mystorage.example.com")
rule all:
input:
S3.remote("bucket-name/file.txt")
Only remote files needed to satisfy the DAG build are downloaded for the workflow. By default, remote files are downloaded prior to rule execution and are removed locally as soon as no rules depend on them. Remote files can be explicitly kept by setting the keep_local=True
keyword argument:
from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider(access_key_id="MYACCESSKEY", secret_access_key="MYSECRET")
rule all:
input: S3.remote('bucket-name/prefix{split_id}.txt', keep_local=True)
If you wish to have a rule to simply download a file to a local copy, you can do so by declaring the same file path locally as is used by the remote file:
from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider(access_key_id="MYACCESSKEY", secret_access_key="MYSECRET")
rule all:
input:
S3.remote("bucket-name/out.txt")
output:
"bucket-name/out.txt"
run:
shell("cp {output[0]} ./")
The remote provider also supports a new glob_wildcards()
(see How do I run my rule on all files of a certain directory?) which acts the same as the local version of glob_wildcards()
, but for remote files:
from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider(access_key_id="MYACCESSKEY", secret_access_key="MYSECRET")
S3.glob_wildcards("bucket-name/{file_prefix}.txt")
# (result looks just like as if the local glob_wildcards() function were used on a locally with a folder called "bucket-name")
Google Cloud Storage (GS)¶
Using Google Cloud Storage (GS) is a simple import change, though since GS support it is based on boto, GS must be accessed via Google’s “interoperable” credentials.
Usage of the GS provider is the same as the S3 provider.
You may specify credentials as environment variables in the file =/.aws/credentials
, prefixed with AWS_*
, as with a standard boto config, or explicitly in the Snakefile
.
from snakemake.remote.GS import RemoteProvider as GSRemoteProvider
GS = GSRemoteProvider(access_key_id="MYACCESSKEY", secret_access_key="MYSECRET")
rule all:
input:
GS.remote("bucket-name/file.txt")
File transfer over SSH (SFTP)¶
Snakemake can use files on remove servers accessible via SFTP (i.e. most *nix servers).
It uses pysftp for the underlying support of SFTP, so the same connection options exist.
Assuming you have SSH keys already set up for the server you are using in the Snakefile
, usage is simple:
from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider()
rule all:
input:
SFTP.remote("example.com/path/to/file.bam")
The remote file addresses used must be specified with the host (domain or IP address) and the absolute path to the file on the remote server. A port may be specified if the SSH daemon on the server is listening on a port other than 22, in either the RemoteProvider
or in each instance of remote()
:
from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider(port=4040)
rule all:
input:
SFTP.remote("example.com/path/to/file.bam")
from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider()
rule all:
input:
SFTP.remote("example.com:4040/path/to/file.bam")
The standard keyword arguments used by pysftp may be provided to the RemoteProvider to specify credentials (either password or private key):
from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider(username="myusername", private_key="/Users/myusername/.ssh/particular_id_rsa")
rule all:
input:
SFTP.remote("example.com/path/to/file.bam")
from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider(username="myusername", password="mypassword")
rule all:
input:
SFTP.remote("example.com/path/to/file.bam")
If you share credentials between servers but connect to one on a different port, the alternate port may be specified in the remote()
wrapper:
from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider(username="myusername", password="mypassword")
rule all:
input:
SFTP.remote("some-example-server-1.com/path/to/file.bam"),
SFTP.remote("some-example-server-2.com:2222/path/to/file.bam")
There is a glob_wildcards()
function:
from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider()
SFTP.glob_wildcards("example.com/path/to/{sample}.bam")
Read-only web (HTTP[s])¶
Snakemake can access web resources via a read-only HTTP(S) provider. This provider can be helpful for including public web data in a workflow.
Web addresses must be specified without protocol, so if your URI looks like this:
http://server3.example.com/path/to/myfile.tar.gz
The URI used in the Snakefile
must look like this:
server3.example.com/path/to/myfile.tar.gz
It is straightforward to use the HTTP provider to download a file to the cwd:
import os
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()
rule all:
input:
HTTP.remote("www.example.com/path/to/document.pdf", keep_local=True)
run:
outputName = os.path.basename(input[0])
shell("mv {input} {outputName}")
To connect on a different port, specify the port as part of the URI string:
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()
rule all:
input:
HTTP.remote("www.example.com:8080/path/to/document.pdf", keep_local=True)
By default, the HTTP provider always uses HTTPS (TLS). If you need to connect to a resource with regular HTTP (no TLS), you must explicitly include insecure
as a kwarg
to remote()
:
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()
rule all:
input:
HTTP.remote("www.example.com/path/to/document.pdf", insecure=True, keep_local=True)
If the URI used includes characters not permitted in a local file path, you may include them as part of the additional_request_string
in the kwargs
for remote()
. This may also be useful for including additional parameters you don not want to be part of the local filename (since the URI string becomes the local file name).
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()
rule all:
input:
HTTP.remote("example.com/query.php", additional_request_string="?range=2;3")
If the file requires authentication, you can specify a username and password for HTTP Basic Auth with the Remote Provider, or with each instance of remote().
For different types of authentication, you can pass in a Python `requests.auth
object (see here) the auth kwarg
.
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider(username="myusername", password="mypassword")
rule all:
input:
HTTP.remote("example.com/interactive.php", keep_local=True)
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()
rule all:
input:
HTTP.remote("example.com/interactive.php", username="myusername", password="mypassword", keep_local=True)
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()
rule all:
input:
HTTP.remote("example.com/interactive.php", auth=requests.auth.HTTPDigestAuth("myusername", "mypassword"), keep_local=True)
Since remote servers do not present directory contents uniformly, glob_wildcards()
is __not__ supported by the HTTP provider.
File Transfer Protocol (FTP)¶
Snakemake can work with files stored on regular FTP. Currently supported are authenticated FTP and anonymous FTP, excluding FTP via TLS.
Usage is similar to the SFTP provider, however the paths specified are relative to the FTP home directory (since this is typically a chroot):
from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider
FTP = FTPRemoteProvider(username="myusername", password="mypassword")
rule all:
input:
FTP.remote("example.com/rel/path/to/file.tar.gz")
The port may be specified in either the provider, or in each instance of remote():
from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider
FTP = FTPRemoteProvider(username="myusername", password="mypassword", port=2121)
rule all:
input:
FTP.remote("example.com/rel/path/to/file.tar.gz")
from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider
FTP = FTPRemoteProvider(username="myusername", password="mypassword")
rule all:
input:
FTP.remote("example.com:2121/rel/path/to/file.tar.gz")
Anonymous download of FTP resources is possible:
from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider
FTP = FTPRemoteProvider()
rule all:
input:
# only keeping the file so we can move it out to the cwd
FTP.remote("example.com/rel/path/to/file.tar.gz", keep_local=True)
run:
shell("mv {input} ./")
glob_wildcards()
:
from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider
FTP = FTPRemoteProvider(username="myusername", password="mypassword")
print(FTP.glob_wildcards("example.com/somedir/{file}.txt"))
Dropbox¶
The Dropbox remote provider allows you to upload and download from your Dropbox account without having the client installed on your machine. In order to use the provider you first need to register an “app” on the Dropbox developer website, with access to the Full Dropbox. After registering, generate an OAuth2 access token. You will need the token to use the Snakemake Dropbox remote provider.
Using the Dropbox provider is straightforward:
from snakemake.remote.dropbox import RemoteProvider as DropboxRemoteProvider
DBox = DropboxRemoteProvider(oauth2_access_token="mytoken")
rule all:
input:
DBox.remote("path/to/input.txt")
glob_wildcards()
is supported:
from snakemake.remote.dropbox import RemoteProvider as DropboxRemoteProvider
DBox = DropboxRemoteProvider(oauth2_access_token="mytoken")
DBox.glob_wildcards("path/to/{title}.txt")
Note that Dropbox paths are case-insensitive.
Remote cross-provider transfers¶
It is possible to use Snakemake to transfer files between remote providers (using the local machine as an intermediary), as long as the sub-directory (bucket) names differ:
from snakemake.remote.GS import RemoteProvider as GSRemoteProvider
from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
GS = GSRemoteProvider(access_key_id="MYACCESSKEYID", secret_access_key="MYSECRETACCESSKEY")
S3 = S3RemoteProvider(access_key_id="MYACCESSKEYID", secret_access_key="MYSECRETACCESSKEY")
fileList, = S3.glob_wildcards("source-bucket/{file}.bam")
rule all:
input:
GS.remote( expand("destination-bucket/{file}.bam", file=fileList) )
rule transfer_S3_to_GS:
input:
S3.remote( expand("source-bucket/{file}.bam", file=fileList) )
output:
GS.remote( expand("destination-bucket/{file}.bam", file=fileList) )
run:
shell("cp -R source-bucket/ destination-bucket/")
Utils¶
The module snakemake.utils
provides a collection of helper functions for common tasks in Snakemake workflows. Details can be found in Additional utils.
Reports¶
The report
function provides an easy mechanism to write reports containing your results. A report is written in reStructuredText and compiled to HTML. The function allows you to embed your generated tables and plots into the HTML file. By referencing the files from your text, you can easily provide a semantical connection between them. For using this function, you need to have the docutils package installed.
from snakemake.utils import report
SOMECONSTANT = 42
rule report:
input: F1="someplot.pdf",
T1="sometable.txt"
output: html="report.html"
run:
report("""
=======================
The title of the report
=======================
Write your report here, explaining your results. Don't fear to use math
it will be rendered correctly in any browser using MathJAX,
e.g. inline :math:`\sum_{{j \in E}} t_j \leq I`,
or even properly separated:
.. math::
|cq_{{0ctrl}}^i - cq_{{nt}}^i| > 0.5
Include your files using their keyword name and an underscore: F1_, T1_.
Access your global and local variables like within shell commands, e.g. {SOMECONSTANT}.
""", output.html, metadata="Johannes Köster (johannes.koester@uni-due.de)", **input)
The optional metadata argument allows to provide arbitrary additional information to the report, e.g. the author name.
The unpacked input files (**input
) in the report function generates a list of keyword args, that can be referenced inside the document with the mentioned underscore notation. The files will be embedded into the HTML file using data URLs, thus making the report fully portable and not dependent on your local filesystem structure.
Scripting with R¶
The R
function allows you to use R code in your rules. It relies on rpy2:
from snakemake.utils import R
SOMECONSTANT = 42
rule:
input: ...
output: ...
run:
R("""
# write your R code here
# Access any global or local variables from the Snakefile with the braces notation
sqrt({SOMECONSTANT});
# be sure to mask braces used in R control flow by doubling them:
if(TRUE) {{
# do something
}}
""")
If you compiled your Python installation from source, make sure that Python was build with sqlite support, which is needed for rpy2.
Workflow Distribution and Deployment¶
It is recommended to store each workflow in a dedicated git repository of the following structure:
├── .gitignore
├── README.md
├── LICENSE.md
├── config.yaml
├── environment.yaml
├── scripts
│ ├── __init__.py
│ ├── script1.py
│ └── script2.R
└── Snakefile
Then, a workflow can be deployed to a new system via the following steps
# clone workflow into working directory
git clone https://bitbucket.org/user/myworkflow.git path/to/workdir
cd path/to/workdir
# edit config and workflow as needed
vim config.yaml
# install dependencies into isolated environment
conda env create -n myworkflow --file environment.yaml
# activate environment
source activate myworkflow
# execute workflow
snakemake -n
Importantly, git branching and pull requests can be used to modify and possibly re-integrate workflows.
Integrated Package Management¶
With Snakemake 3.9.0 it is possible to define isolated software environments per rule.
Upon execution of a workflow, the Conda package manager is used to obtain and deploy the defined software packages in the specified versions. Packages will be installed into your working directory, without requiring any admin/root priviledges.
Given that conda is available on your system (see Miniconda), to use the Conda integration, add the --use-conda
flag to your workflow execution command, e.g. snakemake --cores 8 --use-conda
.
When --use-conda
is activated, Snakemake will automatically create software environments for any used wrapper (see above).
Further, you can manually define environments via the conda
directive, e.g.:
rule NAME:
input:
"table.txt"
output:
"plots/myplot.pdf"
conda:
"envs/ggplot.yaml"
script:
"scripts/plot-stuff.R"
with the following environment definition:
channels:
- r
dependencies:
- r=3.3.1
- r-ggplot2=2.1.0
Snakemake will store the environment persistently in .snakemake/conda/$hash
with $hash
being the MD5 hash of the environment definition file content. This way, updates to the environment definition are automatically detected.
Note that you need to clean up environments manually for now. However, they are lightweight and consist only of symlinks to your central conda installation.
Sustainable and reproducible archiving¶
With Snakemake 3.10.0 it is possible to archive a workflow into a tarball (.tar, .tar.gz, .tar.bz2, .tar.xz), via
snakemake --archive my-workflow.tar.gz
If above layout is followed, this will archive any code and config files that is under git version control. Further, all input files will be included into the archive. Finally, the software packages of each defined conda environment are included. This results in a self-contained workflow archive that can be re-executed on a vanilla machine that only has Conda and Snakemake installed via
tar -xf my-workflow.tar.gz
snakemake -n
Note that the archive is platform specific. For example, if created on Linux, it will run on any Linux newer than the minimum version that has been supported by the used Conda packages at the time of archiving (e.g. CentOS 6).
A useful pattern when publishing data analyses is to create such an archive, upload it to Zenodo and thereby obtain a DOI. Then, the DOI can be cited in manuscripts, and readers are able to download and reproduce the data analysis at any time in the future.
The Snakemake API¶
-
snakemake.
snakemake
(snakefile, listrules=False, list_target_rules=False, cores=1, nodes=1, local_cores=1, resources={}, config={}, configfile=None, config_args=None, workdir=None, targets=None, dryrun=False, touch=False, forcetargets=False, forceall=False, forcerun=[], until=[], omit_from=[], prioritytargets=[], stats=None, printreason=False, printshellcmds=False, printdag=False, printrulegraph=False, printd3dag=False, nocolor=False, quiet=False, keepgoing=False, cluster=None, cluster_config=None, cluster_sync=None, drmaa=None, jobname='snakejob.{rulename}.{jobid}.sh', immediate_submit=False, standalone=False, ignore_ambiguity=False, snakemakepath=None, lock=True, unlock=False, cleanup_metadata=None, force_incomplete=False, ignore_incomplete=False, list_version_changes=False, list_code_changes=False, list_input_changes=False, list_params_changes=False, list_resources=False, summary=False, archive=None, detailed_summary=False, latency_wait=3, benchmark_repeats=1, wait_for_files=None, print_compilation=False, debug=False, notemp=False, keep_remote_local=False, nodeps=False, keep_target_files=False, keep_shadow=False, allowed_rules=None, jobscript=None, timestamp=False, greediness=None, no_hooks=False, overwrite_shellcmd=None, updated_files=None, log_handler=None, keep_logger=False, max_jobs_per_second=None, restart_times=0, verbose=False, force_use_threads=False, use_conda=False, mode=0, wrapper_prefix=None)[source]¶ Run snakemake on a given snakefile.
This function provides access to the whole snakemake functionality. It is not thread-safe.
Parameters: - snakefile (str) – the path to the snakefile
- listrules (bool) – list rules (default False)
- list_target_rules (bool) – list target rules (default False)
- cores (int) – the number of provided cores (ignored when using cluster support) (default 1)
- nodes (int) – the number of provided cluster nodes (ignored without cluster support) (default 1)
- local_cores (int) – the number of provided local cores if in cluster mode (ignored without cluster support) (default 1)
- resources (dict) – provided resources, a dictionary assigning integers to resource names, e.g. {gpu=1, io=5} (default {})
- config (dict) – override values for workflow config
- workdir (str) – path to working directory (default None)
- targets (list) – list of targets, e.g. rule or file names (default None)
- dryrun (bool) – only dry-run the workflow (default False)
- touch (bool) – only touch all output files if present (default False)
- forcetargets (bool) – force given targets to be re-created (default False)
- forceall (bool) – force all output files to be re-created (default False)
- forcerun (list) – list of files and rules that shall be re-created/re-executed (default [])
- prioritytargets (list) – list of targets that shall be run with maximum priority (default [])
- stats (str) – path to file that shall contain stats about the workflow execution (default None)
- printreason (bool) – print the reason for the execution of each job (default false)
- printshellcmds (bool) – print the shell command of each job (default False)
- printdag (bool) – print the dag in the graphviz dot language (default False)
- printrulegraph (bool) – print the graph of rules in the graphviz dot language (default False)
- printd3dag (bool) – print a D3.js compatible JSON representation of the DAG (default False)
- nocolor (bool) – do not print colored output (default False)
- quiet (bool) – do not print any default job information (default False)
- keepgoing (bool) – keep goind upon errors (default False)
- cluster (str) – submission command of a cluster or batch system to use, e.g. qsub (default None)
- cluster_config (str,list) – configuration file for cluster options, or list thereof (default None)
- cluster_sync (str) – blocking cluster submission command (like SGE ‘qsub -sync y’) (default None)
- drmaa (str) – if not None use DRMAA for cluster support, str specifies native args passed to the cluster when submitting a job
- jobname (str) – naming scheme for cluster job scripts (default “snakejob.{rulename}.{jobid}.sh”)
- immediate_submit (bool) – immediately submit all cluster jobs, regardless of dependencies (default False)
- standalone (bool) – kill all processes very rudely in case of failure (do not use this if you use this API) (default False) (deprecated)
- ignore_ambiguity (bool) – ignore ambiguous rules and always take the first possible one (default False)
- snakemakepath (str) – Deprecated parameter whose value is ignored. Do not use.
- lock (bool) – lock the working directory when executing the workflow (default True)
- unlock (bool) – just unlock the working directory (default False)
- cleanup_metadata (bool) – just cleanup metadata of output files (default False)
- force_incomplete (bool) – force the re-creation of incomplete files (default False)
- ignore_incomplete (bool) – ignore incomplete files (default False)
- list_version_changes (bool) – list output files with changed rule version (default False)
- list_code_changes (bool) – list output files with changed rule code (default False)
- list_input_changes (bool) – list output files with changed input files (default False)
- list_params_changes (bool) – list output files with changed params (default False)
- summary (bool) – list summary of all output files and their status (default False)
- archive (str) – archive workflow into the given tarball
- latency_wait (int) – how many seconds to wait for an output file to appear after the execution of a job, e.g. to handle filesystem latency (default 3)
- benchmark_repeats (int) – number of repeated runs of a job if declared for benchmarking (default 1)
- wait_for_files (list) – wait for given files to be present before executing the workflow
- list_resources (bool) – list resources used in the workflow (default False)
- summary – list summary of all output files and their status (default False). If no option is specified a basic summary will be ouput. If ‘detailed’ is added as an option e.g –summary detailed, extra info about the input and shell commands will be included
- detailed_summary (bool) – list summary of all input and output files and their status (default False)
- print_compilation (bool) – print the compilation of the snakefile (default False)
- debug (bool) – allow to use the debugger within rules
- notemp (bool) – ignore temp file flags, e.g. do not delete output files marked as temp after use (default False)
- keep_remote_local (bool) – keep local copies of remote files (default False)
- nodeps (bool) – ignore dependencies (default False)
- keep_target_files (bool) – Do not adjust the paths of given target files relative to the working directory.
- keep_shadow (bool) – Do not delete the shadow directory on snakemake startup.
- allowed_rules (set) – Restrict allowed rules to the given set. If None or empty, all rules are used.
- jobscript (str) – path to a custom shell script template for cluster jobs (default None)
- timestamp (bool) – print time stamps in front of any output (default False)
- greediness (float) – set the greediness of scheduling. This value between 0 and 1 determines how careful jobs are selected for execution. The default value (0.5 if prioritytargets are used, 1.0 else) provides the best speed and still acceptable scheduling quality.
- overwrite_shellcmd (str) – a shell command that shall be executed instead of those given in the workflow. This is for debugging purposes only.
- updated_files (list) – a list that will be filled with the files that are updated or created during the workflow execution
- verbose (bool) – show additional debug output (default False)
- max_jobs_per_second (int) – maximal number of cluster/drmaa jobs per second, None to impose no limit (default None)
- restart_times (int) – number of times to restart failing jobs (default 1)
- force_use_threads – whether to force use of threads over processes. helpful if shared memory is full or unavailable (default False)
- use_conda (bool) – create conda environments for each job (defined with conda directive of rules)
- mode (snakemake.common.Mode) – Execution mode
- wrapper_prefix (str) – Prefix for wrapper script URLs (default None)
- log_handler (function) –
redirect snakemake output to this custom log handler, a function that takes a log message dictionary (see below) as its only argument (default None). The log message dictionary for the log handler has to following entries:
level: the log level (“info”, “error”, “debug”, “progress”, “job_info”) level=”info”, “error” or “debug”: msg: the log message level=”progress”: done: number of already executed jobs total: number of total jobs level=”job_info”: input: list of input files of a job output: list of output files of a job log: path to log file of a job local: whether a job is executed locally (i.e. ignoring cluster) msg: the job message reason: the job reason priority: the job priority threads: the threads of the job
Returns: True if workflow execution was successful.
Return type: bool
Additional utils¶
-
class
snakemake.utils.
AlwaysQuotedFormatter
(quote_func=<function quote>, *args, **kwargs)[source]¶ Subclass of QuotedFormatter that always quotes.
Usage is identical to QuotedFormatter, except that it always acts like “q” was appended to the format spec.
-
class
snakemake.utils.
QuotedFormatter
(quote_func=<function quote>, *args, **kwargs)[source]¶ Subclass of string.Formatter that supports quoting.
Using this formatter, any field can be quoted after formatting by appending “q” to its format string. By default, shell quoting is performed using “shlex.quote”, but you can pass a different quote_func to the constructor. The quote_func simply has to take a string argument and return a new string representing the quoted form of the input string.
Note that if an element after formatting is the empty string, it will not be quoted.
-
snakemake.utils.
R
(code)[source]¶ Execute R code
This function executes the R code given as a string. The function requires rpy2 to be installed.
Parameters: code (str) – R code to be executed
-
class
snakemake.utils.
SequenceFormatter
(separator=' ', element_formatter=<string.Formatter object>, *args, **kwargs)[source]¶ string.Formatter subclass with special behavior for sequences.
This class delegates formatting of individual elements to another formatter object. Non-list objects are formatted by calling the delegate formatter’s “format_field” method. List-like objects (list, tuple, set, frozenset) are formatted by formatting each element of the list according to the specified format spec using the delegate formatter and then joining the resulting strings with a separator (space by default).
-
snakemake.utils.
available_cpu_count
()[source]¶ Return the number of available virtual or physical CPUs on this system. The number of available CPUs can be smaller than the total number of CPUs when the cpuset(7) mechanism is in use, as is the case on some cluster systems.
Adapted from http://stackoverflow.com/a/1006301/715090
-
snakemake.utils.
format
(_pattern, *args, stepout=1, _quote_all=False, **kwargs)[source]¶ Format a pattern in Snakemake style.
This means that keywords embedded in braces are replaced by any variable values that are available in the current namespace.
-
snakemake.utils.
linecount
(filename)[source]¶ Return the number of lines of given file.
Parameters: filename (str) – the path to the file
-
snakemake.utils.
listfiles
(pattern, restriction=None, omit_value=None)[source]¶ Yield a tuple of existing filepaths for the given pattern.
Wildcard values are yielded as the second tuple item.
Parameters: - pattern (str) – a filepattern. Wildcards are specified in snakemake syntax, e.g. “{id}.txt”
- restriction (dict) – restrict to wildcard values given in this dictionary
- omit_value (str) – wildcard value to omit
Yields: tuple – The next file matching the pattern, and the corresponding wildcards object
-
snakemake.utils.
makedirs
(dirnames)[source]¶ Recursively create the given directory or directories without reporting errors if they are present.
-
snakemake.utils.
min_version
(version)[source]¶ Require minimum snakemake version, raise workflow error if not met.
-
snakemake.utils.
read_job_properties
(jobscript, prefix='# properties', pattern=re.compile('# properties = (.*)'))[source]¶ Read the job properties defined in a snakemake jobscript.
This function is a helper for writing custom wrappers for the snakemake –cluster functionality. Applying this function to a jobscript will return a dict containing information about the job.
-
snakemake.utils.
report
(text, path, stylesheet='/home/docs/checkouts/readthedocs.org/user_builds/snakemake/checkouts/v3.11.1/snakemake/report.css', defaultenc='utf8', template=None, metadata=None, **files)[source]¶ Create an HTML report using python docutils.
Attention: This function needs Python docutils to be installed for the python installation you use with Snakemake.
All keywords not listed below are intepreted as paths to files that shall be embedded into the document. They keywords will be available as link targets in the text. E.g. append a file as keyword arg via F1=input[0] and put a download link in the text like this:
report(''' ============== Report for ... ============== Some text. A link to an embedded file: F1_. Further text. ''', outputpath, F1=input[0]) Instead of specifying each file as a keyword arg, you can also expand the input of your rule if it is completely named, e.g.: report(''' Some text... ''', outputpath, **input)
Parameters: - text (str) – The “restructured text” as it is expected by python docutils.
- path (str) – The path to the desired output file
- stylesheet (str) – An optional path to a css file that defines the style of the document. This defaults to <your snakemake install>/report.css. Use the default to get a hint how to create your own.
- defaultenc (str) – The encoding that is reported to the browser for embedded text files, defaults to utf8.
- template (str) – An optional path to a docutils HTML template.
- metadata (str) – E.g. an optional author name or email address.
-
snakemake.utils.
update_config
(config, overwrite_config)[source]¶ Recursively update dictionary config with overwrite_config.
See http://stackoverflow.com/questions/3232943/update-value-of-a-nested-dictionary-of-varying-depth for details.
Parameters: - config (dict) – dictionary to update
- overwrite_config (dict) – dictionary whose items will overwrite those in config
Citing and Citations¶
This section gives instructions on how to cite Snakemake and lists citing articles.
Citing Snakemake¶
When using Snakemake for a publication, please cite the following article in you paper:
Cite This¶
More References¶
Another publication describing more of Snakemake internals:
And my PhD thesis which describes all algorithmic details:
Project Pages¶
If you publish a Snakemake workflow, consider to add this badge to your project page:
The markdown syntax is
[](http://snakemake.bitbucket.org)
Replace the 3.5.2
with the minimum required Snakemake version.
You can also change the style.
More Resources¶
Talks and Posters¶
- Poster at ECCB 2016, The Hague, Netherlands.
- Invited talk by Johannes Köster at the Broad Institute, Boston 2015.
- Introduction to Snakemake. Tutorial Slides presented by Johannes Köster at the GCB 2015, Dortmund, Germany.
- Invited talk by Johannes Köster at the DTL Focus Meeting: “NGS Production Pipelines”, Dutch Techcentre for Life Sciences, Utrecht 2014.
- Taming Snakemake by Jeremy Leipzig, Bioinformatics software developer at Children’s Hospital of Philadelphia, 2014.
- “Snakemake makes ... snakes?” - An Introduction by Marcel Martin from SciLifeLab, Stockholm 2015
- “Workflow Management with Snakemake” by Johannes Köster, 2015. Held at the Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute
External Resources¶
These resources are not part of the official documentation.
- Snakemake workflow used for the Kallisto paper
- An alternative tutorial for Snakemake
- An Emacs mode for Snakemake
- Flexible bioinformatics pipelines with Snakemake
- Sandwiches with Snakemake
- A visualization of the past years of Snakemake development
- Japanese version of the Snakemake tutorial
- Basic and advanced french Snakemake tutorial.
- Mini tutorial on Snakemake and Bioconda
Frequently Asked Questions¶
Contents
- Frequently Asked Questions
- What is the key idea of Snakemake workflows?
- What is the recommended way to distribute a Snakemake workflow?
- My shell command fails with with errors about an “unbound variable”, what’s wrong?
- How do I run my rule on all files of a certain directory?
- Snakemake complains about a cyclic dependency or a PeriodicWildcardError. What can I do?
- Is it possible to pass variable values to the workflow via the command line?
- I get a NameError with my shell command. Are braces unsupported?
- How do I incorporate files that do not follow a consistent naming scheme?
- How do I force Snakemake to rerun all jobs from the rule I just edited?
- How do I enable syntax highlighting in Vim for Snakefiles?
- I want to import some helper functions from another python file. Is that possible?
- How can I run Snakemake on a cluster where its main process is not allowed to run on the head node?
- Can the output of a rule be a symlink?
- Can the input of a rule be a symlink?
- I would like to receive a mail upon snakemake exit. How can this be achieved?
- I want to pass variables between rules. Is that possible?
- Why do my global variables behave strangely when I run my job on a cluster?
- I want to configure the behavior of my shell for all rules. How can that be achieved with Snakemake?
- Some command line arguments like –config cannot be followed by rule or file targets. Is that intended behavior?
- How do I make my rule fail if an output file is empty?
- How does Snakemake lock the working directory?
- Snakemake does not trigger re-runs if I add additional input files. What can I do?
- How do I trigger re-runs for rules with updated code or parameters?
- How do I remove all files created by snakemake, i.e. like
make clean
What is the key idea of Snakemake workflows?¶
The key idea is very similar to GNU Make. The workflow is determined automatically from top (the files you want) to bottom (the files you have), by applying very general rules with wildcards you give to Snakemake:

When you start using Snakemake, please make sure to walk through the official tutorial. It is crucial to understand how to properly use the system.
What is the recommended way to distribute a Snakemake workflow?¶
It is recommended that a Snakemake workflow is structured in the following way:
├── config.yaml
├── environment.yaml
├── scripts
│ ├── script1.py
│ └── script2.R
└── Snakefile
This structure can be put into a git repository, allowing to setup the workflow with the following steps:
# clone workflow into working directory
git clone https://bitbucket.org/user/myworkflow.git path/to/workdir
cd path/to/workdir
# edit config as needed
vim config.yaml
# install dependencies into an isolated conda environment
conda env create -n myworkflow --file environment.yaml
# activate environment
source activate myworkflow
# execute workflow
snakemake -n
In certain cases, it might be necessary to extend or modify a given workflow (the Snakefile). Here, git provides the ideal mechanisms to track such changes. Any modifications can happen in either a separate branch or a fork. When the changes are general enough, they can be reintegrated later into the master branch using pull requests.
My shell command fails with with errors about an “unbound variable”, what’s wrong?¶
This happens often when calling virtual environments from within Snakemake. Snakemake is using bash strict mode, to ensure e.g. proper error behavior of shell scripts. Unfortunately, virtualenv and some other tools violate bash strict mode. he quick fix for virtualenv is to temporarily deactivate the check for unbound variables
set +u; source /path/to/venv/bin/activate; set -u
For more details on bash strict mode, see the here.
How do I run my rule on all files of a certain directory?¶
In Snakemake, similar to GNU Make, the workflow is determined from the top, i.e. from the target files. Imagine you have a directory with files 1.fastq, 2.fastq, 3.fastq, ...
, and you want to produce files 1.bam, 2.bam, 3.bam, ...
you should specify these as target files, using the ids 1,2,3,...
. You could end up with at least two rules like this (or any number of intermediate steps):
IDS = "1 2 3 ...".split() # the list of desired ids
# a pseudo-rule that collects the target files
rule all:
input: expand("otherdir/{id}.bam", id=IDS)
# a general rule using wildcards that does the work
rule:
input: "thedir/{id}.fastq"
output: "otherdir/{id}.bam"
shell: "..."
Snakemake will then go down the line and determine which files it needs from your initial directory.
In order to infer the IDs from present files, Snakemake provides the glob_wildcards
function, e.g.
IDS, = glob_wildcards("thedir/{id}.fastq")
The function matches the given pattern against the files present in the filesystem and thereby infers the values for all wildcards in the pattern. A named tuple that contains a list of values for each wildcard is returned. Here, this named tuple has only one item, that is the list of values for the wildcard {id}
.
Snakemake complains about a cyclic dependency or a PeriodicWildcardError. What can I do?¶
One limitation of Snakemake is that graphs of jobs have to be acyclic (similar to GNU Make). This means, that no path in the graph may be a cycle. Although you might have considered this when designing your workflow, Snakemake sometimes runs into situations where a cyclic dependency cannot be avoided without further information, although the solution seems obvious for the developer. Consider the following example:
rule all:
input:
"a"
rule unzip:
input:
"{sample}.tar.gz"
output:
"{sample}"
shell:
"tar -xf {input}"
If this workflow is executed with
snakemake -n
two things may happen.
If the file
a.tar.gz
is present in the filesystem, Snakemake will propose the following (expected and correct) plan:rule a: input: a.tar.gz output: a wildcards: sample=a localrule all: input: a Job counts: count jobs 1 a 1 all 2
If the file
a.tar.gz
is not present and cannot be created by any other rule than rulea
, Snakemake will try to run rulea
again, with{sample}=a.tar.gz
. This would infinitely go on recursively. Snakemake detects this case and produces aPeriodicWildcardError
.
In summary, PeriodicWildcardErrors
hint to a problem where a rule or a set of rules can be applied to create its own input. If you are lucky, Snakemake can be smart and avoid the error by stopping the recursion if a file exists in the filesystem. Importantly, however, bugs upstream of that rule can manifest as PeriodicWildcardError
, although in reality just a file is missing or named differently.
In such cases, it is best to restrict the wildcard of the output file(s), or follow the general rule of putting output files of different rules into unique subfolders of your working directory. This way, you can discover the true source of your error.
Is it possible to pass variable values to the workflow via the command line?¶
Yes, this is possible. Have a look at Configuration. Previously it was necessary to use environment variables like so: E.g. write
$ SAMPLES="1 2 3 4 5" snakemake
and have in the Snakefile some Python code that reads this environment variable, i.e.
SAMPLES = os.environ.get("SAMPLES", "10 20").split()
I get a NameError with my shell command. Are braces unsupported?¶
You can use the entire Python format minilanguage in shell commands. Braces in shell commands that are not intended to insert variable values thus have to be escaped by doubling them:
...
shell: "awk '{{print $1}}' {input}"
Here the double braces are escapes, i.e. there will remain single braces in the final command. In contrast, {input}
is replaced with an input filename.
How do I incorporate files that do not follow a consistent naming scheme?¶
The best solution is to have a dictionary that translates a sample id to the inconsistently named files and use a function (see Functions as Input Files) to provide an input file like this:
FILENAME = dict(...) # map sample ids to the irregular filenames here
rule:
# use a function as input to delegate to the correct filename
input: lambda wildcards: FILENAME[wildcards.sample]
output: "somefolder/{sample}.csv"
shell: ...
How do I force Snakemake to rerun all jobs from the rule I just edited?¶
This can be done by invoking Snakemake with the --forcerules
or -R
flag, followed by the rules that should be re-executed:
$ snakemake -R somerule
This will cause Snakemake to re-run all jobs of that rule and everything downstream (i.e. directly or indirectly depending on the rules output).
How do I enable syntax highlighting in Vim for Snakefiles?¶
A vim syntax highlighting definition for Snakemake is available here.
You can copy that file to $HOME/.vim/syntax
directory and add
au BufNewFile,BufRead Snakefile set syntax=snakemake
au BufNewFile,BufRead *.rules set syntax=snakemake
au BufNewFile,BufRead *.snakefile set syntax=snakemake
au BufNewFile,BufRead *.snake set syntax=snakemake
to your $HOME/.vimrc
file. Highlighting can be forced in a vim session with :set syntax=snakemake
.
I want to import some helper functions from another python file. Is that possible?¶
Yes, from version 2.4.8 on, Snakemake allows to import python modules (and also simple python files) from the same directory where the Snakefile resides.
How can I run Snakemake on a cluster where its main process is not allowed to run on the head node?¶
This can be achived by submitting the main Snakemake invocation as a job to the cluster. If it is not allowed to submit a job from a non-head cluster node, you can provide a submit command that goes back to the head node before submitting:
qsub -N PIPE -cwd -j yes python snakemake --cluster "ssh user@headnode_address 'qsub -N pipe_task -j yes -cwd -S /bin/sh ' " -j
This hint was provided by Inti Pedroso.
Can the output of a rule be a symlink?¶
Yes. As of Snakemake 3.8, output files are removed before running a rule and then touched after the rule completes to ensure they are newer than the input. Symlinks are treated just the same as normal files in this regard, and Snakemake ensures that it only modifies the link and not the target when doing this.
Here is an example where you want to merge N files together, but if N == 1 a symlink will do. This is easier than attempting to implement workflow logic that skips the step entirely. Note the -r flag, supported by modern versions of ln, is useful to achieve correct linking between files in subdirectories.
rule merge_files:
output: "{foo}/all_merged.txt"
input: my_input_func # some function that yields 1 or more files to merge
run:
if len(input) > 1:
shell("cat {input} | sort > {output}")
else:
shell("ln -sr {input} {output}")
Do be careful with symlinks in combination with Step 6: Temporary and protected files. When the original file is deleted, this can cause various errors once the symlink does not point to a valid file any more.
If you get a message like Unable to set utime on symlink .... Your Python build does not support it.
this means that Snakemake is unable to properly adjust the modification time of the symlink.
In this case, a workaround is to add the shell command touch -h {output} to the end of the rule.
Can the input of a rule be a symlink?¶
Yes. In this case, since Snakemake 3.8, one extra consideration is applied. If either the link itself or the target of the link is newer than the output files for the rule then it will trigger the rule to be re-run.
I would like to receive a mail upon snakemake exit. How can this be achieved?¶
On unix, you can make use of the commonly pre-installed mail command:
snakemake 2> snakemake.log
mail -s "snakemake finished" youremail@provider.com < snakemake.log
In case your administrator does not provide you with a proper configuration of the sendmail framework, you can configure mail to work e.g. via Gmail (see here).
I want to pass variables between rules. Is that possible?¶
Because of the cluster support and the ability to resume a workflow where you stopped last time, Snakemake in general should be used in a way that information is stored in the output files of your jobs. Sometimes it might though be handy to have a kind of persistent storage for simple values between jobs and rules. Using plain python objects like a global dict for this will not work as each job is run in a separate process by snakemake. What helps here is the PersistentDict from the pytools package. Here is an example of a Snakemake workflow using this facility:
from pytools.persistent_dict import PersistentDict
storage = PersistentDict("mystorage")
rule a:
input: "test.in"
output: "test.out"
run:
myvar = storage.fetch("myvar")
# do stuff
rule b:
output: temp("test.in")
run:
storage.store("myvar", 3.14)
Here, the output rule b has to be temp in order to ensure that myvar
is stored in each run of the workflow as rule a relies on it. In other words, the PersistentDict is persistent between the job processes, but not between different runs of this workflow. If you need to conserve information between different runs, use output files for them.
Why do my global variables behave strangely when I run my job on a cluster?¶
This is closely related to the question above. Any Python code you put outside of a rule definition is normally run once before Snakemake starts to process rules, but on a cluster it is re-run again for each submitted job, because Snakemake implements jobs by re-running itself.
Consider the following...
from mydatabase import get_connection
dbh = get_connection()
latest_parameters = dbh.get_params().latest()
rule a:
input: "{foo}.in"
output: "{foo}.out"
shell: "do_op -params {latest_parameters} {input} {output}"
When run a single machine, you will see a single connection to your database and get a single value for latest_parameters for the duration of the run. On a cluster you will see a connection attempt from the cluster node for each job submitted, regardless of whether it happens to involve rule a or not, and the parameters will be recalculated for each job.
I want to configure the behavior of my shell for all rules. How can that be achieved with Snakemake?¶
You can set a prefix that will prepended to all shell commands by adding e.g.
shell.prefix("set -o pipefail; ")
to the top of your Snakefile. Make sure that the prefix ends with a semicolon, such that it will not interfere with the subsequent commands. To simulate a bash login shell, you can do the following:
shell.executable("/bin/bash")
shell.prefix("source ~/.bashrc; ")
Some command line arguments like –config cannot be followed by rule or file targets. Is that intended behavior?¶
This is a limitation of the argparse module, which cannot distinguish between the perhaps next arg of --config
and a target.
As a solution, you can put the –config at the end of your invocation, or prepend the target with a single --
, i.e.
$ snakemake --config foo=bar -- mytarget
$ snakemake mytarget --config foo=bar
How do I make my rule fail if an output file is empty?¶
Snakemake expects shell commands to behave properly, meaning that failures should cause an exit status other than zero. If a command does not exit with a status other than zero, Snakemake assumes everything worked fine, even if output files are empty. This is because empty output files are also a reasonable tool to indicate progress where no real output was produced. However, sometimes you will have to deal with tools that do not properly report their failure with an exit status. Here, the recommended way is to use bash to check for non-empty output files, e.g.:
rule:
input: ...
output: "my/output/file.txt"
shell: "somecommand {input} {output} && [[ -s {output} ]]"
How does Snakemake lock the working directory?¶
Per default, Snakemake will lock a working directory by output and input files. Two Snakemake instances that want to create the same output file are not possible. Two instances creating disjoint sets of output files are possible.
With the command line option --nolock
, you can disable this mechanism on your own risk. With --unlock
, you can be remove a stale lock. Stale locks can appear if your machine is powered off with a running Snakemake instance.
Snakemake does not trigger re-runs if I add additional input files. What can I do?¶
Snakemake has a kind of “lazy” policy about added input files if their modification date is older than that of the output files. One reason is that information what to do cannot be inferred just from the input and output files. You need additional information about the last run to be stored. Since behaviour would be inconsistent between cases where that information is available and where it is not, this functionality has been encoded as an extra switch. To trigger updates for jobs with changed input files, you can use the command line argument –list-input-changes in the following way:
$ snakemake -n -R `snakemake --list-input-changes`
Here, snakemake --list-input-changes
returns the list of output files with changed input files, which is fed into -R
to trigger a re-run.
How do I trigger re-runs for rules with updated code or parameters?¶
Similar to the solution above, you can use
$ snakemake -n -R `snakemake --list-params-changes`
and
$ snakemake -n -R `snakemake --list-code-changes`
Again, the list commands in backticks return the list of output files with changes, which are fed into -R
to trigger a re-run.
How do I remove all files created by snakemake, i.e. like make clean
¶
To remove all files created by snakemake as output files to start from scratch, you can use
rm $(snakemake --summary | tail -n+2 | cut -f1)
Contributing¶
Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.
You can contribute in many ways:
Types of Contributions¶
Report Bugs¶
Report bugs at https://bitbucket.org/snakemake/snakemake/issues
If you are reporting a bug, please include:
- Your operating system name and version.
- Any details about your local setup that might be helpful in troubleshooting.
- Detailed steps to reproduce the bug.
Fix Bugs¶
Look through the Bitbucket issues for bugs. If you want to start working on a bug then please write short message on the issue tracker to prevent duplicate work.
Implement Features¶
Look through the Bitbucket issues for features. If you want to start working on an issue then please write short message on the issue tracker to prevent duplicate work.
Write Documentation¶
Snakemake could always use more documentation, whether as part of the official vcfpy docs, in docstrings, or even on the web in blog posts, articles, and such.
Snakemake uses Sphinx for the user manual (that you are currently reading). See project_info-doc_guidelines on how the documentation reStructuredText is used.
Submit Feedback¶
The best way to send feedback is to file an issue at https://bitbucket.org/snakemake/snakemake/issues
If you are proposing a feature:
- Explain in detail how it would work.
- Keep the scope as narrow as possible, to make it easier to implement.
- Remember that this is a volunteer-driven project, and that contributions are welcome :)
Pull Request Guidelines¶
To update the documentation, fix bugs or add new features you need to create a Pull Request . A PR is a change you make to your local copy of the code for us to review and potentially integrate into the code base.
To create a Pull Request you need to do these steps:
- Create a Bitbucket account.
- Fork the repository (see the left sidebar on the main Bitbucket Snakemake page).
- Clone your fork (go to your copy of the repository at
https://bitbucket.org/<your_username>/snakemake
and click clone. This gives you the command you need to paste into your shell). - Go to the snakemake folder with
cd snakemake
. - Create a new branch with
git checkout -b <descriptive_branch_name>
. - Make your changes to the code or documentation.
- Run
git add .
to add all the changed files to the commit (to see what files will be added you can rungit add . --dry-run
). - To commit the added files use
git commit
. (This will open a command line editor to write a commit message. These should have a descriptive 80 line header, followed by an empty line, and then a description of what you did and why. To use your command line text editor of choice use (for example)export GIT_EDITOR=vim
before runninggit commit
). - Now you can push your changes to your Bitbucket copy of Snakemake by running
git push origin <descriptive_branch_name>
. - If you now go to the webpage for your Bitbucket copy of Snakemake you should see a link in the sidebar called “Create Pull Request”.
- Now you need to choose your PR from the menu and click the “Create pull request” button. Be sure to change the pull request target branch to <descriptive_branch_name>!
If you want to create more pull requests, first run git checkout master
and then start at step 5. with a new branch name.
Feel free to ask questions about this if you want to contribute to Snakemake :)
Testing Guidelines¶
To ensure that you do not introduce bugs into Snakemake, you should test your code thouroughly.
To have integration tests run automatically when commiting code changes to bitbucket, you need to sign up on wercker.com and register a user.
The easiest way to run your development version of Snakemake is perhaps to go to the folder containing your local copy of Snakemake and call
conda env create -f environment.yml -n snakemake-testing
pip install -e .
source activate snakemake-testing
This will make your development version of Snakemake the one called when running snakemake. You do not need to run this command after each time you make code changes.
From the base snakemake folder you call python setup.py nosetest
to run all the tests. (If it complains that you do not have nose installed, which is the testing framework we use, you can simply install it by running pip install nose
.)
If you introduce a new feature you should add a new test to the tests directory. See the folder for examples.
Documentation Guidelines¶
For the documentation, please adhere to the following guidelines:
- Put each sentence on its own line, this makes tracking changes through Git SCM easier.
- Provide hyperlink targets, at least for the first two section levels.
For this, use the format
<document_part>-<section_name>
, e.g.,project_info-doc_guidelines
. - Use the section structure from below.
.. document_part-heading_1:
=========
Heading 1
=========
.. document_part-heading_2:
---------
Heading 2
---------
.. document_part-heading_3:
Heading 3
=========
.. document_part-heading_4:
Heading 4
---------
.. document_part-heading_5:
Heading 5
~~~~~~~~~
.. document_part-heading_6:
Heading 6
:::::::::
Documentation Setup¶
For building the documentation, you have to install the Sphinx. If you have already installed Conda, all you need to do is to create a Snakemake development environment via
$ git clone git@bitbucket.org:snakemake/snakemake.git
$ cd snakemake
$ conda env create -f environment.yml -n snakemake
Then, the docs can be built with
$ source activate snakemake
$ cd docs
$ make html
$ make clean && make html # force rebuild
Alternatively, you can use virtualenv. The following assumes you have a working Python 3 setup.
$ git clone git@bitbucket.org:snakemake/snakemake.git
$ cd snakemake/docs
$ virtualenv -p python3 .venv
$ source .venv/bin/activate
$ pip install --upgrade -r requirements.txt
Afterwards, the docs can be built with
$ source .venv/bin/activate
$ make html # rebuild for changed files only
$ make clean && make html # force rebuild
Credits¶
Development Lead¶
- Johannes Köster
Development Team¶
- Christopher Tomkins-Tinch
- David Koppstein
- Tim Booth
- Manuel Holtgrewe
- Christian Arnold
Contributors¶
In alphabetical order
- Andreas Wilm
- Anthony Underwood
- Ryan Dale
- David Alexander
- Elias Kuthe
- Elmar Pruesse
- Hyeshik Chang
- Jay Hesselberth
- Jesper Foldager
- John Huddleston
- Joona Lehtomäki
- Karel Brinda
- Karl Gutwin
- Kemal Eren
- Kostis Anagnostopoulos
- Kyle A. Beauchamp
- Kyle Meyer
- Lance Parsons
- Manuel Holtgrewe
- Marcel Martin
- Matthew Shirley
- Mattias Franberg
- Matt Shirley
- Paul Moore
- percyfal
- Per Unneberg
- Ryan C. Thompson
- Ryan Dale
- Sean Davis
- Simon Ye
- Tobias Marschall
- Willem Ligtenberg
Change Log¶
# Change Log
## [3.11.1] - 2017-03-14
### Changed
- --touch ignores missing files
- Fixed handling of local URIs with the wrapper directive.
## [3.11.0] - 2017-03-08
### Added
- Param functions can now also refer to threads.
### Changed
- Improved tutorial and docs.
- Made conda integration more robust.
- None is converted to NULL in R scripts.
## [3.10.2] - 2017-02-28
### Changed
- Improved config file handling and merging.
- Output files can be referred in params functions (i.e. lambda wildcards, output: ...)
- Improved conda-environment creation.
- Jobs are cached, leading to reduced memory footprint.
- Fixed subworkflow handling in input functions.
## [3.10.0] - 2017-01-18
### Added
- Workflows can now be archived to a tarball with `snakemake --archive my-workflow.tar.gz`. The archive contains all input files, source code versioned with git and all software packages that are defined via conda environments. Hence, the archive allows to fully reproduce a workflow on a different machine. Such an archive can be uploaded to Zenodo, such that your workflow is secured in a self-contained, executable way for the future.
### Changed
- Improved logging.
- Reduced memory footprint.
- Added a flag to automatically unpack the output of input functions.
- Improved handling of HTTP redirects with remote files.
- Improved exception handling with DRMAA.
- Scripts referred by the script directive can now use locally defined external python modules.
## [3.9.1] - 2016-12-23
### Added
- Jobs can be restarted upon failure (--restart-times).
### Changed
- The docs have been restructured and improved. Now available under snakemake.readthedocs.org.
- Changes in scripts show up with --list-code-changes.
- Duplicate output files now cause an error.
- Various bug fixes.
## [3.9.0] - 2016-11-15
### Added
- Ability to define isolated conda software environments (YAML) per rule. Environment will be deployed by Snakemake upon workflow execution.
- Command line argument --wrapper-prefix in order to overwrite the default URL for looking up wrapper scripts.
### Changed
- --summary now displays the log files correspoding to each output file.
- Fixed hangups when using run directive and a large number of jobs
- Fixed pickling errors with anonymous rules and run directive.
- Various small bug fixes
## [3.8.2] - 2016-09-23
### Changed
- Add missing import in rules.py.
- Use threading only in cluster jobs.
## [3.8.1] - 2016-09-14
### Changed
- Snakemake now warns when using relative paths starting with "./".
- The option -R now also accepts an empty list of arguments.
- Bug fix when handling benchmark directive.
- Jobscripts exit with code 1 in case of failure. This should improve the error messages of cluster system.
- Fixed a bug in SFTP remote provider.
## [3.8.0] - 2016-08-26
### Added
- Wildcards can now be constrained by rule and globally via the new `wildcard_constraints` directive (see the [docs](https://bitbucket.org/snakemake/snakemake/wiki/Documentation#markdown-header-wildcards)).
- Subworkflows now allow to overwrite their config file via the configfile directive in the calling Snakefile.
- A method `log_fmt_shell` in the snakemake proxy object that is available in scripts and wrappers allows to obtain a formatted string to redirect logging output from STDOUT or STDERR.
- Functions given to resources can now optionally contain an additional argument `input` that refers to the input files.
- Functions given to params can now optionally contain additional arguments `input` (see above) and `resources`. The latter refers to the resources.
- It is now possible to let items in shell commands be automatically quoted (see the [docs](https://bitbucket.org/snakemake/snakemake/wiki/Documentation#markdown-header-rules)). This is usefull when dealing with filenames that contain whitespaces.
### Changed
- Snakemake now deletes output files before job exection. Further, it touches output files after job execution. This solves various problems with slow NFS filesystems.
- A bug was fixed that caused dynamic output rules to be executed multiple times when forcing their execution with -R.
- A bug causing double uploads with remote files was fixed. Various additional bug fixes related to remote files.
- Various minor bug fixes.
## [3.7.1] - 2016-05-16
### Changed
- Fixed a missing import of the multiprocessing module.
## [3.7.0] - 2016-05-05
### Added
- The entries in `resources` and the `threads` job attribute can now be callables that must return `int` values.
- Multiple `--cluster-config` arguments can be given to the Snakemake command line. Later one override earlier ones.
- In the API, multiple `cluster_config` paths can be given as a list, alternatively to the previous behaviour of expecting one string for this parameter.
- When submitting cluster jobs (either through `--cluster` or `--drmaa`), you can now use `--max-jobs-per-second` to limit the number of jobs being submitted (also available through Snakemake API). Some cluster installations have problems with too many jobs per second.
- Wildcard values are now printed upon job execution in addition to input and output files.
### Changed
- Fixed a bug with HTTP remote providers.
## [3.6.1] - 2016-04-08
### Changed
- Work around missing RecursionError in Python < 3.5
- Improved conversion of numpy and pandas data structures to R scripts.
- Fixed locking of working directory.
## [3.6.0] - 2016-03-10
### Added
- onstart handler, that allows to add code that shall be only executed before the actual workflow execution (not on dryrun).
- Parameters defined in the cluster config file are now accessible in the job properties under the key "cluster".
- The wrapper directive can be considered stable.
### Changed
- Allow to use rule/job parameters with braces notation in cluster config.
- Show a proper error message in case of recursion errors.
- Remove non-empty temp dirs.
- Don't set the process group of Snakemake in order to allow kill signals from parent processes to be propagated.
- Fixed various corner case bugs.
- The params directive no longer converts a list ``l`` implicitly to ``" ".join(l)``.
## [3.5.5] - 2016-01-23
### Added
- New experimental wrapper directive, which allows to refer to re-usable [wrapper scripts](https://bitbucket.org/snakemake/snakemake/wiki/Documentation#markdown-header-wrappers). Wrappers are provided in the [Snakemake Wrapper Repository](https://bitbucket.org/snakemake/snakemake-wrappers).
- David Koppstein implemented two new command line options to constrain the execution of the DAG of job to sub-DAGs (--until and --omit-from).
### Changed
- Fixed various bugs, e.g. with shadow jobs and --latency-wait.
## [3.5.4] - 2015-12-04
### Changed
- The params directive now fully supports non-string parameters. Several bugs in the remote support were fixed.
## [3.5.3] - 2015-11-24
### Changed
- The missing remote module was added to the package.
## [3.5.2] - 2015-11-24
### Added
- Support for easy integration of external R and Python scripts via the new [script directive](https://bitbucket.org/snakemake/snakemake/wiki/Documentation#markdown-header-external-scripts).
- Chris Tomkins-Tinch has implemented support for remote files: Snakemake can now handle input and output files from Amazon S3, Google Storage, FTP, SFTP, HTTP and Dropbox.
- Simon Ye has implemented support for sandboxing jobs with [shadow rules](https://bitbucket.org/snakemake/snakemake/wiki/Documentation#markdown-header-shadow-rules).
### Changed
- Manuel Holtgrewe has fixed dynamic output files in combination with mutliple wildcards.
- It is now possible to add suffixes to all shell commands with shell.suffix("mysuffix").
- Job execution has been refactored to spawn processes only when necessary, resolving several problems in combination with huge workflows consisting of thousands of jobs and reducing the memory footprint.
- In order to reflect the new collaborative development model, Snakemake has moved from my personal bitbucket account to http://snakemake.bitbucket.org.
## [3.4.2] - 2015-09-12
### Changed
- Willem Ligtenberg has reduced the memory usage of Snakemake.
- Per Unneberg has improved config file handling to provide a more intuitive overwrite behavior.
- Simon Ye has improved the test suite of Snakemake and helped with setting up continuous integration via Codeship.
- The cluster implementation has been rewritten to use only a single thread to wait for jobs. This avoids failures with large numbers of jobs.
- Benchmarks are now writing tab-delimited text files instead of JSON.
- Snakemake now always requires to set the number of jobs with -j when in cluster mode. Set this to a high value if your cluster does not have restrictions.
- The Snakemake Conda package has been moved to the bioconda channel.
- The handling of Symlinks was improved, which made a switch to Python 3.3 as the minimum required Python version necessary.
## [3.4.1] - 2015-08-05
### Changed
- This release fixes a bug that caused named input or output files to always be returned as lists instead of single files.
## [3.4] - 2015-07-18
### Added
- This release adds support for executing jobs on clusters in synchronous mode (e.g. qsub -sync). Thanks to David Alexander for implementing this.
- There is now vim syntax highlighting support (thanks to Jay Hesselberth).
- Snakemake is now available as Conda package.
### Changed
- Lots of bugs have been fixed. Thanks go to e.g. David Koppstein, Marcel Martin, John Huddleston and Tao Wen for helping with useful reports and debugging.
See [here](https://bitbucket.org/snakemake/snakemake/wiki/News-Archive) for older changes.
License¶
Snakemake is licensed under the MIT License:
Copyright (c) 2016 Johannes Köster <johannes.koester@tu-dortmund.de>
Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.