Snakemake¶
The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment.
Quick Example¶
Snakemake workflows are essentially Python scripts extended by declarative code to define rules. Rules describe how to create output files from input files.
rule targets:
input:
"plots/dataset1.pdf",
"plots/dataset2.pdf"
rule plot:
input:
"raw/{dataset}.csv"
output:
"plots/{dataset}.pdf"
shell:
"somecommand {input} {output}"
- Similar to GNU Make, you specify targets in terms of a pseudo-rule at the top.
- For each target and intermediate file, you create rules that define how they are created from input files.
- Snakemake determines the rule dependencies by matching file names.
- Input and output files can contain multiple named wildcards.
- Rules can either use shell commands, plain Python code or external Python or R scripts to create output files from input files.
- Snakemake workflows can be easily executed on workstations, clusters, the grid, and in the cloud without modification. The job scheduling can be constrained by arbitrary resources like e.g. available CPU cores, memory or GPUs.
- Snakemake can automatically deploy required software dependencies of a workflow using Conda or Singularity.
- Snakemake can use Amazon S3, Google Storage, Dropbox, FTP, WebDAV, SFTP and iRODS to access input or output files and further access input files via HTTP and HTTPS.
Getting started¶
News about Snakemake are published via Twitter. To get started, consider the Snakemake Tutorial, the introductory slides, and the FAQ.
Support¶
- First, check the FAQ.
- In case of questions, please post on stack overflow.
- To discuss with other Snakemake users, you can use the mailing list. Please do not post questions there. Use stack overflow for questions.
- For bugs and feature requests, please use the issue tracker.
- For contributions, visit Snakemake on bitbucket and read the guidelines.
Resources¶
- Snakemake Wrappers Repository
- The Snakemake Wrapper Repository is a collection of reusable wrappers that allow to quickly use popular tools from Snakemake rules and workflows.
- Snakemake Workflows Project
- This project provides a collection of high quality modularized and re-usable workflows. The provided code should also serve as a best-practices of how to build production ready workflows with Snakemake. Everybody is invited to contribute.
- Snakemake Profiles Project
- This project provides Snakemake configuration profiles for various execution environments. Please consider contributing your own if it is still missing.
- Bioconda
- Bioconda can be used from Snakemake for creating completely reproducible workflows by defining the used software versions and providing binaries.
Publications using Snakemake¶
In the following you find an incomplete list of publications making use of Snakemake for their analyses. Please consider to add your own.
- Uhlitz et al. 2017. An immediate–late gene expression module decodes ERK signal duration. Molecular Systems Biology.
- Akkouche et al. 2017. Piwi Is Required during Drosophila Embryogenesis to License Dual-Strand piRNA Clusters for Transposon Repression in Adult Ovaries. Molecular Cell.
- Beatty et al. 2017. `Giardia duodenalis induces pathogenic dysbiosis of human intestinal microbiota biofilms <>`_. International Journal for Parasitology.
- Meyer et al. 2017. Differential Gene Expression in the Human Brain Is Associated with Conserved, but Not Accelerated, Noncoding Sequences. Molecular Biology and Evolution.
- Lonardo et al. 2017. Priming of soil organic matter: Chemical structure of added compounds is more important than the energy content. Soil Biology and Biochemistry.
- Beisser et al. 2017. Comprehensive transcriptome analysis provides new insights into nutritional strategies and phylogenetic relationships of chrysophytes. PeerJ.
- Dimitrov et al 2017. Successive DNA extractions improve characterization of soil microbial communities. PeerJ.
- de Bourcy et al. 2016. Phylogenetic analysis of the human antibody repertoire reveals quantitative signatures of immune senescence and aging. PNAS.
- Bray et al. 2016. `Near-optimal probabilistic RNA-seq quantification<http://www.nature.com/nbt/journal/v34/n5/abs/nbt.3519.html>`_. Nature Biotechnology.
- Etournay et al. 2016. TissueMiner: a multiscale analysis toolkit to quantify how cellular processes create tissue dynamics. eLife Sciences.
- Townsend et al. 2016. The Public Repository of Xenografts Enables Discovery and Randomized Phase II-like Trials in Mice. Cancer Cell.
- Burrows et al. 2016. Genetic Variation, Not Cell Type of Origin, Underlies the Majority of Identifiable Regulatory Differences in iPSCs. PLOS Genetics.
- Ziller et al. 2015. Coverage recommendations for methylation analysis by whole-genome bisulfite sequencing. Nature Methods.
- Li et al. 2015. Quality control, modeling, and visualization of CRISPR screens with MAGeCK-VISPR. Genome Biology.
- Schmied et al. 2015. An automated workflow for parallel processing of large multiview SPIM recordings. Bioinformatics.
- Chung et al. 2015. Whole-Genome Sequencing and Integrative Genomic Analysis Approach on Two 22q11.2 Deletion Syndrome Family Trios for Genotype to Phenotype Correlations. Human Mutation.
- Kim et al. 2015. TUT7 controls the fate of precursor microRNAs by using three different uridylation mechanisms. The EMBO Journal.
- Park et al. 2015. Ebola Virus Epidemiology, Transmission, and Evolution during Seven Months in Sierra Leone. Cell.
- Břinda et al. 2015. RNF: a general framework to evaluate NGS read mappers. Bioinformatics.
- Břinda et al. 2015. Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics.
- Spjuth et al. 2015. Experiences with workflows for automating data-intensive bioinformatics. Biology Direct.
- Schramm et al. 2015. Mutational dynamics between primary and relapse neuroblastomas. Nature Genetics.
- Berulava et al. 2015. N6-Adenosine Methylation in MiRNAs. PLOS ONE.
- The Genome of the Netherlands Consortium 2014. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nature Genetics.
- Patterson et al. 2014. WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads. Journal of Computational Biology.
- Fernández et al. 2014. H3K4me1 marks DNA regions hypomethylated during aging in human stem and differentiated cells. Genome Research.
- Köster et al. 2014. Massively parallel read mapping on GPUs with the q-group index and PEANUT. PeerJ.
- Chang et al. 2014. TAIL-seq: Genome-wide Determination of Poly(A) Tail Length and 3′ End Modifications. Molecular Cell.
- Althoff et al. 2013. MiR-137 functions as a tumor suppressor in neuroblastoma by downregulating KDM1A. International Journal of Cancer.
- Marschall et al. 2013. MATE-CLEVER: Mendelian-Inheritance-Aware Discovery and Genotyping of Midsize and Long Indels. Bioinformatics.
- Rahmann et al. 2013. Identifying transcriptional miRNA biomarkers by integrating high-throughput sequencing and real-time PCR data. Methods.
- Martin et al. 2013. Exome sequencing identifies recurrent somatic mutations in EIF1AX and SF3B1 in uveal melanoma with disomy 3. Nature Genetics.
- Czeschik et al. 2013. Clinical and mutation data in 12 patients with the clinical diagnosis of Nager syndrome. Human Genetics.
- Marschall et al. 2012. CLEVER: Clique-Enumerating Variant Finder. Bioinformatics.
Installation¶
Snakemake is available on PyPi as well as through Bioconda and also from source code. You can use one of the following ways for installing Snakemake.
Installation via Conda¶
This is the recommended way to install Snakemake, because it also enables Snakemake to handle software dependencies of your workflow.
First, you have to install the Miniconda Python3 distribution. See here for installation instructions. Make sure to …
- Install the Python 3 version of Miniconda.
- Answer yes to the question whether conda shall be put into your PATH.
Then, you can install Snakemake with
$ conda install -c bioconda -c conda-forge snakemake
from the Bioconda channel.
Global Installation¶
With a working Python >=3.5
setup, installation of Snakemake can be performed by issuing
$ easy_install3 snakemake
or
$ pip3 install snakemake
in your terminal.
Installing in Virtualenv¶
To create an installation in a virtual environment, use the following commands:
$ virtualenv -p python3 .venv
$ source .venv/bin/activate
$ pip install snakemake
Installing from Source¶
We recommend installing Snakemake into a virtualenv instead of globally. Use the following commands to create a virtualenv and install Snakemake. Note that this will install the development version and as you are installing from the source code, we trust that you know what you are doing and how to checkout individual versions/tags.
$ git clone https://bitbucket.org/snakemake/snakemake.git
$ cd snakemake
$ virtualenv -p python3 .venv
$ source .venv/bin/activate
$ python setup.py install
You can also use python setup.py develop
to create a “development installation” in which no files are copied but a link is created and changes in the source code are immediately visible in your snakemake
commands.
Examples¶
Most of the examples below assume that Snakemake is executed in a project-specific root directory.
The paths in the Snakefiles below are relative to this directory.
We follow the convention to use different subdirectories for different intermediate results, e.g., mapped/
for mapped sequence reads in .bam
files, etc.
Cufflinks¶
Cufflinks is a tool to assemble transcripts, calculate abundance and conduct a differential expression analysis on RNA-Seq data. This example shows how to create a typical Cufflinks workflow with Snakemake. It assumes that mapped RNA-Seq data for four samples 101-104 is given as bam files.
- For each sample, transcripts are assembled with
cufflinks
(ruleassembly
). - Assemblies are merged into one gtf with
cuffmerge
(rulemerge_assemblies
). - A comparison to the hg19 gtf-track is conducted (rule
compare_assemblies
). - Finally, differential expressions are calculated on the found transcripts (rule
diffexp
).
# path to track and reference
TRACK = 'hg19.gtf'
REF = 'hg19.fa'
# sample names and classes
CLASS1 = '101 102'.split()
CLASS2 = '103 104'.split()
SAMPLES = CLASS1 + CLASS2
# path to bam files
CLASS1_BAM = expand('mapped/{sample}.bam', sample=CLASS1)
CLASS2_BAM = expand('mapped/{sample}.bam', sample=CLASS2)
rule all:
input:
'diffexp/isoform_exp.diff',
'assembly/comparison'
rule assembly:
input:
'mapped/{sample}.bam'
output:
'assembly/{sample}/transcripts.gtf',
dir='assembly/{sample}'
threads: 4
shell:
'cufflinks --num-threads {threads} -o {output.dir} '
'--frag-bias-correct {REF} {input}'
rule compose_merge:
input:
expand('assembly/{sample}/transcripts.gtf', sample=SAMPLES)
output:
txt='assembly/assemblies.txt'
run:
with open(output.txt, 'w') as out:
print(*input, sep="\n", file=out)
rule merge_assemblies:
input:
'assembly/assemblies.txt'
output:
'assembly/merged/merged.gtf', dir='assembly/merged'
shell:
'cuffmerge -o {output.dir} -s {REF} {input}'
rule compare_assemblies:
input:
'assembly/merged/merged.gtf'
output:
'assembly/comparison/all.stats',
dir='assembly/comparison'
shell:
'cuffcompare -o {output.dir}all -s {REF} -r {TRACK} {input}'
rule diffexp:
input:
class1=CLASS1_BAM,
class2=CLASS2_BAM,
gtf='assembly/merged/merged.gtf'
output:
'diffexp/gene_exp.diff', 'diffexp/isoform_exp.diff'
params:
class1=",".join(CLASS1_BAM),
class2=",".join(CLASS2_BAM)
threads: 8
shell:
'cuffdiff --num-threads {threads} {gtf} {params.class1} {params.class2}'
The execution plan of Snakemake for this workflow can be visualized with the following DAG.

Building a C Program¶
GNU Make is primarily used to build C/C++ code. Snakemake can do the same, while providing a superior readability due to less obscure variables inside the rules.
The following example Makefile was adapted from http://www.cs.colby.edu/maxwell/courses/tutorials/maketutor/.
IDIR=../include
ODIR=obj
LDIR=../lib
LIBS=-lm
CC=gcc
CFLAGS=-I$(IDIR)
_HEADERS = hello.h
HEADERS = $(patsubst %,$(IDIR)/%,$(_HEADERS))
_OBJS = hello.o hellofunc.o
OBJS = $(patsubst %,$(ODIR)/%,$(_OBJS))
# build the executable from the object files
hello: $(OBJS)
$(CC) -o $@ $^ $(CFLAGS)
# compile a single .c file to an .o file
$(ODIR)/%.o: %.c $(HEADERS)
$(CC) -c -o $@ $< $(CFLAGS)
# clean up temporary files
.PHONY: clean
clean:
rm -f $(ODIR)/*.o *~ core $(IDIR)/*~
A Snakefile can be easily written as
from os.path import join
IDIR = '../include'
ODIR = 'obj'
LDIR = '../lib'
LIBS = '-lm'
CC = 'gcc'
CFLAGS = '-I' + IDIR
_HEADERS = ['hello.h']
HEADERS = [join(IDIR, hfile) for hfile in _HEADERS]
_OBJS = ['hello.o', 'hellofunc.o']
OBJS = [join(ODIR, ofile) for ofile in _OBJS]
rule hello:
"""build the executable from the object files"""
output:
'hello'
input:
OBJS
shell:
"{CC} -o {output} {input} {CFLAGS} {LIBS}"
rule c_to_o:
"""compile a single .c file to an .o file"""
output:
temp('{ODIR}/{name}.o')
input:
src='{name}.c',
headers=HEADERS
shell:
"{CC} -c -o {output} {input.src} {CFLAGS}"
rule clean:
"""clean up temporary files"""
shell:
"rm -f *~ core {IDIR}/*~"
As can be seen, the shell calls become more readable, e.g. "{CC} -c -o {output} {input} {CFLAGS}"
instead of $(CC) -c -o $@ $< $(CFLAGS)
. Further, Snakemake automatically deletes .o
-files when they are not needed anymore since they are marked as temp
.

Building a Paper with LaTeX¶
Building a scientific paper can be automated by Snakemake as well. Apart from compiling LaTeX code and invoking BibTeX, we provide a special rule to zip the needed files for online submission.
We first provide a Snakefile tex.rules
that contains rules that can be shared for any latex build task:
ruleorder: tex2pdf_with_bib > tex2pdf_without_bib
rule tex2pdf_with_bib:
input:
'{name}.tex',
'{name}.bib'
output:
'{name}.pdf'
shell:
"""
pdflatex {wildcards.name}
bibtex {wildcards.name}
pdflatex {wildcards.name}
pdflatex {wildcards.name}
"""
rule tex2pdf_without_bib:
input:
'{name}.tex'
output:
'{name}.pdf'
shell:
"""
pdflatex {wildcards.name}
pdflatex {wildcards.name}
"""
rule texclean:
shell:
"rm -f *.log *.aux *.bbl *.blg *.synctex.gz"
Note how we distinguish between a .tex
file with and without a corresponding .bib
with the same name.
Assuming that both paper.tex
and paper.bib
exist, an ambiguity arises: Both rules are, in principle, applicable.
This would lead to an AmbiguousRuleException
, but since we have specified an explicit rule order in the file, it is clear that in this case the rule tex2pdf_with_bib
is to be preferred.
If the paper.bib
file does not exist, that rule is not even applicable, and the only option is to execute rule tex2pdf_without_bib
.
Assuming that the above file is saved as tex.rules
, the actual documents are then built from a specific Snakefile that includes these common rules:
DOCUMENTS = ['document', 'response-to-editor']
TEXS = [doc+".tex" for doc in DOCUMENTS]
PDFS = [doc+".pdf" for doc in DOCUMENTS]
FIGURES = ['fig1.pdf']
include:
'tex.smrules'
rule all:
input:
PDFS
rule zipit:
output:
'upload.zip'
input:
TEXS, FIGURES, PDFS
shell:
'zip -T {output} {input}'
rule pdfclean:
shell:
"rm -f {PDFS}"
Hence the user can perform 4 different tasks. Build all PDFs:
$ snakemake
Create a zip-file for online submissions:
$ snakemake zipit
Clean up all PDFs:
$ snakemake pdfclean
Clean up latex temporary files:
$ snakemake texclean
The following DAG of jobs would be executed upon a full run:

Snakemake Tutorial¶
This tutorial introduces the text-based workflow system Snakemake. Snakemake follows the GNU Make paradigm: workflows are defined in terms of rules that define how to create output files from input files. Dependencies between the rules are determined automatically, creating a DAG (directed acyclic graph) of jobs that can be automatically parallelized.
Snakemake sets itself apart from existing text-based workflow systems in the following way. Hooking into the Python interpreter, Snakemake offers a definition language that is an extension of Python with syntax to define rules and workflow specific properties. This allows to combine the flexibility of a plain scripting language with a pythonic workflow definition. The Python language is known to be concise yet readable and can appear almost like pseudo-code. The syntactic extensions provided by Snakemake maintain this property for the definition of the workflow. Further, Snakemake’s scheduling algorithm can be constrained by priorities, provided cores and customizable resources and it provides a generic support for distributed computing (e.g., cluster or batch systems). Hence, a Snakemake workflow scales without modification from single core workstations and multi-core servers to cluster or batch systems.
The examples presented in this tutorial come from Bioinformatics. However, Snakemake is a general-purpose workflow management system for any discipline. We ensured that no bioinformatics knowledge is needed to understand the tutorial.
Also have a look at the corresponding slides.
Setup¶
Requirements¶
To go through this tutorial, you need the following software installed:
- Python ≥3.3
- Snakemake 3.11.0
- BWA 0.7.12
- SAMtools 1.3.1
- BCFtools 1.3.1
- Graphviz 2.38.0
- PyYAML 3.11
- Docutils 0.12
The easiest way to setup these prerequisites is to use the Miniconda Python 3 distribution. The tutorial assumes that you are using either Linux or MacOS X. Both Snakemake and Miniconda work also under Windows, but the Windows shell is too different to be able to provide generic examples.
Setup a Linux VM with Vagrant under Windows¶
If you already use Linux or MacOS X, go on with Step 1.
If you use Windows, you can setup a Linux virtual machine (VM) with Vagrant.
First, install Vagrant following the installation instructions in the Vagrant Documentation.
Then, create a reasonable new directory you want to share with your Linux VM, e.g., create a folder vagrant-linux
somewhere.
Open a command line prompt, and change into that directory.
Here, you create a 64-bit Ubuntu Linux environment with
> vagrant init hashicorp/precise64
> vagrant up
If you decide to use a 32-bit image, you will need to download the 32-bit version of Miniconda in the next step.
The contents of the vagrant-linux
folder will be shared with the virtual machine that is set up by vagrant.
You can log into the virtual machine via
> vagrant ssh
If this command tells you to install an SSH client, you can follow the instructions in this Blogpost. Now, you can follow the steps of our tutorial from within your Linux VM.
Step 1: Installing Miniconda 3¶
First, please open a terminal or make sure you are logged into your Vagrant Linux VM. Assuming that you have a 64-bit system, on Linux, download and install Miniconda 3 with
$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
On MacOS X, download and install with
$ curl https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -o Miniconda3-latest-MacOSX-x86_64.sh
$ bash Miniconda3-latest-MacOSX-x86_64.sh
For a 32-bit system, URLs and file names are analogous but without the _64
.
When you are asked the question
Do you wish the installer to prepend the Miniconda3 install location to PATH ...? [yes|no]
answer with yes.
Along with a minimal Python 3 environment, Miniconda contains the package manager Conda.
After opening a new terminal, you can use the new conda
command to install software packages and create isolated environments to, e.g., use different versions of the same package.
We will later use Conda to create an isolated environment with all required software for this tutorial.
Step 2: Preparing a working directory¶
First, create a new directory snakemake-tutorial
at a reasonable place and change into that directory in your terminal.
If you use a Vagrant Linux VM from Windows as described above, create that directory under /vagrant/
, so that the contents are shared with your host system (you can then edit all files from within Windows with an editor that supports Unix line breaks).
Then, change to the newly created directory.
In this directory, we will later create an example workflow that illustrates the Snakemake syntax and execution environment.
First, we download some example data on which the workflow shall be executed:
$ wget https://bitbucket.org/snakemake/snakemake-tutorial/get/v3.11.0.tar.bz2
$ tar -xf v3.11.0.tar.bz2 --strip 1
This will create a folder data
and a file environment.yaml
in the working directory.
Step 3: Creating an environment with the required software¶
The environment.yaml
file can be used to install all required software into an isolated Conda environment with the name snakemake-tutorial
via
$ conda env create --name snakemake-tutorial --file environment.yaml
Step 4: Activating the environment¶
To activate the snakemake-tutorial
environment, execute
$ source activate snakemake-tutorial
Now you can use the installed tools. Execute
$ snakemake --help
to test this and get information about the command-line interface of Snakemake. To exit the environment, you can execute
$ source deactivate
but don’t do that now, since we finally want to start working with Snakemake :-).
Basics: An example workflow¶
Please make sure that you have activated the environment we created before, and that you have an open terminal in the working directory you have created.
A Snakemake workflow is defined by specifying rules in a Snakefile. Rules decompose the workflow into small steps (e.g., the application of a single tool) by specifying how to create sets of output files from sets of input files. Snakemake automatically determines the dependencies between the rules by matching file names.
The Snakemake language extends the Python language, adding syntactic structures for rule definition and additional controls. All added syntactic structures begin with a keyword followed by a code block that is either in the same line or indented and consisting of multiple lines. The resulting syntax resembles that of original Python constructs.
In the following, we will introduce the Snakemake syntax by creating an example workflow. The workflow comes from the domain of genome analysis. It maps sequencing reads to a reference genome and call variants on the mapped reads. The tutorial does not require you to know what this is about. Nevertheless, we provide some background in the following.
Background¶
The genome of a living organism encodes its hereditary information. It serves as a blueprint for proteins, which form living cells, carry information and drive chemical reactions. Differences between populations, species, cancer cells and healthy tissue, as well as syndromes or diseases can be reflected and sometimes caused by changes in the genome. This makes the genome an major target of biological and medical research. Today, it is often analyzed with DNA sequencing, producing gigabytes of data from a single biological sample (e.g. a biopsy of some tissue). For technical reasons, DNA sequencing cuts the DNA of a sample into millions of small pieces, called reads. In order to recover the genome of the sample, one has to map these reads against a known reference genome (e.g., the human one obtained during the famous human genome project). This task is called read mapping. Often, it is of interest where an individual genome is different from the species-wide consensus represented with the reference genome. Such differences are called variants. They are responsible for harmless individual differences (like eye color), but can also cause diseases like cancer. By investigating the differences between the all mapped reads and the reference sequence at one position, variants can be detected. This is a statistical challenge, because they have to be distinguished from artifacts generated by the sequencing process.
Step 1: Mapping reads¶
Our first Snakemake rule maps reads of a given sample to a given reference genome (see Background).
For this, we will use the tool bwa, specifically the subcommand bwa mem
.
In the working directory, create a new file called Snakefile
with an editor of your choice.
We propose to use the Atom editor, since it provides out-of-the-box syntax highlighting for Snakemake.
In the Snakefile, define the following rule:
rule bwa_map:
input:
"data/genome.fa",
"data/samples/A.fastq"
output:
"mapped_reads/A.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
A Snakemake rule has a name (here bwa_map
) and a number of directives, here input
, output
and shell
.
The input
and output
directives are followed by lists of files that are expected to be used or created by the rule.
In the simplest case, these are just explicit Python strings.
The shell
directive is followed by a Python string containing the shell command to execute.
In the shell command string, we can refer to elements of the rule via braces notation (similar to the Python format function).
Here, we refer to the output file by specifying {output}
and to the input files by specifying {input}
.
Since the rule has multiple input files, Snakemake will concatenate them separated by a whitespace.
In other words, Snakemake will replace {input}
with data/genome.fa data/samples/A.fastq
before executing the command.
The shell command invokes bwa mem
with reference genome and reads, and pipes the output into samtools
which creates a compressed BAM file containing the alignments.
The output of samtools
is piped into the output file defined by the rule.
When a workflow is executed, Snakemake tries to generate given target files. Target files can be specified via the command line. By executing
$ snakemake -np mapped_reads/A.bam
in the working directory containing the Snakefile, we tell Snakemake to generate the target file mapped_reads/A.bam
.
Since we used the -n
(or --dryrun
) flag, Snakemake will only show the execution plan instead of actually perform the steps.
The -p
flag instructs Snakemake to also print the resulting shell command for illustration.
To generate the target files, Snakemake applies the rules given in the Snakefile in a top-down way.
The application of a rule to generate a set of output files is called job.
For each input file of a job, Snakemake again (i.e. recursively) determines rules that can be applied to generate it.
This yields a directed acyclic graph (DAG) of jobs where the edges represent dependencies.
So far, we only have a single rule, and the DAG of jobs consists of a single node.
Nevertheless, we can execute our workflow with
$ snakemake mapped_reads/A.bam
Note that, after completion of above command, Snakemake will not try to create mapped_reads/A.bam
again, because it is already present in the file system.
Snakemake only re-runs jobs if one of the input files is newer than one of the output files or one of the input files will be updated by another job.
Step 2: Generalizing the read mapping rule¶
Obviously, the rule will only work for a single sample with reads in the file data/samples/A.fastq
.
However, Snakemake allows to generalize rules by using named wildcards.
Simply replace the A
in the second input file and in the output file with the wildcard {sample}
, leading to
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
When Snakemake determines that this rule can be applied to generate a target file by replacing the wildcard {sample}
in the output file with an appropriate value, it will propagate that value to all occurrences of {sample}
in the input files and thereby determine the necessary input for the resulting job.
Note that you can have multiple wildcards in your file paths, however, to avoid conflicts with other jobs of the same rule, all output files of a rule have to contain exactly the same wildcards.
When executing
$ snakemake -np mapped_reads/B.bam
Snakemake will determine that the rule bwa_map
can be applied to generate the target file by replacing the wildcard {sample}
with the value B
.
In the output of the dry-run, you will see how the wildcard value is propagated to the input files and all filenames in the shell command.
You can also specify multiple targets, e.g.:
$ snakemake -np mapped_reads/A.bam mapped_reads/B.bam
Some Bash magic can make this particularly handy. For example, you can alternatively compose our multiple targets in a single pass via
$ snakemake -np mapped_reads/{A,B}.bam
Note that this is not a special Snakemake syntax. Bash is just expanding the given path into two, one for each element of the set {A,B}
.
In both cases, you will see that Snakemake only proposes to create the output file mapped_reads/B.bam
.
This is because you already executed the workflow before (see the previous step) and no input file is newer than the output file mapped_reads/A.bam
.
You can update the file modification date of the input file
data/samples/A.fastq
via
$ touch data/samples/A.fastq
and see how Snakemake wants to re-run the job to create the file mapped_reads/A.bam
by executing
$ snakemake -np mapped_reads/A.bam mapped_reads/B.bam
Step 3: Sorting read alignments¶
For later steps, we need the read alignments in the BAM files to be sorted.
This can be achieved with the samtools command.
We add the following rule beneath the bwa_map
rule:
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
This rule will take the input file from the mapped_reads
directory and store a sorted version in the sorted_reads
directory.
Note that Snakemake automatically creates missing directories before jobs are executed.
For sorting, samtools
requires a prefix specified with the flag -T
.
Here, we need the value of the wildcard sample
.
Snakemake allows to access wildcards in the shell command via the wildcards
object that has an attribute with the value for each wildcard.
When issuing
$ snakemake -np sorted_reads/B.bam
you will see how Snakemake wants to run first the rule bwa_map
and then the rule samtools_sort
to create the desired target file:
as mentioned before, the dependencies are resolved automatically by matching file names.
Step 4: Indexing read alignments and visualizing the DAG of jobs¶
Next, we need to use samtools again to index the sorted read alignments for random access. This can be done with the following rule:
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
Having three steps already, it is a good time to take a closer look at the resulting DAG of jobs. By executing
$ snakemake --dag sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg
we create a visualization of the DAG using the dot
command provided by Graphviz.
For the given target files, Snakemake specifies the DAG in the dot language and pipes it into the dot
command, which renders the definition into SVG format.
The rendered DAG is piped into the file dag.svg
and will look similar to this:

The DAG contains a node for each job and edges representing the dependencies. Jobs that don’t need to be run because their output is up-to-date are dashed. For rules with wildcards, the value of the wildcard for the particular job is displayed in the job node.
Exercise¶
- Run parts of the workflow using different targets. Recreate the DAG and see how different rules become dashed because their output is present and up-to-date.
Step 5: Calling genomic variants¶
The next step in our workflow will aggregate the mapped reads from all samples and jointly call genomic variants on them (see Background). For the variant calling, we will combine the two utilities samtools and bcftools. Snakemake provides a helper function for collecting input files that helps us to describe the aggregation in this step. With
expand("sorted_reads/{sample}.bam", sample=SAMPLES)
we obtain a list of files where the given pattern "sorted_reads/{sample}.bam"
was formatted with the values in a given list of samples SAMPLES
, i.e.
["sorted_reads/A.bam", "sorted_reads/B.bam"]
The function is particularly useful when the pattern contains multiple wildcards. For example,
expand("sorted_reads/{sample}.{replicate}.bam", sample=SAMPLES, replicate=[0, 1])
would create the product of all elements of SAMPLES
and the list [0, 1]
, yielding
["sorted_reads/A.0.bam", "sorted_reads/A.1.bam", "sorted_reads/B.0.bam", "sorted_reads/B.1.bam"]
Here, we use only the simple case of expand
.
We first let Snakemake know which samples we want to consider.
Remember that Snakemake works top-down, it does not automatically infer this from, e.g., the fastq files in the data folder.
Also remember that Snakefiles are in principle Python code enhanced by some declarative statements to define workflows.
Hence, we can define the list of samples ad-hoc in plain Python at the top of the Snakefile:
SAMPLES = ["A", "B"]
Later, we will learn about more sophisticated ways like config files. Now, we can add the following rule to our Snakefile:
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
With multiple input or output files, it is sometimes handy to refer them separately in the shell command.
This can be done by specifying names for input or output files (here, e.g., fa=...
).
The files can then be referred in the shell command via, e.g., {input.fa}
.
For long shell commands like this one, it is advisable to split the string over multiple indented lines.
Python will automatically merge it into one.
Further, you will notice that the input or output file lists can contain arbitrary Python statements, as long as it returns a string, or a list of strings.
Here, we invoke our expand
function to aggregate over the aligned reads of all samples.
Exercise¶
- obtain the updated DAG of jobs for the target file
calls/all.vcf
, it should look like this:

Step 6: Writing a report¶
Although Snakemake workflows are already self-documenting to a certain degree, it is often useful to summarize the obtained results and performed steps in a comprehensive report.
With Snakemake, such reports can be composed easily with the built-in report
function.
It is best practice to create reports in a separate rule that takes all desired results as input files and provides a single HTML file as output.
rule report:
input:
"calls/all.vcf"
output:
"report.html"
run:
from snakemake.utils import report
with open(input[0]) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
""", output[0], T1=input[0])
First, we notice that this rule does not entail a shell command.
Instead, we use the run
directive, which is followed by plain Python code.
Similar to the shell case, we have access to input
and output
files, which we can handle as plain Python objects.
We go through the run
block line by line.
First, we import the report
function from snakemake.utils
.
Second, we open the VCF file by accessing it via its index in the input files (i.e. input[0]
), and count the number of non-header lines (which is equivalent to the number of variant calls).
Of course, this is only a silly example of what to do with variant calls.
Third, we create the report using the report
function.
The function takes a string that contains RestructuredText markup.
In addition, we can use the familiar braces notation to access any Python variables (here the samples
and n_calls
variables we have defined before).
The second argument of the report
function is the path were the report will be stored (the function creates a single HTML file).
Then, report expects any number of keyword arguments referring to files that shall be embedded into the report.
Technically, this means that the file will be stored as a Base64 encoded data URI within the HTML file, making reports entirely self-contained.
Importantly, you can refer to the files from within the report via the given keywords followed by an underscore (here T1_
).
Hence, reports can be used to semantically connect and explain the obtained results.
When having many result files, it is sometimes handy to define the names already in the list of input files and unpack these into keyword arguments as follows:
report("""...""", output[0], **input)
Further, you can add meta data in the form of any string that will be displayed in the footer of the report, e.g.
report("""...""", output[0], metadata="Author: Johannes Köster (koester@jimmy.harvard.edu)", **input)
Step 7: Adding a target rule¶
So far, we always executed the workflow by specifying a target file at the command line.
Apart from filenames, Snakemake also accepts rule names as targets if the referred rule does not have wildcards.
Hence, it is possible to write target rules collecting particular subsets of the desired results or all results.
Moreover, if no target is given at the command line, Snakemake will define the first rule of the Snakefile as the target.
Hence, it is best practice to have a rule all
at the top of the workflow which has all typically desired target files as input files.
Here, this means that we add a rule
rule all:
input:
"report.html"
to the top of our workflow. When executing Snakemake with
$ snakemake -n
the execution plan for creating the file report.html
which contains and summarizes all our results will be shown.
Note that, apart from Snakemake considering the first rule of the workflow as default target, the appearance of rules in the Snakefile is arbitrary and does not influence the DAG of jobs.
Exercise¶
- Create the DAG of jobs for the complete workflow.
- Execute the complete workflow and have a look at the resulting
report.html
in your browser. - Snakemake provides handy flags for forcing re-execution of parts of the workflow. Have a look at the command line help with
snakemake --help
and search for the flag--forcerun
. Then, use this flag to re-execute the rulesamtools_sort
and see what happens. - With
--reason
it is possible to display the execution reason for each job. Try this flag together with a dry-run and the--forcerun
flag to understand the decisions of Snakemake.
Summary¶
In total, the resulting workflow looks like this:
SAMPLES = ["A", "B"]
rule all:
input:
"report.html"
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
rule report:
input:
"calls/all.vcf"
output:
"report.html"
run:
from snakemake.utils import report
with open(input[0]) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
""", output[0], T1=input[0])
Advanced: Decorating the example workflow¶
Now that the basic concepts of Snakemake have been illustrated, we can introduce advanced topics.
Step 1: Specifying the number of used threads¶
For some tools, it is advisable to use more than one thread in order to speed up the computation.
Snakemake can be made aware of the threads a rule needs with the threads
directive.
In our example workflow, it makes sense to use multiple threads for the rule bwa_map
:
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
threads: 8
shell:
"bwa mem -t {threads} {input} | samtools view -Sb - > {output}"
The number of threads can be propagated to the shell command with the familiar braces notation (i.e. {threads}
).
If no threads
directive is given, a rule is assumed to need 1 thread.
When a workflow is executed, the number of threads the jobs need is considered by the Snakemake scheduler.
In particular, the scheduler ensures that the sum of the threads of all running jobs does not exceed a given number of available CPU cores.
This number can be given with the --cores
command line argument (per default, Snakemake uses only 1 CPU core).
For example
$ snakemake --cores 10
would execute the workflow with 10 cores.
Since the rule bwa_map
needs 8 threads, only one job of the rule can run at a time, and the Snakemake scheduler will try to saturate the remaining cores with other jobs like, e.g., samtools_sort
.
The threads directive in a rule is interpreted as a maximum: when less cores than threads are provided, the number of threads a rule uses will be reduced to the number of given cores.
Exercise¶
- With the flag
--forceall
you can enforce a complete re-execution of the workflow. Combine this flag with different values for--cores
and examine how the scheduler selects jobs to run in parallel.
Step 2: Config files¶
So far, we specified the samples to consider in a Python list within the Snakefile.
However, often you want your workflow to be customizable, so that it can be easily adapted to new data.
For this purpose, Snakemake provides a config file mechanism.
Config files can be written in JSON or YAML, and loaded with the configfile
directive.
In our example workflow, we add the line
configfile: "config.yaml"
to the top of the Snakefile.
Snakemake will load the config file and store its contents into a globally available dictionary named config
.
In our case, it makes sense to specify the samples in config.yaml
as
samples:
A: data/samples/A.fastq
B: data/samples/B.fastq
Now, we can remove the statement defining SAMPLES
from the Snakefile and change the rule bcftools_call
to
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
Step 3: Input functions¶
Since we have stored the path to the FASTQ files in the config file, we can also generalize the rule bwa_map
to use these paths.
This case is different to the rule bcftools_call
we modified above.
To understand this, it is important to know that Snakemake workflows are executed in three phases.
- In the initialization phase, the workflow is parsed and all rules are instantiated.
- In the DAG phase, the DAG of jobs is built by filling wildcards and matching input files to output files.
- In the scheduling phase, the DAG of jobs is executed.
The expand functions in the list of input files of the rule bcftools_call
are executed during the initialization phase.
In this phase, we don’t know about jobs, wildcard values and rule dependencies.
Hence, we cannot determine the FASTQ paths for rule bwa_map
from the config file in this phase, because we don’t even know which jobs will be generated from that rule.
Instead, we need to defer the determination of input files to the DAG phase.
This can be achieved by specifying an input function instead of a string as inside of the input directive.
For the rule bwa_map
this works as follows:
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
"mapped_reads/{sample}.bam"
threads: 8
shell:
"bwa mem -t {threads} {input} | samtools view -Sb - > {output}"
Here, we use an anonymous function, also called lambda expression.
Any normal function would work as well.
Input functions take as single argument a wildcards
object, that allows to access the wildcards values via attributes (here wildcards.sample
).
They have to return a string or a list of strings, that are interpreted as paths to input files (here, we return the path that is stored for the sample in the config file).
Input functions are evaluated once the wildcard values of a job are determined.
Exercise¶
- In the
data/samples
folder, there is an additional sampleC.fastq
. Add that sample to the config file and see how Snakemake wants to recompute the part of the workflow belonging to the new sample, when invoking withsnakemake -n --reason --forcerun bcftools_call
.
Step 4: Rule parameters¶
Sometimes, shell commands are not only composed of input and output files and some static flags.
In particular, it can happen that additional parameters need to be set depending on the wildcard values of the job.
For this, Snakemake allows to define arbitrary parameters for rules with the params
directive.
In our workflow, it is reasonable to annotate aligned reads with so-called read groups, that contain metadata like the sample name.
We modify the rule bwa_map
accordingly:
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
"mapped_reads/{sample}.bam"
params:
rg="@RG\tID:{sample}\tSM:{sample}"
threads: 8
shell:
"bwa mem -R '{params.rg}' -t {threads} {input} | samtools view -Sb - > {output}"
Similar to input and output files, params
can be accessed from the shell command or the Python based run
block (see Step 6: Writing a report).
Exercise¶
- Variant calling can consider a lot of parameters. A particularly important one is the prior mutation rate (1e-3 per default). It is set via the flag
-P
of thebcftools call
command. Consider making this flag configurable via adding a new key to the config file and using theparams
directive in the rulebcftools_call
to propagate it to the shell command.
Step 5: Logging¶
When executing a large workflow, it is usually desirable to store the output of each job persistently in files instead of just printing it to the terminal.
For this purpose, Snakemake allows to specify log files for rules.
Log files are defined via the log
directive and handled similarly to output files, but they are not subject of rule matching and are not cleaned up when a job fails.
We modify our rule bwa_map
as follows:
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
"mapped_reads/{sample}.bam"
params:
rg="@RG\tID:{sample}\tSM:{sample}"
log:
"logs/bwa_mem/{sample}.log"
threads: 8
shell:
"(bwa mem -R '{params.rg}' -t {threads} {input} | "
"samtools view -Sb - > {output}) 2> {log}"
The shell command is modified to collect STDERR output of both bwa
and samtools
and pipe it into the file referred by {log}
.
Log files must contain exactly the same wildcards as the output files to avoid clashes.
Exercise¶
- Add a log directive to the
bcftools_call
rule as well. - Time to re-run the whole workflow (remember the command line flags to force re-execution). See how log files are created for variant calling and read mapping.
- The ability to track the provenance of each generated result is an important step towards reproducible analyses. Apart from the
report
functionality discussed before, Snakemake can summarize various provenance information for all output files of the workflow. The flag--summary
prints a table associating each output file with the rule used to generate it, the creation date and optionally the version of the tool used for creation is provided. Further, the table informs about updated input files and changes to the source code of the rule after creation of the output file. Invoke Snakemake with--summary
to examine the information for our example.
Step 6: Temporary and protected files¶
In our workflow, we create two BAM files for each sample, namely
the output of the rules bwa_map
and samtools_sort
.
When not dealing with examples, the underlying data is usually huge.
Hence, the resulting BAM files need a lot of disk space and their creation takes some time.
Snakemake allows to mark output files as temporary, such that they are deleted once every consuming job has been executed, in order to save disk space.
We use this mechanism for the output file of the rule bwa_map
:
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
temp("mapped_reads/{sample}.bam")
params:
rg="@RG\tID:{sample}\tSM:{sample}"
log:
"logs/bwa_mem/{sample}.log"
threads: 8
shell:
"(bwa mem -R '{params.rg}' -t {threads} {input} | "
"samtools view -Sb - > {output}) 2> {log}"
This results in the deletion of the BAM file once the corresponding samtools_sort
job has been executed.
Since the creation of BAM files via read mapping and sorting is computationally expensive, it is reasonable to protect the final BAM file from accidental deletion or modification.
We modify the rule samtools_sort
by marking it’s output file as protected
:
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
protected("sorted_reads/{sample}.bam")
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
After execution of the job, Snakemake will write-protect the output file in the filesystem, so that it can’t be overwritten or deleted accidentally.
Exercise¶
- Re-execute the whole workflow and observe how Snakemake handles the temporary and protected files.
- Run Snakemake with the target
mapped_reads/A.bam
. Although the file is marked as temporary, you will see that Snakemake does not delete it because it is specified as a target file. - Try to re-execute the whole workflow again with the dry-run option. You will see that it fails (as intended) because Snakemake cannot overwrite the protected output files.
Summary¶
The final version of our workflow looks like this:
configfile: "config.yaml"
rule all:
input:
"report.html"
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
temp("mapped_reads/{sample}.bam")
params:
rg="@RG\tID:{sample}\tSM:{sample}"
log:
"logs/bwa_mem/{sample}.log"
threads: 8
shell:
"(bwa mem -R '{params.rg}' -t {threads} {input} | "
"samtools view -Sb - > {output}) 2> {log}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
protected("sorted_reads/{sample}.bam")
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
rule report:
input:
"calls/all.vcf"
output:
"report.html"
run:
from snakemake.utils import report
with open(input[0]) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
""", output[0], T1=input[0])
Additional features¶
In the following, we introduce some features that are beyond the scope of above example workflow.
For details and even more features, see Writing Workflows, Frequently Asked Questions and the command line help (snakemake --help
).
Benchmarking¶
With the benchmark
directive, Snakemake can be instructed to measure the wall clock time of a job.
We activate benchmarking for the rule bwa_map
:
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
temp("mapped_reads/{sample}.bam")
params:
rg="@RG\tID:{sample}\tSM:{sample}"
log:
"logs/bwa_mem/{sample}.log"
benchmark:
"benchmarks/{sample}.bwa.benchmark.txt"
threads: 8
shell:
"(bwa mem -R '{params.rg}' -t {threads} {input} | "
"samtools view -Sb - > {output}) 2> {log}"
The benchmark
directive takes a string that points to the file where benchmarking results shall be stored.
Similar to output files, the path can contain wildcards (it must be the same wildcards as in the output files).
When a job derived from the rule is executed, Snakemake will measure the wall clock time and memory usage (in MiB) and store it in the file in tab-delimited format.
With the command line flag --benchmark-repeats
, Snakemake can be instructed to perform repetitive measurements by executing benchmark jobs multiple times.
The repeated measurements occur as subsequent lines in the tab-delimited benchmark file.
We can include the benchmark results into our report:
rule report:
input:
T1="calls/all.vcf",
T2=expand("benchmarks/{sample}.bwa.benchmark.txt", sample=config["samples"])
output:
"report.html"
run:
from snakemake.utils import report
with open(input.T1) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
Benchmark results for BWA can be found in the tables T2_.
""", output[0], **input)
We use the expand
function to collect the benchmark files for all samples.
Here, we directly provide names for the input files.
In particular, we can also name the whole list of benchmark files returned by the expand
function as T2
.
When invoking the report
function, we just unpack input
into keyword arguments (resulting in T1
and T2
).
In the text, we refer with T2_
to the list of benchmark files.
Exercise¶
- Re-execute the workflow and benchmark
bwa_map
with 3 repeats. Open the report and see how the list of benchmark files is presented in the HTML report.
Modularization¶
In order to re-use building blocks or simply to structure large workflows, it is sometimes reasonable to split a workflow into modules.
For this, Snakemake provides the include
directive to include another Snakefile into the current one, e.g.:
include: "path/to/other.snakefile"
Alternatively, Snakemake allows to define sub-workflows. A sub-workflow refers to a working directory with a complete Snakemake workflow. Output files of that sub-workflow can be used in the current Snakefile. When executing, Snakemake ensures that the output files of the sub-workflow are up-to-date before executing the current workflow. This mechanism is particularly useful when you want to extend a previous analysis without modifying it. For details about sub-workflows, see the documentation.
Exercise¶
- Put the read mapping related rules into a separate Snakefile and use the
include
directive to make them available in our example workflow again.
Using custom scripts¶
Using the run
directive as above is only reasonable for short Python scripts.
As soon as your script becomes larger, it is reasonable to separate it from the
workflow definition.
For this purpose, Snakemake offers the script
directive.
Using this, the report
rule from above could instead look like this:
rule report:
input:
T1="calls/all.vcf",
T2=expand("benchmarks/{sample}.bwa.benchmark.txt", sample=config["samples"])
output:
"report.html"
script:
"scripts/report.py"
The actual Python code to generate the report is now hidden in the script scripts/report.py
.
Script paths are always relative to the referring Snakefile.
In the script, all properties of the rule like input
, output
, wildcards
,
params
, threads
etc. are available as attributes of a global snakemake
object:
from snakemake.utils import report
with open(snakemake.input.T1) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
Benchmark results for BWA can be found in the tables T2_.
""", snakemake.output[0], **snakemake.input)
Although there are other strategies to invoke separate scripts from your workflow (e.g., invoking them via shell commands), the benefit of this is obvious: the script logic is separated from the workflow logic (and can be even shared between workflows), but boilerplate code like the parsing of command line arguments in unnecessary.
Apart from Python scripts, it is also possible to use R scripts. In R scripts,
an S4 object named snakemake
analog to the Python case above is available and
allows access to input and output files and other parameters. Here the syntax
follows that of S4 classes with attributes that are R lists, e.g. we can access
the first input file with snakemake@input[[1]]
(note that the first file does
not have index 0 here, because R starts counting from 1). Named input and output
files can be accessed in the same way, by just providing the name instead of an
index, e.g. snakemake@input[["myfile"]]
.
For details and examples, see the External scripts section in the Documentation.
Automatic deployment of software dependencies¶
In order to get a fully reproducible data analysis, it is not sufficient to be able to execute each step and document all used parameters. The used software tools and libraries have to be documented as well. In this tutorial, you have already seen how Conda can be used to specify an isolated software environment for a whole workflow. With Snakemake, you can go one step further and specify Conda environments per rule. This way, you can even make use of conflicting software versions (e.g. combine Python 2 with Python 3).
In our example, instead of using an external environment we can specify environments per rule, e.g.:
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
conda:
"envs/samtools.yaml"
shell:
"samtools index {input}"
with envs/samtools.yaml
defined as
channels:
- bioconda
dependencies:
- samtools =1.3
When Snakemake is executed with
snakemake --use-conda
it will automatically create required environments and activate them before a job is executed. It is best practice to specify at least the major and minor version of any packages in the environment definition. Specifying environments per rule in this way has two advantages. First, the workflow definition also documents all used software versions. Second, a workflow can be re-executed (without admin rights) on a vanilla system, without installing any prerequisites apart from Snakemake and Miniconda.
Tool wrappers¶
In order to simplify the utilization of popular tools, Snakemake provides a
repository of so-called wrappers
(the Snakemake wrapper repository).
A wrapper is a short script that wraps (typically)
a command line application and makes it directly addressable from within Snakemake.
For this, Snakemake provides the wrapper
directive that can be used instead of
shell
, script
, or run
.
For example, the rule bwa_map
could alternatively look like this:
rule bwa_mem:
input:
ref="data/genome.fa",
sample=lambda wildcards: config["samples"][wildcards.sample]
output:
temp("mapped_reads/{sample}.bam")
log:
"logs/bwa_mem/{sample}.log"
params:
"-R '@RG\tID:{sample}\tSM:{sample}'"
threads: 8
wrapper:
"0.15.3/bio/bwa/mem"
The wrapper directive expects a (partial) URL that points to a wrapper in the repository.
These can be looked up in the corresponding database.
The first part of the URL is a Git version tag. Upon invocation, Snakemake
will automatically download the requested version of the wrapper.
Furthermore, in combination with --use-conda
(see Automatic deployment of software dependencies),
the required software will be automatically deployed before execution.
Cluster execution¶
By default, Snakemake executes jobs on the local machine it is invoked on. Alternatively, it can execute jobs in distributed environments, e.g., compute clusters or batch systems. If the nodes share a common file system, Snakemake supports three alternative execution modes.
In cluster environments, compute jobs are usually submitted as shell scripts via commands like qsub
.
Snakemake provides a generic mode to execute on such clusters.
By invoking Snakemake with
$ snakemake --cluster qsub --jobs 100
each job will be compiled into a shell script that is submitted with the given command (here qsub
).
The --jobs
flag limits the number of concurrently submitted jobs to 100.
This basic mode assumes that the submission command returns immediately after submitting the job.
Some clusters allow to run the submission command in synchronous mode, such that it waits until the job has been executed.
In such cases, we can invoke e.g.
$ snakemake --cluster-sync "qsub -sync yes" --jobs 100
The specified submission command can also be decorated with additional parameters taken from the submitted job. For example, the number of used threads can be accessed in braces similarly to the formatting of shell commands, e.g.
$ snakemake --cluster "qsub -pe threaded {threads}" --jobs 100
Alternatively, Snakemake can use the Distributed Resource Management Application API (DRMAA). This API provides a common interface to control various resource management systems. The DRMAA support can be activated by invoking Snakemake as follows:
$ snakemake --drmaa --jobs 100
If available, DRMAA is preferable over the generic cluster modes because it provides better control and error handling. To support additional cluster specific parametrization, a Snakefile can be complemented by a Cluster Configuration file.
Constraining wildcards¶
Snakemake uses regular expressions to match output files to input files and determine dependencies between the jobs.
Sometimes it is useful to constrain the values a wildcard can have.
This can be achieved by adding a regular expression that describes the set of allowed wildcard values.
For example, the wildcard sample
in the output file "sorted_reads/{sample}.bam"
can be constrained to only allow alphanumeric sample names as "sorted_reads/{sample,[A-Za-z0-9]+}.bam"
.
Constrains may be defined per rule or globally using the wildcard_constraints
keyword, as demonstrated in Wildcards.
This mechanism helps to solve two kinds of ambiguity.
- It can help to avoid ambiguous rules, i.e. two or more rules that can be applied to generate the same output file. Other ways of handling ambiguous rules are described in the Section Handling Ambiguous Rules.
- It can help to guide the regular expression based matching so that wildcards are assigned to the right parts of a file name. Consider the output file
{sample}.{group}.txt
and assume that the target file isA.1.normal.txt
. It is not clear whetherdataset="A.1"
andgroup="normal"
ordataset="A"
andgroup="1.normal"
is the right assignment. Here, constraining the dataset wildcard by{sample,[A-Z]+}.{group}
solves the problem.
When dealing with ambiguous rules, it is best practice to first try to solve the ambiguity by using a proper file structure, for example, by separating the output files of different steps in different directories.
Executing Snakemake¶
This part of the documentation describes the snakemake
executable. Snakemake
is primarily a command-line tool, so the snakemake
executable is the primary way
to execute, debug, and visualize workflows.
Useful Command Line Arguments¶
If called without parameters, i.e.
$ snakemake
Snakemake tries to execute the workflow specified in a file called Snakefile
in the same directory (instead, the Snakefile can be given via the parameter -s
).
By issuing
$ snakemake -n
a dry-run can be performed. This is useful to test if the workflow is defined properly and to estimate the amount of needed computation. Further, the reason for each rule execution can be printed via
$ snakemake -n -r
Importantly, Snakemake can automatically determine which parts of the workflow can be run in parallel. By specifying the number of available cores, i.e.
$ snakemake -j 4
one can tell Snakemake to use up to 4 cores and solve a binary knapsack problem to optimize the scheduling of jobs.
If the number is omitted (i.e., only -j
is given), the number of used cores is determined as the number of available CPU cores in the machine.
Cloud Support¶
Snakemake 4.0 and later supports experimental execution in the cloud via Kubernetes. This is independent of the cloud provider, but we provide the setup steps for GCE below.
Google cloud engine¶
First, install the Google Cloud SDK. Then, run
$ gcloud init
to setup your access. Then, you can create a new kubernetes cluster via
$ gcloud container clusters create $CLUSTER_NAME --num-nodes=$NODES --scopes storage-rw
with $CLUSTER_NAME
being the cluster name and $NODES
being the number of cluster
nodes. If you intent to use google storage, make sure that –scopes storage-rw is set.
This enables Snakemake to write to the google storage from within the cloud nodes.
Next, you configure Kubernetes to use the new cluster via
$ gcloud container clusters get-credentials $CLUSTER_NAME
Now, Snakemake is ready to use your cluster.
Important: After finishing your work, do not forget to delete the cluster with
$ gcloud container clusters delete $CLUSTER_NAME
in order to avoid unnecessary charges.
Executing a Snakemake workflow via kubernetes¶
Assuming that kubernetes has been properly configured (see above), you can execute a workflow via:
snakemake --kubernetes --use-conda --default-remote-provider $REMOTE --default-remote-prefix $PREFIX
In this mode, Snakemake will assume all input and output files to be stored in a given
remote location, configured by setting $REMOTE
to your provider of choice
(e.g. GS
for Google cloud storage or S3
for Amazon S3) and $PREFIX
to a bucket name or subfolder within that remote storage.
After successful execution, you find your results in the specified remote storage.
Of course, if any input or output already defines a different remote location, the latter will be used instead.
Importantly, this means that Snakemake does not require a shared network
filesystem to work in the cloud.
Currently, this mode requires that the Snakemake workflow is stored in a git repository. Snakemake uses git to query necessary source files (the Snakefile, scripts, config, …) for workflow execution and encodes them into the kubernetes job.
It is further possible to forward arbitrary environment variables to the kubernetes
jobs via the flag --kubernetes-env
(see snakemake --help
).
When executing, Snakemake will make use of the defined resources and threads to schedule jobs to the correct nodes. In particular, it will forward memory requirements defined as mem_mb to kubernetes. Further, it will propagate the number of threads a job intends to use, such that kubernetes can allocate it to the correct cloud computing node.
Cluster Execution¶
Snakemake can make use of cluster engines that support shell scripts and have access to a common filesystem, (e.g. the Sun Grid Engine). In this case, Snakemake simply needs to be given a submit command that accepts a shell script as first positional argument:
$ snakemake --cluster qsub -j 32
Here, -j
denotes the number of jobs submitted being submitted to the cluster at the same time (here 32).
The cluster command can be decorated with job specific information, e.g.
$ snakemake --cluster "qsub {threads}"
Thereby, all keywords of a rule are allowed (e.g. params, input, output, threads, priority, …). For example, you could encode the expected running time into params:
rule:
input: ...
output: ...
params: runtime="4h"
shell: ...
and forward it to the cluster scheduler:
$ snakemake --cluster "qsub --runtime {params.runtime}"
If your cluster system supports DRMAA, Snakemake can make use of that to increase the control over jobs.
E.g. jobs can be cancelled upon pressing Ctrl+C
, which is not possible with the generic --cluster
support.
With DRMAA, no qsub
command needs to be provided, but system specific arguments can still be given as a string, e.g.
$ snakemake --drmaa " -q username" -j 32
Note that the string has to contain a leading whitespace. Else, the arguments will be interpreted as part of the normal Snakemake arguments, and execution will fail.
Job Properties¶
When executing a workflow on a cluster using the --cluster
parameter (see below), Snakemake creates a job script for each job to execute. This script is then invoked using the provided cluster submission command (e.g. qsub
). Sometimes you want to provide a custom wrapper for the cluster submission command that decides about additional parameters. As this might be based on properties of the job, Snakemake stores the job properties (e.g. rule name, threads, input files, params etc.) as JSON inside the job script. For convenience, there exists a parser function snakemake.utils.read_job_properties that can be used to access the properties. The following shows an example job submission wrapper:
#!python
#!/usr/bin/env python3
import os
import sys
from snakemake.utils import read_job_properties
jobscript = sys.argv[1]
job_properties = read_job_properties(jobscript)
# do something useful with the threads
threads = job_properties[threads]
# access property defined in the cluster configuration file (Snakemake >=3.6.0)
job_properties["cluster"]["time"]
os.system("qsub -t {threads} {script}".format(threads=threads, script=jobscript))
Profiles¶
Adapting Snakemake to a particular environment can entail many flags and options. Therefore, since Snakemake 4.1, it is possible to specify a configuration profile to be used to obtain default options:
$ snakemake --profile myprofile
Here, a folder myprofile
is searched in per-user and global configuration directories (on Linux, this will be $HOME/.config/snakemake
and /etc/xdg/snakemake
, you can find the answer for your system via snakemake --help
).
Alternatively, an absolute or relative path to the folder can be given.
The profile folder is expected to contain a file config.yaml
that defines default values for the Snakemake command line arguments.
For example, the file
cluster: qsub
jobs: 100
would setup Snakemake to always submit to the cluster via the qsub
command, and never use more than 100 parallel jobs in total.
Under https://github.com/snakemake-profiles/doc, you can find publicly available profiles.
Feel free to contribute your own.
The profile folder can additionally contain auxilliary files, e.g., jobscripts, or any kind of wrappers. See https://github.com/snakemake-profiles/doc for examples.
Visualization¶
To visualize the workflow, one can use the option --dag
.
This creates a representation of the DAG in the graphviz dot language which has to be postprocessed by the graphviz tool dot
.
E.g. to visualize the DAG that would be executed, you can issue:
$ snakemake --dag | dot | display
For saving this to a file, you can specify the desired format:
$ snakemake --dag | dot -Tpdf > dag.pdf
To visualize the whole DAG regardless of the eventual presence of files, the forceall
option can be used:
$ snakemake --forceall --dag | dot -Tpdf > dag.pdf
Of course the visual appearance can be modified by providing further command line arguments to dot
.
All Options¶
All command line options can be printed by calling snakemake -h
.
Snakemake is a Python based language and execution environment for GNU Make-like workflows.
usage: snakemake [-h] [--dryrun] [--profile PROFILE] [--snakefile FILE]
[--cores [N]] [--local-cores N]
[--resources [NAME=INT [NAME=INT ...]]]
[--config [KEY=VALUE [KEY=VALUE ...]]] [--configfile FILE]
[--directory DIR] [--touch] [--keep-going] [--force]
[--forceall] [--forcerun [TARGET [TARGET ...]]]
[--prioritize TARGET [TARGET ...]]
[--until TARGET [TARGET ...]]
[--omit-from TARGET [TARGET ...]] [--rerun-incomplete]
[--report HTMLFILE] [--list] [--list-target-rules] [--dag]
[--rulegraph] [--d3dag] [--summary] [--detailed-summary]
[--archive FILE] [--cleanup-metadata FILE [FILE ...]]
[--cleanup-shadow] [--unlock] [--list-version-changes]
[--list-code-changes] [--list-input-changes]
[--list-params-changes] [--list-untracked]
[--delete-all-output] [--delete-temp-output]
[--bash-completion] [--version] [--reason] [--gui [PORT]]
[--printshellcmds] [--debug-dag] [--stats FILE] [--nocolor]
[--quiet] [--timestamp] [--print-compilation] [--verbose]
[--force-use-threads] [--allow-ambiguity] [--nolock]
[--ignore-incomplete] [--latency-wait SECONDS]
[--wait-for-files [FILE [FILE ...]]] [--notemp]
[--keep-remote] [--keep-target-files]
[--allowed-rules ALLOWED_RULES [ALLOWED_RULES ...]]
[--max-jobs-per-second MAX_JOBS_PER_SECOND]
[--max-status-checks-per-second MAX_STATUS_CHECKS_PER_SECOND]
[--restart-times RESTART_TIMES] [--attempt ATTEMPT]
[--wrapper-prefix WRAPPER_PREFIX]
[--default-remote-provider {S3,GS,FTP,SFTP,S3Mocked,gfal,gridftp,iRODS}]
[--default-remote-prefix DEFAULT_REMOTE_PREFIX]
[--no-shared-fs] [--greediness GREEDINESS] [--no-hooks]
[--overwrite-shellcmd OVERWRITE_SHELLCMD] [--debug]
[--runtime-profile FILE] [--mode {0,1,2}]
[--cluster CMD | --cluster-sync CMD | --drmaa [ARGS]]
[--cluster-config FILE] [--immediate-submit]
[--jobscript SCRIPT] [--jobname NAME]
[--cluster-status CLUSTER_STATUS] [--drmaa-log-dir DIR]
[--kubernetes [NAMESPACE]]
[--kubernetes-env ENVVAR [ENVVAR ...]]
[--container-image IMAGE] [--use-conda] [--list-conda-envs]
[--cleanup-conda] [--conda-prefix DIR] [--create-envs-only]
[--use-singularity] [--singularity-prefix DIR]
[--singularity-args ARGS]
[target [target ...]]
EXECUTION¶
target | Targets to build. May be rules or files. |
--dryrun, -n | Do not execute anything, and display what would be done. If you have a very large workflow, use –dryrun –quiet to just print a summary of the DAG of jobs. Default: False |
--profile |
|
--snakefile, -s | |
The workflow definition in a snakefile. Default: “Snakefile” | |
--cores, --jobs, -j | |
Use at most N cores in parallel (default: 1). If N is omitted, the limit is set to the number of available cores. | |
--local-cores | In cluster mode, use at most N cores of the host machine in parallel (default: number of CPU cores of the host). The cores are used to execute local rules. This option is ignored when not in cluster mode. Default: 4 |
--resources, --res | |
Define additional resources that shall constrain the scheduling analogously to threads (see above). A resource is defined as a name and an integer value. E.g. –resources gpu=1. Rules can use resources by defining the resource keyword, e.g. resources: gpu=1. If now two rules require 1 of the resource ‘gpu’ they won’t be run in parallel by the scheduler. | |
--config, -C | Set or overwrite values in the workflow config object. The workflow config object is accessible as variable config inside the workflow. Default values can be set by providing a JSON file (see Documentation). |
--configfile | Specify or overwrite the config file of the workflow (see the docs). Values specified in JSON or YAML format are available in the global config dictionary inside the workflow. |
--directory, -d | |
Specify working directory (relative paths in the snakefile will use this as their origin). | |
--touch, -t | Touch output files (mark them up to date without really changing them) instead of running their commands. This is used to pretend that the rules were executed, in order to fool future invocations of snakemake. Fails if a file does not yet exist. Default: False |
--keep-going, -k | |
Go on with independent jobs if a job fails. Default: False | |
--force, -f | Force the execution of the selected target or the first rule regardless of already created output. Default: False |
--forceall, -F | Force the execution of the selected (or the first) rule and all rules it is dependent on regardless of already created output. Default: False |
--forcerun, -R | Force the re-execution or creation of the given rules or files. Use this option if you changed a rule and want to have all its output in your workflow updated. |
--prioritize, -P | |
Tell the scheduler to assign creation of given targets (and all their dependencies) highest priority. (EXPERIMENTAL) | |
--until, -U | Runs the pipeline until it reaches the specified rules or files. Only runs jobs that are dependencies of the specified rule or files, does not run sibling DAGs. |
--omit-from, -O | |
Prevent the execution or creation of the given rules or files as well as any rules or files that are downstream of these targets in the DAG. Also runs jobs in sibling DAGs that are independent of the rules or files specified here. | |
--rerun-incomplete, --ri | |
Re-run all jobs the output of which is recognized as incomplete. Default: False |
UTILITIES¶
--report | Create an HTML report with results and statistics. |
--list, -l | Show available rules in given Snakefile. Default: False |
--list-target-rules, --lt | |
Show available target rules in given Snakefile. Default: False | |
--dag | Do not execute anything and print the directed acyclic graph of jobs in the dot language. Recommended use on Unix systems: snakemake –dag | dot | display Default: False |
--rulegraph | Do not execute anything and print the dependency graph of rules in the dot language. This will be less crowded than above DAG of jobs, but also show less information. Note that each rule is displayed once, hence the displayed graph will be cyclic if a rule appears in several steps of the workflow. Use this if above option leads to a DAG that is too large. Recommended use on Unix systems: snakemake –rulegraph | dot | display Default: False |
--d3dag | Print the DAG in D3.js compatible JSON format. Default: False |
--summary, -S | Print a summary of all files created by the workflow. The has the following columns: filename, modification time, rule version, status, plan. Thereby rule version contains the versionthe file was created with (see the version keyword of rules), and status denotes whether the file is missing, its input files are newer or if version or implementation of the rule changed since file creation. Finally the last column denotes whether the file will be updated or created during the next workflow execution. Default: False |
--detailed-summary, -D | |
Print a summary of all files created by the workflow. The has the following columns: filename, modification time, rule version, input file(s), shell command, status, plan. Thereby rule version contains the versionthe file was created with (see the version keyword of rules), and status denotes whether the file is missing, its input files are newer or if version or implementation of the rule changed since file creation. The input file and shell command columns are selfexplanatory. Finally the last column denotes whether the file will be updated or created during the next workflow execution. Default: False | |
--archive | Archive the workflow into the given tar archive FILE. The archive will be created such that the workflow can be re-executed on a vanilla system. The function needs conda and git to be installed. It will archive every file that is under git version control. Note that it is best practice to have the Snakefile, config files, and scripts under version control. Hence, they will be included in the archive. Further, it will add input files that are not generated by by the workflow itself and conda environments. Note that symlinks are dereferenced. Supported formats are .tar, .tar.gz, .tar.bz2 and .tar.xz. |
--cleanup-metadata, --cm | |
Cleanup the metadata of given files. That means that snakemake removes any tracked version info, and any marks that files are incomplete. | |
--cleanup-shadow | |
Cleanup old shadow directories which have not been deleted due to failures or power loss. Default: False | |
--unlock | Remove a lock on the working directory. Default: False |
--list-version-changes, --lv | |
List all output files that have been created with a different version (as determined by the version keyword). Default: False | |
--list-code-changes, --lc | |
List all output files for which the rule body (run or shell) have changed in the Snakefile. Default: False | |
--list-input-changes, --li | |
List all output files for which the defined input files have changed in the Snakefile (e.g. new input files were added in the rule definition or files were renamed). For listing input file modification in the filesystem, use –summary. Default: False | |
--list-params-changes, --lp | |
List all output files for which the defined params have changed in the Snakefile. Default: False | |
--list-untracked, --lu | |
List all files in the working directory that are not used in the workflow. This can be used e.g. for identifying leftover files. Hidden files and directories are ignored. Default: False | |
--delete-all-output | |
Remove all files generated by the workflow. Use together with –dryrun to list files without actually deleting anything. Note that this will not recurse into subworkflows. It will also remove files flagged as protected. Use with care! Default: False | |
--delete-temp-output | |
Remove all temporary files generated by the workflow. Use together with –dryrun to list files without actually deleting anything. Note that this will not recurse into subworkflows. It will also remove files flagged as protected. Use with care! Default: False | |
--bash-completion | |
Output code to register bash completion for snakemake. Put the following in your .bashrc (including the accents): snakemake –bash-completion or issue it in an open terminal session. Default: False | |
--version, -v | show program’s version number and exit |
OUTPUT¶
--reason, -r | Print the reason for each executed rule. Default: False |
--gui | Serve an HTML based user interface to the given network and port e.g. 168.129.10.15:8000. By default Snakemake is only available in the local network (default port: 8000). To make Snakemake listen to all ip addresses add the special host address 0.0.0.0 to the url (0.0.0.0:8000). This is important if Snakemake is used in a virtualised environment like Docker. If possible, a browser window is opened. |
--printshellcmds, -p | |
Print out the shell commands that will be executed. Default: False | |
--debug-dag | Print candidate and selected jobs (including their wildcards) while inferring DAG. This can help to debug unexpected DAG topology or errors. Default: False |
--stats | Write stats about Snakefile execution in JSON format to the given file. |
--nocolor | Do not use a colored output. Default: False |
--quiet, -q | Do not output any progress or rule information. Default: False |
--timestamp, -T | |
Add a timestamp to all logging output Default: False | |
--print-compilation | |
Print the python representation of the workflow. Default: False | |
--verbose | Print debugging output. Default: False |
BEHAVIOR¶
--force-use-threads | |
Force threads rather than processes. Helpful if shared memory (/dev/shm) is full or unavailable. Default: False | |
--allow-ambiguity, -a | |
Don’t check for ambiguous rules and simply use the first if several can produce the same file. This allows the user to prioritize rules by their order in the snakefile. Default: False | |
--nolock | Do not lock the working directory Default: False |
--ignore-incomplete, --ii | |
Do not check for incomplete output files. Default: False | |
--latency-wait, --output-wait, -w | |
Wait given seconds if an output file of a job is not present after the job finished. This helps if your filesystem suffers from latency (default 5). Default: 5 | |
--wait-for-files | |
Wait –latency-wait seconds for these files to be present before executing the workflow. This option is used internally to handle filesystem latency in cluster environments. | |
--notemp, --nt | Ignore temp() declarations. This is useful when running only a part of the workflow, since temp() would lead to deletion of probably needed files by other parts of the workflow. Default: False |
--keep-remote | Keep local copies of remote input files. Default: False |
--keep-target-files | |
Do not adjust the paths of given target files relative to the working directory. Default: False | |
--allowed-rules | |
Only consider given rules. If omitted, all rules in Snakefile are used. Note that this is intended primarily for internal use and may lead to unexpected results otherwise. | |
--max-jobs-per-second | |
Maximal number of cluster/drmaa jobs per second, default is 10, fractions allowed. Default: 10 | |
--max-status-checks-per-second | |
Maximal number of job status checks per second, default is 10, fractions allowed. Default: 10 | |
--restart-times | |
Number of times to restart failing jobs (defaults to 0). Default: 0 | |
--attempt | Internal use only: define the initial value of the attempt parameter (default: 1). Default: 1 |
--wrapper-prefix | |
Prefix for URL created from wrapper directive (default: https://bitbucket.org/snakemake/snakemake-wrappers/raw/). Set this to a different URL to use your fork or a local clone of the repository. Default: “https://bitbucket.org/snakemake/snakemake-wrappers/raw/” | |
--default-remote-provider | |
Possible choices: S3, GS, FTP, SFTP, S3Mocked, gfal, gridftp, iRODS Specify default remote provider to be used for all input and output files that don’t yet specify one. | |
--default-remote-prefix | |
Specify prefix for default remote provider. E.g. a bucket name. Default: “” | |
--no-shared-fs | Do not assume that jobs share a common file system. When this flag is activated, Snakemake will assume that the filesystem on a cluster node is not shared with other nodes. For example, this will lead to downloading remote files on each cluster node separately. Further, it won’t take special measures to deal with filesystem latency issues. This option will in most cases only make sense in combination with –default-remote-provider. Further, when using –cluster you will have to also provide –cluster-status. Only activate this if you know what you are doing. Default: False |
--greediness | Set the greediness of scheduling. This value between 0 and 1 determines how careful jobs are selected for execution. The default value (1.0) provides the best speed and still acceptable scheduling quality. |
--no-hooks | Do not invoke onstart, onsuccess or onerror hooks after execution. Default: False |
--overwrite-shellcmd | |
Provide a shell command that shall be executed instead of those given in the workflow. This is for debugging purposes only. | |
--debug | Allow to debug rules with e.g. PDB. This flag allows to set breakpoints in run blocks. Default: False |
--runtime-profile | |
Profile Snakemake and write the output to FILE. This requires yappi to be installed. | |
--mode | Possible choices: 0, 1, 2 Set execution mode of Snakemake (internal use only). Default: 0 |
CLUSTER¶
--cluster, -c | Execute snakemake rules with the given submit command, e.g. qsub. Snakemake compiles jobs into scripts that are submitted to the cluster with the given command, once all input files for a particular job are present. The submit command can be decorated to make it aware of certain job properties (input, output, params, wildcards, log, threads and dependencies (see the argument below)), e.g.: $ snakemake –cluster ‘qsub -pe threaded {threads}’. |
--cluster-sync | cluster submission command will block, returning the remote exitstatus upon remote termination (for example, this should be usedif the cluster command is ‘qsub -sync y’ (SGE) |
--drmaa | Execute snakemake on a cluster accessed via DRMAA, Snakemake compiles jobs into scripts that are submitted to the cluster with the given command, once all input files for a particular job are present. ARGS can be used to specify options of the underlying cluster system, thereby using the job properties input, output, params, wildcards, log, threads and dependencies, e.g.: –drmaa ‘ -pe threaded {threads}’. Note that ARGS must be given in quotes and with a leading whitespace. |
--cluster-config, -u | |
A JSON or YAML file that defines the wildcards used in ‘cluster’for specific rules, instead of having them specified in the Snakefile. For example, for rule ‘job’ you may define: { ‘job’ : { ‘time’ : ‘24:00:00’ } } to specify the time for rule ‘job’. You can specify more than one file. The configuration files are merged with later values overriding earlier ones. Default: [] | |
--immediate-submit, --is | |
Immediately submit all jobs to the cluster instead of waiting for present input files. This will fail, unless you make the cluster aware of job dependencies, e.g. via: $ snakemake –cluster ‘sbatch –dependency {dependencies}. Assuming that your submit script (here sbatch) outputs the generated job id to the first stdout line, {dependencies} will be filled with space separated job ids this job depends on. Default: False | |
--jobscript, --js | |
Provide a custom job script for submission to the cluster. The default script resides as ‘jobscript.sh’ in the installation directory. | |
--jobname, --jn | |
Provide a custom name for the jobscript that is submitted to the cluster (see –cluster). NAME is “snakejob.{name}.{jobid}.sh” per default. The wildcard {jobid} has to be present in the name. Default: “snakejob.{name}.{jobid}.sh” | |
--cluster-status | |
Status command for cluster execution. This is only considered in combination with the –cluster flag. If provided, Snakemake will use the status command to determine if a job has finished successfully or failed. For this it is necessary that the submit command provided to –cluster returns the cluster job id. Then, the status command will be invoked with the job id. Snakemake expects it to return ‘success’ if the job was successfull, ‘failed’ if the job failed and ‘running’ if the job still runs. | |
--drmaa-log-dir | |
Specify a directory in which stdout and stderr files of DRMAA jobs will be written. The value may be given as a relative path, in which case Snakemake will use the current invocation directory as the origin. If given, this will override any given ‘-o’ and/or ‘-e’ native specification. If not given, all DRMAA stdout and stderr files are written to the current working directory. |
CLOUD¶
--kubernetes | Execute workflow in a kubernetes cluster (in the cloud). NAMESPACE is the namespace you want to use for your job (if nothing specified: ‘default’). Usually, this requires –default-remote-provider and –default-remote-prefix to be set to a S3 or GS bucket where your . data shall be stored. It is further advisable to activate conda integration via –use-conda. |
--kubernetes-env | |
Specify environment variables to pass to the kubernetes job. Default: [] | |
--container-image | |
Docker image to use, e.g., when submitting jobs to kubernetes. By default, this is ‘quay.io/snakemake/snakemake’, tagged with the same version as the currently running Snakemake instance. Note that overwriting this value is up to your responsibility. Any used image has to contain a working snakemake installation that is compatible with (or ideally the same as) the currently running version. |
CONDA¶
--use-conda | If defined in the rule, run job in a conda environment. If this flag is not set, the conda directive is ignored. Default: False |
--list-conda-envs | |
List all conda environments and their location on disk. Default: False | |
--cleanup-conda | |
Cleanup unused conda environments. Default: False | |
--conda-prefix | Specify a directory in which the ‘conda’ and ‘conda-archive’ directories are created. These are used to store conda environments and their archives, respectively. If not supplied, the value is set to the ‘.snakemake’ directory relative to the invocation directory. If supplied, the –use-conda flag must also be set. The value may be given as a relative path, which will be extrapolated to the invocation directory, or as an absolute path. |
--create-envs-only | |
If specified, only creates the job-specific conda environments then exits. The –use-conda flag must also be set. Default: False |
SINGULARITY¶
--use-singularity | |
If defined in the rule, run job within a singularity container. If this flag is not set, the singularity directive is ignored. Default: False | |
--singularity-prefix | |
Specify a directory in which singularity images will be stored.If not supplied, the value is set to the ‘.snakemake’ directory relative to the invocation directory. If supplied, the –use-singularity flag must also be set. The value may be given as a relative path, which will be extrapolated to the invocation directory, or as an absolute path. | |
--singularity-args | |
Pass additional args to singularity. Default: “” |
Bash Completion¶
Snakemake supports bash completion for filenames, rulenames and arguments. To enable it globally, just append
`snakemake --bash-completion`
including the accents to your .bashrc
.
This only works if the snakemake
command is in your path.
Writing Workflows¶
In Snakemake, workflows are specified as Snakefiles. Inspired by GNU Make, a Snakefile contains rules that denote how to create output files from input files. Dependencies between rules are handled implicitly, by matching filenames of input files against output files. Thereby wildcards can be used to write general rules.
Grammar¶
The Snakefile syntax obeys the following grammar, given in extended Backus-Naur form (EBNF)
snakemake = statement | rule | include | workdir
rule = "rule" (identifier | "") ":" ruleparams
include = "include:" stringliteral
workdir = "workdir:" stringliteral
ni = NEWLINE INDENT
ruleparams = [ni input] [ni output] [ni params] [ni message] [ni threads] [ni (run | shell)] NEWLINE snakemake
input = "input" ":" parameter_list
output = "output" ":" parameter_list
params = "params" ":" parameter_list
log = "log" ":" parameter_list
benchmark = "benchmark" ":" statement
message = "message" ":" stringliteral
threads = "threads" ":" integer
resources = "resources" ":" parameter_list
version = "version" ":" statement
run = "run" ":" ni statement
shell = "shell" ":" stringliteral
while all not defined non-terminals map to their Python equivalents.
Depend on a Minimum Snakemake Version¶
From Snakemake 3.2 on, if your workflow depends on a minimum Snakemake version, you can easily ensure that at least this version is installed via
from snakemake.utils import min_version
min_version("3.2")
given that your minimum required version of Snakemake is 3.2. The statement will raise a WorkflowError (and therefore abort the workflow execution) if the version is not met.
Rules¶
Most importantly, a rule can consist of a name (the name is optional and can be left out, creating an anonymous rule), input files, output files, and a shell command to generate the output from the input, i.e.
rule NAME:
input: "path/to/inputfile", "path/to/other/inputfile"
output: "path/to/outputfile", "path/to/another/outputfile"
shell: "somecommand {input} {output}"
Inside the shell command, all local and global variables, especially input and output files can be accessed via their names in the python format minilanguage. Here input and output (and in general any list or tuple) automatically evaluate to a space-separated list of files (i.e. path/to/inputfile path/to/other/inputfile
).
From Snakemake 3.8.0 on, adding the special formatting instruction :q
(e.g. "somecommand {input:q} {output:q}")
) will let Snakemake quote each of the list or tuple elements that contains whitespace.
Instead of a shell command, a rule can run some python code to generate the output:
rule NAME:
input: "path/to/inputfile", "path/to/other/inputfile"
output: "path/to/outputfile", somename = "path/to/another/outputfile"
run:
for f in input:
...
with open(output[0], "w") as out:
out.write(...)
with open(output.somename, "w") as out:
out.write(...)
As can be seen, instead of accessing input and output as a whole, we can also access by index (output[0]
) or by keyword (output.somename
).
Note that, when adding keywords or names for input or output files, their order won’t be preserved when accessing them as a whole via e.g. {output}
in a shell command.
Shell commands like above can also be invoked inside a python based rule, via the function shell
that takes a string with the command and allows the same formatting like in the rule above, e.g.:
shell("somecommand {output.somename}")
Further, this combination of python and shell commands, allows to iterate over the output of the shell command, e.g.:
for line in shell("somecommand {output.somename}", iterable=True):
... # do something in python
Note that shell commands in Snakemake use the bash shell in strict mode by default.
Wildcards¶
Usually, it is useful to generalize a rule to be applicable to a number of e.g. datasets. For this purpose, wildcards can be used. Automatically resolved multiple named wildcards are a key feature and strength of Snakemake in comparison to other systems. Consider the following example.
rule complex_conversion:
input:
"{dataset}/inputfile"
output:
"{dataset}/file.{group}.txt"
shell:
"somecommand --group {wildcards.group} < {input} > {output}"
Here, we define two wildcards, dataset
and group
. By this, the rule can produce all files that follow the regular expression pattern .+/file\..+\.txt
, i.e. the wildcards are replaced by the regular expression .+
. If the rule’s output matches a requested file, the substrings matched by the wildcards are propagated to the input files and to the variable wildcards, that is here also used in the shell command. The wildcards object can be accessed in the same way as input and output, which is described above.
For example, if another rule in the workflow requires the file the file 101/file.A.txt
, Snakemake recognizes that this rule is able to produce it by setting dataset=101
and group=A
.
Thus, it requests file 101/inputfile
as input and executes the command somecommand --group A < 101/inputfile > 101/file.A.txt
.
Of course, the input file might have to be generated by another rule with different wildcards.
Importantly, the wildcard names in input and output must be named identically. Most typically, the same wildcard is present in both input and output, but it is of course also possible to have wildcards only in the output but not the input section.
Multiple wildcards in one filename can cause ambiguity.
Consider the pattern {dataset}.{group}.txt
and assume that a file 101.B.normal.txt
is available.
It is not clear whether dataset=101.B
and group=normal
or dataset=101
and group=B.normal
in this case.
Hence wildcards can be constrained to given regular expressions.
Here we could restrict the wildcard dataset
to consist of digits only using \d+
as the corresponding regular expression.
With Snakemake 3.8.0, there are three ways to constrain wildcards.
First, a wildcard can be constrained within the file pattern, by appending a regular expression separated by a comma:
output: "{dataset,\d+}.{group}.txt"
Second, a wildcard can be constrained within the rule via the keyword wildcard_constraints
:
rule complex_conversion:
input:
"{dataset}/inputfile"
output:
"{dataset}/file.{group}.txt"
wildcard_constraints:
dataset="\d+"
shell:
"somecommand --group {wildcards.group} < {input} > {output}"
Finally, you can also define global wildcard constraints that apply for all rules:
wildcard_constraints:
dataset="\d+"
rule a:
...
rule b:
...
See the Python documentation on regular expressions for detailed information on regular expression syntax.
Targets¶
By default snakemake executes the first rule in the snakefile. This gives rise to pseudo-rules at the beginning of the file that can be used to define build-targets similar to GNU Make:
rule all:
input: ["{dataset}/file.A.txt".format(dataset=dataset) for dataset in DATASETS]
Here, for each dataset in a python list DATASETS
defined before, the file {dataset}/file.A.txt
is requested. In this example, Snakemake recognizes automatically that these can be created by multiple applications of the rule complex_conversion
shown above.
Above expression can be simplified to the following:
rule all:
input: expand("{dataset}/file.A.txt", dataset=DATASETS)
This may be used for “aggregation” rules for which files from multiple or all datasets are needed to produce a specific output (say, allSamplesSummary.pdf).
Note that dataset is NOT a wildcard here because it is resolved by Snakemake due to the expand
statement (see below also for more information).
The expand
function thereby allows also to combine different variables, e.g.
rule all:
input: expand("{dataset}/file.A.{ext}", dataset=DATASETS, ext=PLOTFORMATS)
If now PLOTFORMATS=["pdf", "png"]
contains a list of desired output formats then expand will automatically combine any dataset with any of these extensions.
Further, the first argument can also be a list of strings. In that case, the transformation is applied to all elements of the list. E.g.
expand(["{dataset}/plot1.{ext}", "{dataset}/plot2.{ext}"], dataset=DATASETS, ext=PLOTFORMATS)
leads to
["ds1/plot1.pdf", "ds1/plot2.pdf", "ds2/plot1.pdf", "ds2/plot2.pdf", "ds1/plot1.png", "ds1/plot2.png", "ds2/plot1.png", "ds2/plot2.png"]
Per default, expand
uses the python itertools function product
that yields all combinations of the provided wildcard values. However by inserting a second positional argument this can be replaced by any combinatoric function, e.g. zip
:
expand("{dataset}/plot1.{ext} {dataset}/plot2.{ext}".split(), zip, dataset=DATASETS, ext=PLOTFORMATS)
leads to
["ds1/plot1.pdf", "ds1/plot2.pdf", "ds2/plot1.png", "ds2/plot2.png"]
You can also mask a wildcard expression in expand such that it will be kept, e.g.
expand("{{dataset}}/plot1.{ext}", ext=PLOTFORMATS)
will create strings with all values for ext but starting with "{dataset}"
.
Threads¶
Further, a rule can be given a number of threads to use, i.e.
rule NAME:
input: "path/to/inputfile", "path/to/other/inputfile"
output: "path/to/outputfile", "path/to/another/outputfile"
threads: 8
shell: "somecommand --threads {threads} {input} {output}"
Snakemake can alter the number of cores available based on command line options. Therefore it is useful to propagate it via the built in variable threads
rather than hardcoding it into the shell command.
In particular, it should be noted that the specified threads have to be seen as a maximum. When Snakemake is executed with fewer cores, the number of threads will be adjusted, i.e. threads = min(threads, cores)
with cores
being the number of cores specified at the command line (option --cores
). On a cluster node, Snakemake uses as many cores as available on that node. Hence, the number of threads used by a rule never exceeds the number of physically available cores on the node. Note: This behavior is not affected by --local-cores
, which only applies to jobs running on the master node.
Starting from version 3.7, threads can also be a callable that returns an int
value. The signature of the callable should be callable(wildcards[, input])
(input is an optional parameter). It is also possible to refer to a predefined variable (e.g, threads: threads_max
) so that the number of cores for a set of rules can be changed with one change only by altering the value of the variable threads_max
.
Resources¶
In addition to threads, a rule can use arbitrary user-defined resources by specifying them with the resources-keyword:
rule:
input: ...
output: ...
resources:
mem_mb=100
shell:
"..."
If limits for the resources are given via the command line, e.g.
$ snakemake --resources mem_mb=100
the scheduler will ensure that the given resources are not exceeded by running jobs.
If no limits are given, the resources are ignored.
Apart from making Snakemake aware of hybrid-computing architectures (e.g. with a limited number of additional devices like GPUs) this allows to control scheduling in various ways, e.g. to limit IO-heavy jobs by assigning an artificial IO-resource to them and limiting it via the --resources
flag.
Resources must be int
values.
Note that you are free to choose any names for the given resources.
When defining memory constraints, it is however advised to use mem_mb
, because there are
Snakemake execution modes that make use of this information, (e.g., when using Executing a Snakemake workflow via kubernetes).
Resources can also be callables that return int
values.
The signature of the callable has to be callable(wildcards [, input] [, threads] [, attempt])
(input
, threads
, and attempt
are optional parameters).
The parameter attempt
allows to adjust resources based on how often the job has been restarted (see All Options, option --restart-times
).
This is handy when executing a Snakemake workflow in a cluster environment, where jobs can e.g. fail because of too limited resources.
When Snakemake is executed with --restart-times 3
, it will try to restart a failed job 3 times before it gives up.
Thereby, the parameter attempt
will contain the current attempt number (starting from 1
).
This can be used to adjust the required memory as follows
rule:
input: ...
output: ...
resources:
mem_mb=lambda wildcards, attempt: attempt * 100
shell:
"..."
Here, the first attempt will require 100 MB memory, the second attempt will require 200 MB memory and so on. When passing memory requirements to the cluster engine, you can by this automatically try out larger nodes if it turns out to be necessary.
Messages¶
When executing snakemake, a short summary for each running rule is given to the console. This can be overridden by specifying a message for a rule:
rule NAME:
input: "path/to/inputfile", "path/to/other/inputfile"
output: "path/to/outputfile", "path/to/another/outputfile"
threads: 8
message: "Executing somecommand with {threads} threads on the following files {input}."
shell: "somecommand --threads {threads} {input} {output}"
Note that access to wildcards is also possible via the variable wildcards
(e.g, {wildcards.sample}
), which is the same as with shell commands. It is important to have a namespace around wildcards in order to avoid clashes with other variable names.
Priorities¶
Snakemake allows rules to specify numeric priorities:
rule:
input: ...
output: ...
priority: 50
shell: ...
Per default, each rule has a priority of 0. Any rule that specifies a higher priority, will be preferred by the scheduler over all rules that are ready to execute at the same time without having at least the same priority.
Furthermore, the --prioritize
or -P
command line flag allows to specify files (or rules) that shall be created with highest priority during the workflow execution. This means that the scheduler will assign the specified target and all its dependencies highest priority, such that the target is finished as soon as possible.
The --dryrun
or -n
option allows you to see the scheduling plan including the assigned priorities.
Log-Files¶
Each rule can specify a log file where information about the execution is written to:
rule abc:
input: "input.txt"
output: "output.txt"
log: "logs/abc.log"
shell: "somecommand --log {log} {input} {output}"
Log files can be used as input for other rules, just like any other output file. However, unlike output files, log files are not deleted upon error. This is obviously necessary in order to discover causes of errors which might become visible in the log file.
The variable log
can be used inside a shell command to tell the used tool to which file to write the logging information.
The log file has to use the same wildcards as output files, e.g.
log: "logs/abc.{dataset}.log"
For programs that do not have an explicit log
parameter, you may always use 2> {log}
to redirect standard output to a file (here, the log
file) in Linux-based systems.
Note that it is also supported to have multiple (named) log files being specified:
rule abc:
input: "input.txt"
output: "output.txt"
log: log1="logs/abc.log", log2="logs/xyz.log"
shell: "somecommand --log {log.log1} METRICS_FILE={log.log2} {input} {output}"
Non-file parameters for rules¶
Sometimes you may want to define certain parameters separately from the rule body. Snakemake provides the params
keyword for this purpose:
rule:
input:
...
params:
prefix="somedir/{sample}"
output:
"somedir/{sample}.csv"
shell:
"somecommand -o {params.prefix}"
The params
keyword allows you to specify additional parameters depending on the wildcards values. This allows you to circumvent the need to use run:
and python code for non-standard commands like in the above case.
Here, the command somecommand
expects the prefix of the output file instead of the actual one. The params
keyword helps here since you cannot simply add the prefix as an output file (as the file won’t be created, Snakemake would throw an error after execution of the rule).
Furthermore, for enhanced readability and clarity, the params
section is also an excellent place to name and assign parameters and variables for your subsequent command.
Similar to input
, params
can take functions as well (see Functions as Input Files), e.g. you can write
rule:
input:
...
params:
prefix=lambda wildcards, output: output[0][:-4]
output:
"somedir/{sample}.csv"
shell:
"somecommand -o {params.prefix}"
to get the same effect as above. Note that in contrast to the input
directive, the
params
directive can optionally take more arguments than only wildcards
, namely input
, output
, threads
, and resources
.
From the Python perspective, they can be seen as optional keyword arguments without a default value.
Their order does not matter, apart from the fact that wildcards
has to be the first argument.
In the example above, this allows you to derive the prefix name from the output file.
External scripts¶
A rule can also point to an external script instead of a shell command or inline Python code, e.g.
rule NAME:
input:
"path/to/inputfile",
"path/to/other/inputfile"
output:
"path/to/outputfile",
"path/to/another/outputfile"
script:
"path/to/script.py"
The script path is always relative to the Snakefile (in contrast to the input and output file paths, which are relative to the working directory).
Inside the script, you have access to an object snakemake
that provides access to the same objects that are available in the run
and shell
directives (input, output, params, wildcards, log, threads, resources, config), e.g. you can use snakemake.input[0]
to access the first input file of above rule.
Apart from Python scripts, this mechanism also allows you to integrate R and R Markdown scripts with Snakemake, e.g.
rule NAME:
input:
"path/to/inputfile",
"path/to/other/inputfile"
output:
"path/to/outputfile",
"path/to/another/outputfile"
script:
"path/to/script.R"
In the R script, an S4 object named snakemake
analog to the Python case above is available and allows access to input and output files and other parameters. Here the syntax follows that of S4 classes with attributes that are R lists, e.g. we can access the first input file with snakemake@input[[1]]
(note that the first file does not have index 0
here, because R starts counting from 1
). Named input and output files can be accessed in the same way, by just providing the name instead of an index, e.g. snakemake@input[["myfile"]]
.
An example external Python script would could look like this:
def do_something(data_path, out_path, threads, myparam):
# python code
do_something(snakemake.input[0], snakemake.output[0], snakemake.threads, snakemake.config["myparam"])
You can use the Python debugger from within the script if you invoke Snakemake with --debug
.
An equivalent script written in R would look like this:
do_something <- function(data_path, out_path, threads, myparam) {
# R code
}
do_something(snakemake@input[[1]], snakemake@output[[1]], snakemake@threads, snakemake@config[["myparam"]])
To debug R scripts, you can save the workspace with save.image()
, and invoke R after Snakemake has terminated. Then you can use the usual R debugging facilities while having access to the snakemake
variable.
It is best practice to wrap the actual code into a separate function. This increases the portability if the code shall be invoked outside of Snakemake or from a different rule.
An R Markdown file can be integrated in the same way as R and Python scripts, but only a single output (html) file can be used:
rule NAME:
input:
"path/to/inputfile",
"path/to/other/inputfile"
output:
"path/to/report.html",
script:
"path/to/report.Rmd"
In the R Markdown file you can insert output from a R command, and access variables stored in the S4 object named snakemake
---
title: "Test Report"
author:
- "Your Name"
date: "`r format(Sys.time(), '%d %B, %Y')`"
params:
rmd: "report.Rmd"
output:
html_document:
highlight: tango
number_sections: no
theme: default
toc: yes
toc_depth: 3
toc_float:
collapsed: no
smooth_scroll: yes
---
## R Markdown
This is an R Markdown document.
Test include from snakemake `r snakemake@input`.
## Source
<a download="report.Rmd" href="`r base64enc::dataURI(file = params$rmd, mime = 'text/rmd', encoding = 'base64')`">R Markdown source file (to produce this document)</a>
A link to the R Markdown document with the snakemake object can be inserted. Therefore a variable called rmd
needs to be added to the params
section in the header of the report.Rmd
file. The generated R Markdown file with snakemake object will be saved in the file specified in this rmd
variable. This file can be embedded into the HTML document using base64 encoding and a link can be inserted as shown in the example above.
Also other input and output files can be embedded in this way to make a portable report. Note that the above method with a data URI only works for small files. An experimental technology to embed larger files is using Javascript Blob object.
Protected and Temporary Files¶
A particular output file may require a huge amount of computation time. Hence one might want to protect it against accidental deletion or overwriting. Snakemake allows this by marking such a file as protected
:
rule NAME:
input:
"path/to/inputfile"
output:
protected("path/to/outputfile")
shell:
"somecommand {input} {output}"
A protected file will be write-protected after the rule that produces it is completed.
Further, an output file marked as temp
is deleted after all rules that use it as an input are completed:
rule NAME:
input:
"path/to/inputfile"
output:
temp("path/to/outputfile")
shell:
"somecommand {input} {output}"
Ignoring timestamps¶
For determining whether output files have to be re-created, Snakemake checks whether the file modification date (i.e. the timestamp) of any input file of the same job is newer than the timestamp of the output file.
This behavior can be overridden by marking an input file as ancient
.
The timestamp of such files is ignored and always assumed to be older than any of the output files:
rule NAME:
input:
ancient("path/to/inputfile")
output:
"path/to/outputfile"
shell:
"somecommand {input} {output}"
Here, this means that the file path/to/outputfile
will not be triggered for re-creation after it has been generated once, even when the input file is modified in the future.
Note that any flag that forces re-creation of files still also applies to files marked as ancient
.
Shadow rules¶
Shadow rules result in each execution of the rule to be run in isolated temporary directories. This “shadow” directory contains symlinks to files and directories in the current workdir. This is useful for running programs that generate lots of unused files which you don’t want to manually cleanup in your snakemake workflow. It can also be useful if you want to keep your workdir clean while the program executes, or simplify your workflow by not having to worry about unique filenames for all outputs of all rules.
By setting shadow: "shallow"
, the top level files and directories are symlinked, so that any relative paths in a subdirectory will be real paths in the filesystem. The setting shadow: "full"
fully shadows the entire subdirectory structure of the current workdir. Once the rule successfully executes, the output file will be moved if necessary to the real path as indicated by output
.
Shadow directories are stored one per rule execution in .snakemake/shadow/
, and are cleared on subsequent snakemake invocations unless the --keep-shadow
command line argument is used.
Typically, you will not need to modify your rule for compatibility with shadow
, unless you reference parent directories relative to your workdir in a rule.
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
shadow: "shallow"
shell: "somecommand --other_outputs other.txt {input} {output}"
Flag files¶
Sometimes it is necessary to enforce some rule execution order without real file dependencies. This can be achieved by “touching” empty files that denote that a certain task was completed. Snakemake supports this via the touch flag:
rule all:
input: "mytask.done"
rule mytask:
output: touch("mytask.done")
shell: "mycommand ..."
With the touch
flag, Snakemake touches (i.e. creates or updates) the file mytask.done
after mycommand
has finished successfully.
Job Properties¶
When executing a workflow on a cluster using the --cluster
parameter (see below), Snakemake creates a job script for each job to execute.
This script is then invoked using the provided cluster submission command (e.g. qsub
).
Sometimes you want to provide a custom wrapper for the cluster submission command that decides about additional parameters.
As this might be based on properties of the job, Snakemake stores the job properties (e.g. rule name, threads, input files, params etc.) as JSON inside the job script.
For convenience, there exists a parser function snakemake.utils.read_job_properties
that can be used to access the properties.
The following shows an example job submission wrapper:
#!/usr/bin/env python3
import os
import sys
from snakemake.utils import read_job_properties
jobscript = sys.argv[1]
job_properties = read_job_properties(jobscript)
# do something useful with the threads
threads = job_properties[threads]
# access property defined in the cluster configuration file (Snakemake >=3.6.0)
job_properties["cluster"]["time"]
os.system("qsub -t {threads} {script}".format(threads=threads, script=jobscript))
Dynamic Files¶
Snakemake provides experimental support for dynamic files. Dynamic files can be used whenever one has a rule, for which the number of output files is unknown before the rule was executed. This is useful for example with cetain clustering algorithms:
rule cluster:
input: "afile.csv"
output: dynamic("{clusterid}.cluster.csv")
run: ...
Now the results of the rule can be used in Snakemake although it does not know how many files will be present before executing the rule cluster, e.g. by:
rule all:
input: dynamic("{clusterid}.cluster.plot.pdf")
rule plot:
input: "{clusterid}.cluster.csv"
output: "{clusterid}.cluster.plot.pdf"
run: ...
Here, Snakemake determines the input files for the rule all after the rule cluster was executed, and then dynamically inserts jobs of the rule plot into the DAG to create the desired plots.
Functions as Input Files¶
Instead of specifying strings or lists of strings as input files, snakemake can also make use of functions that return single or lists of input files:
def myfunc(wildcards):
return [... a list of input files depending on given wildcards ...]
rule:
input: myfunc
output: "someoutput.{somewildcard}.txt"
shell: "..."
The function has to accept a single argument that will be the wildcards object generated from the application of the rule to create some requested output files. Note that you can also use lambda expressions instead of full function definitions. By this, rules can have entirely different input files (both in form and number) depending on the inferred wildcards. E.g. you can assign input files that appear in entirely different parts of your filesystem based on some wildcard value and a dictionary that maps the wildcard value to file paths.
Note that the function will be executed when the rule is evaluated and before the workflow actually starts to execute. Further note that using a function as input overrides the default mechanism of replacing wildcards with their values inferred from the output files. You have to take care of that yourself with the given wildcards object.
Finally, when implementing the input function, it is best practice to make sure that it can properly handle all possible wildcard values your rule can have. In particular, input files should not be combined with very general rules that can be applied to create almost any file: Snakemake will try to apply the rule, and will report the exceptions of your input function as errors.
For a practical example, see the Snakemake Tutorial (Step 3: Input functions).
Input Functions and unpack()
¶
In some cases, you might want to have your input functions return named input files.
This can be done by having them return dict()
objects with the names as the dict keys and the file names as the dict values and using the unpack()
keyword.
def myfunc(wildcards):
return { 'foo': '{wildcards.token}.txt'.format(wildcards=wildcards)
rule:
input: unpack(myfunc)
output: "someoutput.{token}.txt"
shell: "..."
Note that unpack()
only necessary for input functions returning dict
.
While it also works for list
, remember that lists (and nested lists) of strings are automatically flattened.
Also note that if you do not pass in a function into the input list but you directly call a function then you don’t use unpack()
either.
Here, you can simply use Python’s double-star (**
) operator for unpacking the parameters.
Note that as Snakefiles are translated into Python for execution, the same rules as for using the star and double-star unpacking Python operators apply.
These restrictions do not apply when using unpack()
.
def myfunc1():
return ['foo.txt']
def myfunc2():
return {'foo': 'nowildcards.txt'}
rule:
input:
*myfunc1(),
**myfunc2(),
output: "..."
shell: "..."
Version Tracking¶
Rules can specify a version that is tracked by Snakemake together with the output files. When the version changes snakemake informs you when using the flag --summary
or --list-version-changes
.
The version can be specified by the version directive, which takes a string:
rule:
input: ...
output: ...
version: "1.0"
shell: ...
The version can of course also be filled with the output of a shell command, e.g.:
SOMECOMMAND_VERSION = subprocess.check_output("somecommand --version", shell=True)
rule:
version: SOMECOMMAND_VERSION
Alternatively, you might want to use file modification times in case of local scripts:
SOMECOMMAND_VERSION = str(os.path.getmtime("path/to/somescript"))
rule:
version: SOMECOMMAND_VERSION
A re-run can be automated by invoking Snakemake as follows:
$ snakemake -R `snakemake --list-version-changes`
With the availability of the conda
directive (see Integrated Package Management)
the version
directive has become obsolete in favor of defining isolated
software environments that can be automatically deployed via the conda package
manager.
Code Tracking¶
Snakemake tracks the code that was used to create your files.
In combination with --summary
or --list-code-changes
this can be used to see what files may need a re-run because the implementation changed.
Re-run can be automated by invoking Snakemake as follows:
$ snakemake -R `snakemake --list-code-changes`
Onstart, onsuccess and onerror handlers¶
Sometimes, it is necessary to specify code that shall be executed when the workflow execution is finished (e.g. cleanup, or notification of the user).
With Snakemake 3.2.1, this is possible via the onsuccess
and onerror
keywords:
onsuccess:
print("Workflow finished, no error")
onerror:
print("An error occurred")
shell("mail -s "an error occurred" youremail@provider.com < {log}")
The onsuccess
handler is executed if the workflow finished without error. Else, the onerror
handler is executed.
In both handlers, you have access to the variable log
, which contains the path to a logfile with the complete Snakemake output.
Snakemake 3.6.0 adds an ``onstart``
handler, that will be executed before the workflow starts.
Note that dry-runs do not trigger any of the handlers.
Rule dependencies¶
From verion 2.4.8 on, rules can also refer to the output of other rules in the Snakefile, e.g.:
rule a:
input: "path/to/input"
output: "path/to/output"
shell: ...
rule b:
input: rules.a.output
output: "path/to/output/of/b"
shell: ...
Importantly, be aware that referring to rule a here requires that rule a was defined above rule b in the file, since the object has to be known already. This feature also allows to resolve dependencies that are ambiguous when using filenames.
Note that when the rule you refer to defines multiple output files but you want to require only a subset of those as input for another rule, you should name the output files and refer to them specifically:
rule a:
input: "path/to/input"
output: a = "path/to/output", b = "path/to/output2"
shell: ...
rule b:
input: rules.a.output.a
output: "path/to/output/of/b"
shell: ...
Handling Ambiguous Rules¶
When two rules can produce the same output file, snakemake cannot decide per default which one to use. Hence an AmbiguousRuleException
is thrown.
Note: ruleorder is not intended to bring rules in the correct execution order (this is solely guided by the names of input and output files you use), it only helps snakemake to decide which rule to use when multiple ones can create the same output file!
The proposed strategy to deal with such ambiguity is to provide a ruleorder
for the conflicting rules, e.g.
ruleorder: rule1 > rule2 > rule3
Here, rule1
is preferred over rule2
and rule3
, and rule2
is preferred over rule3
.
Only if rule1 and rule2 cannot be applied (e.g. due to missing input files), rule3 is used to produce the desired output file.
Alternatively, rule dependencies (see above) can also resolve ambiguities.
Another (quick and dirty) possiblity is to tell snakemake to allow ambiguity via a command line option
$ snakemake --allow-ambiguity
such that similar to GNU Make always the first matching rule is used. Here, a warning that summarizes the decision of snakemake is provided at the terminal.
Local Rules¶
When working in a cluster environment, not all rules need to become a job that has to be submitted (e.g. downloading some file, or a target-rule like all, see Targets). The keyword localrules allows to mark a rule as local, so that it is not submitted to the cluster and instead executed on the host node:
localrules: all, foo
rule all:
input: ...
rule foo:
...
rule bar:
...
Here, only jobs from the rule bar
will be submitted to the cluster, whereas all and foo will be run locally.
Note that you can use the localrules directive multiple times. The result will be the union of all declarations.
Benchmark Rules¶
Since version 3.1, Snakemake provides support for benchmarking the run times of rules. This can be used to create complex performance analysis pipelines. With the benchmark keyword, a rule can be declared to store a benchmark of its code into the specified location. E.g. the rule
rule benchmark_command:
input:
"path/to/input.{sample}.txt"
output:
"path/to/output.{sample}.txt"
benchmark:
"benchmarks/somecommand/{sample}.tsv"
shell:
"somecommand {input} {output}"
benchmarks the CPU and wall clock time of the command somecommand
for the given output and input files.
For this, the shell or run body of the rule is executed on that data, and all run times are stored into the given benchmark tsv file (which will contain a tab-separated table of run times and memory usage in MiB).
Per default, Snakemake executes the job once, generating one run time.
However, the benchmark file can be annotated with the desired number of repeats, e.g.,
rule benchmark_command:
input:
"path/to/input.{sample}.txt"
output:
"path/to/output.{sample}.txt"
benchmark:
repeat("benchmarks/somecommand/{sample}.tsv", 3)
shell:
"somecommand {input} {output}"
will instruct Snakemake to run each job of this rule three times and store all measurements in the benchmark file. The resulting tsv file can be used as input for other rules, just like any other output file.
Note
Note that benchmarking is only possible in a reliable fashion for subprocesses (thus for tasks run through the shell
, script
, and wrapper
directive).
In the run
block, the variable bench_record
is available that you can pass to shell()
as bench_record=bench_record
.
When using shell(..., bench_record=bench_record)
, the maximum of all measurements of all shell()
calls will be used but the running time of the rule execution including any Python code.
Defining groups for execution¶
From Snakemake 5.0 on, it is possible to assign rules to groups. Such groups will be executed together in cluster or cloud mode, as a so-called group job, i.e., all jobs of a particular group will be submitted at once, to the same computing node. By this, queueing and execution time can be safed, in particular if one or several short-running rules are involved.
Groups can be defined via the group
keyword, e.g.,
samples = [1,2,3,4,5]
rule all:
input:
"test.out"
rule a:
output:
"a/{sample}.out"
group: "mygroup"
shell:
"touch {output}"
rule b:
input:
"a/{sample}.out"
output:
"b/{sample}.out"
group: "mygroup"
shell:
"touch {output}"
rule c:
input:
expand("b/{sample}.out", sample=samples)
output:
"test.out"
shell:
"touch {output}"
Here, jobs from rule a
and b
end up in one group mygroup
, whereas jobs from rule c
are executed separately.
Note that Snakemake always determines a connected subgraph with the same group id to be a group job.
Here, this means that, e.g., the jobs creating a/1.out
and b/1.out
will be in one group, and the jobs creating a/2.out
and b/2.out
will be in a separate group.
However, if we would add group: "mygroup"
to rule c
, all jobs would end up in a single group, including the one spawned from rule c
, because c
connects all the other jobs.
Piped output¶
From Snakemake 5.0 on, it is possible to mark output files as pipes, via the pipe
flag, e.g.:
rule all:
input:
expand("test.{i}.out", i=range(2))
rule a:
output:
pipe("test.{i}.txt")
shell:
"for i in {{0..2}}; do echo {wildcards.i} >> {output}; done"
rule b:
input:
"test.{i}.txt"
output:
"test.{i}.out"
shell:
"grep {wildcards.i} < {input} > {output}"
If an output file is marked to be a pipe, then Snakemake will first create a named pipe with the given name and then execute the creating job simultaneously with the consuming job, inside a group job (see above). Naturally, a pipe output may only have a single consumer. It is possible to combine explicit group definition as above with pipe outputs. Thereby, pipe jobs can live within, or (automatically) extend existing groups. However, the two jobs connected by a pipe may not exist in conflicting groups.
Configuration¶
Snakemake allows you to use configuration files for making your workflows more flexible and also for abstracting away direct dependencies to a fixed HPC cluster scheduler.
Standard Configuration¶
Snakemake directly supports the configuration of your workflow. A configuration is provided as a JSON or YAML file and can be loaded with:
configfile: "path/to/config.json"
The config file can be used to define a dictionary of configuration parameters and their values. In the workflow, the configuration is accessible via the global variable config, e.g.
rule all:
input:
expand("{sample}.{param}.output.pdf", sample=config["samples"], param=config["yourparam"])
If the configfile statement is not used, the config variable provides an empty array. In addition to the configfile statement, config values can be overwritten via the command line or the The Snakemake API, e.g.:
$ snakemake --config yourparam=1.5
Further, you can manually alter the config dictionary using any Python code outside of your rules. Changes made from within a rule won’t be seen from other rules. Finally, you can use the –configfile command line argument to overwrite values from the configfile statement. Note that any values parsed into the config dictionary with any of above mechanisms are merged, i.e., all keys defined via a configfile statement, or the –configfile and –config command line arguments will end up in the final config dictionary, but if two methods define the same key, command line overwrites the configfile statement.
For adding config placeholders into a shell command, Python string formatting syntax requires you to leave out the quotes around the key name, like so:
shell:
"mycommand {config[foo]} ..."
Tabular configuration¶
It is usually advisable to complement YAML based configuration (see above) by a sheet based approach for meta-data that is of tabular form. For example, such a sheet can contain per-sample information. With the Pandas library such data can be read and used with minimal overhead, e.g.,
import pandas as pd
samples = pd.read_table("samples.tsv").set_index("samples", drop=False)
reads in a table samples.tsv
in TSV format and makes every record accessible by the sample name.
For details, see the Pandas documentation.
A fully working real-world example containing both types of configuration can be found here.
Validation¶
With Snakemake 5.1, it is possible to validate both types of configuration via JSON schemas.
The function snakemake.utils.validate
takes a loaded configuration (a config dictionary or a Pandas data frame) and validates it with a given JSON schema.
Thereby, the schema can be provided in JSON or YAML format.
In case of the data frame, the schema should model the record that is expected in each row of the data frame.
In the following example,
import pandas as pd
from snakemake.utils import validate
configfile: "config.yaml"
validate(config, "config.schema.yaml")
samples = pd.read_table(config["samples"]).set_index("sample", drop=False)
validate(samples, "samples.schema.yaml")
rule all:
input:
expand("test.{sample}.txt", sample=samples.index)
rule a:
output:
"test.{sample}.txt"
shell:
"touch {output}"
the schema for validating the samples data frame looks like this:
$schema: "http://json-schema.org/draft-06/schema#"
description: an entry in the sample sheet
properties:
sample:
type: string
description: sample name/identifier
condition:
type: string
description: sample condition that will be compared during differential expression analysis (e.g. a treatment, a tissue time, a disease)
required:
- sample
- condition
Cluster Configuration¶
Snakemake supports a separate configuration file for execution on a cluster.
A cluster config file allows you to specify cluster submission parameters outside the Snakefile.
The cluster config is a JSON- or YAML-formatted file that contains objects that match names of rules in the Snakefile.
The parameters in the cluster config are then accessed by the cluster.*
wildcard when you are submitting jobs.
Note that a workflow shall never depend on a cluster configuration, because this would limit its portability.
Therefore, it is also not intended to access the cluster configuration from within the workflow.
For example, say that you have the following Snakefile:
rule all:
input: "input1.txt", "input2.txt"
rule compute1:
output: "input1.txt"
shell: "touch input1.txt"
rule compute2:
output: "input2.txt"
shell: "touch input2.txt"
This Snakefile can then be configured by a corresponding cluster config, say “cluster.json”:
{
"__default__" :
{
"account" : "my account",
"time" : "00:15:00",
"n" : 1,
"partition" : "core"
},
"compute1" :
{
"time" : "00:20:00"
}
}
Any string in the cluster configuration can be formatted in the same way as shell commands, e.g. {rule}.{wildcards.sample}
is formatted to a.xy
if the rulename is a
and the wildcard value is xy
.
Here __default__
is a special object that specifies default parameters, these will be inherited by the other configuration objects. The compute1
object here changes the time
parameter, but keeps the other parameters from __default__
. The rule compute2
does not have any configuration, and will therefore use the default configuration. You can then run the Snakefile with the following command on a SLURM system.
$ snakemake -j 999 --cluster-config cluster.json --cluster "sbatch -A {cluster.account} -p {cluster.partition} -n {cluster.n} -t {cluster.time}"
For cluster systems using LSF/BSUB, a cluster config may look like this:
{
"__default__" :
{
"queue" : "medium_priority",
"nCPUs" : "16",
"memory" : 20000,
"resources" : "\"select[mem>20000] rusage[mem=20000] span[hosts=1]\"",
"name" : "JOBNAME.{rule}.{wildcards}",
"output" : "logs/cluster/{rule}.{wildcards}.out",
"error" : "logs/cluster/{rule}.{wildcards}.err"
},
"trimming_PE" :
{
"memory" : 30000,
"resources" : "\"select[mem>30000] rusage[mem=30000] span[hosts=1]\"",
}
}
The advantage of this setup is that it is already pretty general by exploiting the wildcard possibilities that Snakemake provides via {rule}
and {wildcards}
. So job names, output and error files all have reasonable and trackable default names, only the directies (logs/cluster
) and job names (JOBNAME
) have to adjusted accordingly.
If a rule named bamCoverage
is executed with the wildcard basename = sample1
, for example, the output and error files will be bamCoverage.basename=sample1.out
and bamCoverage.basename=sample1.err
, respectively.
Configure Working Directory¶
All paths in the snakefile are interpreted relative to the directory snakemake is executed in. This behaviour can be overridden by specifying a workdir in the snakefile:
workdir: "path/to/workdir"
Usually, it is preferred to only set the working directory via the command line, because above directive limits the portability of Snakemake workflows.
Modularization¶
Modularization in Snakemake comes at different levels.
- The most fine-grained level are wrappers. They are available and can be published at?the Snakemake Wrapper Repository. These wrappers can then be composed and customized according to your needs, by copying skeleton rules into your workflow. In combination with conda integration, wrappers also automatically deploy the needed software dependencies into isolated environments.
- For larger, reusable parts that shall be integrated into a common workflow, it is recommended to write small Snakefiles and include them into a master Snakefile via the include statement. In such a setup, all rules share a common config file.
- The third level of separation are subworkflows. Importantly, these are rather meant as links between otherwise separate data analyses.
Wrappers¶
The wrapper directive allows to have re-usable wrapper scripts around e.g. command line tools.
In contrast to modularization strategies like include
or subworkflows, the wrapper directive allows to re-wire the DAG of jobs.
For example
rule samtools_sort:
input:
"mapped/{sample}.bam"
output:
"mapped/{sample}.sorted.bam"
params:
"-m 4G"
threads: 8
wrapper:
"0.0.8/bio/samtools_sort"
Refers to the wrapper "0.0.8/bio/samtools_sort"
to create the output from the input.
Snakemake will automatically download the wrapper from the Snakemake Wrapper Repository.
Thereby, 0.0.8 can be replaced with the git version tag you want to use, or a commit id (see here).
This ensures reproducibility since changes in the wrapper implementation won’t be propagated automatically to your workflow.
Alternatively, e.g., for development, the wrapper directive can also point to full URLs, including URLs to local files with absolute paths file://
or relative paths file:
.
Examples for each wrapper can be found in the READMEs located in the wrapper subdirectories at the Snakemake Wrapper Repository.
The Snakemake Wrapper Repository is meant as a collaborative project and pull requests are very welcome.
Common-Workflow-Language (CWL) support¶
With Snakemake 4.8.0, it is possible to refer to CWL tool definitions in rules instead of specifying a wrapper or a plain shell command. A CWL tool definition can be used as follows.
rule samtools_sort:
input:
input="mapped/{sample}.bam"
output:
output_name="mapped/{sample}.sorted.bam"
params:
threads=lambda wildcards, threads: threads,
memory="4G"
threads: 8
cwl:
"https://github.com/common-workflow-language/workflows/blob/"
"fb406c95/tools/samtools-sort.cwl"
It is advisable to use a github URL that includes the commit as above instead of a branch name, in order to ensure reproducible results. Snakemake will execute the rule by invoking cwltool, which has to be available via your $PATH variable, and can be, e.g., installed via conda or pip. When using in combination with –use-singularity, Snakemake will instruct cwltool to execute the command via Singularity in user space. Otherwise, cwltool will in most cases use a Docker container, which requires Docker to be set up properly.
The advantage is that predefined tools available via any repository of CWL tool definitions can be used in any supporting workflow management system. In contrast to a Snakemake wrapper, CWL tool definitions are in general not suited to alter the behavior of a tool, e.g., by normalizing output names or special input handling. As you can see in comparison to the analog wrapper declaration above, the rule becomes slightly more verbose, because input, output, and params have to be dispatched to the specific expectations of the CWL tool definition.
Includes¶
Another Snakefile with all its rules can be included into the current:
include: "path/to/other/snakefile"
The default target rule (often called the all
-rule), won’t be affected by the include.
I.e. it will always be the first rule in your Snakefile, no matter how many includes you have above your first rule.
Includes are relative to the directory of the Snakefile in which they occur.
For example, if above Snakefile resides in the directory my/dir
, then Snakemake will search for the include at my/dir/path/to/other/snakefile
, regardless of the working directory.
Sub-Workflows¶
In addition to including rules of another workflow, Snakemake allows to depend on the output of other workflows as sub-workflows. A sub-workflow is executed independently before the current workflow is executed. Thereby, Snakemake ensures that all files the current workflow depends on are created or updated if necessary. This allows to create links between otherwise separate data analyses.
subworkflow otherworkflow:
workdir: "../path/to/otherworkflow"
snakefile: "../path/to/otherworkflow/Snakefile"
rule a:
input: otherworkflow("test.txt")
output: ...
shell: ...
Here, the subworkflow is named “otherworkflow” and it is located in the working directory ../path/to/otherworkflow
.
The snakefile is in the same directory and called Snakefile
.
If snakefile
is not defined for the subworkflow, it is assumed be located in the workdir location and called Snakefile
, hence, above we could have left the snakefile
keyword out as well.
If workdir
is not specified, it is assumed to be the same as the current one.
Files that are output from the subworkflow that we depend on are marked with the otherworkflow
function (see the input of rule a).
This function automatically determines the absolute path to the file (here ../path/to/otherworkflow/test.txt
).
When executing, snakemake first tries to create (or update, if necessary) test.txt
(and all other possibly mentioned dependencies) by executing the subworkflow.
Then the current workflow is executed.
This can also happen recursively, since the subworkflow may have its own subworkflows as well.
Remote files¶
In versions snakemake>=3.5
.
The Snakefile
supports a wrapper function, remote()
, indicating a file is on a remote storage provider (this is similar to temp()
or protected()
). In order to use all types of remote files, the Python packages boto
, moto
, filechunkio
, pysftp
, dropbox
, requests
, ftputil
, XRootD
, and biopython
must be installed.
During rule execution, a remote file (or object) specified is downloaded to the local cwd
, within a sub-directory bearing the same name as the remote provider. This sub-directory naming lets you have multiple remote origins with reduced likelihood of name collisions, and allows Snakemake to easily translate remote objects to local file paths. You can think of each local remote sub-directory as a local mirror of the remote system. The remote()
wrapper is mutually-exclusive with the temp()
and protected()
wrappers.
Snakemake includes the following remote providers, supported by the corresponding classes:
- Amazon Simple Storage Service (AWS S3):
snakemake.remote.S3
- Google Cloud Storage (GS):
snakemake.remote.GS
- File transfer over SSH (SFTP):
snakemake.remote.SFTP
- Read-only web (HTTP[S]):
snakemake.remote.HTTP
- File transfer protocol (FTP):
snakemake.remote.FTP
- Dropbox:
snakemake.remote.dropbox
- XRootD:
snakemake.remote.XRootD
- GenBank / NCBI Entrez:
snakemake.remote.NCBI
- WebDAV:
snakemake.remote.webdav
- GFAL:
snakemake.remote.gfal
- GridFTP:
snakemake.remote.gridftp
- iRODS:
snakemake.remote.iRODS
Amazon Simple Storage Service (S3)¶
This section describes usage of the S3 RemoteProvider, and also provides an intro to remote files and their usage.
It is important to note that you must have credentials (access_key_id
and secret_access_key
) which permit read/write access. If a file only serves as input to a Snakemake rule, read access is sufficient. You may specify credentials as environment variables or in the file =/.aws/credentials
, prefixed with AWS_*
, as with a standard boto config. Credentials may also be explicitly listed in the Snakefile
, as shown below:
For the Amazon S3 and Google Cloud Storage providers, the sub-directory used must be the bucket name.
Using remote files is easy (AWS S3 shown):
from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider(access_key_id="MYACCESSKEY", secret_access_key="MYSECRET")
rule all:
input:
S3.remote("bucket-name/file.txt")
Expand still works as expected, just wrap the expansion:
from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider()
rule all:
input:
S3.remote(expand("bucket-name/{letter}-2.txt", letter=["A", "B", "C"]))
It is possible to use S3-compatible storage by specifying a different endpoint address as the host kwarg in the provider, as the kwargs used in instantiating the provider are passed in to boto:
from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider(access_key_id="MYACCESSKEY", secret_access_key="MYSECRET", host="mystorage.example.com")
rule all:
input:
S3.remote("bucket-name/file.txt")
Only remote files needed to satisfy the DAG build are downloaded for the workflow. By default, remote files are downloaded prior to rule execution and are removed locally as soon as no rules depend on them. Remote files can be explicitly kept by setting the keep_local=True
keyword argument:
from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider(access_key_id="MYACCESSKEY", secret_access_key="MYSECRET")
rule all:
input: S3.remote('bucket-name/prefix{split_id}.txt', keep_local=True)
If you wish to have a rule to simply download a file to a local copy, you can do so by declaring the same file path locally as is used by the remote file:
from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider(access_key_id="MYACCESSKEY", secret_access_key="MYSECRET")
rule all:
input:
S3.remote("bucket-name/out.txt")
output:
"bucket-name/out.txt"
run:
shell("cp {output[0]} ./")
In some cases the rule can use the data directly on the remote provider, in these cases stay_on_remote=True
can be set to avoid downloading/uploading data unnecessarily. Additionally, if the backend supports it, any potentially corrupt output files will be removed from the remote. The default for stay_on_remote
and keep_local
can be configured by setting these properties on the remote provider object:
from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider(access_key_id="MYACCESSKEY", secret_access_key="MYSECRET", keep_local=True, stay_on_remote=True)
The remote provider also supports a new glob_wildcards()
(see How do I run my rule on all files of a certain directory?) which acts the same as the local version of glob_wildcards()
, but for remote files:
from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider(access_key_id="MYACCESSKEY", secret_access_key="MYSECRET")
S3.glob_wildcards("bucket-name/{file_prefix}.txt")
# (result looks just like as if the local glob_wildcards() function were used on a locally with a folder called "bucket-name")
If the AWS CLI is installed it is possible to configure your keys globally. This removes the necessity of hardcoding the keys in the Snakefile. The interactive AWS credentials setup can be done using the following command:
aws configure
S3 then can be used without the keys.
from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
S3 = S3RemoteProvider()
Google Cloud Storage (GS)¶
Usage of the GS provider is the same as the S3 provider.
For authentication, one simply needs to login via the gcloud
tool before
executing Snakemake, i.e.:
$ gcloud auth application-default login
In the Snakefile, no additional authentication information has to be provided:
from snakemake.remote.GS import RemoteProvider as GSRemoteProvider
GS = GSRemoteProvider()
rule all:
input:
GS.remote("bucket-name/file.txt")
File transfer over SSH (SFTP)¶
Snakemake can use files on remove servers accessible via SFTP (i.e. most *nix servers).
It uses pysftp for the underlying support of SFTP, so the same connection options exist.
Assuming you have SSH keys already set up for the server you are using in the Snakefile
, usage is simple:
from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider()
rule all:
input:
SFTP.remote("example.com/path/to/file.bam")
The remote file addresses used must be specified with the host (domain or IP address) and the absolute path to the file on the remote server. A port may be specified if the SSH daemon on the server is listening on a port other than 22, in either the RemoteProvider
or in each instance of remote()
:
from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider(port=4040)
rule all:
input:
SFTP.remote("example.com/path/to/file.bam")
from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider()
rule all:
input:
SFTP.remote("example.com:4040/path/to/file.bam")
The standard keyword arguments used by pysftp may be provided to the RemoteProvider to specify credentials (either password or private key):
from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider(username="myusername", private_key="/Users/myusername/.ssh/particular_id_rsa")
rule all:
input:
SFTP.remote("example.com/path/to/file.bam")
from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider(username="myusername", password="mypassword")
rule all:
input:
SFTP.remote("example.com/path/to/file.bam")
If you share credentials between servers but connect to one on a different port, the alternate port may be specified in the remote()
wrapper:
from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider(username="myusername", password="mypassword")
rule all:
input:
SFTP.remote("some-example-server-1.com/path/to/file.bam"),
SFTP.remote("some-example-server-2.com:2222/path/to/file.bam")
There is a glob_wildcards()
function:
from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider()
SFTP.glob_wildcards("example.com/path/to/{sample}.bam")
Read-only web (HTTP[s])¶
Snakemake can access web resources via a read-only HTTP(S) provider. This provider can be helpful for including public web data in a workflow.
Web addresses must be specified without protocol, so if your URI looks like this:
http://server3.example.com/path/to/myfile.tar.gz
The URI used in the Snakefile
must look like this:
server3.example.com/path/to/myfile.tar.gz
It is straightforward to use the HTTP provider to download a file to the cwd:
import os
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()
rule all:
input:
HTTP.remote("www.example.com/path/to/document.pdf", keep_local=True)
run:
outputName = os.path.basename(input[0])
shell("mv {input} {outputName}")
To connect on a different port, specify the port as part of the URI string:
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()
rule all:
input:
HTTP.remote("www.example.com:8080/path/to/document.pdf", keep_local=True)
By default, the HTTP provider always uses HTTPS (TLS). If you need to connect to a resource with regular HTTP (no TLS), you must explicitly include insecure
as a kwarg
to remote()
:
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()
rule all:
input:
HTTP.remote("www.example.com/path/to/document.pdf", insecure=True, keep_local=True)
If the URI used includes characters not permitted in a local file path, you may include them as part of the additional_request_string
in the kwargs
for remote()
. This may also be useful for including additional parameters you don not want to be part of the local filename (since the URI string becomes the local file name).
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()
rule all:
input:
HTTP.remote("example.com/query.php", additional_request_string="?range=2;3")
If the file requires authentication, you can specify a username and password for HTTP Basic Auth with the Remote Provider, or with each instance of remote().
For different types of authentication, you can pass in a Python `requests.auth
object (see here) the auth kwarg
.
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider(username="myusername", password="mypassword")
rule all:
input:
HTTP.remote("example.com/interactive.php", keep_local=True)
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()
rule all:
input:
HTTP.remote("example.com/interactive.php", username="myusername", password="mypassword", keep_local=True)
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()
rule all:
input:
HTTP.remote("example.com/interactive.php", auth=requests.auth.HTTPDigestAuth("myusername", "mypassword"), keep_local=True)
Since remote servers do not present directory contents uniformly, glob_wildcards()
is __not__ supported by the HTTP provider.
File Transfer Protocol (FTP)¶
Snakemake can work with files stored on regular FTP. Currently supported are authenticated FTP and anonymous FTP, excluding FTP via TLS.
Usage is similar to the SFTP provider, however the paths specified are relative to the FTP home directory (since this is typically a chroot):
from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider
FTP = FTPRemoteProvider(username="myusername", password="mypassword")
rule all:
input:
FTP.remote("example.com/rel/path/to/file.tar.gz")
The port may be specified in either the provider, or in each instance of remote():
from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider
FTP = FTPRemoteProvider(username="myusername", password="mypassword", port=2121)
rule all:
input:
FTP.remote("example.com/rel/path/to/file.tar.gz")
from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider
FTP = FTPRemoteProvider(username="myusername", password="mypassword")
rule all:
input:
FTP.remote("example.com:2121/rel/path/to/file.tar.gz")
Anonymous download of FTP resources is possible:
from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider
FTP = FTPRemoteProvider()
rule all:
input:
# only keeping the file so we can move it out to the cwd
FTP.remote("example.com/rel/path/to/file.tar.gz", keep_local=True)
run:
shell("mv {input} ./")
glob_wildcards()
:
from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider
FTP = FTPRemoteProvider(username="myusername", password="mypassword")
print(FTP.glob_wildcards("example.com/somedir/{file}.txt"))
Setting immediate_close=True allows the use of a large number of remote FTP input files in a job where the endpoint server limits the number of concurrent connections. When immediate_close=True, Snakemake will terminate FTP connections after each remote file action (exists(), size(), download(), mtime(), etc.). This is in contrast to the default behavior which caches FTP details and leaves the connection open across actions to improve performance (closing the connection upon job termination). :
from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider
FTP = FTPRemoteProvider()
rule all:
input:
# only keep the file so we can move it out to the cwd
# This server limits the number of concurrent connections so we need to have Snakemake close each after each FTP action.
FTP.remote(expand("ftp.example.com/rel/path/to/{file}", file=large_list), keep_local=True, immediate_close=True)
run:
shell("mv {input} ./")
glob_wildcards()
:
from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider
FTP = FTPRemoteProvider(username="myusername", password="mypassword")
print(FTP.glob_wildcards("example.com/somedir/{file}.txt"))
Dropbox¶
The Dropbox remote provider allows you to upload and download from your Dropbox account without having the client installed on your machine. In order to use the provider you first need to register an “app” on the Dropbox developer website, with access to the Full Dropbox. After registering, generate an OAuth2 access token. You will need the token to use the Snakemake Dropbox remote provider.
Using the Dropbox provider is straightforward:
from snakemake.remote.dropbox import RemoteProvider as DropboxRemoteProvider
DBox = DropboxRemoteProvider(oauth2_access_token="mytoken")
rule all:
input:
DBox.remote("path/to/input.txt")
glob_wildcards()
is supported:
from snakemake.remote.dropbox import RemoteProvider as DropboxRemoteProvider
DBox = DropboxRemoteProvider(oauth2_access_token="mytoken")
DBox.glob_wildcards("path/to/{title}.txt")
Note that Dropbox paths are case-insensitive.
XRootD¶
Snakemake can be used with XRootD backed storage provided the python bindings are installed.
This is typically most useful when combined with the stay_on_remote
flag to minimise local storage requirements.
This flag can be overridden on a file by file basis as described in the S3 remote. Additionally glob_wildcards()
is supported:
from snakemake.remote.XRootD import RemoteProvider as XRootDRemoteProvider
XRootD = XRootDRemoteProvider(stay_on_remote=True)
file_numbers = XRootD.glob_wildcards("root://eospublic.cern.ch//eos/opendata/lhcb/MasterclassDatasets/D0lifetime/2014/mclasseventv2_D0_{n}.root")
rule all:
input:
XRootD.remote(expand("local_data/mclasseventv2_D0_{n}.root", n=file_numbers))
rule make_data:
input:
XRootD.remote("root://eospublic.cern.ch//eos/opendata/lhcb/MasterclassDatasets/D0lifetime/2014/mclasseventv2_D0_{n}.root")
output:
'local_data/mclasseventv2_D0_{n}.root'
shell:
'xrdcp {input[0]} {output[0]}'
GenBank / NCBI Entrez¶
Snakemake can directly source input files from GenBank and other NCBI Entrez databases if the Biopython library is installed.
from snakemake.remote.NCBI import RemoteProvider as NCBIRemoteProvider
NCBI = NCBIRemoteProvider(email="someone@example.com") # email required by NCBI to prevent abuse
rule all:
input:
"size.txt"
rule download_and_count:
input:
NCBI.remote("KY785484.1.fasta", db="nuccore")
output:
"size.txt"
run:
shell("wc -c {input} > {output}")
The output format and source database of a record retrieved from GenBank is inferred from the file extension specified. For example, NCBI.RemoteProvider().remote("KY785484.1.fasta", db="nuccore")
will download a FASTA file while NCBI.RemoteProvider().remote("KY785484.1.gb", db="nuccore")
will download a GenBank-format file. If the options are ambiguous, Snakemake will raise an exception and inform the user of possible format choices. To see available formats, consult the Entrez EFetch documentation. To view the valid file extensions for these formats, access NCBI.RemoteProvider()._gb.valid_extensions
, or instantiate an NCBI.NCBIHelper()
and access NCBI.NCBIHelper().valid_extensions
(this is a property).
When used in conjunction with NCBI.RemoteProvider().search()
, Snakemake and NCBI.RemoteProvider().remote()
can be used to find accessions by query and download them:
from snakemake.remote.NCBI import RemoteProvider as NCBIRemoteProvider
NCBI = NCBIRemoteProvider(email="someone@example.com") # email required by NCBI to prevent abuse
# get accessions for the first 3 results in a search for full-length Zika virus genomes
# the query parameter accepts standard GenBank search syntax
query = '"Zika virus"[Organism] AND (("9000"[SLEN] : "20000"[SLEN]) AND ("2017/03/20"[PDAT] : "2017/03/24"[PDAT])) '
accessions = NCBI.search(query, retmax=3)
# give the accessions a file extension to help the RemoteProvider determine the
# proper output type.
input_files = expand("{acc}.fasta", acc=accessions)
rule all:
input:
"sizes.txt"
rule download_and_count:
input:
# Since *.fasta files could come from several different databases, specify the database here.
# if the input files are ambiguous, the provider will alert the user with possible options
# standard options like "seq_start" are supported
NCBI.remote(input_files, db="nuccore", seq_start=5000)
output:
"sizes.txt"
run:
shell("wc -c {input} > sizes.txt")
Normally, all accessions for a query are returned from NCBI.RemoteProvider.search()
. To truncate the results, specify retmax=<desired_number>
. Standard Entrez fetch query options are supported as kwargs, and may be passed in to NCBI.RemoteProvider.remote()
and NCBI.RemoteProvider.search()
.
WebDAV¶
WebDAV support is currently experimental
and available in Snakemake 4.0 and later.
Snakemake supports reading and writing WebDAV remote files. The protocol defaults to https://
, but insecure connections
can be used by specifying protocol=="http://"
. Similarly, the port defaults to 443, and can be overridden by specifying port=##
or by including the port as part of the file address.
from snakemake.remote import webdav
webdav = webdav.RemoteProvider(username="test", password="test", protocol="http://")
rule a:
input:
webdav.remote("example.com:8888/path/to/input_file.csv"),
shell:
# do something
GFAL¶
GFAL support is available in Snakemake 4.1 and later.
Snakemake supports reading and writing remote files via the GFAL command line client (gfal-* commands). By this, it supports various grid storage protocols like GridFTP. In general, if you are able to use the gfal-* commands directly, Snakemake support for GFAL will work as well.
from snakemake.remote import gfal
gfal = gfal.RemoteProvider(retry=5)
rule a:
input:
gfal.remote("gridftp.grid.sara.nl:2811/path/to/infile.txt")
output:
gfal.remote("gridftp.grid.sara.nl:2811/path/to/outfile.txt")
shell:
# do something
Authentication has to be setup in the system, e.g. via certificates in the .globus
directory.
Usually, this is already the case and no action has to be taken.
The keyword argument to the remote provider allows to set the number of retries (10 per default) in case of failed commands (the GRID is usually relatively unreliable).
The latter may be unsupported depending on the system configuration.
Note that GFAL support used together with the flags --no-shared-fs
and --default-remote-provider
enables you
to transparently use Snakemake in a grid computing environment without a shared network filesystem.
For an example see the surfsara-grid configuration profile.
GridFTP¶
GridFTP support is available in Snakemake 4.3.0 and later.
As a more specialized alternative to the GFAL remote provider, Snakemake provides a GridFTP remote provider. This provider only supports the GridFTP protocol. Internally, it uses the globus-url-copy command for downloads and uploads, while all other tasks are delegated to the GFAL remote provider.
from snakemake.remote import gridftp
gridftp = gridftp.RemoteProvider(retry=5)
rule a:
input:
gridftp.remote("gridftp.grid.sara.nl:2811/path/to/infile.txt")
output:
gridftp.remote("gridftp.grid.sara.nl:2811/path/to/outfile.txt")
shell:
# do something
Authentication has to be setup in the system, e.g. via certificates in the .globus
directory.
Usually, this is already the case and no action has to be taken.
The keyword argument to the remote provider allows to set the number of retries (10 per default) in case of failed commands (the GRID is usually relatively unreliable).
The latter may be unsupported depending on the system configuration.
Note that GridFTP support used together with the flags --no-shared-fs
and --default-remote-provider
enables you
to transparently use Snakemake in a grid computing environment without a shared network filesystem.
For an example see the surfsara-grid configuration profile.
Remote cross-provider transfers¶
It is possible to use Snakemake to transfer files between remote providers (using the local machine as an intermediary), as long as the sub-directory (bucket) names differ:
from snakemake.remote.GS import RemoteProvider as GSRemoteProvider
from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
GS = GSRemoteProvider(access_key_id="MYACCESSKEYID", secret_access_key="MYSECRETACCESSKEY")
S3 = S3RemoteProvider(access_key_id="MYACCESSKEYID", secret_access_key="MYSECRETACCESSKEY")
fileList, = S3.glob_wildcards("source-bucket/{file}.bam")
rule all:
input:
GS.remote( expand("destination-bucket/{file}.bam", file=fileList) )
rule transfer_S3_to_GS:
input:
S3.remote( expand("source-bucket/{file}.bam", file=fileList) )
output:
GS.remote( expand("destination-bucket/{file}.bam", file=fileList) )
run:
shell("cp {input} {output}")
iRODS¶
You can access an iRODS server to retrieve data from and upload data to it.
If your iRODS server is not set to a certain timezone, it is using UTC. It is
advised to shift the modification time provided by iRODS (modify_time
)
then to your timezone by providing the timezone
parameter such that
timestamps coming from iRODS are converted to the correct time.
iRODS actually does not save the timestamp from your original file but creates
its own timestamp of the upload time. When iRODS downloads the file for
processing, it does not take the timestamp from the remote file. Instead,
the file will have the timestamp when it was downloaded. To get around this,
we create a metadata entry to store the original file stamp from your system
and alter the timestamp of the downloaded file accordingly. While uploading,
the metadata entries atime
, ctime
and mtime
are added. When this
entry does not exist (because this module didn’t upload the file), we fall back
to the timestamp provided by iRODS with the above mentioned strategy.
To access the iRODS server you need to have an iRODS environment configuration file available and in this file the authentication needs to be configured. The iRODS configuration file can be created by following the official instructions).
The default location for the configuration file is
~/.irods/irods_environment.json
. The RemoteProvider()
class accepts
the parameter irods_env_file
where an alternative path to the
irods_environment.json
file can be specified. Another way is to export the
environment variable IRODS_ENVIRONMENT_FILE
in your shell to specify the
location.
There are several ways to configure the authentication against the iRODS
server, depending on what your iRODS server offers. If you are using the
authentication via password, the default location of the authentication file is
~/.irods/.irodsA
. Usually this file is generated with the iinit
command
from the iCommands
program suite. Inside the irods_environment.json
file, the parameter "irods_authentication_file"
can be set to specifiy an
alternative location for the .irodsA
file. Another possibility to change
the location is to export the environment variable
IRODS_AUTHENTICATION_FILE
.
The glob_wildcards()
function is supported.
from snakemake.remote.iRODS import RemoteProvider
irods = RemoteProvider(irods_env_file='setup-data/irods_environment.json',
timezone="Europe/Berlin") # all parameters are optional
# please note the comma after the variable name!
# access: irods.remote(expand('home/rods/{f}), f=files))
files, = irods.glob_wildcards('home/rods/{files})
rule all:
input:
irods.remote('home/rods/testfile.out'),
rule gen:
input:
irods.remote('home/rods/testfile.in')
output:
irods.remote('home/rods/testfile.out')
shell:
r"""
touch {output}
"""
An example for the iRODS configuration file (irods_environment.json
):
{
"irods_host": "localhost",
"irods_port": 1247,
"irods_user_name": "rods",
"irods_zone_name": "tempZone",
"irods_authentication_file": "setup-data/.irodsA"
}
Please note that the zone
folder is not included in the path as it will be
taken from the configuration file. The path also must not start with a /
.
By default, temporarily stored local files are removed. You can specify anyway
the parameter overwrite
to tell iRODS to overwrite existing files that are
downloaded, because iRODS complains if a local file already exists when a
download attempt is issued (uploading is not a problem, though).
In the Snakemake source directory in snakemake/tests/test_remote_irods
you
can find a working example.
Utils¶
The module snakemake.utils
provides a collection of helper functions for common tasks in Snakemake workflows. Details can be found in Additional utils.
Distribution and Reproducibility¶
It is recommended to store each workflow in a dedicated git repository of the following structure:
├── .gitignore
├── README.md
├── LICENSE.md
├── config.yaml
├── scripts
│ ├── script1.py
│ └── script2.R
├── envs
│ └── myenv.yaml
└── Snakefile
Then, a workflow can be deployed to a new system via the following steps
# clone workflow into working directory
git clone https://bitbucket.org/user/myworkflow.git path/to/workdir
cd path/to/workdir
# edit config and workflow as needed
vim config.yaml
# execute workflow, deploy software dependencies via conda
snakemake -n --use-conda
Importantly, git branching and pull requests can be used to modify and possibly re-integrate workflows. A cookiecutter template for creating this structure can be found here. Given that cookiecutter is installed, you can use it via:
cookiecutter gh:snakemake-workflows/cookiecutter-snakemake-workflow
Visit the Snakemake Workflows Project for best-practice workflows.
Integrated Package Management¶
With Snakemake 3.9.0 it is possible to define isolated software environments per rule.
Upon execution of a workflow, the Conda package manager is used to obtain and deploy the defined software packages in the specified versions. Packages will be installed into your working directory, without requiring any admin/root priviledges.
Given that conda is available on your system (see Miniconda), to use the Conda integration, add the --use-conda
flag to your workflow execution command, e.g. snakemake --cores 8 --use-conda
.
When --use-conda
is activated, Snakemake will automatically create software environments for any used wrapper (see Wrappers).
Further, you can manually define environments via the conda
directive, e.g.:
rule NAME:
input:
"table.txt"
output:
"plots/myplot.pdf"
conda:
"envs/ggplot.yaml"
script:
"scripts/plot-stuff.R"
with the following environment definition:
channels:
- r
dependencies:
- r=3.3.1
- r-ggplot2=2.1.0
Snakemake will store the environment persistently in .snakemake/conda/$hash
with $hash
being the MD5 hash of the environment definition file content. This way, updates to the environment definition are automatically detected.
Note that you need to clean up environments manually for now. However, in many cases they are lightweight and consist of symlinks to your central conda installation.
Running jobs in containers¶
As an alternative to using Conda (see above), it is possible to define, for each rule, a docker or singularity container to use, e.g.,
rule NAME:
input:
"table.txt"
output:
"plots/myplot.pdf"
singularity:
"docker://joseespinosa/docker-r-ggplot2"
script:
"scripts/plot-stuff.R"
When executing Snakemake with
snakemake --use-singularity
it will execute the job within a singularity container that is spawned from the given image.
Allowed image urls entail everything supported by singularity (e.g., shub://
and docker://
).
When --use-singularity
is combined with --kubernetes
(see Executing a Snakemake workflow via kubernetes), cloud jobs will be automatically configured to run in priviledged mode, because this is a current requirement of the singularity executable.
Importantly, those privileges won’t be shared by the actual code that is executed in the singularity container though.
Combining Conda package management with containers¶
While Integrated Package Management provides control over the used software in exactly the desired versions, it does not control the underlying operating system. Here, it becomes handy that Snakemake >=4.8.0 allows to combine Conda-based package management with Running jobs in containers. For example, you can write
singularity: "docker://continuumio/miniconda3:4.4.10"
rule NAME:
input:
"table.txt"
output:
"plots/myplot.pdf"
conda:
"envs/ggplot.yaml"
script:
"scripts/plot-stuff.R"
in other words, a global definition of a container image can be combined with a per-rule conda directive. Then, upon invocation with
snakemake --use-conda --use-singularity
Snakemake will first pull the defined container image, and then create the requested conda environment from within the container. The conda environments will still be stored in your working environment, such that they don’t have to be recreated unless they have changed. The hash under which the environments are stored includes the used container image url, such that changes to the container image also lead to new environments to be created. When a job is executed, Snakemake will first enter the container and then activate the conda environment.
By this, both packages and OS can be easily controlled without the overhead of creating and distributing specialized container images. Of course, it is also possible (though less common) to define a container image per rule in this scenario.
The user can, upon execution, freely choose the desired level of reproducibility:
- no package management (use whatever is on the system)
- Conda based package management (use versions defined by the workflow developer)
- Conda based package management in containerized OS (use versions and OS defined by the workflow developer)
Sustainable and reproducible archiving¶
With Snakemake 3.10.0 it is possible to archive a workflow into a tarball (.tar, .tar.gz, .tar.bz2, .tar.xz), via
snakemake --archive my-workflow.tar.gz
If above layout is followed, this will archive any code and config files that is under git version control. Further, all input files will be included into the archive. Finally, the software packages of each defined conda environment are included. This results in a self-contained workflow archive that can be re-executed on a vanilla machine that only has Conda and Snakemake installed via
tar -xf my-workflow.tar.gz
snakemake -n
Note that the archive is platform specific. For example, if created on Linux, it will run on any Linux newer than the minimum version that has been supported by the used Conda packages at the time of archiving (e.g. CentOS 6).
A useful pattern when publishing data analyses is to create such an archive, upload it to Zenodo and thereby obtain a DOI. Then, the DOI can be cited in manuscripts, and readers are able to download and reproduce the data analysis at any time in the future.
Reports¶
From Snakemake 5.1 on, it is possible to automatically generate detailed self-contained HTML reports that encompass runtime statistics, provenance information, workflow topology and results.
For the latter, the Snakefile has to be annotated with additional information.
Each output file that shall be part of the report has to be marked with the report
flag, which optionally points to a caption in restructured text format.
Moreover, a global workflow description can be defined via the report
directive.
Consider the following example:
report: "report/workflow.rst"
rule all:
input:
["fig1.svg", "fig2.png"]
rule c:
output:
"test.{i}.out"
singularity:
"docker://continuumio/miniconda3:4.4.10"
conda:
"envs/test.yaml"
shell:
"sleep `shuf -i 1-3 -n 1`; touch {output}"
rule a:
input:
expand("test.{i}.out", i=range(10))
output:
report("fig1.svg", caption="report/fig1.rst")
shell:
"sleep `shuf -i 1-3 -n 1`; cp data/fig1.svg {output}"
rule b:
input:
expand("test.{i}.out", i=range(10))
output:
report("fig2.png", caption="report/fig2.rst")
shell:
"sleep `shuf -i 1-3 -n 1`; cp data/fig2.png {output}"
As can be seen, we define a global description which is contained in the file report/workflow.rst
.
In addition, we mark fig1.svg
and fig2.png
for inclusion into the report, while in both cases specifying a caption text via again referring to a restructured text file.
Note the paths to the .rst
-files are interpreted relative to the current Snakefile.
Inside the .rst
-files you can use Jinja2 templating to access context information.
In case of the global description, you can access the config dictionary via {{ snakemake.config }}
, (e.g., use {{ snakemake.config["mykey"] }}
to access the key mykey
).
In case of output files, you can access the same values as available with the script directive.
To create the report simply run
snakemake --report report.html
after your workflow has finished.
All other information contained in the report (e.g. runtime statistics) is automatically collected during creation.
These statistics are obtained from the metadata that is stored in the .snakemake
directory inside your working directory.
The report for above example can be found here
.
Note that the report can be restricted to particular jobs and results by specifying targets at the command line, analog to normal Snakemake execution. For example, with
snakemake fig1.svg --report report-short.html
the report contains only fig1.svg
.
The Snakemake API¶
-
snakemake.
snakemake
(snakefile, report=None, listrules=False, list_target_rules=False, cores=1, nodes=1, local_cores=1, resources={}, config={}, configfile=None, config_args=None, workdir=None, targets=None, dryrun=False, touch=False, forcetargets=False, forceall=False, forcerun=[], until=[], omit_from=[], prioritytargets=[], stats=None, printreason=False, printshellcmds=False, debug_dag=False, printdag=False, printrulegraph=False, printd3dag=False, nocolor=False, quiet=False, keepgoing=False, cluster=None, cluster_config=None, cluster_sync=None, drmaa=None, drmaa_log_dir=None, jobname='snakejob.{rulename}.{jobid}.sh', immediate_submit=False, standalone=False, ignore_ambiguity=False, snakemakepath=None, lock=True, unlock=False, cleanup_metadata=None, cleanup_conda=False, cleanup_shadow=False, force_incomplete=False, ignore_incomplete=False, list_version_changes=False, list_code_changes=False, list_input_changes=False, list_params_changes=False, list_untracked=False, list_resources=False, summary=False, archive=None, delete_all_output=False, delete_temp_output=False, detailed_summary=False, latency_wait=3, wait_for_files=None, print_compilation=False, debug=False, notemp=False, keep_remote_local=False, nodeps=False, keep_target_files=False, allowed_rules=None, jobscript=None, timestamp=False, greediness=None, no_hooks=False, overwrite_shellcmd=None, updated_files=None, log_handler=None, keep_logger=False, max_jobs_per_second=None, max_status_checks_per_second=100, restart_times=0, attempt=1, verbose=False, force_use_threads=False, use_conda=False, use_singularity=False, singularity_args='', conda_prefix=None, list_conda_envs=False, singularity_prefix=None, create_envs_only=False, mode=0, wrapper_prefix=None, kubernetes=None, kubernetes_envvars=None, container_image=None, default_remote_provider=None, default_remote_prefix='', assume_shared_fs=True, cluster_status=None)[source]¶ Run snakemake on a given snakefile.
This function provides access to the whole snakemake functionality. It is not thread-safe.
Parameters: - snakefile (str) – the path to the snakefile
- report (str) – create an HTML report for a previous run at the given path
- listrules (bool) – list rules (default False)
- list_target_rules (bool) – list target rules (default False)
- cores (int) – the number of provided cores (ignored when using cluster support) (default 1)
- nodes (int) – the number of provided cluster nodes (ignored without cluster support) (default 1)
- local_cores (int) – the number of provided local cores if in cluster mode (ignored without cluster support) (default 1)
- resources (dict) – provided resources, a dictionary assigning integers to resource names, e.g. {gpu=1, io=5} (default {})
- config (dict) – override values for workflow config
- workdir (str) – path to working directory (default None)
- targets (list) – list of targets, e.g. rule or file names (default None)
- dryrun (bool) – only dry-run the workflow (default False)
- touch (bool) – only touch all output files if present (default False)
- forcetargets (bool) – force given targets to be re-created (default False)
- forceall (bool) – force all output files to be re-created (default False)
- forcerun (list) – list of files and rules that shall be re-created/re-executed (default [])
- prioritytargets (list) – list of targets that shall be run with maximum priority (default [])
- stats (str) – path to file that shall contain stats about the workflow execution (default None)
- printreason (bool) – print the reason for the execution of each job (default false)
- printshellcmds (bool) – print the shell command of each job (default False)
- printdag (bool) – print the dag in the graphviz dot language (default False)
- printrulegraph (bool) – print the graph of rules in the graphviz dot language (default False)
- printd3dag (bool) – print a D3.js compatible JSON representation of the DAG (default False)
- nocolor (bool) – do not print colored output (default False)
- quiet (bool) – do not print any default job information (default False)
- keepgoing (bool) – keep goind upon errors (default False)
- cluster (str) – submission command of a cluster or batch system to use, e.g. qsub (default None)
- cluster_config (str,list) – configuration file for cluster options, or list thereof (default None)
- cluster_sync (str) – blocking cluster submission command (like SGE ‘qsub -sync y’) (default None)
- drmaa (str) – if not None use DRMAA for cluster support, str specifies native args passed to the cluster when submitting a job
- drmaa_log_dir (str) – the path to stdout and stderr output of DRMAA jobs (default None)
- jobname (str) – naming scheme for cluster job scripts (default “snakejob.{rulename}.{jobid}.sh”)
- immediate_submit (bool) – immediately submit all cluster jobs, regardless of dependencies (default False)
- standalone (bool) – kill all processes very rudely in case of failure (do not use this if you use this API) (default False) (deprecated)
- ignore_ambiguity (bool) – ignore ambiguous rules and always take the first possible one (default False)
- snakemakepath (str) – Deprecated parameter whose value is ignored. Do not use.
- lock (bool) – lock the working directory when executing the workflow (default True)
- unlock (bool) – just unlock the working directory (default False)
- cleanup_metadata (list) – just cleanup metadata of given list of output files (default None)
- cleanup_conda (bool) – just cleanup unused conda environments (default False)
- cleanup_shadow (bool) – just cleanup old shadow directories (default False)
- force_incomplete (bool) – force the re-creation of incomplete files (default False)
- ignore_incomplete (bool) – ignore incomplete files (default False)
- list_version_changes (bool) – list output files with changed rule version (default False)
- list_code_changes (bool) – list output files with changed rule code (default False)
- list_input_changes (bool) – list output files with changed input files (default False)
- list_params_changes (bool) – list output files with changed params (default False)
- list_untracked (bool) – list files in the workdir that are not used in the workflow (default False)
- summary (bool) – list summary of all output files and their status (default False)
- archive (str) – archive workflow into the given tarball
- delete_all_output (bool) remove all files generated by the workflow (default False) –
- delete_temp_output (bool) remove all temporary files generated by the workflow (default False) –
- latency_wait (int) – how many seconds to wait for an output file to appear after the execution of a job, e.g. to handle filesystem latency (default 3)
- wait_for_files (list) – wait for given files to be present before executing the workflow
- list_resources (bool) – list resources used in the workflow (default False)
- summary – list summary of all output files and their status (default False). If no option is specified a basic summary will be ouput. If ‘detailed’ is added as an option e.g –summary detailed, extra info about the input and shell commands will be included
- detailed_summary (bool) – list summary of all input and output files and their status (default False)
- print_compilation (bool) – print the compilation of the snakefile (default False)
- debug (bool) – allow to use the debugger within rules
- notemp (bool) – ignore temp file flags, e.g. do not delete output files marked as temp after use (default False)
- keep_remote_local (bool) – keep local copies of remote files (default False)
- nodeps (bool) – ignore dependencies (default False)
- keep_target_files (bool) – Do not adjust the paths of given target files relative to the working directory.
- allowed_rules (set) – Restrict allowed rules to the given set. If None or empty, all rules are used.
- jobscript (str) – path to a custom shell script template for cluster jobs (default None)
- timestamp (bool) – print time stamps in front of any output (default False)
- greediness (float) – set the greediness of scheduling. This value between 0 and 1 determines how careful jobs are selected for execution. The default value (0.5 if prioritytargets are used, 1.0 else) provides the best speed and still acceptable scheduling quality.
- overwrite_shellcmd (str) – a shell command that shall be executed instead of those given in the workflow. This is for debugging purposes only.
- updated_files (list) – a list that will be filled with the files that are updated or created during the workflow execution
- verbose (bool) – show additional debug output (default False)
- max_jobs_per_second (int) – maximal number of cluster/drmaa jobs per second, None to impose no limit (default None)
- restart_times (int) – number of times to restart failing jobs (default 0)
- attempt (int) – initial value of Job.attempt. This is intended for internal use only (default 1).
- force_use_threads – whether to force use of threads over processes. helpful if shared memory is full or unavailable (default False)
- use_conda (bool) – create conda environments for each job (defined with conda directive of rules)
- use_singularity (bool) – run jobs in singularity containers (if defined with singularity directive)
- singularity_args (str) – additional arguments to pass to singularity
- conda_prefix (str) – the directory in which conda environments will be created (default None)
- singularity_prefix (str) – the directory to which singularity images will be pulled (default None)
- create_envs_only (bool) – If specified, only builds the conda environments specified for each job, then exits.
- list_conda_envs (bool) – List conda environments and their location on disk.
- mode (snakemake.common.Mode) – Execution mode
- wrapper_prefix (str) – Prefix for wrapper script URLs (default None)
- kubernetes (str) – Submit jobs to kubernetes, using the given namespace.
- kubernetes_env (list) – Environment variables that shall be passed to kubernetes jobs.
- container_image (str) – Docker image to use, e.g., for kubernetes.
- default_remote_provider (str) – Default remote provider to use instead of local files (e.g. S3, GS)
- default_remote_prefix (str) – Prefix for default remote provider (e.g. name of the bucket).
- assume_shared_fs (bool) – Assume that cluster nodes share a common filesystem (default true).
- cluster_status (str) – Status command for cluster execution. If None, Snakemake will rely on flag files. Otherwise, it expects the command to return “success”, “failure” or “running” when executing with a cluster jobid as single argument.
- log_handler (function) –
redirect snakemake output to this custom log handler, a function that takes a log message dictionary (see below) as its only argument (default None). The log message dictionary for the log handler has to following entries:
level: the log level (“info”, “error”, “debug”, “progress”, “job_info”) level=”info”, “error” or “debug”: msg: the log message level=”progress”: done: number of already executed jobs total: number of total jobs level=”job_info”: input: list of input files of a job output: list of output files of a job log: path to log file of a job local: whether a job is executed locally (i.e. ignoring cluster) msg: the job message reason: the job reason priority: the job priority threads: the threads of the job
Returns: True if workflow execution was successful.
Return type: bool
Additional utils¶
-
class
snakemake.utils.
AlwaysQuotedFormatter
(quote_func=<function quote>, *args, **kwargs)[source]¶ Subclass of QuotedFormatter that always quotes.
Usage is identical to QuotedFormatter, except that it always acts like “q” was appended to the format spec.
-
class
snakemake.utils.
QuotedFormatter
(quote_func=<function quote>, *args, **kwargs)[source]¶ Subclass of string.Formatter that supports quoting.
Using this formatter, any field can be quoted after formatting by appending “q” to its format string. By default, shell quoting is performed using “shlex.quote”, but you can pass a different quote_func to the constructor. The quote_func simply has to take a string argument and return a new string representing the quoted form of the input string.
Note that if an element after formatting is the empty string, it will not be quoted.
-
snakemake.utils.
R
(code)[source]¶ Execute R code.
This is deprecated in favor of the
script
directive. This function executes the R code given as a string. The function requires rpy2 to be installed.Parameters: code (str) – R code to be executed
-
class
snakemake.utils.
SequenceFormatter
(separator=' ', element_formatter=<string.Formatter object>, *args, **kwargs)[source]¶ string.Formatter subclass with special behavior for sequences.
This class delegates formatting of individual elements to another formatter object. Non-list objects are formatted by calling the delegate formatter’s “format_field” method. List-like objects (list, tuple, set, frozenset) are formatted by formatting each element of the list according to the specified format spec using the delegate formatter and then joining the resulting strings with a separator (space by default).
-
snakemake.utils.
available_cpu_count
()[source]¶ Return the number of available virtual or physical CPUs on this system. The number of available CPUs can be smaller than the total number of CPUs when the cpuset(7) mechanism is in use, as is the case on some cluster systems.
Adapted from http://stackoverflow.com/a/1006301/715090
-
snakemake.utils.
format
(_pattern, *args, stepout=1, _quote_all=False, **kwargs)[source]¶ Format a pattern in Snakemake style.
This means that keywords embedded in braces are replaced by any variable values that are available in the current namespace.
-
snakemake.utils.
linecount
(filename)[source]¶ Return the number of lines of given file.
Parameters: filename (str) – the path to the file
-
snakemake.utils.
listfiles
(pattern, restriction=None, omit_value=None)[source]¶ Yield a tuple of existing filepaths for the given pattern.
Wildcard values are yielded as the second tuple item.
Parameters: - pattern (str) – a filepattern. Wildcards are specified in snakemake syntax, e.g. “{id}.txt”
- restriction (dict) – restrict to wildcard values given in this dictionary
- omit_value (str) – wildcard value to omit
Yields: tuple – The next file matching the pattern, and the corresponding wildcards object
-
snakemake.utils.
makedirs
(dirnames)[source]¶ Recursively create the given directory or directories without reporting errors if they are present.
-
snakemake.utils.
min_version
(version)[source]¶ Require minimum snakemake version, raise workflow error if not met.
-
snakemake.utils.
read_job_properties
(jobscript, prefix='# properties', pattern=re.compile('# properties = (.*)'))[source]¶ Read the job properties defined in a snakemake jobscript.
This function is a helper for writing custom wrappers for the snakemake –cluster functionality. Applying this function to a jobscript will return a dict containing information about the job.
-
snakemake.utils.
report
(text, path, stylesheet='/home/docs/checkouts/readthedocs.org/user_builds/snakemake/checkouts/v5.1.3/snakemake/report.css', defaultenc='utf8', template=None, metadata=None, **files)[source]¶ Create an HTML report using python docutils.
This is deprecated in favor of the –report flag.
Attention: This function needs Python docutils to be installed for the python installation you use with Snakemake.
All keywords not listed below are intepreted as paths to files that shall be embedded into the document. They keywords will be available as link targets in the text. E.g. append a file as keyword arg via F1=input[0] and put a download link in the text like this:
report(''' ============== Report for ... ============== Some text. A link to an embedded file: F1_. Further text. ''', outputpath, F1=input[0]) Instead of specifying each file as a keyword arg, you can also expand the input of your rule if it is completely named, e.g.: report(''' Some text... ''', outputpath, **input)
Parameters: - text (str) – The “restructured text” as it is expected by python docutils.
- path (str) – The path to the desired output file
- stylesheet (str) – An optional path to a css file that defines the style of the document. This defaults to <your snakemake install>/report.css. Use the default to get a hint how to create your own.
- defaultenc (str) – The encoding that is reported to the browser for embedded text files, defaults to utf8.
- template (str) – An optional path to a docutils HTML template.
- metadata (str) – E.g. an optional author name or email address.
-
snakemake.utils.
update_config
(config, overwrite_config)[source]¶ Recursively update dictionary config with overwrite_config.
See http://stackoverflow.com/questions/3232943/update-value-of-a-nested-dictionary-of-varying-depth for details.
Parameters: - config (dict) – dictionary to update
- overwrite_config (dict) – dictionary whose items will overwrite those in config
-
snakemake.utils.
validate
(data, schema)[source]¶ Validate data with JSON schema at given path.
Arguments data – data to validate. Can be a config dict or a pandas data frame. schema – Path to JSON schema used for validation. The schema can also be
in YAML format. If validating a pandas data frame, the schema has to describe a row record (i.e., a dict with column names as keys pointing to row values). See http://json-schema.org. The path is interpreted relative to the Snakefile when this function is called.
Citing and Citations¶
This section gives instructions on how to cite Snakemake and lists citing articles.
Citing Snakemake¶
When using Snakemake for a publication, please cite the following article in you paper:
Cite This¶
More References¶
Another publication describing more of Snakemake internals:
And my PhD thesis which describes all algorithmic details:
Project Pages¶
If you publish a Snakemake workflow, consider to add this badge to your project page:
The markdown syntax is
[](https://snakemake.bitbucket.io)
Replace the 3.5.2
with the minimum required Snakemake version.
You can also change the style.
More Resources¶
Talks and Posters¶
- Poster at ECCB 2016, The Hague, Netherlands.
- Invited talk by Johannes Köster at the Broad Institute, Boston 2015.
- Introduction to Snakemake. Tutorial Slides presented by Johannes Köster at the GCB 2015, Dortmund, Germany.
- Invited talk by Johannes Köster at the DTL Focus Meeting: “NGS Production Pipelines”, Dutch Techcentre for Life Sciences, Utrecht 2014.
- Taming Snakemake by Jeremy Leipzig, Bioinformatics software developer at Children’s Hospital of Philadelphia, 2014.
- “Snakemake makes … snakes?” - An Introduction by Marcel Martin from SciLifeLab, Stockholm 2015
- “Workflow Management with Snakemake” by Johannes Köster, 2015. Held at the Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute
External Resources¶
These resources are not part of the official documentation.
- A number of tutorials on the subject “Tools for reproducible research”
- Snakemake workflow used for the Kallisto paper
- An alternative tutorial for Snakemake
- An Emacs mode for Snakemake
- Flexible bioinformatics pipelines with Snakemake
- Sandwiches with Snakemake
- A visualization of the past years of Snakemake development
- Japanese version of the Snakemake tutorial
- Basic and advanced french Snakemake tutorial.
- Mini tutorial on Snakemake and Bioconda
- Snakeparse: a utility to expose Snakemake workflow configuation via a command line interface
Frequently Asked Questions¶
Contents
- Frequently Asked Questions
- What is the key idea of Snakemake workflows?
- What is the recommended way to distribute a Snakemake workflow?
- My shell command fails with with errors about an “unbound variable”, what’s wrong?
- My shell command fails with exit code != 0 from within a pipe, what’s wrong?
- How do I run my rule on all files of a certain directory?
- Snakemake complains about a cyclic dependency or a PeriodicWildcardError. What can I do?
- Is it possible to pass variable values to the workflow via the command line?
- I get a NameError with my shell command. Are braces unsupported?
- How do I incorporate files that do not follow a consistent naming scheme?
- How do I force Snakemake to rerun all jobs from the rule I just edited?
- How do I enable syntax highlighting in Vim for Snakefiles?
- I want to import some helper functions from another python file. Is that possible?
- How can I run Snakemake on a cluster where its main process is not allowed to run on the head node?
- Can the output of a rule be a symlink?
- Can the input of a rule be a symlink?
- I would like to receive a mail upon snakemake exit. How can this be achieved?
- I want to pass variables between rules. Is that possible?
- Why do my global variables behave strangely when I run my job on a cluster?
- I want to configure the behavior of my shell for all rules. How can that be achieved with Snakemake?
- Some command line arguments like –config cannot be followed by rule or file targets. Is that intended behavior?
- How do I make my rule fail if an output file is empty?
- How does Snakemake lock the working directory?
- Snakemake does not trigger re-runs if I add additional input files. What can I do?
- How do I trigger re-runs for rules with updated code or parameters?
- How do I remove all files created by snakemake, i.e. like
make clean
- Why can’t I use the conda directive with a run block?
- My workflow is very large, how do I stop Snakemake from printing all this rule/job information in a dry-run?
- Git is messing up the modification times of my input files, what can I do?
- How do I exit a running Snakemake workflow?
- How do I access elements of input or output by a variable index?
What is the key idea of Snakemake workflows?¶
The key idea is very similar to GNU Make. The workflow is determined automatically from top (the files you want) to bottom (the files you have), by applying very general rules with wildcards you give to Snakemake:

When you start using Snakemake, please make sure to walk through the official tutorial. It is crucial to understand how to properly use the system.
What is the recommended way to distribute a Snakemake workflow?¶
It is recommended that a Snakemake workflow is structured in the following way:
├── config.yaml
├── environment.yaml
├── scripts
│ ├── script1.py
│ └── script2.R
└── Snakefile
This structure can be put into a git repository, allowing to setup the workflow with the following steps:
# clone workflow into working directory
git clone https://bitbucket.org/user/myworkflow.git path/to/workdir
cd path/to/workdir
# edit config as needed
vim config.yaml
# install dependencies into an isolated conda environment
conda env create -n myworkflow --file environment.yaml
# activate environment
source activate myworkflow
# execute workflow
snakemake -n
In certain cases, it might be necessary to extend or modify a given workflow (the Snakefile). Here, git provides the ideal mechanisms to track such changes. Any modifications can happen in either a separate branch or a fork. When the changes are general enough, they can be reintegrated later into the master branch using pull requests.
My shell command fails with with errors about an “unbound variable”, what’s wrong?¶
This happens often when calling virtual environments from within Snakemake. Snakemake is using bash strict mode, to ensure e.g. proper error behavior of shell scripts. Unfortunately, virtualenv and some other tools violate bash strict mode. he quick fix for virtualenv is to temporarily deactivate the check for unbound variables
set +u; source /path/to/venv/bin/activate; set -u
For more details on bash strict mode, see the here.
My shell command fails with exit code != 0 from within a pipe, what’s wrong?¶
Snakemake is using bash strict mode to ensure best practice error reporting in shell commands. This entails the pipefail option, which reports errors from within a pipe to outside. If you don’t want this, e.g., to handle empty output in the pipe, you can disable pipefail via prepending
set +o pipefile;
to your shell command in the problematic rule.
How do I run my rule on all files of a certain directory?¶
In Snakemake, similar to GNU Make, the workflow is determined from the top, i.e. from the target files. Imagine you have a directory with files 1.fastq, 2.fastq, 3.fastq, ...
, and you want to produce files 1.bam, 2.bam, 3.bam, ...
you should specify these as target files, using the ids 1,2,3,...
. You could end up with at least two rules like this (or any number of intermediate steps):
IDS = "1 2 3 ...".split() # the list of desired ids
# a pseudo-rule that collects the target files
rule all:
input: expand("otherdir/{id}.bam", id=IDS)
# a general rule using wildcards that does the work
rule:
input: "thedir/{id}.fastq"
output: "otherdir/{id}.bam"
shell: "..."
Snakemake will then go down the line and determine which files it needs from your initial directory.
In order to infer the IDs from present files, Snakemake provides the glob_wildcards
function, e.g.
IDS, = glob_wildcards("thedir/{id}.fastq")
The function matches the given pattern against the files present in the filesystem and thereby infers the values for all wildcards in the pattern. A named tuple that contains a list of values for each wildcard is returned. Here, this named tuple has only one item, that is the list of values for the wildcard {id}
.
Snakemake complains about a cyclic dependency or a PeriodicWildcardError. What can I do?¶
One limitation of Snakemake is that graphs of jobs have to be acyclic (similar to GNU Make). This means, that no path in the graph may be a cycle. Although you might have considered this when designing your workflow, Snakemake sometimes runs into situations where a cyclic dependency cannot be avoided without further information, although the solution seems obvious for the developer. Consider the following example:
rule all:
input:
"a"
rule unzip:
input:
"{sample}.tar.gz"
output:
"{sample}"
shell:
"tar -xf {input}"
If this workflow is executed with
snakemake -n
two things may happen.
If the file
a.tar.gz
is present in the filesystem, Snakemake will propose the following (expected and correct) plan:rule a: input: a.tar.gz output: a wildcards: sample=a localrule all: input: a Job counts: count jobs 1 a 1 all 2
If the file
a.tar.gz
is not present and cannot be created by any other rule than rulea
, Snakemake will try to run rulea
again, with{sample}=a.tar.gz
. This would infinitely go on recursively. Snakemake detects this case and produces aPeriodicWildcardError
.
In summary, PeriodicWildcardErrors
hint to a problem where a rule or a set of rules can be applied to create its own input. If you are lucky, Snakemake can be smart and avoid the error by stopping the recursion if a file exists in the filesystem. Importantly, however, bugs upstream of that rule can manifest as PeriodicWildcardError
, although in reality just a file is missing or named differently.
In such cases, it is best to restrict the wildcard of the output file(s), or follow the general rule of putting output files of different rules into unique subfolders of your working directory. This way, you can discover the true source of your error.
Is it possible to pass variable values to the workflow via the command line?¶
Yes, this is possible. Have a look at Configuration. Previously it was necessary to use environment variables like so: E.g. write
$ SAMPLES="1 2 3 4 5" snakemake
and have in the Snakefile some Python code that reads this environment variable, i.e.
SAMPLES = os.environ.get("SAMPLES", "10 20").split()
I get a NameError with my shell command. Are braces unsupported?¶
You can use the entire Python format minilanguage in shell commands. Braces in shell commands that are not intended to insert variable values thus have to be escaped by doubling them:
This:
...
shell: "awk '{print $1}' {input}"
becomes:
...
shell: "awk '{{print $1}}' {input}"
Here the double braces are escapes, i.e. there will remain single braces in the final command. In contrast, {input}
is replaced with an input filename.
In addition, if your shell command has literal backslashes, \\
, you must escape them with a backslash, \\\\
. For example:
This:
shell: """printf \">%s\"" {{input}}"""
becomes:
shell: """printf \\">%s\\"" {{input}}"""
How do I incorporate files that do not follow a consistent naming scheme?¶
The best solution is to have a dictionary that translates a sample id to the inconsistently named files and use a function (see Functions as Input Files) to provide an input file like this:
FILENAME = dict(...) # map sample ids to the irregular filenames here
rule:
# use a function as input to delegate to the correct filename
input: lambda wildcards: FILENAME[wildcards.sample]
output: "somefolder/{sample}.csv"
shell: ...
How do I force Snakemake to rerun all jobs from the rule I just edited?¶
This can be done by invoking Snakemake with the --forcerules
or -R
flag, followed by the rules that should be re-executed:
$ snakemake -R somerule
This will cause Snakemake to re-run all jobs of that rule and everything downstream (i.e. directly or indirectly depending on the rules output).
How do I enable syntax highlighting in Vim for Snakefiles?¶
A vim syntax highlighting definition for Snakemake is available here.
You can copy that file to $HOME/.vim/syntax
directory and add
au BufNewFile,BufRead Snakefile set syntax=snakemake
au BufNewFile,BufRead *.smk set syntax=snakemake
to your $HOME/.vimrc
file. Highlighting can be forced in a vim session with :set syntax=snakemake
.
I want to import some helper functions from another python file. Is that possible?¶
Yes, from version 2.4.8 on, Snakemake allows to import python modules (and also simple python files) from the same directory where the Snakefile resides.
How can I run Snakemake on a cluster where its main process is not allowed to run on the head node?¶
This can be achived by submitting the main Snakemake invocation as a job to the cluster. If it is not allowed to submit a job from a non-head cluster node, you can provide a submit command that goes back to the head node before submitting:
qsub -N PIPE -cwd -j yes python snakemake --cluster "ssh user@headnode_address 'qsub -N pipe_task -j yes -cwd -S /bin/sh ' " -j
This hint was provided by Inti Pedroso.
Can the output of a rule be a symlink?¶
Yes. As of Snakemake 3.8, output files are removed before running a rule and then touched after the rule completes to ensure they are newer than the input. Symlinks are treated just the same as normal files in this regard, and Snakemake ensures that it only modifies the link and not the target when doing this.
Here is an example where you want to merge N files together, but if N == 1 a symlink will do. This is easier than attempting to implement workflow logic that skips the step entirely. Note the -r flag, supported by modern versions of ln, is useful to achieve correct linking between files in subdirectories.
rule merge_files:
output: "{foo}/all_merged.txt"
input: my_input_func # some function that yields 1 or more files to merge
run:
if len(input) > 1:
shell("cat {input} | sort > {output}")
else:
shell("ln -sr {input} {output}")
Do be careful with symlinks in combination with Step 6: Temporary and protected files. When the original file is deleted, this can cause various errors once the symlink does not point to a valid file any more.
If you get a message like Unable to set utime on symlink .... Your Python build does not support it.
this means that Snakemake is unable to properly adjust the modification time of the symlink.
In this case, a workaround is to add the shell command touch -h {output} to the end of the rule.
Can the input of a rule be a symlink?¶
Yes. In this case, since Snakemake 3.8, one extra consideration is applied. If either the link itself or the target of the link is newer than the output files for the rule then it will trigger the rule to be re-run.
I would like to receive a mail upon snakemake exit. How can this be achieved?¶
On unix, you can make use of the commonly pre-installed mail command:
snakemake 2> snakemake.log
mail -s "snakemake finished" youremail@provider.com < snakemake.log
In case your administrator does not provide you with a proper configuration of the sendmail framework, you can configure mail to work e.g. via Gmail (see here).
I want to pass variables between rules. Is that possible?¶
Because of the cluster support and the ability to resume a workflow where you stopped last time, Snakemake in general should be used in a way that information is stored in the output files of your jobs. Sometimes it might though be handy to have a kind of persistent storage for simple values between jobs and rules. Using plain python objects like a global dict for this will not work as each job is run in a separate process by snakemake. What helps here is the PersistentDict from the pytools package. Here is an example of a Snakemake workflow using this facility:
from pytools.persistent_dict import PersistentDict
storage = PersistentDict("mystorage")
rule a:
input: "test.in"
output: "test.out"
run:
myvar = storage.fetch("myvar")
# do stuff
rule b:
output: temp("test.in")
run:
storage.store("myvar", 3.14)
Here, the output rule b has to be temp in order to ensure that myvar
is stored in each run of the workflow as rule a relies on it. In other words, the PersistentDict is persistent between the job processes, but not between different runs of this workflow. If you need to conserve information between different runs, use output files for them.
Why do my global variables behave strangely when I run my job on a cluster?¶
This is closely related to the question above. Any Python code you put outside of a rule definition is normally run once before Snakemake starts to process rules, but on a cluster it is re-run again for each submitted job, because Snakemake implements jobs by re-running itself.
Consider the following…
from mydatabase import get_connection
dbh = get_connection()
latest_parameters = dbh.get_params().latest()
rule a:
input: "{foo}.in"
output: "{foo}.out"
shell: "do_op -params {latest_parameters} {input} {output}"
When run a single machine, you will see a single connection to your database and get a single value for latest_parameters for the duration of the run. On a cluster you will see a connection attempt from the cluster node for each job submitted, regardless of whether it happens to involve rule a or not, and the parameters will be recalculated for each job.
I want to configure the behavior of my shell for all rules. How can that be achieved with Snakemake?¶
You can set a prefix that will prepended to all shell commands by adding e.g.
shell.prefix("set -o pipefail; ")
to the top of your Snakefile. Make sure that the prefix ends with a semicolon, such that it will not interfere with the subsequent commands. To simulate a bash login shell, you can do the following:
shell.executable("/bin/bash")
shell.prefix("source ~/.bashrc; ")
Some command line arguments like –config cannot be followed by rule or file targets. Is that intended behavior?¶
This is a limitation of the argparse module, which cannot distinguish between the perhaps next arg of --config
and a target.
As a solution, you can put the –config at the end of your invocation, or prepend the target with a single --
, i.e.
$ snakemake --config foo=bar -- mytarget
$ snakemake mytarget --config foo=bar
How do I make my rule fail if an output file is empty?¶
Snakemake expects shell commands to behave properly, meaning that failures should cause an exit status other than zero. If a command does not exit with a status other than zero, Snakemake assumes everything worked fine, even if output files are empty. This is because empty output files are also a reasonable tool to indicate progress where no real output was produced. However, sometimes you will have to deal with tools that do not properly report their failure with an exit status. Here, the recommended way is to use bash to check for non-empty output files, e.g.:
rule:
input: ...
output: "my/output/file.txt"
shell: "somecommand {input} {output} && [[ -s {output} ]]"
How does Snakemake lock the working directory?¶
Per default, Snakemake will lock a working directory by output and input files. Two Snakemake instances that want to create the same output file are not possible. Two instances creating disjoint sets of output files are possible.
With the command line option --nolock
, you can disable this mechanism on your own risk. With --unlock
, you can be remove a stale lock. Stale locks can appear if your machine is powered off with a running Snakemake instance.
Snakemake does not trigger re-runs if I add additional input files. What can I do?¶
Snakemake has a kind of “lazy” policy about added input files if their modification date is older than that of the output files. One reason is that information what to do cannot be inferred just from the input and output files. You need additional information about the last run to be stored. Since behaviour would be inconsistent between cases where that information is available and where it is not, this functionality has been encoded as an extra switch. To trigger updates for jobs with changed input files, you can use the command line argument –list-input-changes in the following way:
$ snakemake -n -R `snakemake --list-input-changes`
Here, snakemake --list-input-changes
returns the list of output files with changed input files, which is fed into -R
to trigger a re-run.
How do I trigger re-runs for rules with updated code or parameters?¶
Similar to the solution above, you can use
$ snakemake -n -R `snakemake --list-params-changes`
and
$ snakemake -n -R `snakemake --list-code-changes`
Again, the list commands in backticks return the list of output files with changes, which are fed into -R
to trigger a re-run.
How do I remove all files created by snakemake, i.e. like make clean
¶
To remove all files created by snakemake as output files to start from scratch, you can use
$ snakemake some_target --delete-all-output
Only files that are output of snakemake rules will be removed, not those that serve as primary inputs to the workflow.
Note that this will only affect the files involved in reaching the specified target(s).
It is strongly advised to first run together with --dryrun
to list the files that would be removed without actually deleting anything.
The flag --delete-temp-output
can be used in a similar manner to only delete files flagged as temporary.
Why can’t I use the conda directive with a run block?¶
The run block of a rule (see Rules) has access to anything defined in the Snakefile, outside of the rule. Hence, it has to share the conda environment with the main Snakemake process. To avoid confusion we therefore disallow the conda directive together with the run block. It is recommended to use the script directive instead (see External scripts).
My workflow is very large, how do I stop Snakemake from printing all this rule/job information in a dry-run?¶
Indeed, the information for each individual job can slow down a dryrun if there are tens of thousands of jobs.
If you are just interested in the final summary, you can use the --quiet
flag to suppress this.
$ snakemake -n --quiet
Git is messing up the modification times of my input files, what can I do?¶
When you checkout a git repository, the modification times of updated files are set to the time of the checkout. If you rely on these files as input and output files in your workflow, this can cause trouble. For example, Snakemake could think that a certain (git-tracked) output has to be re-executed, just because its input has been checked out a bit later. In such cases, it is advisable to set the file modification dates to the last commit date after an update has been pulled. See here for a solution to achieve this.
How do I exit a running Snakemake workflow?¶
There are two ways to exit a currently running workflow.
If you want to kill all running jobs, hit Ctrl+C. Note that when using –cluster, this will only cancel the main Snakemake process.
If you want to stop the scheduling of new jobs and wait for all running jobs to be finished, you can send a TERM signal, e.g., via
killall -TERM snakemake
How do I access elements of input or output by a variable index?¶
Assuming you have something like the following rule
rule a: output: expand("test.{i}.out", i=range(20)) run: for i in range(20): shell("echo test > {output[i]}")
Snakemake will fail upon execution with the error 'OutputFiles' object has no attribute 'i'
. The reason is that the shell command is using the Python format mini language, which does only allow indexing via constants, e.g., output[1]
, but not via variables. Variables are treated as attribute names instead. The solution is to write
rule a: output: expand("test.{i}.out", i=range(20)) run: for i in range(20): f = output[i] shell("echo test > {f}")
or, more concise in this special case:
rule a: output: expand("test.{i}.out", i=range(20)) run: for f in output: shell("echo test > {f}")
Contributing¶
Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.
You can contribute in many ways:
Types of Contributions¶
Report Bugs¶
Report bugs at https://bitbucket.org/snakemake/snakemake/issues
If you are reporting a bug, please include:
- Your operating system name and version.
- Any details about your local setup that might be helpful in troubleshooting.
- Detailed steps to reproduce the bug.
Fix Bugs¶
Look through the Bitbucket issues for bugs. If you want to start working on a bug then please write short message on the issue tracker to prevent duplicate work.
Implement Features¶
Look through the Bitbucket issues for features. If you want to start working on an issue then please write short message on the issue tracker to prevent duplicate work.
Write Documentation¶
Snakemake could always use more documentation, whether as part of the official vcfpy docs, in docstrings, or even on the web in blog posts, articles, and such.
Snakemake uses Sphinx for the user manual (that you are currently reading). See project_info-doc_guidelines on how the documentation reStructuredText is used.
Submit Feedback¶
The best way to send feedback is to file an issue at https://bitbucket.org/snakemake/snakemake/issues
If you are proposing a feature:
- Explain in detail how it would work.
- Keep the scope as narrow as possible, to make it easier to implement.
- Remember that this is a volunteer-driven project, and that contributions are welcome :)
Pull Request Guidelines¶
To update the documentation, fix bugs or add new features you need to create a Pull Request . A PR is a change you make to your local copy of the code for us to review and potentially integrate into the code base.
To create a Pull Request you need to do these steps:
- Create a Bitbucket account.
- Fork the repository (see the left sidebar on the main Bitbucket Snakemake page).
- Clone your fork (go to your copy of the repository at
https://bitbucket.org/<your_username>/snakemake
and click clone. This gives you the command you need to paste into your shell). - Go to the snakemake folder with
cd snakemake
. - Create a new branch with
git checkout -b <descriptive_branch_name>
. - Make your changes to the code or documentation.
- Run
git add .
to add all the changed files to the commit (to see what files will be added you can rungit add . --dry-run
). - To commit the added files use
git commit
. (This will open a command line editor to write a commit message. These should have a descriptive 80 line header, followed by an empty line, and then a description of what you did and why. To use your command line text editor of choice use (for example)export GIT_EDITOR=vim
before runninggit commit
). - Now you can push your changes to your Bitbucket copy of Snakemake by running
git push origin <descriptive_branch_name>
. - If you now go to the webpage for your Bitbucket copy of Snakemake you should see a link in the sidebar called “Create Pull Request”.
- Now you need to choose your PR from the menu and click the “Create pull request” button. Be sure to change the pull request target branch to <descriptive_branch_name>!
If you want to create more pull requests, first run git checkout master
and then start at step 5. with a new branch name.
Feel free to ask questions about this if you want to contribute to Snakemake :)
Testing Guidelines¶
To ensure that you do not introduce bugs into Snakemake, you should test your code thouroughly.
To have integration tests run automatically when commiting code changes to bitbucket, you need to sign up on wercker.com and register a user.
The easiest way to run your development version of Snakemake is perhaps to go to the folder containing your local copy of Snakemake and call
conda env create -f environment.yml -n snakemake-testing
source activate snakemake-testing
pip install -e .
This will make your development version of Snakemake the one called when running snakemake. You do not need to run this command after each time you make code changes.
From the base snakemake folder you call python setup.py nosetests
to run all the tests. (If it complains that you do not have nose installed, which is the testing framework we use, you can simply install it by running pip install nose
.)
If you introduce a new feature you should add a new test to the tests directory. See the folder for examples.
Documentation Guidelines¶
For the documentation, please adhere to the following guidelines:
- Put each sentence on its own line, this makes tracking changes through Git SCM easier.
- Provide hyperlink targets, at least for the first two section levels.
For this, use the format
<document_part>-<section_name>
, e.g.,project_info-doc_guidelines
. - Use the section structure from below.
.. document_part-heading_1:
=========
Heading 1
=========
.. document_part-heading_2:
---------
Heading 2
---------
.. document_part-heading_3:
Heading 3
=========
.. document_part-heading_4:
Heading 4
---------
.. document_part-heading_5:
Heading 5
~~~~~~~~~
.. document_part-heading_6:
Heading 6
:::::::::
Documentation Setup¶
For building the documentation, you have to install the Sphinx. If you have already installed Conda, all you need to do is to create a Snakemake development environment via
$ git clone git@bitbucket.org:snakemake/snakemake.git
$ cd snakemake
$ conda env create -f environment.yml -n snakemake
Then, the docs can be built with
$ source activate snakemake
$ cd docs
$ make html
$ make clean && make html # force rebuild
Alternatively, you can use virtualenv. The following assumes you have a working Python 3 setup.
$ git clone git@bitbucket.org:snakemake/snakemake.git
$ cd snakemake/docs
$ virtualenv -p python3 .venv
$ source .venv/bin/activate
$ pip install --upgrade -r requirements.txt
Afterwards, the docs can be built with
$ source .venv/bin/activate
$ make html # rebuild for changed files only
$ make clean && make html # force rebuild
Credits¶
Development Lead¶
- Johannes Köster
Development Team¶
- Christopher Tomkins-Tinch
- David Koppstein
- Tim Booth
- Manuel Holtgrewe
- Christian Arnold
- Wibowo Arindrarto
- Rasmus Ågren
Contributors¶
In alphabetical order
- Andreas Wilm
- Anthony Underwood
- Ryan Dale
- David Alexander
- Elias Kuthe
- Elmar Pruesse
- Hyeshik Chang
- Jay Hesselberth
- Jesper Foldager
- John Huddleston
- Joona Lehtomäki
- Karel Brinda
- Karl Gutwin
- Kemal Eren
- Kostis Anagnostopoulos
- Kyle A. Beauchamp
- Kyle Meyer
- Lance Parsons
- Manuel Holtgrewe
- Marcel Martin
- Matthew Shirley
- Mattias Franberg
- Matt Shirley
- Paul Moore
- percyfal
- Per Unneberg
- Ryan C. Thompson
- Ryan Dale
- Sean Davis
- Simon Ye
- Tobias Marschall
- Willem Ligtenberg
Change Log¶
# Change Log
# [5.1.3] - 2018-05-22
## Changed
- Fixed various bugs in job groups, shadow directive, singularity directive, and more.
# [5.1.2] - 2018-05-18
## Changed
- Fixed a bug in the report stylesheet.
# [5.1.0] - 2018-05-17
## Added
- A new framework for self-contained HTML reports, including results, statistics and topology information. In future releases this will be further extended.
- A new utility snakemake.utils.validate() which allows to validate config and pandas data frames using JSON schemas.
- Two new flags --cleanup-shadow and --cleanup-conda to clean up old unused conda and shadow data.
## Changed
- Benchmark repeats are now specified inside the workflow via a new flag repeat().
- Command line interface help has been refactored into groups for better readability.
# [5.0.0] - 2018-05-11
# Added
- Group jobs for reduced queuing and network overhead, in particular with short running jobs.
- Output files can be marked as pipes, such that producing and consuming job are executed simultaneously and interfomation is transferred directly without using disk.
- Command line flags to clean output files.
- Command line flag to list files in working directory that are not tracked by Snakemake.
# Changes
- Fix of --default-remote-prefix in case of input functions returning lists or dicts.
- Scheduler no longer prefers jobs with many downstream jobs.
# [4.8.1] - 2018-04-25
# Added
- Allow URLs for the conda directive.
# Changed
- Various minor updates in the docs.
- Several bug fixes with remote file handling.
- Fix ImportError occuring with script directive.
- Use latest singularity.
- Improved caching for file existence checks. We first check existence of parent directories and cache these results. By this, large parts of the generated FS tree can be pruned if files are not yet present. If files are present, the overhead is minimal, since the checks for the parents are cached.
- Various minor bug fixes.
# [4.8.0] - 2018-03-13
### Added
- Integration with CWL: the `cwl` directive allows to use CWL tool definitions in addition to shell commands or Snakemake wrappers.
- A global `singularity` directive allows to define a global singularity container to be used for all rules that don't specify their own.
- Singularity and Conda can now be combined. This can be used to specify the operating system (via singularity), and the software stack (via conda), without the overhead of creating specialized container images for workflows or tasks.
# [4.7.0] - 2018-02-19
### Changed
- Speedups when calculating dry-runs.
- Speedups for workflows with many rules when calculating the DAG.
- Accept SIGTERM to gracefully finish all running jobs and exit.
- Various minor bug fixes.
# [4.6.0] - 2018-02-06
### Changed
- Log files can now be used as input files for other rules.
- Adapted to changes in Kubernetes client API.
- Fixed minor issues in --archive option.
- Search path order in scripts was changed to fix a bug with leaked packages from root env when using script directive together with conda.
# [4.5.1] - 2018-02-01
### Added
- Input and output files can now tag pathlib objects.
### Changed
- Various minor bug fixes.
# [4.5.0] - 2018-01-18
### Added
- iRODS remote provider
### Changed
- Bug fix in shell usage of scripts and wrappers.
- Bug fixes for cluster execution, --immediate-submit and subworkflows.
## [4.4.0] - 2017-12-21
### Added
- A new shadow mode (minimal) that only symlinks input files has been added.
### Changed
- The default shell is now bash on linux and macOS. If bash is not installed, we fall back to sh. Previously, Snakemake used the default shell of the user, which defeats the purpose of portability. If the developer decides so, the shell can be always overwritten using shell.executable().
- Snakemake now requires Singularity 2.4.1 at least (only when running with --use-singularity).
- HTTP remote provider no longer automatically unpacks gzipped files.
- Fixed various smaller bugs.
## [4.3.1] - 2017-11-16
### Added
- List all conda environments with their location on disk via --list-conda-envs.
### Changed
- Do not clean up shadow on dry-run.
- Allow R wrappers.
## [4.3.0] - 2017-10-27
### Added
- GridFTP remote provider. This is a specialization of the GFAL remote provider that uses globus-url-copy to download or upload files.
### Changed
- Scheduling and execution mechanisms have undergone a major revision that removes several potential (but rare) deadlocks.
- Several bugs and corner cases of the singularity support have been fixed.
- Snakemake now requires singularity 2.4 at least.
## [4.2.0] - 2017-10-10
### Added
- Support for executing jobs in per-rule singularity images. This is meant as an alternative to the conda directive (see docs), providing even more guarantees for reproducibility.
### Changed
- In cluster mode, jobs that are still running after Snakemake has been killed are automatically resumed.
- Various fixes to GFAL remote provider.
- Fixed --summary and --list-code-changes.
- Many other small bug fixes.
## [4.1.0] - 2017-09-26
### Added
- Support for configuration profiles. Profiles allow to specify default options, e.g., a cluster
submission command. They can be used via 'snakemake --profile myprofile'. See the docs for details.
- GFAL remote provider. This allows to use GridFTP, SRM and any other protocol supported by GFAL for remote input and output files.
- Added --cluster-status flag that allows to specify a command that returns jobs status.
### Changed
- The scheduler now tries to get rid of the largest temp files first.
- The Docker image used for kubernetes support can now be configured at the command line.
- Rate-limiting for cluster interaction has been unified.
- S3 remote provider uses boto3.
- Resource functions can now use an additional `attempt` parameter, that contains the number of times this job has already been tried.
- Various minor fixes.
## [4.0.0] - 2017-07-24
### Added
- Cloud computing support via Kubernetes. Snakemake workflows can be executed transparently
in the cloud, while storing input and output files within the cloud storage
(e.g. S3 or Google Storage). I.e., this feature does not need a shared filesystem
between the cloud notes, and thereby makes the setup really simple.
- WebDAV remote file support: Snakemake can now read and write from WebDAV. Hence,
it can now, e.g., interact with Nextcloud or Owncloud.
- Support for default remote providers: define a remote provider to implicitly
use for all input and output files.
- Added an option to only create conda environments instead of executing the workflow.
### Changed
- The number of files used for the metadata tracking of Snakemake (e.g., code, params, input changes) in the .snakemake directory has been reduced by a factor of 10, which should help with NFS and IO bottlenecks. This is a breaking change in the sense that Snakemake 4.x won't see the metadata of workflows executed with Snakemake 3.x. However, old metadata won't be overwritten, so that you can always go back and check things by installing an older version of Snakemake again.
- The google storage (GS) remote provider has been changed to use the google SDK.
This is a breaking change, since the remote provider invocation has been simplified (see docs).
- Due to WebDAV support (which uses asyncio), Snakemake now requires Python 3.5 at least.
- Various minor bug fixes (e.g. for dynamic output files).
## [3.13.3] - 2017-06-23
### Changed
- Fix a followup bug in Namedlist where a single item was not returned as string.
## [3.13.2] - 2017-06-20
### Changed
- The --wrapper-prefix flag now also affects where the corresponding environment definition is fetched from.
- Fix bug where empty output file list was recognized as containing duplicates (issue #574).
## [3.13.1] - 2017-06-20
### Changed
- Fix --conda-prefix to be passed to all jobs.
- Fix cleanup issue with scripts that fail to download.
## [3.13.0] - 2017-06-12
### Added
- An NCBI remote provider. By this, you can seamlessly integrate any NCBI resouce (reference genome, gene/protein sequences, ...) as input file.
### Changed
- Snakemake now detects if automatically generated conda environments have to be recreated because the workflow has been moved to a new path.
- Remote functionality has been made more robust, in particular to avoid race conditions.
- `--config` parameter evaluation has been fixed for non-string types.
- The Snakemake docker container is now based on the official debian image.
## [3.12.0] - 2017-05-09
### Added
- Support for RMarkdown (.Rmd) in script directives.
- New option --debug-dag that prints all decisions while building the DAG of jobs. This helps to debug problems like cycles or unexpected MissingInputExceptions.
- New option --conda-prefix to specify the place where conda environments are stored.
### Changed
- Benchmark files now also include the maximal RSS and VMS size of the Snakemake process and all sub processes.
- Speedup conda environment creation.
- Allow specification of DRMAA log dir.
- Pass cluster config to subworkflow.
## [3.11.2] - 2017-03-15
### Changed
- Fixed fix handling of local URIs with the wrapper directive.
## [3.11.1] - 2017-03-14
### Changed
- --touch ignores missing files
- Fixed handling of local URIs with the wrapper directive.
## [3.11.0] - 2017-03-08
### Added
- Param functions can now also refer to threads.
### Changed
- Improved tutorial and docs.
- Made conda integration more robust.
- None is converted to NULL in R scripts.
## [3.10.2] - 2017-02-28
### Changed
- Improved config file handling and merging.
- Output files can be referred in params functions (i.e. lambda wildcards, output: ...)
- Improved conda-environment creation.
- Jobs are cached, leading to reduced memory footprint.
- Fixed subworkflow handling in input functions.
## [3.10.0] - 2017-01-18
### Added
- Workflows can now be archived to a tarball with `snakemake --archive my-workflow.tar.gz`. The archive contains all input files, source code versioned with git and all software packages that are defined via conda environments. Hence, the archive allows to fully reproduce a workflow on a different machine. Such an archive can be uploaded to Zenodo, such that your workflow is secured in a self-contained, executable way for the future.
### Changed
- Improved logging.
- Reduced memory footprint.
- Added a flag to automatically unpack the output of input functions.
- Improved handling of HTTP redirects with remote files.
- Improved exception handling with DRMAA.
- Scripts referred by the script directive can now use locally defined external python modules.
## [3.9.1] - 2016-12-23
### Added
- Jobs can be restarted upon failure (--restart-times).
### Changed
- The docs have been restructured and improved. Now available under snakemake.readthedocs.org.
- Changes in scripts show up with --list-code-changes.
- Duplicate output files now cause an error.
- Various bug fixes.
## [3.9.0] - 2016-11-15
### Added
- Ability to define isolated conda software environments (YAML) per rule. Environments will be deployed by Snakemake upon workflow execution.
- Command line argument --wrapper-prefix in order to overwrite the default URL for looking up wrapper scripts.
### Changed
- --summary now displays the log files correspoding to each output file.
- Fixed hangups when using run directive and a large number of jobs
- Fixed pickling errors with anonymous rules and run directive.
- Various small bug fixes
## [3.8.2] - 2016-09-23
### Changed
- Add missing import in rules.py.
- Use threading only in cluster jobs.
## [3.8.1] - 2016-09-14
### Changed
- Snakemake now warns when using relative paths starting with "./".
- The option -R now also accepts an empty list of arguments.
- Bug fix when handling benchmark directive.
- Jobscripts exit with code 1 in case of failure. This should improve the error messages of cluster system.
- Fixed a bug in SFTP remote provider.
## [3.8.0] - 2016-08-26
### Added
- Wildcards can now be constrained by rule and globally via the new `wildcard_constraints` directive (see the [docs](https://bitbucket.org/snakemake/snakemake/wiki/Documentation#markdown-header-wildcards)).
- Subworkflows now allow to overwrite their config file via the configfile directive in the calling Snakefile.
- A method `log_fmt_shell` in the snakemake proxy object that is available in scripts and wrappers allows to obtain a formatted string to redirect logging output from STDOUT or STDERR.
- Functions given to resources can now optionally contain an additional argument `input` that refers to the input files.
- Functions given to params can now optionally contain additional arguments `input` (see above) and `resources`. The latter refers to the resources.
- It is now possible to let items in shell commands be automatically quoted (see the [docs](https://bitbucket.org/snakemake/snakemake/wiki/Documentation#markdown-header-rules)). This is usefull when dealing with filenames that contain whitespaces.
### Changed
- Snakemake now deletes output files before job exection. Further, it touches output files after job execution. This solves various problems with slow NFS filesystems.
- A bug was fixed that caused dynamic output rules to be executed multiple times when forcing their execution with -R.
- A bug causing double uploads with remote files was fixed. Various additional bug fixes related to remote files.
- Various minor bug fixes.
## [3.7.1] - 2016-05-16
### Changed
- Fixed a missing import of the multiprocessing module.
## [3.7.0] - 2016-05-05
### Added
- The entries in `resources` and the `threads` job attribute can now be callables that must return `int` values.
- Multiple `--cluster-config` arguments can be given to the Snakemake command line. Later one override earlier ones.
- In the API, multiple `cluster_config` paths can be given as a list, alternatively to the previous behaviour of expecting one string for this parameter.
- When submitting cluster jobs (either through `--cluster` or `--drmaa`), you can now use `--max-jobs-per-second` to limit the number of jobs being submitted (also available through Snakemake API). Some cluster installations have problems with too many jobs per second.
- Wildcard values are now printed upon job execution in addition to input and output files.
### Changed
- Fixed a bug with HTTP remote providers.
## [3.6.1] - 2016-04-08
### Changed
- Work around missing RecursionError in Python < 3.5
- Improved conversion of numpy and pandas data structures to R scripts.
- Fixed locking of working directory.
## [3.6.0] - 2016-03-10
### Added
- onstart handler, that allows to add code that shall be only executed before the actual workflow execution (not on dryrun).
- Parameters defined in the cluster config file are now accessible in the job properties under the key "cluster".
- The wrapper directive can be considered stable.
### Changed
- Allow to use rule/job parameters with braces notation in cluster config.
- Show a proper error message in case of recursion errors.
- Remove non-empty temp dirs.
- Don't set the process group of Snakemake in order to allow kill signals from parent processes to be propagated.
- Fixed various corner case bugs.
- The params directive no longer converts a list ``l`` implicitly to ``" ".join(l)``.
## [3.5.5] - 2016-01-23
### Added
- New experimental wrapper directive, which allows to refer to re-usable [wrapper scripts](https://bitbucket.org/snakemake/snakemake/wiki/Documentation#markdown-header-wrappers). Wrappers are provided in the [Snakemake Wrapper Repository](https://bitbucket.org/snakemake/snakemake-wrappers).
- David Koppstein implemented two new command line options to constrain the execution of the DAG of job to sub-DAGs (--until and --omit-from).
### Changed
- Fixed various bugs, e.g. with shadow jobs and --latency-wait.
## [3.5.4] - 2015-12-04
### Changed
- The params directive now fully supports non-string parameters. Several bugs in the remote support were fixed.
## [3.5.3] - 2015-11-24
### Changed
- The missing remote module was added to the package.
## [3.5.2] - 2015-11-24
### Added
- Support for easy integration of external R and Python scripts via the new [script directive](https://bitbucket.org/snakemake/snakemake/wiki/Documentation#markdown-header-external-scripts).
- Chris Tomkins-Tinch has implemented support for remote files: Snakemake can now handle input and output files from Amazon S3, Google Storage, FTP, SFTP, HTTP and Dropbox.
- Simon Ye has implemented support for sandboxing jobs with [shadow rules](https://bitbucket.org/snakemake/snakemake/wiki/Documentation#markdown-header-shadow-rules).
### Changed
- Manuel Holtgrewe has fixed dynamic output files in combination with mutliple wildcards.
- It is now possible to add suffixes to all shell commands with shell.suffix("mysuffix").
- Job execution has been refactored to spawn processes only when necessary, resolving several problems in combination with huge workflows consisting of thousands of jobs and reducing the memory footprint.
- In order to reflect the new collaborative development model, Snakemake has moved from my personal bitbucket account to http://snakemake.bitbucket.org.
## [3.4.2] - 2015-09-12
### Changed
- Willem Ligtenberg has reduced the memory usage of Snakemake.
- Per Unneberg has improved config file handling to provide a more intuitive overwrite behavior.
- Simon Ye has improved the test suite of Snakemake and helped with setting up continuous integration via Codeship.
- The cluster implementation has been rewritten to use only a single thread to wait for jobs. This avoids failures with large numbers of jobs.
- Benchmarks are now writing tab-delimited text files instead of JSON.
- Snakemake now always requires to set the number of jobs with -j when in cluster mode. Set this to a high value if your cluster does not have restrictions.
- The Snakemake Conda package has been moved to the bioconda channel.
- The handling of Symlinks was improved, which made a switch to Python 3.3 as the minimum required Python version necessary.
## [3.4.1] - 2015-08-05
### Changed
- This release fixes a bug that caused named input or output files to always be returned as lists instead of single files.
## [3.4] - 2015-07-18
### Added
- This release adds support for executing jobs on clusters in synchronous mode (e.g. qsub -sync). Thanks to David Alexander for implementing this.
- There is now vim syntax highlighting support (thanks to Jay Hesselberth).
- Snakemake is now available as Conda package.
### Changed
- Lots of bugs have been fixed. Thanks go to e.g. David Koppstein, Marcel Martin, John Huddleston and Tao Wen for helping with useful reports and debugging.
See [here](https://bitbucket.org/snakemake/snakemake/wiki/News-Archive) for older changes.
License¶
Snakemake is licensed under the MIT License:
Copyright (c) 2016 Johannes Köster <johannes.koester@tu-dortmund.de>
Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.