Tutorial: General use

This tutorial introduces the text-based workflow system Snakemake. Snakemake follows the GNU Make paradigm: workflows are defined in terms of rules that define how to create output files from input files. Dependencies between the rules are determined automatically, creating a DAG (directed acyclic graph) of jobs that can be automatically parallelized. Snakemake sets itself apart from other text-based workflow systems in the following way. Hooking into the Python interpreter, Snakemake offers a definition language that is an extension of Python with syntax to define rules and workflow specific properties. This allows Snakemake to combine the flexibility of a plain scripting language with a pythonic workflow definition. The Python language is known to be concise yet readable and can appear almost like pseudo-code. The syntactic extensions provided by Snakemake maintain this property for the definition of the workflow. Further, Snakemake’s scheduling algorithm can be constrained by priorities, provided cores and customizable resources and it provides a generic support for distributed computing (e.g., cluster or batch systems). Hence, a Snakemake workflow scales without modification from single core workstations and multi-core servers to cluster or batch systems. Finally, Snakemake integrates with package managers and container runtimes such that defining the software stack becomes part of the workflow itself.

In this tutorial, we will design and execute our first workflow with Snakemake, going over the basic concepts and building blocks of the system.

The examples presented in this tutorial come from Bioinformatics. However, Snakemake is a general-purpose workflow management system for any discipline. We have tried to ensure that no bioinformatics knowledge is needed to understand the tutorial.

There are also slides available for this tutorial. As follow-up to this tutorial, we recommend to have a look at the interaction, visualization and reporting tutorial, which focuses on Snakemake’s ability to cover the last mile of data analysis, i.e., the generation of publication ready reports and interactive visualizations.