Executing Snakemake¶

This part of the documentation describes the snakemake executable. Snakemake is primarily a command-line tool, so the snakemake executable is the primary way to execute, debug, and visualize workflows.

Useful Command Line Arguments¶

If called without parameters, i.e.

$ snakemake

Snakemake tries to execute the workflow specified in a file called Snakefile in the same directory (instead, the Snakefile can be given via the parameter -s).

By issuing

$ snakemake -n

a dry-run can be performed. This is useful to test if the workflow is defined properly and to estimate the amount of needed computation. Further, the reason for each rule execution can be printed via

$ snakemake -n -r

Importantly, Snakemake can automatically determine which parts of the workflow can be run in parallel. By specifying the number of available cores, i.e.

$ snakemake -j 4

one can tell Snakemake to use up to 4 cores and solve a binary knapsack problem to optimize the scheduling of jobs. If the number is omitted (i.e., only -j is given), the number of used cores is determined as the number of available CPU cores in the machine.

Cloud Support¶

Snakemake 4.0 and later supports experimental execution in the cloud via Kubernetes. This is independent of the cloud provider, but we provide the setup steps for GCE below.

Google cloud engine¶

First, install the Google Cloud SDK. Then, run

$ gcloud init

to setup your access. Then, you can create a new kubernetes cluster via

$ gcloud container clusters create $CLUSTER_NAME --num-nodes=$NODES --scopes storage-rw

with $CLUSTER_NAME being the cluster name and $NODES being the number of cluster nodes. If you intent to use google storage, make sure that –scopes storage-rw is set. This enables Snakemake to write to the google storage from within the cloud nodes. Next, you configure Kubernetes to use the new cluster via

$ gcloud container clusters get-credentials $CLUSTER_NAME

Now, Snakemake is ready to use your cluster.

Important: After finishing your work, do not forget to delete the cluster with

$ gcloud container clusters delete $CLUSTER_NAME

in order to avoid unnecessary charges.

Executing a Snakemake workflow via kubernetes¶

Assuming that kubernetes has been properly configured (see above), you can execute a workflow via:

snakemake --kubernetes --use-conda --default-remote-provider $REMOTE --default-remote-prefix $PREFIX

In this mode, Snakemake will assume all input and output files to be stored in a given remote location, configured by setting $REMOTE to your provider of choice (e.g. GS for Google cloud storage or S3 for Amazon S3) and $PREFIX to a bucket name or subfolder within that remote storage. After successful execution, you find your results in the specified remote storage. Of course, if any input or output already defines a different remote location, the latter will be used instead. Importantly, this means that Snakemake does not require a shared network filesystem to work in the cloud.

It is further possible to forward arbitrary environment variables to the kubernetes jobs via the flag --kubernetes-env (see snakemake --help).

When executing, Snakemake will make use of the defined resources and threads to schedule jobs to the correct nodes. In particular, it will forward memory requirements defined as mem_mb to kubernetes. Further, it will propagate the number of threads a job intends to use, such that kubernetes can allocate it to the correct cloud computing node.

Cluster Execution¶

Snakemake can make use of cluster engines that support shell scripts and have access to a common filesystem, (e.g. the Sun Grid Engine). In this case, Snakemake simply needs to be given a submit command that accepts a shell script as first positional argument:

$ snakemake --cluster qsub -j 32

Here, -j denotes the number of jobs submitted being submitted to the cluster at the same time (here 32). The cluster command can be decorated with job specific information, e.g.

$ snakemake --cluster "qsub {threads}"

Thereby, all keywords of a rule are allowed (e.g. params, input, output, threads, priority, ...). For example, you could encode the expected running time into params:

rule:
    input:  ...
    output: ...
    params: runtime="4h"
    shell: ...

and forward it to the cluster scheduler:

$ snakemake --cluster "qsub --runtime {params.runtime}"

If your cluster system supports DRMAA, Snakemake can make use of that to increase the control over jobs. E.g. jobs can be cancelled upon pressing Ctrl+C, which is not possible with the generic --cluster support. With DRMAA, no qsub command needs to be provided, but system specific arguments can still be given as a string, e.g.

$ snakemake --drmaa " -q username" -j 32

Note that the string has to contain a leading whitespace. Else, the arguments will be interpreted as part of the normal Snakemake arguments, and execution will fail.

When executing a workflow on a cluster using the --cluster parameter (see below), Snakemake creates a job script for each job to execute. This script is then invoked using the provided cluster submission command (e.g. qsub). Sometimes you want to provide a custom wrapper for the cluster submission command that decides about additional parameters. As this might be based on properties of the job, Snakemake stores the job properties (e.g. rule name, threads, input files, params etc.) as JSON inside the job script. For convenience, there exists a parser function snakemake.utils.read_job_properties that can be used to access the properties. The following shows an example job submission wrapper:

#!python

#!/usr/bin/env python3
import os
import sys

from snakemake.utils import read_job_properties

jobscript = sys.argv[1]
job_properties = read_job_properties(jobscript)

# do something useful with the threads
threads = job_properties[threads]

# access property defined in the cluster configuration file (Snakemake >=3.6.0)
job_properties["cluster"]["time"]

os.system("qsub -t {threads} {script}".format(threads=threads, script=jobscript))

Profiles¶

Adapting Snakemake to a particular environment can entail many flags and options. Therefore, since Snakemake 4.1, it is possible to specify a configuration profile to be used to obtain default options:

$ snakemake --profile myprofile

Here, a folder myprofile is searched in per-user and global configuration directories (on Linux, this will be $HOME/.config/snakemake and /etc/xdg/snakemake, you can find the answer for your system via snakemake --help). Alternatively, an absolute or relative path to the folder can be given. The profile folder is expected to contain a file config.yaml that defines default values for the Snakemake command line arguments. For example, the file

cluster: qsub
jobs: 100

would setup Snakemake to always submit to the cluster via the qsub command, and never use more than 100 parallel jobs in total. Under https://github.com/snakemake-profiles/doc, you can find publicly available profiles. Feel free to contribute your own.

The profile folder can additionally contain auxilliary files, e.g., jobscripts, or any kind of wrappers. See https://github.com/snakemake-profiles/doc for examples.

Visualization¶

To visualize the workflow, one can use the option --dag. This creates a representation of the DAG in the graphviz dot language which has to be postprocessed by the graphviz tool dot. E.g. to visualize the DAG that would be executed, you can issue:

$ snakemake --dag | dot | display

For saving this to a file, you can specify the desired format:

$ snakemake --dag | dot -Tpdf > dag.pdf

To visualize the whole DAG regardless of the eventual presence of files, the forceall option can be used:

$ snakemake --forceall --dag | dot -Tpdf > dag.pdf

Of course the visual appearance can be modified by providing further command line arguments to dot.

All Options¶

All command line options can be printed by calling snakemake -h.

Snakemake is a Python based language and execution environment for GNU Make-like workflows.

usage: snakemake [-h] [--profile PROFILE] [--snakefile FILE] [--gui [PORT]]
                 [--cores [N]] [--local-cores N]
                 [--resources [NAME=INT [NAME=INT ...]]]
                 [--config [KEY=VALUE [KEY=VALUE ...]]] [--configfile FILE]
                 [--list] [--list-target-rules] [--directory DIR] [--dryrun]
                 [--printshellcmds] [--debug-dag] [--dag]
                 [--force-use-threads] [--rulegraph] [--d3dag] [--summary]
                 [--detailed-summary] [--archive FILE] [--touch]
                 [--keep-going] [--force] [--forceall]
                 [--forcerun [TARGET [TARGET ...]]]
                 [--prioritize TARGET [TARGET ...]]
                 [--until TARGET [TARGET ...]]
                 [--omit-from TARGET [TARGET ...]] [--allow-ambiguity]
                 [--cluster CMD | --cluster-sync CMD | --drmaa [ARGS]]
                 [--drmaa-log-dir DIR] [--cluster-config FILE]
                 [--immediate-submit] [--jobscript SCRIPT] [--jobname NAME]
                 [--cluster-status CLUSTER_STATUS] [--kubernetes [NAMESPACE]]
                 [--kubernetes-env ENVVAR [ENVVAR ...]]
                 [--container-image IMAGE] [--reason] [--stats FILE]
                 [--nocolor] [--quiet] [--nolock] [--unlock]
                 [--cleanup-metadata FILE [FILE ...]] [--rerun-incomplete]
                 [--ignore-incomplete] [--list-version-changes]
                 [--list-code-changes] [--list-input-changes]
                 [--list-params-changes] [--latency-wait SECONDS]
                 [--wait-for-files [FILE [FILE ...]]] [--benchmark-repeats N]
                 [--notemp] [--keep-remote] [--keep-target-files]
                 [--keep-shadow]
                 [--allowed-rules ALLOWED_RULES [ALLOWED_RULES ...]]
                 [--max-jobs-per-second MAX_JOBS_PER_SECOND]
                 [--max-status-checks-per-second MAX_STATUS_CHECKS_PER_SECOND]
                 [--restart-times RESTART_TIMES] [--attempt ATTEMPT]
                 [--timestamp] [--greediness GREEDINESS] [--no-hooks]
                 [--print-compilation]
                 [--overwrite-shellcmd OVERWRITE_SHELLCMD] [--verbose]
                 [--debug] [--runtime-profile FILE] [--mode {0,1,2}]
                 [--bash-completion] [--use-conda] [--conda-prefix DIR]
                 [--create-envs-only] [--wrapper-prefix WRAPPER_PREFIX]
                 [--default-remote-provider {S3,GS,FTP,SFTP,S3Mocked,gfal}]
                 [--default-remote-prefix DEFAULT_REMOTE_PREFIX]
                 [--no-shared-fs] [--version]
                 [target [target ...]]

Positional Arguments¶

target

Targets to build. May be rules or files.

Named Arguments¶

`–profile`	Name of profile to use for configuring Snakemake. Snakemake will search for a corresponding folder in /etc/xdg/snakemake and /home/docs/.config/snakemake. Alternatively, this can be an absolute or relative path. The profile folder has to contain a file ‘config.yaml’. This file can be used to set default values for command line options in YAML format. For example, ‘–cluster qsub’ becomes ‘cluster: qsub’ in the YAML file. Profiles can be obtained from https://github.com/snakemake-profiles.
`–snakefile, -s`
	The workflow definition in a snakefile. Default: “Snakefile”
`–gui`	Serve an HTML based user interface to the given network and port e.g. 168.129.10.15:8000. By default Snakemake is only available in the local network (default port: 8000). To make Snakemake listen to all ip addresses add the special host address 0.0.0.0 to the url (0.0.0.0:8000). This is important if Snakemake is used in a virtualised environment like Docker. If possible, a browser window is opened.
`–cores, –jobs, -j`
	Use at most N cores in parallel (default: 1). If N is omitted, the limit is set to the number of available cores.
`–local-cores`	In cluster mode, use at most N cores of the host machine in parallel (default: number of CPU cores of the host). The cores are used to execute local rules. This option is ignored when not in cluster mode. Default: 4
`–resources, –res`
	Define additional resources that shall constrain the scheduling analogously to threads (see above). A resource is defined as a name and an integer value. E.g. –resources gpu=1. Rules can use resources by defining the resource keyword, e.g. resources: gpu=1. If now two rules require 1 of the resource ‘gpu’ they won’t be run in parallel by the scheduler.
`–config, -C`	Set or overwrite values in the workflow config object. The workflow config object is accessible as variable config inside the workflow. Default values can be set by providing a JSON file (see Documentation).
`–configfile`	Specify or overwrite the config file of the workflow (see the docs). Values specified in JSON or YAML format are available in the global config dictionary inside the workflow.
`–list, -l`	Show availiable rules in given Snakefile. Default: False
`–list-target-rules, –lt`
	Show available target rules in given Snakefile. Default: False
`–directory, -d`
	Specify working directory (relative paths in the snakefile will use this as their origin).
`–dryrun, -n`	Do not execute anything. Default: False
`–printshellcmds, -p`
	Print out the shell commands that will be executed. Default: False
`–debug-dag`	Print candidate and selected jobs (including their wildcards) while inferring DAG. This can help to debug unexpected DAG topology or errors. Default: False
`–dag`	Do not execute anything and print the directed acyclic graph of jobs in the dot language. Recommended use on Unix systems: snakemake –dag \| dot \| display Default: False
`–force-use-threads`
	Force threads rather than processes. Helpful if shared memory (/dev/shm) is full or unavailable. Default: False
`–rulegraph`	Do not execute anything and print the dependency graph of rules in the dot language. This will be less crowded than above DAG of jobs, but also show less information. Note that each rule is displayed once, hence the displayed graph will be cyclic if a rule appears in several steps of the workflow. Use this if above option leads to a DAG that is too large. Recommended use on Unix systems: snakemake –rulegraph \| dot \| display Default: False
`–d3dag`	Print the DAG in D3.js compatible JSON format. Default: False
`–summary, -S`	Print a summary of all files created by the workflow. The has the following columns: filename, modification time, rule version, status, plan. Thereby rule version contains the versionthe file was created with (see the version keyword of rules), and status denotes whether the file is missing, its input files are newer or if version or implementation of the rule changed since file creation. Finally the last column denotes whether the file will be updated or created during the next workflow execution. Default: False
`–detailed-summary, -D`
	Print a summary of all files created by the workflow. The has the following columns: filename, modification time, rule version, input file(s), shell command, status, plan. Thereby rule version contains the versionthe file was created with (see the version keyword of rules), and status denotes whether the file is missing, its input files are newer or if version or implementation of the rule changed since file creation. The input file and shell command columns are selfexplanatory. Finally the last column denotes whether the file will be updated or created during the next workflow execution. Default: False
`–archive`	Archive the workflow into the given tar archive FILE. The archive will be created such that the workflow can be re-executed on a vanilla system. The function needs conda and git to be installed. It will archive every file that is under git version control. Note that it is best practice to have the Snakefile, config files, and scripts under version control. Hence, they will be included in the archive. Further, it will add input files that are not generated by by the workflow itself and conda environments. Note that symlinks are dereferenced. Supported formats are .tar, .tar.gz, .tar.bz2 and .tar.xz.
`–touch, -t`	Touch output files (mark them up to date without really changing them) instead of running their commands. This is used to pretend that the rules were executed, in order to fool future invocations of snakemake. Fails if a file does not yet exist. Default: False
`–keep-going, -k`
	Go on with independent jobs if a job fails. Default: False
`–force, -f`	Force the execution of the selected target or the first rule regardless of already created output. Default: False
`–forceall, -F`	Force the execution of the selected (or the first) rule and all rules it is dependent on regardless of already created output. Default: False
`–forcerun, -R`	Force the re-execution or creation of the given rules or files. Use this option if you changed a rule and want to have all its output in your workflow updated.
`–prioritize, -P`
	Tell the scheduler to assign creation of given targets (and all their dependencies) highest priority. (EXPERIMENTAL)
`–until, -U`	Runs the pipeline until it reaches the specified rules or files. Only runs jobs that are dependencies of the specified rule or files, does not run sibling DAGs.
`–omit-from, -O`
	Prevent the execution or creation of the given rules or files as well as any rules or files that are downstream of these targets in the DAG. Also runs jobs in sibling DAGs that are independent of the rules or files specified here.
`–allow-ambiguity, -a`
	Don’t check for ambiguous rules and simply use the first if several can produce the same file. This allows the user to prioritize rules by their order in the snakefile. Default: False
`–cluster, -c`	Execute snakemake rules with the given submit command, e.g. qsub. Snakemake compiles jobs into scripts that are submitted to the cluster with the given command, once all input files for a particular job are present. The submit command can be decorated to make it aware of certain job properties (input, output, params, wildcards, log, threads and dependencies (see the argument below)), e.g.: $ snakemake –cluster ‘qsub -pe threaded {threads}’.
`–cluster-sync`	cluster submission command will block, returning the remote exitstatus upon remote termination (for example, this should be usedif the cluster command is ‘qsub -sync y’ (SGE)
`–drmaa`	Execute snakemake on a cluster accessed via DRMAA, Snakemake compiles jobs into scripts that are submitted to the cluster with the given command, once all input files for a particular job are present. ARGS can be used to specify options of the underlying cluster system, thereby using the job properties input, output, params, wildcards, log, threads and dependencies, e.g.: –drmaa ‘ -pe threaded {threads}’. Note that ARGS must be given in quotes and with a leading whitespace.
`–drmaa-log-dir`
	Specify a directory in which stdout and stderr files of DRMAA jobs will be written. The value may be given as a relative path, in which case Snakemake will use the current invocation directory as the origin. If given, this will override any given ‘-o’ and/or ‘-e’ native specification. If not given, all DRMAA stdout and stderr files are written to the current working directory.
`–cluster-config, -u`
	A JSON or YAML file that defines the wildcards used in ‘cluster’for specific rules, instead of having them specified in the Snakefile. For example, for rule ‘job’ you may define: { ‘job’ : { ‘time’ : ‘24:00:00’ } } to specify the time for rule ‘job’. You can specify more than one file. The configuration files are merged with later values overriding earlier ones. Default: []
`–immediate-submit, –is`
	Immediately submit all jobs to the cluster instead of waiting for present input files. This will fail, unless you make the cluster aware of job dependencies, e.g. via: $ snakemake –cluster ‘sbatch –dependency {dependencies}. Assuming that your submit script (here sbatch) outputs the generated job id to the first stdout line, {dependencies} will be filled with space separated job ids this job depends on. Default: False
`–jobscript, –js`
	Provide a custom job script for submission to the cluster. The default script resides as ‘jobscript.sh’ in the installation directory.
`–jobname, –jn`
	Provide a custom name for the jobscript that is submitted to the cluster (see –cluster). NAME is “snakejob.{rulename}.{jobid}.sh” per default. The wildcard {jobid} has to be present in the name. Default: “snakejob.{rulename}.{jobid}.sh”
`–cluster-status`
	Status command for cluster execution. This is only considered in combination with the –cluster flag. If provided, Snakemake will use the status command to determine if a job has finished successfully or failed. For this it is necessary that the submit command provided to –cluster returns the cluster job id. Then, the status command will be invoked with the job id. Snakemake expects it to return ‘success’ if the job was successfull, ‘failed’ if the job failed and ‘running’ if the job still runs.
`–kubernetes`	Execute workflow in a kubernetes cluster (in the cloud). NAMESPACE is the namespace you want to use for your job (if nothing specified: ‘default’). Usually, this requires –default-remote-provider and –default-remote-prefix to be set to a S3 or GS bucket where your . data shall be stored. It is further advisable to activate conda integration via –use-conda.
`–kubernetes-env`
	Specify environment variables to pass to the kubernetes job. Default: []
`–container-image`
	Docker image to use, e.g., when submitting jobs to kubernetes. By default, this is ‘quay.io/snakemake/snakemake’, tagged with the same version as the currently running Snakemake instance. Note that overwriting this value is up to your responsibility. Any used image has to contain a working snakemake installation that is compatible with (or ideally the same as) the currently running version.
`–reason, -r`	Print the reason for each executed rule. Default: False
`–stats`	Write stats about Snakefile execution in JSON format to the given file.
`–nocolor`	Do not use a colored output. Default: False
`–quiet, -q`	Do not output any progress or rule information. Default: False
`–nolock`	Do not lock the working directory Default: False
`–unlock`	Remove a lock on the working directory. Default: False
`–cleanup-metadata, –cm`
	Cleanup the metadata of given files. That means that snakemake removes any tracked version info, and any marks that files are incomplete.
`–rerun-incomplete, –ri`
	Re-run all jobs the output of which is recognized as incomplete. Default: False
`–ignore-incomplete, –ii`
	Do not check for incomplete output files. Default: False
`–list-version-changes, –lv`
	List all output files that have been created with a different version (as determined by the version keyword). Default: False
`–list-code-changes, –lc`
	List all output files for which the rule body (run or shell) have changed in the Snakefile. Default: False
`–list-input-changes, –li`
	List all output files for which the defined input files have changed in the Snakefile (e.g. new input files were added in the rule definition or files were renamed). For listing input file modification in the filesystem, use –summary. Default: False
`–list-params-changes, –lp`
	List all output files for which the defined params have changed in the Snakefile. Default: False
`–latency-wait, –output-wait, -w`
	Wait given seconds if an output file of a job is not present after the job finished. This helps if your filesystem suffers from latency (default 5). Default: 5
`–wait-for-files`
	Wait –latency-wait seconds for these files to be present before executing the workflow. This option is used internally to handle filesystem latency in cluster environments.
`–benchmark-repeats`
	Repeat a job N times if marked for benchmarking (default 1). Default: 1
`–notemp, –nt`	Ignore temp() declarations. This is useful when running only a part of the workflow, since temp() would lead to deletion of probably needed files by other parts of the workflow. Default: False
`–keep-remote`	Keep local copies of remote input files. Default: False
`–keep-target-files`
	Do not adjust the paths of given target files relative to the working directory. Default: False
`–keep-shadow`	Do not delete the shadow directory on snakemake startup. Default: False
`–allowed-rules`
	Only consider given rules. If omitted, all rules in Snakefile are used. Note that this is intended primarily for internal use and may lead to unexpected results otherwise.
`–max-jobs-per-second`
	Maximal number of cluster/drmaa jobs per second, default is 10, fractions allowed. Default: 10
`–max-status-checks-per-second`
	Maximal number of job status checks per second, default is 10, fractions allowed. Default: 10
`–restart-times`
	Number of times to restart failing jobs (defaults to 0). Default: 0
`–attempt`	Internal use only: define the initial value of the attempt parameter (default: 1). Default: 1
`–timestamp, -T`
	Add a timestamp to all logging output Default: False
`–greediness`	Set the greediness of scheduling. This value between 0 and 1 determines how careful jobs are selected for execution. The default value (1.0) provides the best speed and still acceptable scheduling quality.
`–no-hooks`	Do not invoke onstart, onsuccess or onerror hooks after execution. Default: False
`–print-compilation`
	Print the python representation of the workflow. Default: False
`–overwrite-shellcmd`
	Provide a shell command that shall be executed instead of those given in the workflow. This is for debugging purposes only.
`–verbose`	Print debugging output. Default: False
`–debug`	Allow to debug rules with e.g. PDB. This flag allows to set breakpoints in run blocks. Default: False
`–runtime-profile`
	Profile Snakemake and write the output to FILE. This requires yappi to be installed.
`–mode`	Possible choices: 0, 1, 2 Set execution mode of Snakemake (internal use only). Default: 0
`–bash-completion`
	Output code to register bash completion for snakemake. Put the following in your .bashrc (including the accents): snakemake –bash-completion or issue it in an open terminal session. Default: False
`–use-conda`	If defined in the rule, create job specific conda environments. If this flag is not set, the conda directive is ignored. Default: False
`–conda-prefix`	Specify a directory in which the ‘conda’ and ‘conda-archive’ directories are created. These are used to store conda environments and their archives, respectively. If not supplied, the value is set to the ‘.snakemake’ directory relative to the invocation directory. If supplied, the –use-conda flag must also be set. The value may be given as a relative path, which will be extrapolated to the invocation directory, or as an absolute path.
`–create-envs-only`
	If specified, only creates the job-specific conda environments then exits. The –use-conda flag must also be set. Default: False
`–wrapper-prefix`
	Prefix for URL created from wrapper directive (default: https://bitbucket.org/snakemake/snakemake-wrappers/raw/). Set this to a different URL to use your fork or a local clone of the repository. Default: “https://bitbucket.org/snakemake/snakemake-wrappers/raw/“
`–default-remote-provider`
	Possible choices: S3, GS, FTP, SFTP, S3Mocked, gfal Specify default remote provider to be used for all input and output files that don’t yet specify one.
`–default-remote-prefix`
	Specify prefix for default remote provider. E.g. a bucket name. Default: “”
`–no-shared-fs`	Do not assume that jobs share a common file system. When this flag is activated, Snakemake will assume that the filesystem on a cluster node is not shared with other nodes. For example, this will lead to downloading remote files on each cluster node separately. Further, it won’t take special measures to deal with filesystem latency issues. This option will in most cases only make sense in combination with –default-remote-provider. Further, when using –cluster you will have to also provide –cluster-status. Only activate this if you know what you are doing. Default: False
`–version, -v`	show program’s version number and exit

Bash Completion¶

Snakemake supports bash completion for filenames, rulenames and arguments. To enable it globally, just append

`snakemake --bash-completion`

including the accents to your .bashrc. This only works if the snakemake command is in your path.