Snakemake¶
The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment.
Quick Example¶
Snakemake workflows are essentially Python scripts extended by declarative code to define rules. Rules describe how to create output files from input files.
rule targets:
input:
"plots/dataset1.pdf",
"plots/dataset2.pdf"
rule plot:
input:
"raw/{dataset}.csv"
output:
"plots/{dataset}.pdf"
shell:
"somecommand {input} {output}"
- Similar to GNU Make, you specify targets in terms of a pseudo-rule at the top.
- For each target and intermediate file, you create rules that define how they are created from input files.
- Snakemake determines the rule dependencies by matching file names.
- Input and output files can contain multiple named wildcards.
- Rules can either use shell commands, plain Python code or external Python or R scripts to create output files from input files.
- Snakemake workflows can be easily executed on workstations, clusters, the grid, and in the cloud without modification. The job scheduling can be constrained by arbitrary resources like e.g. available CPU cores, memory or GPUs.
- Snakemake can automatically deploy required software dependencies of a workflow using Conda or Singularity.
- Snakemake can use Amazon S3, Google Storage, Dropbox, FTP, WebDAV, SFTP and iRODS to access input or output files and further access input files via HTTP and HTTPS.
Getting started¶
To get a first impression, see our introductory slides or watch the live demo video. News about Snakemake are published via Twitter. To learn Snakemake, please do the Snakemake Tutorial, and see the FAQ.
Support¶
- For releases, see Changelog.
- Check frequently asked questions (FAQ).
- In case of questions, please post on stack overflow.
- To discuss with other Snakemake users, you can use the mailing list. Please do not post questions there. Use stack overflow for questions.
- For bugs and feature requests, please use the issue tracker.
- For contributions, visit Snakemake on bitbucket and read the guidelines.
Resources¶
- Snakemake Wrappers Repository
- The Snakemake Wrapper Repository is a collection of reusable wrappers that allow to quickly use popular tools from Snakemake rules and workflows.
- Snakemake Workflows Project
- This project provides a collection of high quality modularized and re-usable workflows. The provided code should also serve as a best-practices of how to build production ready workflows with Snakemake. Everybody is invited to contribute.
- Snakemake Profiles Project
- This project provides Snakemake configuration profiles for various execution environments. Please consider contributing your own if it is still missing.
- Bioconda
- Bioconda can be used from Snakemake for creating completely reproducible workflows by defining the used software versions and providing binaries.
Publications using Snakemake¶
In the following you find an incomplete list of publications making use of Snakemake for their analyses. Please consider to add your own.
- Doris et al. 2018. Spt6 is required for the fidelity of promoter selection. Molecular Cell.
- Karlsson et al. 2018. Four evolutionary trajectories underlie genetic intratumoral variation in childhood cancer. Nature Genetics.
- Planchard et al. 2018. The translational landscape of Arabidopsis mitochondria. Nucleic acids research.
- Schult et al. 2018. Effect of UV irradiation on Sulfolobus acidocaldarius and involvement of the general transcription factor TFB3 in the early UV response. Nucleic acids research.
- Goormaghtigh et al. 2018. Reassessing the Role of Type II Toxin-Antitoxin Systems in Formation of Escherichia coli Type II Persister Cells. mBio.
- Ramirez et al. 2018. Detecting macroecological patterns in bacterial communities across independent studies of global soils. Nature microbiology.
- Amato et al. 2018. Evolutionary trends in host physiology outweigh dietary niche in structuring primate gut microbiomes. The ISME journal.
- Uhlitz et al. 2017. An immediate–late gene expression module decodes ERK signal duration. Molecular Systems Biology.
- Akkouche et al. 2017. Piwi Is Required during Drosophila Embryogenesis to License Dual-Strand piRNA Clusters for Transposon Repression in Adult Ovaries. Molecular Cell.
- Beatty et al. 2017. Giardia duodenalis induces pathogenic dysbiosis of human intestinal microbiota biofilms. International Journal for Parasitology.
- Meyer et al. 2017. Differential Gene Expression in the Human Brain Is Associated with Conserved, but Not Accelerated, Noncoding Sequences. Molecular Biology and Evolution.
- Lonardo et al. 2017. Priming of soil organic matter: Chemical structure of added compounds is more important than the energy content. Soil Biology and Biochemistry.
- Beisser et al. 2017. Comprehensive transcriptome analysis provides new insights into nutritional strategies and phylogenetic relationships of chrysophytes. PeerJ.
- Dimitrov et al 2017. Successive DNA extractions improve characterization of soil microbial communities. PeerJ.
- de Bourcy et al. 2016. Phylogenetic analysis of the human antibody repertoire reveals quantitative signatures of immune senescence and aging. PNAS.
- Bray et al. 2016. Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology.
- Etournay et al. 2016. TissueMiner: a multiscale analysis toolkit to quantify how cellular processes create tissue dynamics. eLife Sciences.
- Townsend et al. 2016. The Public Repository of Xenografts Enables Discovery and Randomized Phase II-like Trials in Mice. Cancer Cell.
- Burrows et al. 2016. Genetic Variation, Not Cell Type of Origin, Underlies the Majority of Identifiable Regulatory Differences in iPSCs. PLOS Genetics.
- Ziller et al. 2015. Coverage recommendations for methylation analysis by whole-genome bisulfite sequencing. Nature Methods.
- Li et al. 2015. Quality control, modeling, and visualization of CRISPR screens with MAGeCK-VISPR. Genome Biology.
- Schmied et al. 2015. An automated workflow for parallel processing of large multiview SPIM recordings. Bioinformatics.
- Chung et al. 2015. Whole-Genome Sequencing and Integrative Genomic Analysis Approach on Two 22q11.2 Deletion Syndrome Family Trios for Genotype to Phenotype Correlations. Human Mutation.
- Kim et al. 2015. TUT7 controls the fate of precursor microRNAs by using three different uridylation mechanisms. The EMBO Journal.
- Park et al. 2015. Ebola Virus Epidemiology, Transmission, and Evolution during Seven Months in Sierra Leone. Cell.
- Břinda et al. 2015. RNF: a general framework to evaluate NGS read mappers. Bioinformatics.
- Břinda et al. 2015. Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics.
- Spjuth et al. 2015. Experiences with workflows for automating data-intensive bioinformatics. Biology Direct.
- Schramm et al. 2015. Mutational dynamics between primary and relapse neuroblastomas. Nature Genetics.
- Berulava et al. 2015. N6-Adenosine Methylation in MiRNAs. PLOS ONE.
- The Genome of the Netherlands Consortium 2014. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nature Genetics.
- Patterson et al. 2014. WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads. Journal of Computational Biology.
- Fernández et al. 2014. H3K4me1 marks DNA regions hypomethylated during aging in human stem and differentiated cells. Genome Research.
- Köster et al. 2014. Massively parallel read mapping on GPUs with the q-group index and PEANUT. PeerJ.
- Chang et al. 2014. TAIL-seq: Genome-wide Determination of Poly(A) Tail Length and 3′ End Modifications. Molecular Cell.
- Althoff et al. 2013. MiR-137 functions as a tumor suppressor in neuroblastoma by downregulating KDM1A. International Journal of Cancer.
- Marschall et al. 2013. MATE-CLEVER: Mendelian-Inheritance-Aware Discovery and Genotyping of Midsize and Long Indels. Bioinformatics.
- Rahmann et al. 2013. Identifying transcriptional miRNA biomarkers by integrating high-throughput sequencing and real-time PCR data. Methods.
- Martin et al. 2013. Exome sequencing identifies recurrent somatic mutations in EIF1AX and SF3B1 in uveal melanoma with disomy 3. Nature Genetics.
- Czeschik et al. 2013. Clinical and mutation data in 12 patients with the clinical diagnosis of Nager syndrome. Human Genetics.
- Marschall et al. 2012. CLEVER: Clique-Enumerating Variant Finder. Bioinformatics.