Provenance
By default, snakemake stores provenance information / metadata in the .snakemake/metadata directory.
However, for workflows with large numbers of inputs/outputs, this can lead to issues with the underlying filesystem,
especially when using networked filesystems or filesystems with limitations on the number of files in a directory.
To address this, an experimental DB based provenance system can be enabled using
$ snakemake --persistence-backend db [--persistence-backend-db-url URL]
By default, this will store provenance information in a SQLite database located at .snakemake/metadata.db.
However, users can specify a different database URL using the --persistence-backend-db-url option, which supports any database backend supported by SQLAlchemy;
note that the database backend must support JSON columns (e.g. PostgreSQL, MySQL, SQLite 3).
If using an SQLite database on networked filesystems: Note that this can sometimes experience lock contention and latency issues during highly parallel cluster execution. By default, snakemake automatically detects if the database is located on a network filesystem and applies specific SQLite3 optimizations (see sqlite docs):
PRAGMA journal_mode=PERSIST: Overwrites the journal header with zeros instead of deleting the file.PRAGMA synchronous=OFF: Hands data off to the OS immediately without waiting for disk syncs.PRAGMA temp_store=MEMORY: Prevents the creation of temporary lock files over the network.PRAGMA cache_size=-64000: Allocates up to ~64MB of RAM for the page cache to minimize network reads.
If a non-network filesystem is detected, snakemake uses: PRAGMA synchronous=NORMAL and PRAGMA journal_mode=TRUNCATE.
By default, we configure a PRAGMA busy_timeout={max(10s, latency_wait)} (so 10 seconds by default, or the value of latency_wait if it is higher) to mitigate locking issues with SQLite.
This may or may not be a problem depending on your specific infrastructure. If you encounter issues with sqlite, consider using a dedicated database server (like PostgreSQL or MySQL) instead. Since this backend is experimental, finding the optimal setup for your cluster might require some experimentation. Use at your own risk.