
This course includes our updated coding exercises so you can practice your skills as you learn.
See a demo
snakemake is a language to build pipelines. It can be used for any type of industry and it is especially popular in bioinformatics.
snakemake helps you save time and headaches when building pipelines.
You have two options to find instructions on how to install conda:
1. Go to the Anaconda website and follow the instructions to install Anaconda (which contains conda)
2. Follow Section 2 of my course "Software package management with conda"
mkdir /Users/biomagician/OneDrive/Documents/udemy/snakemake/snakemake-tutorial
curl -L "INSERT URL OF snakemake tutorial compressed tarball" -o snakemake-tutorial-data.tar.gz
tar -xf snakemake-tutorial-data.tar.gz --strip 1 "*/data" "*/environment.yaml"
(For GNU/Linux)
tar --wildcards -xf snakemake-tutorial-data.tar.gz --strip 1 "*/data" "*/environment.yaml"
conda create --name snakemake-tutorial --channel bioconda --channel conda-forge --yes snakemake=7.14.0 jinja2=3.1.2 networkx=2.8.6 matplotlib=3.5.3 graphviz=5.0.1 bcftools=1.15.1 samtools=1.15.1 bwa=0.7.17 pysam=0.19.1
conda activate snakemake-tutorial
snakemake --help
Least efficient way
mkdir mapped_reads
bwa mem data/genome.fa data/samples/A.fastq > mapped_reads/A.sam
bwa mem data/genome.fa data/samples/B.fastq > mapped_reads/B.sam
bwa mem data/genome.fa data/samples/C.fastq > mapped_reads/C.sam
samtools view -Sb mapped_reads/A.sam > mapped_reads/A.bam
samtools view -Sb mapped_reads/B.sam > mapped_reads/B.bam
samtools view -Sb mapped_reads/C.sam > mapped_reads/C.bam
Inefficient way
bwa mem data/genome.fa data/samples/A.fastq | samtools view -Sb - > mapped_reads/A.bam
bwa mem data/genome.fa data/samples/B.fastq | samtools view -Sb - > mapped_reads/B.bam
bwa mem data/genome.fa data/samples/C.fastq | samtools view -Sb - > mapped_reads/C.bam
Snakemake efficiency
snakemake --cores 1 mapped_reads/{A,B,C}.bam
A requested target file is generated when it does not exist.
A requested target file is generated if its timestamp is older than the timestamp of an input file.
rule bwa_map:
input:
'data/genome.fa',
'data/samples/{sample}.fastq'
output:
'mapped_reads/{sample}.bam'
shell:
'bwa mem {input} | samtools view -Sb - > {output}'
rule bwa_map:
input:
'data/genome.fa',
'data/samples/{sample}.fastq'
output:
'mapped_reads/{sample}.bam'
shell:
'bwa mem {input} | samtools view -Sb - > {output}'
snakemake --cores 1 mapped_reads/A.bam mapped_reads/B.bam mapped_reads/C.bam -n
snakemake --cores 1 mapped_reads/{A,B,C}.bam -n
2 situations when the snakemake workflow is executed:
- the requested output files do no exist on disk (trivial situation)
- the timestamp of the requested output file is older than the timestamp of the corresponding input file
rule bwa_map:
input:
'data/genome.fa',
'data/samples/{sample}.fastq'
output:
'mapped_reads/{sample}.bam'
shell:
'bwa mem {input} | samtools view -Sb - > {output}'
snakemake --cores 3 mapped_reads/{A,B,C}.bam -n
rule bwa_map:
input:
'data/genome.fa',
'data/samples/{sample}.fastq'
output:
'mapped_reads/{sample}.bam'
shell:
'bwa mem {input} | samtools view -Sb - > {output}'
rule samtools_sort:
input:
'mapped_reads/{sample}.bam'
output:
'sorted_reads/{sample}.bam'
shell:
'samtools sort -T sorted_reads/{wildcards.sample} -O bam {input} > {output}'
snakemake --cores 6 sorted_reads/{A,B,C}.bam -np
rule bwa_map:
input:
'data/genome.fa',
'data/samples/{sample}.fastq'
output:
'mapped_reads/{sample}.bam'
shell:
'bwa mem {input} | samtools view -Sb - > {output}'
rule samtools_sort:
input:
'mapped_reads/{sample}.bam'
output:
'sorted_reads/{sample}.bam'
shell:
'samtools sort -T sorted_reads/{wildcards.sample} '
'-O bam {input} > {output}'
rule samtools_index:
input:
'sorted_reads/{sample}.bam'
output:
'sorted_reads/{sample}.bam.bai'
shell:
'samtools index {input}'
snakemake --dag sorted_reads/{A,B,C}.bam.bai | dot -Tsvg > dag.svg
snakemake --dag sorted_reads/{A,B,C}.bam.bai | dot -Tpdf > dag.pdf
snakemake --cores 6 sorted_reads/{A,B,C}.bam.bai -np
snakemake --dag sorted_reads/{A,B,C}.bam.bai | dot -Tsvg > dag.svg
snakemake --dag sorted_reads/{A,B,C}.bam.bai | dot -Tpdf > dag.pdf
rule bwa_map:
input:
'data/genome.fa',
'data/samples/{sample}.fastq'
output:
'mapped_reads/{sample}.bam'
shell:
'bwa mem {input} | samtools view -Sb - > {output}'
rule samtools_sort:
input:
'mapped_reads/{sample}.bam'
output:
'sorted_reads/{sample}.bam'
shell:
'samtools sort -T sorted_reads/{wildcards.sample} '
'-O bam {input} > {output}'
rule samtools_index:
input:
bam = 'sorted_reads/{sample}.bam'
output:
'sorted_reads/{sample}.bam.bai'
shell:
'samtools index {input.bam}'
snakemake --cores 6 sorted_reads/{A,B,C}.bam.bai -np
SAMPLES = ['A', 'B', 'C']
REPLICATES = ['1', '2']
rule bwa_map:
input:
'data/genome.fa',
'data/samples/{sample}.fastq'
output:
'mapped_reads/{sample}.bam'
shell:
'bwa mem {input} | samtools view -Sb - > {output}'
rule samtools_sort:
input:
'mapped_reads/{sample}.bam'
output:
'sorted_reads/{sample}.bam'
shell:
'samtools sort -T sorted_reads/{wildcards.sample} '
'-O bam {input} > {output}'
rule samtools_index:
input:
bam = 'sorted_reads/{sample}.bam'
output:
'sorted_reads/{sample}.bam.bai'
shell:
'samtools index {input.bam}'
rule bcftools_call:
input:
fa = 'data/genome.fa',
bam = expand('sorted_reads/{sample}_{replicate}.bam', sample = SAMPLES, replicate = REPLICATES),
bai = expand('sorted_reads/{sample}_{replicate}.bam.bai', sample = SAMPLES, replicate = REPLICATES),
output:
'calls/all.vcf'
shell:
'bcftools mpileup -f {input.fa} {input.bam} | '
'bcftools call -mv - > {output}'
SAMPLES = ['A', 'B', 'C']
rule bwa_map:
input:
'data/genome.fa',
'data/samples/{sample}.fastq'
output:
'mapped_reads/{sample}.bam'
shell:
'bwa mem {input} | samtools view -Sb - > {output}'
rule samtools_sort:
input:
'mapped_reads/{sample}.bam'
output:
'sorted_reads/{sample}.bam'
shell:
'samtools sort -T sorted_reads/{wildcards.sample} '
'-O bam {input} > {output}'
rule samtools_index:
input:
bam = 'sorted_reads/{sample}.bam'
output:
'sorted_reads/{sample}.bam.bai'
shell:
'samtools index {input.bam}'
rule bcftools_call:
input:
fa = 'data/genome.fa',
bam = expand('sorted_reads/{sample}.bam', sample = SAMPLES),
bai = expand('sorted_reads/{sample}.bam.bai', sample = SAMPLES),
output:
'calls/all.vcf'
shell:
'bcftools mpileup -f {input.fa} {input.bam} | '
'bcftools call -mv - > {output}'
rule plot_quals:
input:
'calls/all.vcf'
output:
'plots/quals.pdf'
script:
'scripts/plot-quals.py'
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
from pysam import VariantFile
quals = [record.qual for record in VariantFile(snakemake.input[0])]
plt.hist(quals)
plt.savefig(snakemake.output[0])
alias snakemake='snakemake --cores 20000 --resources downloads=1 irods=10 --use-conda --printshellcmds --keep-going --reason --rerun-incomplete --restart-times 1'
R script job uses conda environment but R_LIBS environment variable is set. This is likely not intended, as R_LIBS can interfere with R packages deployed via conda. Consider running `unset R_LIBS` or remove it entirely before executing Snakemake.
snakemake --profile slurm -n
Error: profile given but no config.yaml found. Profile has to be given as either absolute path, relative path or name of a directory available in either /etc/xdg/snakemake or /mnt/home3/miska/cr517/.config/snakemake.
Course on snakemake. snakemake is a modern workflow language that is widely used in academic and industrial circles to build reproducible, legible, portable, interoperable and efficient pipelines in bioinformatics and beyond. The course closely follows the basic bioinformatics workflow described in the official snakemake tutorial but takes a step-by-step approach and delves deeply into each feature of the snakemake language. It covers:
- installation
- Snakefile
- rules
- directives: input, output, shell, script
- target files
- creation of a directed acyclic graph
This course does not cover:
- benchmarking
- conda directive
- snakemake profiles for cluster computers
- temporary files
- parameters
- resources
At the end of this course, you will be able to build a basic bioinformatics pipeline. This knowledge will be sufficient to make a positive difference in your day-to-day life as a bioinformatician. It will also prepare you for my advanced course on snakemake.
The course is primarily intended for bioinformaticians but it can also be useful for people from other fields who want to build pipelines.
The course can also be used as an introduction to the field of bioinformatics. In it, I use the concepts of "reads", "alignment", "BAM" files, "VCF" files, variant calls. However, note that I do not spend much time explaining those concepts and focus primarily on the snakemake language.