Snakemake example

Since Snakemake direclty supports python, pyrpipe libraries can be driectly imorted into snakemake. Advantage of using a workflow manager like Snakemake is that it can handle parallel job-scheduling and scale jobs on clusters.

Basic RNA-Seq example

A basic example of directly using pyrpipe with Snakemake is provided here. This example uses yeast RNA-Seq samples from the GEO accession GSE132425.

First run the following bash script to download the yeast reference genome from Ensembl. The last command also generated a star index with the downloaded genome under the refdatayeast_index directory.

1
2
3
4
5
6
7
8
9
#!/bin/bash
mkdir -p refdata
wget ftp://ftp.ensemblgenomes.org/pub/fungi/release-49/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz -O refdata/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz
wget ftp://ftp.ensemblgenomes.org/pub/fungi/release-49/gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.49.gff3.gz -O refdata/Saccharomyces_cerevisiae.R64-1-1.49.gff3.gz
cd refdata
gunzip -f *.gz
#run star index
mkdir -p yeast_index
STAR --runThreadN 4 --runMode genomeGenerate --genomeDir yeast_index --genomeFastaFiles Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa --genomeSAindexNbases 10

Next, we create the following snakefile.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import yaml
import sys
import os
from pyrpipe import sra,qc,mapping,assembly

####Read config#####
configfile: "config.yaml"
DIR = config['DIR']
THREADS=config['THREADS']
##check required files
GENOME= config['genome']
GTF=config['gtf']
#####Read SRR ids######
with open ("runids.txt") as f:
    SRR=f.read().splitlines()

#Create pyrpipe objects
#parameters defined in ./params will be automatically loaded, threads will be replaced with the supplied value
tg=qc.Trimgalore(threads=THREADS)
star=mapping.Star(threads=THREADS)
st=assembly.Stringtie(threads=THREADS)

rule all:
    input:
            expand("{wd}/{sample}/Aligned.sortedByCoord.out_star_stringtie.gtf",sample=SRR,wd=DIR),

rule process:
    output:
            gtf="{wd}/{sample}/Aligned.sortedByCoord.out_star_stringtie.gtf"
    run:
            gtffile=str(output.gtf)
            srrid=gtffile.split("/")[1]
            sra.SRA(srrid,directory=DIR).trim(tg).align(star).assemble(st)

The above snakefile requires some additional files. A config.yaml file contains path to data and threads information.

DIR: "results"
THREADS: 5
genome: "refdata/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa"
gtf: "refdata/Saccharomyces_cerevisiae.R64-1-1.49.gff3"

Next the snakefile reads a file, runids.txt containing SRR accessions.

SRR9257163
SRR9257164
SRR9257165

Finally, we need to provide tool parameters for pyrpipe. Create a file ./params/star.yaml and specify the index in it

--genomeDir: ./refdata/yeast_index/

Now the snakefile could be run using the snakemake command e.g snakemake -j 8

pyrpipe_conf.yaml

Users can create a yaml file, pyrpipe_conf.yaml, to specify pyrpipe parameters, instead of directly passing them as command line arguments. When the pyrpipe_conf.yaml is found the pyrpipe specific arguments passed via the command-line are ignored. An example of pyrpipe_conf.yaml is shown below with the pyrpipe default values

dry: False          # Only print pyrpipe's commands and not execute anything through pyrpipe_engine module
threads: None       # Set the number of threads to use
force: False        # Force execution of commands if their target files already exist
params_dir: ./params # Directory containing parameters
logs: True          # Enable or disable pyrpipe logs
logs_dir: ./pyrpipe_logs    # Directory to save logs
verbose: False      # Display pyrpipe messages
memory: None        # Set memory to use (in GB)
safe: False         # Disable file deletion via pyrpipe commands
multiqc: False      # Automatically run multiqc after analysis