pyrpipe: A python package for RNA-Seq workflows¶
Author: | Urminder Singh |
---|---|
Date: | Jun 04, 2021 |
Version: | 0.0.5 |
Introduction¶
pyrpipe (pronounced as “pyre-pipe”) is a python package to easily develop computational pipelines in pure python, in an object-oriented manner. pyrpipe provides an easy-to-use object oriented framework to import any UNIX executable command or tool in python. pyrpipe also provides flexible and easy handling of tool options and parameters and can automatically load the tool parameters from .yaml files. This framework minimizes the commands user has to write, rather the tools are available as objects and are fully re-usable. All commands executed via pyrpipe are automatically logged extensively and could be compiled into reports using the pyrpipe_diagnostic tool.
To enable easy and easy processing of RNA-Seq data, we have implemented specialized classes build on top of the pyrpipe framework. These classes provide high level APIs to many popular RNA-Seq tools for easier and faster development of RNA-Seq pipelines. These pipelines can be fully customizable and users can easily add/replace the tools using the pyrpipe framework. Finally, pyrpipe can be used on local computers or on HPC environments and pyrpipe scripts can be easily integrated into workflow management systems such as Snakemake and Nextflow.
Get started with the tutorial: Tutorial or look at real data examples.
Key Features¶
- Import any UNIX executable command/tool in python
- Dry-runnable pipelines to check dependencies and commands before execution
- Flexible and robust handling of tool arguments and parameters
- Auto load parameters from .yaml files
- Easily override threads and memory options using global values
- Extensive logging and reports with MultiQC reports for bioinformatic pipelines
- Specify GNU make like targets and verify the integrity of the targets
- Automatically resume pipelines/jobs from where interrupted
- Easily integrated into workflow managers like Snakemake and NextFlow
Installation¶
To install the latest stable release via conda:
conda install -c bioconda pyrpipe
To install the latest stable release via pip:
pip install pyrpipe --upgrade
Install latest development version :
git clone https://github.com/urmi-21/pyrpipe.git
pip install -r pyrpipe/requirements.txt
pip install -e pyrpipe
See the Installation notes for details.
Examples and Case-Studies¶
Example usage and case-studies with real data is provided at GitHub.
Installation¶
Note: See Create a new conda environment in tutorial, to learn how to install pyrpipe and required tools in an conda environment.
Before installing pyrpipe, make sure conda channels are added in right order
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
pyrpipe is available through bioconda could be installed via:
conda install -c bioconda pyrpipe
pyrpipe is available on PyPI and could be installed via pip:
pip install pyrpipe
To install from source, clone the git repo:
git clone https://github.com/urmi-21/pyrpipe
Then install using pip:
pip install -r pyrpipe/requirements.txt
pip install -e pyrpipe
To run tests using pytest, from the pyrpipe root perform:
py.test
#or
py.test tests/<specific test file>
Tutorial¶
This tutorial provides an introduction to pyrpipe. New users can start here to get an idea of how to use pyrpipe APIs for RNA-Seq or other analysis. This tutorial is divided into smaller sections.
Section 1 describes the recommended way of installing pyrpipe and setting up conda environments.
Section 2 is an introduction to basic RNA-Seq API is provided via a simple RNA-Seq processing pipeline example.
Section 3 provides information on how to extend pipelines using third-party tools and integrate pyrpipe into snakemake for more scalable workflows.
Subsequent Sections detail pyrpipe_engine, pyrpipe_utils and pyrpipe_diagnostic, each of which provide helpful functionality.
Setting up the environment¶
First step is to setup the environment in a way such that reproducibility is ensured or maximized. In this tutorial we will use the conda environment manager to install python, pyrpipe and the required tools and dependencies into a single environment. Note: Conda must be installed on the system. For help with setting up conda, please see miniconda.
Create a new conda environment¶
To create a new conda environment with python 3.8 execute the following commands. We recommend sharing conda environment files with pipeline scripts to allow for reproducible analysis.
conda create -n pyrpipe python=3.8
Activate the newly created conda environment and install required tools
conda activate pyrpipe
conda install -c bioconda pyrpipe star=2.7.7a sra-tools=2.10.9 stringtie=2.1.4 trim-galore=0.6.6 orfipy=0.0.3 salmon=1.4.0
If the above command fails, please try adding conda channels (see commands below) in the right order and then try again.
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
Using conda environment in yaml files¶
We have provided a yaml file containing the conda packages required to reproduce pyrpipe environment. Users can also use this file to create a conda environment and run pyrpipe. To create a conda environment, use the pyrpipe_environment.yaml:
conda env create -f pyrpipe_environment.yml
Users can easily export and share their own conda environment yaml files containing information about the conda environment. To export any conda environment as yaml, run the following command
conda env export | grep -v "^prefix: " > environment.yml
To recreate the conda environment in the environment.yml, use
conda env create -f environment.yml
Automated installation of required tools¶
We have also provided a utility to install required RNA-Seq tools via a single command:
pyrpipe_diagnostic build-tools
Note: Users must verify the versions of the tools installed in the conda environment.
Setting up NCBI SRA-Tools¶
After installing sra-tools, please configure prefetch to save the downloads the the public user-repository. This will ensure that the prefetch command will download the data to the user defined directory. To do this
- Type vdb-config -i command in terminal to open the NCBI SRA-Tools configuration editor.
- Under the TOOLS tab, set prefetch downloads option to public user-repository
Users can easily test if SRA-Tools has been setup properly by invoking the following command
pyrpipe_diagnostic test
Basic RNA-Seq processing¶
After setting up the environment, one can import pyrpipe modules in python and start using it. In this example we will use the A. thaliana Paired-end RNA-Seq run SRR976159.
Many other examples are available on github
Required files¶
We need to download the reference genome and annotation for A. thalinia. This can be done inside python script too. For simplicity we just download these using the wget command from the terminal.
Code for A. thalinia transcript assembly case study is available on github
1 2 3 4 | wget ftp://ftp.ensemblgenomes.org/pub/release-46/plants/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
gunzip Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
wget ftp://ftp.ensemblgenomes.org/pub/release-46/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.46.gtf.gz
gunzip Arabidopsis_thaliana.TAIR10.46.gtf.gz
|
Simple pipeline¶
RNA-Seq processing is as easy as creating required objects and executing required functions. The following python script provides a basic example of using pyrpipe on publicly available RNA-Seq data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | from pyrpipe import sra,qc,mapping,assembly
#define some vaiables
run_id='SRR976159'
working_dir='example_output'
gen='Arabidopsis_thaliana.TAIR10.dna.toplevel.fa'
ann='Arabidopsis_thaliana.TAIR10.46.gtf'
star_index='star_index/athaliana'
#initialize objects
#creates a star object to use with threads
star=mapping.Star(index=star_index,genome=gen,threads=4)
#use trim_galore for trimming
trim_galore=qc.Trimgalore()
#Stringtie for assembly
stringtie=assembly.Stringtie(guide=ann)
#create SRA object; this will download fastq if doesnt exist
srr_object=sra.SRA(run_id,directory=working_dir)
#create a pipeline using the objects
srr_object.trim(trim_galore).align(star).assemble(stringtie)
#The assembled transcripts are in srr_object.gtf
print('Final result',srr_object.gtf)
|
The above code defines a simple pipeline (in a single line: Line 18) that:
- Downloads fastq files from NCBI-SRA
- Uses Trim Galore for trimming
- Uses Star for alignemnt to ref genome
- Uses Stringtie for assembly
A line by line explanation:
- Imports the required pyrpipe modules
- Lines 3 to 7 defines the variables for reference files, output directory, and star index. The output directory will be used to store the downloaded RNA-Seq data and will be the default directory for all the results.
- Creates a Star object. It takes index and genome as parameters. It will automatically verify the index and if an index is not found, it will use the genome to build one and save it to the index path provided.
- Creates a Trimgalore object
- Creates a Stringtie object
- Creates an SRA object. This represents the RNA-Seq data. If the raw data is not available on disk it auto-downloads it via fasterq-dump.
- This is the pipeline which describes a series of operations. The SRA class implements the trim(), align() and assemble() methods.
- trim() takes a qc type object, performs trimming via qc.perform_qc method and the trimmed fastq are updated in the SRA object
- align() takes a mappping type object, performs alignemnt via mapping.perform_alignemnt method. The resulting bam file is stored in SRA.bam_path
- assemble() takes a assembly type object, performs assembly via mapping.perform_assembly method. The resulting gtf file is stored in SRA.gtf
Executing the pipeline¶
To execute the pipeline defined above, save the python code in a file pipeline.py. The code can be executed as any python script using the python command:
1 | python pipeline.py
|
Or it can be executed using the pyrpipe command and specifying the input script with the –in option
1 | pyrpipe --in pipeline.py
|
One can specify pyrpipe specific options too
1 2 3 | python pipeline.py --threads 10 --dry-run
#OR
pyrpipe --in pipeline.py --threads 10 --dry-run
|
The above two commands are equivalent and specifies pyrpipe to use 10 threads. Thus 10 threads will be used for each of the tool except for STAR where we explicitly specied to use 4 threads during object creation.
The other option provided here is the –dry-run option and this option turns off the pyrpipe_engine and any command passed to the pyrpipe_engine is not actually executed but just displayed on screen and logged. During dry run the Runnable class also verifies the file dependencies (if any). More details are provided in the later chapters of the tutorial.
We recommend using the dry-run option before actually starting a job to make sure all parameters/dependencies are correct.
Specifying tool parameters¶
pyrpipe supports auto-loading of tool parameters specified in .yaml files. The .yaml files must be stored is a directory and can be specified using the –param-dir option. The default value is ./params. The files must be named as <tool>.yaml, for example star.yaml for STAR parameters. These parameters are loaded during object creation and user can easily override these during execution.
Create a directory params in the current directory and make a file star.yaml inside params. Add the following to star.yaml and rerun pipeline.py using the dry run option.
1 2 3 4 5 6 | --outSAMtype: BAM Unsorted SortedByCoordinate
--outSAMunmapped: Within
--genomeLoad: NoSharedMemory
--chimSegmentMin: 15
--outSAMattributes: NH HI AS nM NM ch
--outSAMattrRGline: ID:rg1 SM:sm1
|
If you did everything correctly, you wil notice that now the STAR commands contain these specified parameters.
Updating parameters dynamically¶
The parameters specified in the yaml will be replaced by any parameters provided during object creation. For example, consider the star.yaml specifying –runThreadN as 20
1 2 3 4 | --outSAMtype: BAM Unsorted SortedByCoordinate
--outSAMunmapped: Within
--genomeLoad: NoSharedMemory
--runThreadN: 20
|
Now, consider creating a star object in the following scenarios
1 2 3 4 5 | star1=Star(index='index') #will use 20 threads as mentioned in the yaml file
star2=Star(index='index',threads=5) #will use 5 threads
star3=Star(index='index',**{'--runThreadN':'10'}) #will use 10 threads
star4=Star(index='index') #initialized with 20 threads
star4.run(...,**{'--runThreadN':'10'}) #will use 10 threads for this particular run; '--runThreadN':'20' remains in the star4 object
|
Importing tools into python¶
pyrpipe’s Runnable can be used to import any Unix command into python. The Runnable class implements the run() method which checks required dependencies, monitors execution, execute commands via pyrpipe_engine, and verify target files. We will first download the E. coli genome file to use as input to orfipy.
1 2 | wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
gunzip GCF_000005845.2_ASM584v2_genomic.fna.gz
|
Basic example¶
In this tutorial we will consider the tool orfipy, a tool for fast and flexible ORFs, and import it in python.
1 2 3 4 5 6 7 | from pyrpipe.runnable import Runnable
infile='GCF_000005845.2_ASM584v2_genomic.fna'
#create a Runnable object
orfipy=Runnable(command='orfipy')
#specify orfipy options; these can be specified into orfipy.yaml too
param={'--outdir':'orfipy_out','--procs':'3','--dna':'orfs.fa'}
orfipy.run(infile,**param)
|
Save the above script in example.py and execute it using python example.py –dry-run. pyrpipe should generate the orfipy command and display on screen during dry-run. Running it without –dry-run flag will generate the orfipy_out/orfs.fa file.
Requirements and targets¶
One can specify required dependencies and expected target files in the run() method Replacing the call to run() with the following will verify the required files and the target files. If command is interrupted, pyrpipe will scan for Locked taget files and resume from where the pipeline was interrupted.
1 | orfipy.run(infile,requires=infile,target='orfipy_out/orfs.fa',**param)
|
Building APIs¶
Users can extend the Runnable class and specify classes dedicated to tools. Extra functionalities can be added by defining custom behaviour of classes. The RNA-Seq API built using pyrpipe the framework.
A small example is presented here to build a class for orfipy tool
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | from pyrpipe import Runnable
from pyrpipe.runnable import Runnable
from pyrpipe import pyrpipe_utils as pu
from pyrpipe import sra
from pyrpipe import _threads,_mem,_force
import os
class Orfipy(Runnable):
"""
Extends Runnable class
Attributes
----------
"""
def __init__(self,*args,threads=None,mem=None,**kwargs):
"""
init an Orfipy object
Parameters
----------
*args : tuple
Positional arguements to orfipy
threads : int, optional
Threads to use for orfipy. This will override the global --threads parameter supplied to pyrpipe. The default is None.
mem : int, optional
Maximum memory to use in MB. The default is None.
**kwargs : dict
options for orfipy
Returns
-------
None.
"""
super().__init__(*args,command='orfipy',**kwargs)
self._deps=[self._command]
self._param_yaml='orfipy.yaml'
#valid arguments for orfipy
self._valid_args=['--min','--between-stops','--include-stop','--dna','--pep','--bed','--bed12','--procs','--chunk-size','--outdir'] #valid arguments for orfipy
#resolve threads to use
"""
orfipy parameter for threads is --procs
if threads is passed in __init__() it will be used
else if --procs is found in orfipy.yaml that will be used
else if --procs is found in the passed **kwargs in __init__() it will be used
else the default value i.e. _threads will be used
if default value is None nothing will be done
after the function, --procs and its value will be stored in self._kwargs, and _threads variable will be stored in the Orfipy object.
"""
self.resolve_parameter("--procs",threads,_threads,'_threads')
#resolve memory to use
"""
default value is None--> if mem is not supplied don't make the self._mem variable
"""
self.resolve_parameter("--chunk-size",mem,None,'_mem')
##now we write a custom function that can be used with an SRA object
def find_orfs(self,sra_object):
out_dir=sra_object.directory
out_file=os.path.join(out_dir,sra_object.srr_accession+"_ORFs.bed")
if not _force and pu.check_files_exist(out_file):
pu.print_green('Target files {} already exist.'.format(out_file))
return out_file
#In this example use orfipy on only first fastq file
internal_args=(sra_object.fastq_path,)
internal_kwargs={"--bed":sra_object.srr_accession+"_ORFs.bed","--outdir":out_dir}
#call run
status=self.run(*internal_args,objectid=sra_object.srr_accession,target=out_file,**internal_kwargs)
if status:
return out_file
return ""
|
The above class, Orfipy, we created can be used directly with SRA type objects via the find_orfs function. We can also still use the run() method and provide any input to orfipy.
1 2 3 4 5 6 7 8 | #create object
orfipy=Orfipy()
#use run()
orfipy.run('test.fa',**{'--dna':'d.fa','--outdir':'of_out'},requires='test.fa',target='of_out/d.fa')
#use the api function to work with SRA
srr=sra.SRA('SRR9257212')
orfipy.find_orfs(srr)
|
Now try passing orfipy parameters from orfipy.yaml file. Create params/orfipy.yaml and add the following options into it.
1 2 3 | --min: 36
--between-stops: True
--include-stop: True
|
Now re-run the python code and it will automatically read orfipy options from ./params/orfipy.yaml.
Snakemake example¶
Since Snakemake direclty supports python, pyrpipe libraries can be driectly imorted into snakemake. Advantage of using a workflow manager like Snakemake is that it can handle parallel job-scheduling and scale jobs on clusters.
Basic RNA-Seq example¶
A basic example of directly using pyrpipe with Snakemake is provided here. This example uses yeast RNA-Seq samples from the GEO accession GSE132425.
First run the following bash script to download the yeast reference genome from Ensembl. The last command also generated a star index with the downloaded genome under the refdatayeast_index directory.
1 2 3 4 5 6 7 8 9 | #!/bin/bash
mkdir -p refdata
wget ftp://ftp.ensemblgenomes.org/pub/fungi/release-49/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz -O refdata/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz
wget ftp://ftp.ensemblgenomes.org/pub/fungi/release-49/gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.49.gff3.gz -O refdata/Saccharomyces_cerevisiae.R64-1-1.49.gff3.gz
cd refdata
gunzip -f *.gz
#run star index
mkdir -p yeast_index
STAR --runThreadN 4 --runMode genomeGenerate --genomeDir yeast_index --genomeFastaFiles Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa --genomeSAindexNbases 10
|
Next, we create the following snakefile.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | import yaml
import sys
import os
from pyrpipe import sra,qc,mapping,assembly
####Read config#####
configfile: "config.yaml"
DIR = config['DIR']
THREADS=config['THREADS']
##check required files
GENOME= config['genome']
GTF=config['gtf']
#####Read SRR ids######
with open ("runids.txt") as f:
SRR=f.read().splitlines()
#Create pyrpipe objects
#parameters defined in ./params will be automatically loaded, threads will be replaced with the supplied value
tg=qc.Trimgalore(threads=THREADS)
star=mapping.Star(threads=THREADS)
st=assembly.Stringtie(threads=THREADS)
rule all:
input:
expand("{wd}/{sample}/Aligned.sortedByCoord.out_star_stringtie.gtf",sample=SRR,wd=DIR),
rule process:
output:
gtf="{wd}/{sample}/Aligned.sortedByCoord.out_star_stringtie.gtf"
run:
gtffile=str(output.gtf)
srrid=gtffile.split("/")[1]
sra.SRA(srrid,directory=DIR).trim(tg).align(star).assemble(st)
|
The above snakefile requires some additional files. A config.yaml file contains path to data and threads information.
DIR: "results"
THREADS: 5
genome: "refdata/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa"
gtf: "refdata/Saccharomyces_cerevisiae.R64-1-1.49.gff3"
Next the snakefile reads a file, runids.txt containing SRR accessions.
SRR9257163
SRR9257164
SRR9257165
Finally, we need to provide tool parameters for pyrpipe. Create a file ./params/star.yaml and specify the index in it
--genomeDir: ./refdata/yeast_index/
Now the snakefile could be run using the snakemake command e.g snakemake -j 8
pyrpipe_conf.yaml¶
Users can create a yaml file, pyrpipe_conf.yaml, to specify pyrpipe parameters, instead of directly passing them as command line arguments. When the pyrpipe_conf.yaml is found the pyrpipe specific arguments passed via the command-line are ignored. An example of pyrpipe_conf.yaml is shown below with the pyrpipe default values
dry: False # Only print pyrpipe's commands and not execute anything through pyrpipe_engine module
threads: None # Set the number of threads to use
force: False # Force execution of commands if their target files already exist
params_dir: ./params # Directory containing parameters
logs: True # Enable or disable pyrpipe logs
logs_dir: ./pyrpipe_logs # Directory to save logs
verbose: False # Display pyrpipe messages
memory: None # Set memory to use (in GB)
safe: False # Disable file deletion via pyrpipe commands
multiqc: False # Automatically run multiqc after analysis
RNA-Seq API¶
pyrpipe implements specialized classes for RNA-Seq processing. These classes are defined into different modules, each designed to capture steps integral to analysis of RNA-Seq data –from downloading raw data to trimming, alignment and assembly or quantification. Each of these module implements classes coressponding to a RNA-Seq tool. These classes extend the Runnable class. Specialized functions are implemented such that analysis of RNA-Seq data is intuitive and easy to code. The following table provide details about the pyrpipe’s RNA-Seq related modules.
Module | Class | Purpose |
assembly | Assembly | Abstract class to represent Assebler type |
assembly | Stringtie | API to Stringtie |
assembly | Cufflinks | API to Cufflinks |
mapping | Aligner | Abstract class for Aligner type |
mapping | Star | API to Star |
mapping | Bowtie2 | API to Bowtie2 |
mapping | Hisat2 | API to Hisat2 |
qc | RNASeqQC | Abstract class for RNASeqQC type (quality control and trimming) |
qc | Trimgalore | API to Trim Galore |
qc | BBmap | API to bbduk.sh |
quant | Quant | Abstract class for Quantification type |
quant | Salmon | API to Salmon |
quant | Kallisto | API to Kallisto |
sra | SRA | Class to represent RNA-Seq data and API to NCBI SRA-Tools |
tools | Samtools | API to Samtools and other commonly used tools |
The SRA class¶
The SRA class contained in the sra module captures RNA-Seq data. It can automatically download RNA-Seq data from the NCBI SRA servers via the prefetch command. The SRA constructor can take the SRR accession, path to fastq or sra file as arguments.
The main attributes and functions are defined the following table.
Attribute | Description |
fastq_path | Path to the fastq file. If single end this is the only fastq file. |
fastq2_path | Path to the second fastq file for paired end data. |
sra_path | Path to the sra file |
srr_accession | The SRR accession for RNA-Seq run |
layout | RNA-Seq layout, auto determined by SRA class. |
bam_path | Path to bam file after running the align() function |
gtf | Path to the gtf file after running assemble() function |
Function | Description |
__init__() | This is the constructor. It can take SRR accession, path to fastq files, or sra file as input. If accession if provided as input the files are downloaded via prefetch if they aren’t preset on disk. It will automatically handle single-end and paired-end data. |
download_sra() | This function downloads the sra file via prefetch. |
download_fastq() | This function runs fasterq-dump on the sra file downloaded via prefetch. |
sra_exists() | Check if fastq sra files are present |
fastq_exists() | Check if fastq file exists |
delete_sra() | Delete the sra file |
delete_fastq() | Delete the fastq files |
trim() | This function takes a RNASeqQC type object and performs trimming. The trimmed fastq files are then stored in fastq_path and fastq2_path. |
align() | This function takes an Aligner type object and performs read alignemnt. The BAM file returned is stored in bam_path attribute. |
assemble() | This function takes an Assembly type object and performs transcript assembly. The result is soted on the SRA object as gtf attributes |
quant() | This function takes a Quantification type object and performs quant. |
The RNASeqQC class¶
The RNASeqQC is an abstract class defined in the qc module. RNASeqQC class extends the Runnable class and thus has all the attributes as in the Runnable class. Classes Trimgalore and BBmap extends RNASeqQC class and share following attributes and functions.
Attribute | Description |
_category | Represents the type: “RNASeqQC” |
Function | Description |
__init__() | The constructor function |
perform_qc() | Takes a SRA object, performs qc and returns path to resultant fastq files |
The Aligner class¶
The Aligner is an abstract class defined in the mapping module. Aligner class extends the Runnable class and thus has all the attributes as in the Runnable class. Classes Star, Hisat2 and Bowtie2 extends the Aligner class and share following attributes and functions.
Attribute | Description |
_category | Represents the type: “Aligner” |
index | Index used by the aligner tool |
genome | Reference genome used by the tool |
Function | Description |
__init__() | The constructor function |
build_index() | Build an index for the aligner tool using the genome. |
check_index() | Checks if the index is valid |
perform_alignment() | Takes a sra object, performs alignemnt and returns path to the bam file |
The Assembly class¶
The Assembly is an abstract class defined in the assembly module. Assembly class extends the Runnable class and thus has all the attributes as in the Runnable class. Classes Stringtie and Cufflinks extends the Assembly class and share following attributes and functions.
Attribute | Description |
_category | Represents the type: “Assembler” |
Function | Description |
__init__() | The constructor function |
perform_assembly() | Takes a SRA object, performs transcript assembly and returns path to resultant gtf/gff files |
The Quant class¶
The Quant is an abstract class defined in the quant module. Quant class extends the Runnable class and thus has all the attributes as in the Runnable class. Classes Salmon and Kallisto extends the Quant class and share following attributes and functions.
Attribute | Description |
_category | Represents the type: “Quantification” |
index | Index used by the aligner tool |
transcriptome | Reference transcriptome used by the tool |
Function | Description |
__init__() | The constructor function |
build_index() | Build an index for the quantification tool using the transcriptome. |
check_index() | Checks if the index is valid |
perform_quant() | Takes a sra object, performs quantification and returns path to the quantification results file |
pyrpipe_engine module¶
The pyrpipe_engine module is the key module responsible for handling execution of all the Unix commands. The Runnable class calls the execute_comand() function in the pyrpipe_engine module. All the commands executed via the pyrpipe_engine module are automatically logged. All the functions responsible for executing Unix command are “decorated” with the dryable method, which allows using the –dry-run flag on any function using the pyrpipe_engine module.
The pyrpipe_engine can be directly used to execute Unix command or to import output of a Unix command in python. The functions defined in the pyrpipe_engine module are described below.
Function | Description |
dryable() | A decorater to make Unix functions dry when –dry-run is specified |
parse_cmd() | Parse a Unix command and return command as string |
get_shell_output() | Execute a command and return return code and output. These commands are not logged |
get_return_status() | Execute command and return status of the command |
execute_commandRealtime() | Execute a command and print output in realtime |
execute_command() | Execute a command and log the status. This is the function used by the Runnable class. |
is_paired() | Check is sra file paired or single end. Uses fastq-dump |
get_program_path() | Return path to a Unix command |
check_dependencies() | Check a list of dependencies are present in Unix path |
delete_files() | Delete files, rm command |
move_file() | Moves files, mv command |
pyrpipe_utils module¶
The pyrpipe_utils module define multiple helpful functions. Functions defined in the pyrpipe_utils modules are extensively used throughout pyrpipe modules. Users can directly utilize these fuctions in their code and expedite development. A description of these functions is provided below.
Function | Description |
pyrpipe_print() | Prints in color |
get_timestamp() | Return current timestamp |
check_paths_exist() | Return true is paths are valid |
check_files_exist() | Return True if files are valid |
check_hisatindex() | Verify valid Hisat2 index |
check_kallistoindex() | Verify kallisto index |
check_salmonindex() | Verify salmon index |
check_starindex() | Verify STAR index |
check_bowtie2index() | Verify Bowtie2 index |
get_file_size() | Return file size in human readable format |
parse_java_args() | Parse tool options in JAVA style format |
parse_unix_args() | Parse tool options in Unix style format |
get_file_directory() | Return a file’s directory |
get_filename() | Return filename with extension |
get_fileext() | Return file extension |
get_file_basename() | Return filename, without extension |
mkdir() | Create a directory |
get_union() | Return union of lists |
find_files() | Search files using regex patterns |
get_mdf() | Compute and return MD5 checksum of a file |
pyrpipe_diagnostic¶
The pyrpipe_diagnostic command allows users to easily examine pyrpipe logs. It provides several options.
- report: This command can generate a summary or detailed report of the analysis.
- shell: This command creates a bash file containing all the commands executed via pyrpipe
- benchmark: This command can generate benchmarks to compare walltimes of the different commands in the pipeline.
- multiqc: This command uses the MultiQC tool to generate a report using pyrpipe logs and other logs geenrated by the pipeline.
Cookbook¶
Contents
Using SRA objects¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | from pyrpipe.sra import SRA #imports the SRA class
#create an SRA object using a valid run accession
"""
this checks if fastq files already exist in the directory,
otherwise downloads the fastq files and stores the path in the object
"""
myob=SRA('SRR1168424',directory='./results')
#create an SRA object using fastq paths
myob=SRA('SRR1168424',fastq='./fastq/SRR1168424_1.fastq',fastq2='./fastq/SRR1168424_2.fastq')
#create an SRA object using sra path
myob=SRA('SRR1168424',sra='./sra/SRR1168424.sra')
#accessing fastq files
print(myob.fastq,myob.fastq2)
#check if fastq files are present
print (myob.fastq_exists())
#check sra file
print (myob.sra_exists())
#delete fastq
myob.delete_fastq()
#delete sra
myob.delete_sra()
#download fastq
myob.download_fastq()
#trim fastq files
myob.trim(qcobject)
|
Using RNASeqQC objects¶
RNASeqQC objects can be used for quality control and trimming. These are defined in the qc module. Following example uses Trimgalore class but is applicable to any class extending the RNASeqQC class
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | from pyrpipe.qc import Trimgalore
#create trimgalore object
tgalore=Trimgalore()
#print category
print(tgalore._category) #should print Aligner
#use with SRA object
"""
Following will trim fastq and update fastq paths in the sraobject
"""
sraobject.trim(tgalore)
#following trims and returns qc
fq1,fq2=tgalore.perform_qc(sraobject)
#run trimgalore using user arguments; provide any arguments that Runnable.run() can take
tgalore.run(*args,**kwargs)
|
Using Aligner objects¶
Aligner objects from the mapping module can be used to perform alignment tasks.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | from pyrpipe.mapping import Star
#create a star object
star=Star(index='path_to_star_index')
#print category
print(star._category) #should print Aligner
#perform alignment using SRA object
bam=star.perform_alignment(sraobject)
#or
sraobject.align(star)
bam=sraobject.bam_path
#execute STAR with any arguments and parameters
kwargs={'--outFilterType' : 'BySJout',
'--runThreadN': '6',
'--outSAMtype': 'BAM SortedByCoordinate',
'--readFilesIn': 'SRR3098744_1.fastq SRR3098744_2.fastq'
}
star.run(**kwargs)
|
Using Assembler objects¶
Assembler objects are defined the the assembly module and can be used for transcript assembly.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | from pyrpipe.assembly import Stringtie
#create a stringtie object
stringtie=Stringtie(guide='path_to_ref_gtf')
#perform assembly using SRA object
"""
Note: following first runs star to perform alignment. After alignment the Sorted
BAM file is stored in the sraobject.bam_path attribute and returns the modified sraobject
The assemble function requires a valid bam_path attribute to work.
"""
sraobject.align(star).assemble(stringtie)
#Or manually set bam_path
sraobject.bam_path='/path/to/sorted.bam'
sraobject.assemble(stringtie)
#use perform_assembly function
result_gtf=stringtie.perform_assembly('/path/to/sorted.bam')
#run stringtie with user arguments
stringtie.run(verbose=True, **kwargs)
|
Using Quantification objects¶
Quantification type objects can perform quantification and are defined inside the quant module.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | from pyrpipe.assembly import Salmon
#create salmon object
"""
A valid salmon idex is required. If index is not found it is built using the provided transcriptome
"""
salmon=Salmon(index='path/to/index',transcriptome='path/to/tr')
#directly quantify using SRA object
sraobject.quant(salmon)
#or trim reads before quant
sraobject.trim(tgalore).quant(salmon)
print('Result file',sraobject.abundance)
#use perform quant function
abundance_file=salmon.perform_quant(sraobject)
#use salmon with user defined arguments
salmon.run(**kwargs)
|
Using RNASeqTools objects¶
The RNASeqTools type is defined in tools module. This contains various tools used routinely for RNA-Seq data processing/analysis
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | from pyrpipe.tools import Samtools
#create samtools object
samtools=Samtools(threads=6)
#convert sam to sorted bam
bam=samtools.sam_sorted_bam('sam_file')
#merge bam files
mergedbam=samtools.merge_bam(bamfiles_list)
#run samtools with used defined arguments
"""
NOTE: the Runnable.run() method accepts a subcommand argument that allows user to procide a subcommand like samtools index or samtools merge
"""
samtools.run(*args,subcommand='index',**kwargs)
|
Using Runnable objects¶
The Runnable class, defined inside the runnable module, is the main parent class in pyrpipe i.e. all other classes borrows its functionality. User can directly create Runnable objects to define their own tools/commands.
A full example to build APIs is here: Building APIs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | from pyrpipe.runnable import Runnable
#say you want to use the tool orfipy
orfipy=Runnable(command='orfipy')
#execute orfipy as
orfipy.run(*args,**kwargs)
#another example using Unix grep
grep=Runnable(command='grep')
grep.run('query1','file1.txt',verbose=True)
grep.run('query2','file2.txt',verbose=True)
#extend Runnable to build more complex APIs that fit with each other
"""
One can create classes extending the Runnable class.
Full example is given in the tutorial
"""
class Orfipy(Runnable):
def __init__(self,*args,threads=None,mem=None,**kwargs):
super().__init__(*args,command='orfipy',**kwargs)
self._deps=[self._command]
self._param_yaml='orfipy.yaml'
#create special API functions that can work with other objects
def find_orfs(self,sra_object):
#define logic here and gather command options and parameters
#call the self.run() function and check values
#return a useful value
|
Using pyrpipe_engine module¶
The pyrpipe_engine module contain functions that creates new processes and enable executing commands. User can directly import pyrpipe_engine module and start using the functions. This is very useful for quickly executing commands without having to create a Runnable object. A table describing functions implements in pyrpipe_engine is provided in the tutorial pyrpipe_engine module
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | import pyrpipe_engine as pe
#execute_command: Runs a command, logs the status and returns the status (True or False)
pe.execute_command(['ls', '-l'],logs=False,verbose=True)
#get_shell_output Runs a command and returns a tuple (returncode, stdout and stderr)
"""
NOTE: only this function supports shell=True
"""
result=pe.get_shell_output(['head','sample_file'])
#result contains return code, stdout, stderr
print(result)
#make a function dry-run compatible
"""
when --dry-run flag is used this function will be skipped and the first parameter 'cmd' will be printed to screen
"""
@pe.dryable
def func(cmd,...)
#function logic here
"""
Other way to work with dry-run flag is to directly import _dryrun flag
"""
from from pyrpipe import _threads,_force,_dryrun
def myfunction(...):
if _dryrun:
print('This is a dry run')
return
#real code here...
|
Using pyrpipe_utils module¶
The pyrpipe_utils module contains several helpful functions that one needs to frequently access for typical computational pipelines. A table describing functions implements in pyrpipe_utils is provided in the tutorial pyrpipe_utils module
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | import pyrpipe_utils as pu
#check if files exist
pu.check_files_exist('path/f1','path/f2','path/f3') #returns bool
#get filename without extention
pu.get_file_basename('path/to/file.ext')
#get file directory
pu.get_file_directory('path/to/file.ext')
#create a directory
pu.mkdir('path/to/dir')
#Search .txt files in a directly
pu.find_files('path/to/directory','.*\.txt$',recursive=False,verbose=False)
|
Contents: