Pipelining in Python with Snakemake with Biological Applications
Pipelining in Python with Snakemake with Biological Applications


During this tutorial I will show you how to use Snakemake to create a scalable and reproducible data analysis pipeline. Snakemake is a workflow management system that uses sets of rules to define steps in the analysis process and it integrates smoothly with server, cluster, or cloud environments to allow easy scaling. Each rule defines the input files, output files, and the steps to get from input to output (python code, Python or R scripts, or shell commands). The use of wildcards allows the rules to be scaled to analyze multiple datasets at once. Snakemake is a very general framework for creating pipelines and I have used it for bioinformatics applications.

I will cover what snakemake is, how to think about your problem in this type of pipeline, walk through examples of pipelines that can perform RNA-sequencing analysis, and discuss a FELIX, a program on which I have used Snakemake. FELIX is a program whose goal is to identify genetically engineered organisms in complex samples. For Felix, I use Snakemake to create pipelines that have been deployed onto clusters to process 100 samples (100s of Gb of data) and perform over a thousand hours of compute time in about two days. The pipeline includes alignment of DNA sequences, assembling sequences that contain signs of engineering into larger constructs, identifying the makeup of those constructs, and creating visualizations so that subject matter experts can quickly identify if a sample has been genetically engineered or not.


Laura Seaman is a Senior Machine Intelligence Scientist at Draper where she applies machine learning and bioinformatics algorithms to a variety of applications. Dr. Seaman’s graduate work focused on using genetic data to identify alterations to the genomic structure in cancer. She is currently using data science for many applications including analysis of financial networks and identification of genetically engineered organisms. Dr. Seaman has a Bachelor of Science in Biological Engineering from the Massachusetts Institute of Technology, a Masters of Arts in Statistics from the University of Michigan, and a Doctor of Philosophy in Bioinformatics from the University of Michigan.