Pomegranate: Flexible Probabilistic Modeling for Big Data

Abstract: Pomegranate is a python package for probabilistic modeling with a primary emphasis on ease of use and a secondary emphasis on speed. In keeping with the first emphasis it has a consistent sklearn-like API for training and making predictions using a model, and a convenient “lego API” that allows complex models to be built out of simple components without needing to think about how the math might work. In keeping with the second emphasis the computationally intensive parts are written in efficient cython code and all models support both multithreaded parallelism and out-of-core computations for training on massive datasets. Currently, pomegranate allows you to use basic probability distributions to build general mixture models, naive Bayes classifiers, Markov chains, hidden Markov models, factor graphs, and Bayesian networks. In this talk I will show how to build models of increasing complexity with code examples and describing the type of phenomena they model well, drawing examples from “popular culture” and inadvertently proving how out of touch I am with today’s youth. I will showcase both its speed and flexibility at each step with comparisons to other well-known packages such as numpy, scipy, and scikit-learn. Finally, I will show the simplicity of training a mixture of hidden Markov models in parallel using pomegranate.

Bio: Jacob is a fourth year graduate student and IGERT big data fellow currently pursuing a Ph.D. in computer science at the University of Washington. His research involves connecting the three dimensional structure of the genome with a variety of cellular phenomena. Recently, Jacob’s work has involved predicting the three dimensional structure of the genome at high resolution given simple, cheap, data sources, and investigating a deep tensor factorization approach to impute missing epigenetics experiments for the human genome. As a separate track of research, he has also recently worked on techniques for merging expert knowledge with data to efficiently learn the structure of Bayesian networks, sometimes even making the problem tractable for large problems. When not doing research, Jacob actively contributes to the open source community. He is a core developer of the popular scikit-learn machine learning package for python, where he focuses on maintaining the tree based models/ensembles and probabilistic models. Jacob also written pomegranate, a python package that extends scikit-learn to flexible probabilistic modeling and some probabilistic graphical models with a focus on scalability through parallelized and out-of-core learning.

Open Data Science Conference