apricot: Taming Big Data by Removing Redundancy


The past decade of data science has been by defined by massive data sets and the tools for processing and analyzing them. Now, it is commonplace to see even toy data sets with thousands or millions of examples. The large-scale nature of these data are frequently necessary to observe all the modalities---even the rare ones---within a data manifold. Consequently, these data have enabled subtle associations to be detected and sophisticated machine learning methods to be trained.

However, each additional example that is acquired is not equally informative. Indeed, many examples within massive data sets are redundant with each other because they cover the same common modalities. When training machine learning models, one can waste a significant amount of time performing unhelpful updates on redundant examples.

In this talk, I'll describe apricot, which is a fast and flexible Python implementation of submodular optimization. Submodular optimization is similar to convex optimization, but deals with sets of items rather than continuous values. At a high level, most submodular optimization problems can be thought of as selecting K elements from a bag of N elements, where the K elements are minimally redundant. I'll show how this notion of redundancy can be derived from a similarity (or distance) graph in process similar to fast k-medoids clustering, or from feature values themselves to scale to millions of examples. Most importantly, I'll show how the examples that are chosen can be used to train a machine learning model in a fraction of the time while still achieving comparable accuracy as the full data set.


Jacob Schreiber is a post-doctoral researcher at the Stanford School of Medicine. As a researcher, he has developed machine learning approaches to integrate thousands of genomics data sets, to design biological sequences with desired characteristics, and has described how statistical pitfalls can be encountered and accounted for in genomics data sets. As an engineer, he has contributed to the community as a core contributor to scikit-learn and as the developer of several machine learning toolkits, including pomegranate for probabilistic modeling and apricot for submodular optimization.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from Youtube
Consent to display content from Vimeo
Google Maps
Consent to display content from Google