Building Provenance and Reproducibility into ML Systems


Every day, machine learning is increasingly affecting human lives. Machine learning models are being integrated into software to make credit decisions, sift through resumes, and translate between languages. All of this makes tracking and reproducing ML models even more important.

Knowing how deployed models are trained and on what data usually requires layering a tracking system on top of the ML training library. Once captured this provenance metadata can be used for reporting, experiment tracking and failure analysis. However the layering approach can lead to issues if the tracking is misconfigured or not integrated into all runs. A resulting lack of provenance information can make it difficult to perform analysis & tracking tasks. It also makes it hard to reproduce models either for regulatory reasons (to verify that a model was trained on the right data) or to allow retraining of the model on new or modified data while preserving the hyperparameters and data pipeline.

In this talk we'll discuss our approach to solving the problems of provenance tracking and reproducibility by engineering a machine learning library from the ground up to incorporate first-class notions of provenance and reproducibility, automatically capturing provenance for all ML computations.

Session Outline
Attend this session to learn about:
- Considerations for building provenance into concurrent environments like the JVM
- How to ensure accurate capture of all the information that flows into training a model
- Use cases for provenance information and how to use it to track deployed models


Adam Pocock is a Machine Learning researcher at Oracle Labs. He's the lead developer of the Tribuo machine learning library, and maintains several other machine learning libraries on the JVM including TensorFlow-Java and ONNX Runtime's Java API. Adam's research has covered several areas of ML & applications, from work on scaling up and parallelizing Bayesian inference, to building multilingual NLP systems. He holds a PhD in Computer Science from the University of Manchester where his research focused on the theoretical underpinnings of feature selection algorithms.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google