Putting the science back in data science
Putting the science back in data science


Despite the many amazing applications of statistics, machine learning, and visualization in industry, many attempts at doing "data science" are anything but scientific. Specifically, data science processes often lack reproducibility, a key tenet of science in general and a precursor to having true collaboration in a scientific community. In this session, I will discuss the importance of reproducibility and data provenance in any data science organization, and I will provide some practical steps to help data science organizations produce reproducible data analyses and maintain integrity in their data science applications. I will also demo a reproducible data science workflow that includes complete provenance explaining the entire process that produced specific results.


Daniel (@dwhitena) is a Ph.D. trained data scientist working with Pachyderm (@pachydermIO). Daniel develops innovative, distributed data pipelines which include predictive models, data visualizations, statistical analyses, and more. He has spoken at conferences around the world, teaches data science/engineering with Ardan Labs (@ardanlabs), maintains the Go kernel for Jupyter, and is actively helping to organize contributions to various open source data science projects.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google