Data Science Best Practices: Continuous Delivery for Machine Learning
Data Science Best Practices: Continuous Delivery for Machine Learning

Abstract: 

Machine learning is usually taught from tutorials using small, clean datasets put into data-frames and orchestrated with Jupyter notebooks; all done in one, in-memory, local environment. This is a fine style for presenting a new topic and teaching the main ideas, but unfortunately, these patterns are not conducive to the delivery of real production applications at scale. Real industrial situations involve multiple environments and data sets from databases or other data stores rather than file-based input. They interact with live production systems and must be coordinated with software delivery teams and product owners. They must be production quality, with good design, well-tested and maintainable. This often results in data scientists having to choose between the environment that they are used to, and one that is suitable for delivery to production; and an awkward migration from one to the other. In this workshop, we show how to maintain data science productivity as well as collaborate effectively and deliver value continuously and seamlessly. We demonstrate and guide the participants through CI/CD practices for machine learning and a new pattern of working that avoids most of the pitfalls of the typical approach.

Participants will learn how to utilize new patterns of repeatable continuous model development to collaborate effectively and deliver value continuously and seamlessly in industrial data science projects using Continuous Integration (CI) and Continuous Delivery (CD) practices.

● Github;
● Docker;
● Jenkins;
● Jupyter;
● Python;
● DVC;
● MLFlow;
● Kibana;
● ElasticSearch;

https://github.com/thoughtworksInc/CD4ML-Scenarios
https://drive.google.com/open?id=1QtJljTqRqR5E-GfgpGPZTaWyroXNcXFZ

Bio: 

David Johnston is a Principal Data Scientist and founding data scientist of the ThoughtWorks Data Science & Engineering practice. David has over 25 years of experience working with data, data processing pipelines, algorithms, optimization and statistical and machine learning models. David has a Ph.D. in physics and worked previously as a researcher at top universities, NASA and US government labs in the field of cosmology. Since leaving academia he has specialized in helping clients apply these techniques in their business environments with a focus on end-to-end delivery of valuable data-driven products and creating working, maintainable production systems. David is a frequent writer and speaker on data science, artificial intelligence and the importance of applying quality software development best practices toward data science-driven applications.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from Youtube
Vimeo
Consent to display content from Vimeo
Google Maps
Consent to display content from Google