Abstract: One of the key questions in modern data science and machine learning, for businesses and practitioners alike, is how do you move machine learning projects from prototype and experiment to production as a repeatable process. In this workshop, we present an introduction to the landscape of production-grade tools, techniques, and workflows that bridge the gap between laptop data science and production ML workflows.
We’ll present a high-level overview of the 8 layers of the ML stack: data, compute, versioning, orchestration, software architecture, model operations, feature engineering, and model development. We’ll present a schematic as to which layers data scientists need to be thinking about and working with, and then introduce attendees to the tooling and workflow landscape. In doing so, we’ll present a widely applicable stack that provides the best possible user experience for data scientists, allowing them to focus on parts they like (modeling using their favorite off-the-shelf libraries) while providing robust built-in solutions for the foundational infrastructure.
- Lesson 1: Laptop Machine Learning (the refresher)
This lesson will be a refresher on laptop machine learning, that is, when you’re using local compute resources, not working on the cloud: using the PyData stack (packages such as NumPy, pandas, and scikit-learn) to do basic forms of prediction and inference locally. We will also cover common pitfalls and gotchas, which motivate the next lessons.
- Lesson 2: Machine learning workflows and DAGs
This lesson will focus on building local machine learning workflows using Metaflow, although the high-level concepts taught will be applicable to any workflow orchestrator. Attendees will get a feel for writing flows and DAGs to define the steps in their workflows. We’ll also use DAG cards to visualize our ML workflows. This lesson will be local computation and in the next lesson, we’ll burst to the cloud.
We'll introduce the framework Metaflow, which allows data scientists to focus on the top layers of the ML stack, while having access to the infrastructural layers.
- Lesson 3: Bursting to the Cloud
In this lesson, we’ll see how we can move ML steps or entire workflows to the cloud from the comfort of our own IDE. In this case, we’ll be using AWS Batch compute resources, but the techniques are generalizable.
- Lesson 4 (optional and time permitting): Integrating other tools into your ML pipelines
We’ll also see how to begin integrating other tools into our pipelines, such as dbt for data transformation, great expectations for data validation, Weights & Biases for experiment tracking, and Amazon Sagemaker for model deployment. Once again, the intention is not to tie us to any of these tools, but to use them to illustrate various aspects of the ML stack and to develop a workflow in which they can easily be switched out for other tools, depending on where you work and who you’re collaborating with.
* programming fundamentals and the basics of the Python programming language (e.g., variables, for loops);
* a bit about the PyData stack: `numpy`, `pandas`, `scikit-learn`, for example;
* a bit about Jupyter Notebooks and Jupyter Lab;
* your way around the terminal/shell.
Please also find here a few instructions for the workshop, which also contains any browser compatibility issues that may arise with the Full Stack ML Sandbox. https://docs.google.com/document/d/1WF6u_RZbnFjmdpXK6MaKcRsfiGuEK_ajPEvnPReDq58/edit?usp=sharing
Bio: Hugo Bowne-Anderson is a data scientist, writer, educator & podcaster. His interests include promoting data & AI literacy/fluency, helping to spread data skills through organizations and society and doing amateur stand up comedy in NYC. He does many of these at DataCamp, a data science training company educating over 3 million learners worldwide through interactive courses on the use of Python, R, SQL, Git, Bash and Spreadsheets in a data science context. He has spearheaded the development of over 25 courses in DataCamp’s Python curriculum, impacting over 170,000 learners worldwide through my own courses. He hosts and produce the data science podcast DataFramed, in which he uses long-format interviews with working data scientists to delve into what actually happens in the space and what impact it can and does have. He earned PhD in Mathematics from the University of New South Wales, Australia and has conducted biomedical research at the Max Planck Institute in Germany and Yale University, New Haven.