It’s standard industry practice to prototype Machine Learning pipelines in Jupyter notebooks, refactor them into Python modules and then deploy using production tools such as Airflow or Kubernetes. However, this process slows down development as it requires significant changes to the code.

Ploomber enables a leaner approach where data scientists can use Jupyter but still adhere to software development best practices such as code reviews or continuous integration. To prove that this approach is a better alternative to the current prototype in a notebook, then refactor, this presentation develops and deploys a Machine Learning pipeline in 45 minutes.

The rest of this post describes how Ploomber achieves such a lean workflow.

Break down logic in multiple files

One of the main issues with notebook-based pipelines is that they often live in a single notebook. Debugging large notebooks is a nightmare, making pipelines hard to maintain. In contrast, Ploomber allows us to break down the logic in multiple, smaller steps that we declare in a pipeline.yaml file. For example, assume we’re working on a model to predict user activity using demographics and past activity. Our training pipeline would look like this:

Ploomber pipeline

Figure 1. Example pipeline

To create such a pipeline, we create a pipeline.yaml file and list our tasks (source) with their corresponding outputs (product):

# pipeline.yaml
tasks:
    # get user demographics
    - source: get-demographics.py
      product:
        nb: output/demographics.ipynb
        data: output/demographics.csv

    # get user activity
    - source: get-activity.py
      product:
        nb: output/activity.ipynb
        data: output/activity.csv

    # features from user demographics
    - source: fts-demographics.py
      product:
        nb: output/fts-demographics.ipynb
        data: output/fts-demographics.csv

    # features from user activity
    - source: fts-activity.py
      product:
        nb: output/fts-activity.ipynb
        data: output/fts-activity.csv

    # train model
    - source: train.py
      product:
        nb: output/train.ipynb
        data: output/model.pickle

Since each .py has a clearly defined objective, they are easier to maintain and test than a single notebook.

https://odsc.com/california/#register

Write code in .py and interact with it using Jupyter

Jupyter is a fantastic tool to develop data pipelines. It allows us to get quick feedback such as metrics or visualizations, essential for understanding our data. However, traditional .ipynb files have a lot of problems. For example, they make code reviews difficult because comparing versions yields illegible results. The following image shows the diff view of a notebook whose only change is a new cell with a comment:

Ploomber example

Figure 2. Illegible notebook diff on GitHub

To fix those problems, Ploomber allows users to open .py files as notebooks, which enables code reviews while still providing the power of interactive development with Jupyter. The following image shows the same .py file rendered as a notebook in Jupyter and as a script in VS Code:

Ploomer pipeline

Figure 3. Same .py file rendered as a notebook in Jupyter and script in VS Code

However, Ploomber leverages the .ipynb format as an output. Each .py executes as a notebook, generating a .ipynb file that we can use during a code review to check visual results such as tables or charts. Note that in the pipeline.yaml file, each task has a .ipynb file in the product section. See the fragment below:

# pipeline.yaml (fragment)
tasks:
    # the source script...
    - source: get-demographics.py
      product:
        # ...generates a notebook as output
        nb: output/demographics.ipynb
        data: output/demographics.csv

# pipeline.yaml continues...

Retrieve results from previous tasks

Another essential feature is how we establish execution order. For example, to generate features from activity data, we need the raw data:

Figure 4. Declaring upstream dependencies

To establish this dependency, we edit fts-activity.py and add a special upstream variable at the top of the file:

upstream = ['activity']

We are stating that activity.py must execute before fts-activity.py. Once we provide such information, Ploomber adds a new cell to give us the location of our input files; we will see something like this:

# what we write
upstream = ['activity']


# what Ploomber adds in a new cell
upstream = {
    'activity': {
        # extracted from pipeline.yaml
        'nb': 'output/activity.ipynb'
        'data': 'output/activity.csv'
    }
}

No need to hardcode paths to files!

Pipeline composition

A training pipeline and its serving counterpart have a lot of overlap. The only difference is that the training pipeline gets historical records, processes them, and trains a model, while the serving version gets new observations, processes them, and makes predictions.

Figure 5. The training and serving pipelines are mostly the same

All the data processing steps must be the same to prevent discrepancies at serving time. Once we have the training pipeline, we can easily create the serving version. The first step is to create a new file with our processing tasks:

# features.yaml - extracted from the original pipeline.yaml

# features from user demographics
- source: fts-demographics.py
  product:
    nb: output/fts-demographics.ipynb
    data: output/fts-demographics.csv

# features from user activity
- source: fts-activity.py
  product:
    nb: output/fts-activity.ipynb
    data: output/fts-activity.csv

Then we compose the training and serving pipeline by importing such tasks and adding the remaining ones:

Ploomber

We can now deploy our serving pipeline!

Deployment using Ploomber

Once we have our serving pipeline, we can deploy it to any available production backend: Kubernetes (via Argo Workflows), Airflow, or AWS Batch with our second command-line tool: Soopervisor. Such a tool requires a few additional configuration settings to create a Docker image and push our pipeline to production.

That’s it! Ploomber allows us to move back and forth between Jupyter and a production environment without any compromise on software engineering best practices.

If you are looking forward to our presentation, show your support with a star on GitHub, or join our community. See you in November during my session at ODSC West 2021, “Develop and Deploy a Machine Learning Pipeline in 45 Minutes with Ploomber.”


About the author/ODSC West 2021 speaker on Ploomber:

Eduardo Blancas is interested in developing tools to deliver reliable Machine Learning products. Towards that end, he developed Ploomber, an open-source Python library for reproducible Data Science, first introduced at JupyterCon 2020. He holds an M.S in Data Science from Columbia University, where he took part in Computational Neuroscience research. He started his Data Science career in 2015 at the Center for Data Science and Public Policy at The University of Chicago.