Visually Inspecting Data Profiles for Data Distribution Shifts

Abstract: 

The real world is a constant source of ever-changing and non-stationary data. That ultimately means that even the best ML models will eventually go stale. Data distribution shifts, in all of their forms, are one of the major post-production concerns for any ML/data practitioner. As organizations are increasingly relying on ML to improve performance as intended outside of the lab, the need for efficient debugging and troubleshooting tools in the ML operations world also increases. That becomes especially challenging when taking into consideration common requirements in the production environment, such as scalability, privacy, security, and real-time concerns.

Distribution shift issues, if unaddressed, can mean significant performance degradation over time and even turn the model downright unusable. How can teams proactively assess these issues in their production environment before their models degrade significantly? To answer this question, traditional statistical methods and efficient data logging techniques must be combined into practical tools in order to enable distribution shift inspection and detection under the strict requirements a production environment can entail.

In this talk, Data Scientist Felipe Adachi will talk about different types of data distribution shifts in ML applications, such as covariate shift, label shift, and concept drift, and how these issues can affect your ML application. Furthermore, the speaker will discuss the challenges of enabling distribution shift detection in data in a lightweight and scalable manner by calculating approximate statistics for drift measurements. Finally, the speaker will walk through steps that data scientists and ML engineers can take in order to surface data distribution shift issues in a practical manner, rather than reacting to the impacts of performance degradation reported by their customers.

Session Outline:
For this 90-minute hands-on workshop, the following session outline is planned.
Session 1 - Data Distribution Shift
In this session, we’ll introduce the concept of data distribution shifts, and exactly why this is a problem for ML practitioners. We will cover different types of distribution shifts and how to measure them.
In this session, we will cover:

1. Data Distribution Shift
a. What is Data Distribution Shift?
b. Why is it a problem?

2. Types of Distribution Shift (With Definitions and Examples)
a. Covariate Shift / Concept Drift / Label Shift

3. How to Measure Drift
a. Visual Inspection / Validation / Statistical Tests

4. Notebook Hands-on: Detecting distribution shift with popular statistical packages
(scipy/alibi-detect)

Session 2 - Facing the Real World
In the real world, we might not always have data readily available as we would like. In this session, we’ll cover several challenges presented by the real world, and how we can leverage data logging to help us overcome these challenges.

In this session, we will cover:
1. Challenges of the real world
a. Big Data/Privacy/Streaming & Distributed Systems

2. Data Logging
a. Principles of whylogs
i. Efficient / Customizable /Mergeable

3. Notebook Hand-on: Profiling data and inspecting results with whylogs Session 3 Inspecting and Comparing Distributions with whylogs
In this session, we will explore the whylogs’ Visualizer Module and its capabilities, using the Wine Quality dataset as a use-case to demonstrate distribution shifts. We will first generate statistical summaries with whylogs and then visualize the profiles with the visualization module.

This is a Hands-on Notebook Session.

In this session, we will cover:
1. Notebook Hands-on with whylogs’ Visualizer Module
a. Introduction to the Visualizer module
b. Profiling data with whylogs
c. Generating Summary Drift Reports
d. Inspecting Distribution Charts between distributions
e. Inspecting Histograms between distributions
f. Inspecting Feature Statistics

Session 4 - Data Validation
As discussed in previous sessions, data validation plays a critical role in detecting changes in your data. In this session, we will introduce the concept of constraints - ways to express your expectations from your data - and how to apply them to ensure the quality of your data.
This is a Hands-on notebook session.

In this session, we will cover:
1. Introduction to Constraints
2. Defining Data Constraints
3. Applying defined constraints to data
4. Generating Data Validation Reports

Bio: 

Bernease Herman is a senior data scientist at WhyLabs, the AI Observability company, and a research scientist at the University of Washington eScience Institute. At WhyLabs, she is building model and data monitoring solutions using approximate statistics techniques. Earlier in her career, Bernease built ML-driven solutions for inventory planning at Amazon and conducted quantitative research at Morgan Stanley. Her academic research focuses on evaluation metrics and interpretable ML with specialty on synthetic data and societal implications. She has published work in top machine learning conferences and workshops such as NeurIPS, ICLR, and FAccT.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from Youtube
Vimeo
Consent to display content from Vimeo
Google Maps
Consent to display content from Google