How to Practice Data-Centric AI and Have AI improve its Own Dataset

Abstract: 

In Machine Learning projects, one starts by exploring the data and training an initial baseline model. While it’s tempting to experiment with different modeling techniques right after that, an emerging science of data-centric AI introduces systematic techniques to utilize the baseline model to find and fix dataset issues. Improving the dataset in this manner, one can drastically improve the initial model’s performance without any change to the modeling code at all! These techniques work with any ML model and the improved dataset can be used to train any type of model (allowing modeling improvements to be stacked on top of dataset improvements). Such automated data curation has been instrumental to the success of AI organizations like OpenAI and Tesla.

While data scientists have long been improving data through manual labor, data-centric AI studies algorithms to do this automatically. This tutorial will teach you how to operationalize fundamental ideas from data-centric AI across a wide variety of datasets (image, text, tabular, etc). We will cover recent algorithms to automatically identify common issues in real-world data (label errors, bad data annotators, outliers, low-quality examples, and other dataset problems that once identified can be easily addressed to significantly improve trained models). Open-source code to easily run these algorithms within end-to-end Data Science projects will also be demonstrated. After this tutorial, you will know how to use models to improve your data, in order to immediately retrain better models (and iterate this data/model improvement in a virtuous cycle).

Session Outline:

Lesson 1: Fundamentals of Data-Centric AI. At the end of this lesson, you will understand how it is possible to use your baseline ML model to improve your dataset.

Lesson 2: Algorithms to automatically detect issues in data. At the end of this lesson, you will understand the concepts behind algorithms that can estimate which data are:
mislabeled, outliers, near duplicates, or suffering from other common types of issues.

Lesson 3: Methods to improve datasets. At the end of this lesson, you will understand methods to best deal with data labeled by multiple annotators, and how to efficiently annotate datasets (active learning) — as well as open-source code to easily apply these ideas in your projects.

Background Knowledge:

After this tutorial, you will know how to use ML models to improve your data, in order to immediately retrain better models (and iterate this data/model improvement in a virtuous cycle). You will be able to apply these ideas in practice via the open-source cleanlab library (https://github.com/cleanlab/cleanlab)

The intended audience is folks with experience in supervised learning who want to develop the most effective ML for messy, real-world applications. Some of the content will be technical, but not require a deep understanding of how particular ML algorithms/model work (having completed one previous ML course/project should suffice). The topics should be of interest to anybody working in: computer vision, natural language processing (and LLMs), audio/speech or tabular data, and other standard supervised learning applications.

Bio: 

Jonas Mueller is Chief Scientist and Co-Founder at Cleanlab, a software company providing data-centric AI tools to efficiently improve ML datasets. Previously, he was a senior scientist at Amazon Web Services developing AutoML and Deep Learning algorithms which now power ML applications at hundreds of the world's largest companies. In 2018, he completed his PhD in Machine Learning at MIT, also doing research in NLP, Statistics, and Computational Biology.

Jonas has published over 30 papers in top ML and Data Science venues (NeurIPS, ICML, ICLR, AAAI, JASA, Annals of Statistics, etc). This research has been featured in Wired, VentureBeat, Technology Review, World Economic Forum, and other media. He has also contributed open-source software, including the fastest-growing open-source libraries for AutoML (https://github.com/awslabs/autogluon) and Data-Centric AI (https://github.com/cleanlab/cleanlab).""Jonas Mueller is Chief Scientist and Co-Founder at Cleanlab, a software company providing data-centric AI tools to efficiently improve ML datasets. Previously, he was a senior scientist at Amazon Web Services developing AutoML and Deep Learning algorithms which now power ML applications at hundreds of the world's largest companies. In 2018, he completed his PhD in Machine Learning at MIT, also doing research in NLP, Statistics, and Computational Biology.

Jonas has published over 30 papers in top ML and Data Science venues (NeurIPS, ICML, ICLR, AAAI, JASA, Annals of Statistics, etc). This research has been featured in Wired, VentureBeat, Technology Review, World Economic Forum, and other media. He has also contributed open-source software, including the fastest-growing open-source libraries for AutoML (https://github.com/awslabs/autogluon) and Data-Centric AI (https://github.com/cleanlab/cleanlab).

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from Youtube
Vimeo
Consent to display content from Vimeo
Google Maps
Consent to display content from Google