Improving ML Datasets with Cleanlab, a Standard Framework for Data-Centric AI


In practical ML projects, experienced data scientists know that improving data brings higher ROI than tinkering with models. However the process of finding and fixing problems in a dataset is highly manual (ad hoc ideas explored in Jupyter notebooks). Cleanlab develops software to help make this process more: efficient (via novel algorithms that automatically detect certain issues in data) and systematic (with better coverage to detect different types of issues).

This talk will describe how high-level ideas from data-centric AI can be operationalized across a wide variety of datasets (image, text, tabular, etc), and introduce new algorithmic strategies to improve data that we have researched and published papers on with extensive benchmarks. I will conclude with a discussion of connections to AutoML, where the data-centric AI movement is headed next, and key obstacles that deserve more attention.

The intended audience is folks with experience in supervised learning who want to develop the most effective ML for messy, real-world applications. Some of the content will be technical, but not require a deep understanding of how particular ML algorithms/model work (having completed one previous ML course/project should suffice). The topics should be of interest to anybody working in: computer vision, natural language processing, audio/speech or tabular data, and other standard supervised learning applications.


Jonas Mueller is Chief Scientist and Co-Founder at Cleanlab, a software company providing data-centric AI tools to efficiently improve ML datasets. Previously, he was a senior scientist at Amazon Web Services developing AutoML and Deep Learning algorithms which now power ML applications at hundreds of the world's largest companies. In 2018, he completed his PhD in Machine Learning at MIT, also doing research in NLP, Statistics, and Computational Biology.

Jonas has published over 30 papers in top ML and Data Science venues (NeurIPS, ICML, ICLR, AAAI, JASA, Annals of Statistics, etc). This research has been featured in Wired, VentureBeat, Technology Review, World Economic Forum, and other media. He has also contributed open-source software, including the fastest-growing open-source libraries for AutoML ( and Data-Centric AI (

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google