Hands-on Data-Centric AI: Data Preparation Tuning – Why and How?


Data challenges of accessibility, volume, and quality are some of the top barriers that organizations and data science teams face. Data-Centric AI workflows provide organizations with the ability to overcome these challenges while accelerating and increasing the value of AI applications.

By completing this workshop, you will learn about data-centric AI as well as the grounds to adopt this approach while developing AI solutions. The objective of the session is to provide a basic understanding of the data quality impact on model’s performance.

You’ll be able to learn how to find the most critical challenges through data-profiling, fix them (mislabels or inconsistencies, missing data, imbalanced classes, etc) with frameworks such as synthetic data, as well optimize data preparation through an iterative, scalable and versionable process.

YData Fabric will be used to demo a use-case development and all the code and examples shared will be made available.

Session Outline:

- Lesson 1: Data-Centric AI and the importance of data quality
Familiarize yourself with the concept of Data-Centric AI and the impact of data-quality while developing a Machine-Learning based approach. Get to know some of the open-source packages and solutions that can help you.

By the end of this lesson, you will be able to understand how to explore assess the quality of a dataset for a certain use-case, which will set the ground to explore and define what strategies shall be used while processing the data.

- Lesson 2: Data preparation & the role of synthetic data
A practical guide on data-preparation steps and how to mitigate some of the identified challenges, such as mislabels, inconsistencies, missing data, imbalanced classes, bias, etc.

Introduction of the concept of synthetic data, and how can users leverage it to deal with small variability within the dataset population.

- Lesson 3: How to iterate & version the process of data preparation
Get to know why fast-iteration and versions of your data preparation process are so relevant to ensure the success of a data-centric AI adoption. Learn how to integrate and align the data preparation with business expectations to achieve higher performance and generalization of your results in production.

Background Knowledge:

Supervised Machine Learning, Python, scikit-learn
Supervised Machine Learning, Python, scikit-learn


Fabiana Clemente is the co-founder and CDO of YData, combining Data Understanding, Causality, and Privacy as her main fields of work and research, with the mission to make data actionable for organizations. Passionate for data, Fabiana has vast experience leading data science teams in startups and multinational companies. Host of the “When Machine Learning meets privacy” podcast and a guest speaker at Datacast and Privacy Please, the previous WebSummit speaker, was recently awarded “Founder of the Year” by the South Europe Startup Awards.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google