Data Validation at Scale – Detecting and Responding to Data Misbehavior


In today's data-driven world, companies rely heavily on data to make informed decisions and gain a competitive edge. This also means that the presence of low-quality data can have serious negative consequences for businesses. Incorrect or incomplete data can lead to poor decision-making, missed opportunities, and ultimately, financial losses. Data validation can be particularly challenging as the amount of data involved continues to grow steadily. As businesses generate and collect more data than ever before, the task of ensuring that all of this information is accurate and consistent becomes increasingly complex.

As the complexity of systems increases, data errors may occur and spread throughout various stages of the pipeline. In order to pinpoint the root cause of these errors, it is crucial for the data validation process to facilitate debugging and integrate closely with alerting systems. To achieve that goal, it is essential to develop practical tools that enable accurate and efficient data validation under the strict requirements of modern data-driven systems.

In this hands-on workshop, Data Scientist Felipe Adachi will introduce the concept of data logging and discuss how to validate data at scale by creating metric constraints and generating reports based on the data's statistical profiles using the whylogs open-source package. He will also walk through steps that data scientists and ML engineers can take to tailor their set of validations to fit the specific needs of their business or project, take actions when their rules fail to be met, and debug and troubleshoot cases where data fails to behave as expected.


Felipe is a Data Scientist at WhyLabs. He is a core contributor to whylogs, an open-source data logging library, and focuses on writing technical content and expanding the whylogs library in order to make AI more accessible, robust, and responsible. Previously, Felipe was an AI Researcher at WEG, where he researched and deployed Natural Language Processing approaches to extract knowledge from textual information about electric machinery. He is also a Master in Electronic Systems Engineering from UFSC (Universidade Federal de Santa Catarina), with research focused on developing and deploying fault detection strategies based on machine learning for unmanned underwater vehicles. Felipe has published a series of blog articles about MLOps, Monitoring, and Natural Language Processing in publications such as Towards Data Science, Analytics Vidhya, and Google Cloud Community.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google