Running Data Quality Checks in Your Data Pipelines

Abstract: 

Ensuring proper data quality is critical in the effective implementation of data pipelines for ML, data science, geospatial analysis, or general analytics.

Most engineering teams address data quality and pipeline orchestration as two separate tasks. In this presentation, Sandy Ryza will explain the benefits of a model in which arbitrary checks are included in the data orchestration logic, resulting in better control and integration of data quality checks at various steps in the pipeline.

To remain versatile, the orchestrator should not determine what “data quality” means to an organization, but rather facilitate the implementation and observability of data quality checks, no matter how data quality is defined. Checks should be intuitive to implement and the outcome of the checks should inform the pipeline logic.

Achieving this degree of flexibility without impacting performance requires careful design, and Sandy will share best practices and lessons learned on creating data quality checks that provide actionable insights for data engineering and ML teams.

Session Outline:

Sandy's presentation will share how we designed and implemented data quality checks in the Python-based open-source platform Dagster, bringing data quality capabilities to the orchestration layer.

Background Knowledge:

Participants will benefit most if they have a working understanding of data orchestration, ML and data science pipelines, and a general grasp of data quality techniques.

Bio: 

Sandy is a lead engineer, author, and thought leader in the domain of data engineering. Sandy co-wrote “Advanced Analytics with PySpark” and """"Advanced Analytics with Spark”. He led ML and data science teams at Cloudera, Remix, Clover Health, and KeepTruckin.

Sandy is currently the lead engineer on the Dagster project, an open-source data orchestration platform used in MLOps, data science, IOT and analytics. Sandy is a regular speaker at data engineering and ML conferences.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from - Youtube
Vimeo
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google