The Importance of Early Data Observability on the Quality of Machine Learning


Deferring data observability until use-case driven feature engineering is accompanied by the threat of a biased analysis. When observing, modifying and inferring are all done at the same stage, it’s too easy to make data support a specific hypothesis. In this talk, we argue that the antidote is incorporating data observability at the earliest possible phase of an ML pipeline.

Observing data as it moves from the source into the analysis or application environment, helps us dissociate data quality from data function. Observability features at ingestion should include:

Data flow rate: historical and time series understanding of volume, spikes and outages of data emitted by a source.

Data continuity and freshness: field-wise first and last arrived timestamps and any gaps.

Data consistency: cardinality, density and uniqueness statistics by field.

Schema evolution: time-stamped schema change events and record of how each change was handled.

Field summary statistics: running mean, median, mode etc. of numeric fields.

Additionally, a good data observability framework at ingestion should include methodology for proactive correction and prevention of any issues discovered.

Data quality is so integral to data and ML applications that data observability and quality enforcement must become a fundamental component in all pipelines, and should appear early so as to avoid result-biasing analyses.


Ori Rafael is co-founder and CEO of Upsolver, the only no-code data lake engineering platform. He has more than 15 years of experience in databases, data integration and big data.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google