
Abstract: Deferring data observability until use-case driven feature engineering is accompanied by the threat of a biased analysis. When observing, modifying and inferring are all done at the same stage, it’s too easy to make data support a specific hypothesis. In this talk, we argue that the antidote is incorporating data observability at the earliest possible phase of an ML pipeline.
Observing data as it moves from the source into the analysis or application environment, helps us dissociate data quality from data function. Observability features at ingestion should include:
Data flow rate: historical and time series understanding of volume, spikes and outages of data emitted by a source.
Data continuity and freshness: field-wise first and last arrived timestamps and any gaps.
Data consistency: cardinality, density and uniqueness statistics by field.
Schema evolution: time-stamped schema change events and record of how each change was handled.
Field summary statistics: running mean, median, mode etc. of numeric fields.
Additionally, a good data observability framework at ingestion should include methodology for proactive correction and prevention of any issues discovered.
Data quality is so integral to data and ML applications that data observability and quality enforcement must become a fundamental component in all pipelines, and should appear early so as to avoid result-biasing analyses.
Bio: Ori Rafael is co-founder and CEO of Upsolver, the only no-code data lake engineering platform. He has more than 15 years of experience in databases, data integration and big data.