Metrics & Visualizations for Evaluating Synthetic Data Quality

Abstract: 

Synthetic data has shown great promise for solving a variety of problems like addressing data scarcity for AI and overcoming barriers to data access. But the field of synthetic data generation is still extremely nascent and we haven’t converged on a set of common benchmarks for evaluating the quality of synthetic data.

Our team originally came from MIT’s Data-to-AI Lab and we’ve spent years researching and collecting the best metrics for evaluating synthetic data quality like CategoricalCAP, Boundary Adherence, and more.

Learning Objectives
Learn the basic approach of evaluating synthetic data by comparing columns with your original data.
Most of the data in organizations and business is structured, relational, and tabular. Learn about the unique problems that synthetic data generation can solve based on our experience helping thousands of individuals work with synthetic data.
Choosing the right synthetic data quality metrics isn’t easy and is tied closely to the goal of your project. We’ll showcase our recommended framework, which incorporates the context & expertise of domain experts and specific, interpretable statistical measures.
Learn which metrics and visualizations you should use for each data type.
What are the most common pitfalls and mistakes people make when generating synthetic data?

Takeaways
Statistical measures are necessary but insufficient for evaluating synthetic data. Domain expertise is important for defining business rules that your data should follow, independent of just the quality score itself.
Using side-by-side visualizations of quality scores can help communicate synthetic data quality to your stakeholders and collaborators.
The goals of a project play a big factor in how you evaluate the quality of synthetic data.
When evaluating synthetic data, avoid common statistical pitfalls. For example, it’s tempting to rely on correlation between columns in the original data and synthetic data but often the linearity assumption is violated.

Tools
Plotly and SDMetrics, both completely open source (MIT licensed)
Examples of visualizations we’ll showcase are here and here.

Bio: 

Bio Coming Soon!

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from - Youtube
Vimeo
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google