
Abstract: Finding good datasets or web assets to build data products or websites with, respectively, can be time-consuming. For instance, data professionals might require data from heavily regulated industries like healthcare and finance, and, in contrast, software developers might want to skip the tedious task of collecting images, text, and videos for a website. Luckily, both scenarios can now benefit from the same solution, Synthetic Data.
Synthetic Data is artificially generated data created with machine learning models, algorithms, and simulations, and this workshop is designed to show you how to enter that synthetic data world by teaching you how to create a full-stack digital product with five interrelated projects. These projects include reproducible data pipelines, a dashboard, machine learning models, a web interface, and a documentation site. So, if you want to enhance your data projects or find great assets to build a website with, come and spend 3 fun and knowledge-rich hours in this workshop.
Session Outline:
1. Introduction and Setup (~20 minutes)
- In this part of the workshop, we'll walk through the agenda and set up our development environment (optional browser-based environments will be provided via GitHub Codespaces and other websites).
2. Section I - Building Blocks (~50 minutes)
1. Introduction to Synthetic Data
- What is it and why use it?
- How to generate synthetic data with plain Python code.
- Introduction to the different frameworks available.
- Creating a synthetic data generator module.
- Exercise (7-min)
2. Analytics
- Analysing and comparing real data vs synthetic data
- Creating an analytical proof of concept with synthetic data
- Exercise (5-min)
3. 10-minute break
4. Section II - Engineering (~50 minutes)
1. Data Engineering
- Task - Create synthetic datasets and build Extract, Transform and Load pipelines for different use cases
- Synthetic Data Use Case - Generate data with errors to simulate how data professionals receive data in the real world
- Exercise (5-min)
2. Software Engineering
- Task - Develop a minimalist website using different Python frameworks such as FastAPI and jinja templates
- Synthetic Data Use Case - Generation of a website's assets including images, videos, and text
- Exercise (5-minutes)
5. 10-minute break
6. Section III - Machine Learning (~50 minutes)
- Quick intro to Machine Learning
- Task - Create and evaluate different models, and develop an inference pipeline
- Synthetic Data Use Cases:
1. Data Augmentation
2. Increase of Privacy
3. Evaluation of Machine Learning Models
- Exercise (5-min)
7. Concluding Thoughts (~5 minutes)
In this workshop, you will gain the necessary practical knowledge to create a variety of synthetic datasets and web assets (no more [Lorem Ipsum](https://loremipsum.io/)) using different open source tools and machine learning models. You will first build an intuition about (1) why the use synthetic data can help you solve some of the same problems real data can. Next, (2) you will learn about when to use synthetic data, and, lastly, (3) you will learn how to use it as you build a full-stack digital product contained in a monorepo with five interrelated projects coded in Python. These projects include a data pipeline, some data analysis, machine learning models, a web interface, and a documentation section. This session is designed to be a fun and knowledge-rich 3-hour workshop.
Background Knowledge:
Prerequisites (P) and Good To Have's (GTH)
- (P) Attendees for this tutorial are expected to be familiar with Python (1 year of coding experience or over that would be perfect).
- (P) Participants should be comfortable with loops, functions, lists comprehensions, and if-else statements.
- (P) Participants should have at least 10 GB of free space in their computers (not applicable if using the browser-based IDE option).
- (GTH) While it is not necessary to have any knowledge of data analytics or machine learning-related libraries, some experience with pandas, numpy, altair, metaflow, and scikit-learn would be very beneficial.
- (GTH) While it is not required to have experience using Jupyter Notebooks in integrated development environments like VS Code or Jupyter Lab, having either of the two, plus miniconda (https://docs.conda.io/en/latest/miniconda.html) installed, would be very beneficial for the session.
Bio: Ramon is a data scientist, researcher, and educator currently working in the Developer Relations team at Seldon in London. Prior to joining Seldon, he worked as a freelance data professional and as a Senior Product Developer at Decoded, where he created custom data science tools, workshops, and training programs for clients in various industries. Before freelancing, Ramon wore different research hats in the areas of entrepreneurship, strategy, consumer behavior, and development economics in industry and academia. Outside of work, he enjoys giving talks and technical workshops and has participated in several conferences and meetup events. In his free time, you will most likely find him traveling to new places, mountain biking, or both.

Ramon Perez
Title
Developer Advocate | Instructor | Seldon | Decoded
