
Abstract: A session where Large Language Models, Generative AI and synthetic data converge to address one of the most pressing challenges in data science development: time-series missing data imputation. Missing data imputation is one of the most challenging tasks in machine learning. It can be particularly difficult to address when dealing with incomplete or messy data, and depending on the context, it can even lead to the introduction of biases in our training set. However, there are several strategies that can be used to address this issue and improve the accuracy of our models. When it comes to time-series datasets this issue can introduce even more complexity. This is because the missing values can disrupt the continuity of the data and lead to inaccurate results. To address this challenge, there are various techniques that can be used to impute time-series missing values.
In this talk, we will cover the use of Generative models, such as LLMs and GANs, for the generation of smart synthetic data that can be leveraged to impute missing data. By using a generative model to impute missing data, we can generate new samples that are representative of the underlying data distribution, which can help to reduce the impact of missing data on our models. In addition, these models can be fine-tuned to specific datasets, allowing us to generate synthetic data that is tailored to our particular use case.
The structure of the talk can be split into three main sections:
Introduction to LLMs and Generative models: In this module it will briefly introduce the concept of Generative models and what changed with LLMs. Introduce a few examples and use-cases on how generative models can be used.
Time-series introduction to missing data and profiling: In this module, users will learn how to profile their time-series datasets to gain a deeper understanding of the data. This includes techniques for identifying missing data and gaps in the dataset. Additionally, users will explore different statistical measures and visualizations to gain insights into the underlying patterns and trends in the data. By the end of this module, users will have a solid foundation in analyzing time-series data and making data-driven decisions.
The different methods for missing data imputation: In this module the audience is expected to learn that there are various methods available for imputing missing data in time-series, ranging from traditional approaches to more advanced generative models. By understanding the differences, challenges, and impacts of these methods, researchers and practitioners can ensure that their imputed datasets are of high quality and fit for purpose. Traditional approaches to missing data imputation in time-series include simple methods such as back-filling and forward filling, as well as more sophisticated methods such as linear interpolation and mean imputation. While these methods can be quick and easy to implement, they may not always produce the most accurate results, particularly when dealing with complex or noisy datasets. Generative models, on the other hand, offer a more advanced approach to missing data imputation in time-series.
Evaluate the different imputation methods: Finally, the audience will be able to visualize and understand how to choose the best imputation solution based on a set of different metrics. These metrics will help to determine the quality of the imputation and reduce the potential for bias and other errors. For example, we might look at the accuracy of the imputed values, the speed of the imputation algorithm, or the ability of the imputation to handle missing data in different forms. By carefully evaluating these metrics, we can choose the best imputation solution for our specific needs and ensure that our analysis is as accurate and reliable as possible.
Bio: Fabiana Clemente is the co-founder and CDO of YData, combining Data Understanding, Causality, and Privacy as her main fields of work and research, with the mission to make data actionable for organizations. Passionate for data, Fabiana has vast experience leading data science teams in startups and multinational companies. Host of “When Machine Learning meets privacy” podcast and a guest speaker at Datacast and Privacy Please, the previous WebSummit speaker, was recently awarded “Founder of the Year” by the South Europe Startup Awards.