Troubleshooting Large Language Models in Production with Embeddings and Evals


A standing-room-only (with many on the floor!) version of this talk was delivered at ODSC East. In this updated version of the talk, Amber will also explore some applications to foundation models and emerging use cases around LLMs.

According to a recent survey, over half (53%) of data science and machine learning teams say they plan to deploy large language model (LLM) applications into production in the next 12 months or “as soon as possible” – however, nearly as many (43%) cite issues like accuracy of responses and hallucinations as a main barrier to implementation.

Clearly, better debugging is needed! It’s often said that debugging machine learning is 10 times harder than debugging software since it uniquely combines many of the problems of software and data engineering as well as challenges unique to data science and MLOps.

This is particularly true in deep learning, where labeling data is expensive and is one of the only ways to get model performance feedback. It’s no wonder that even the most advanced and best-funded large language models, like GPT-4 and Google’s Bard, still sometimes hallucinate and fail in the real world.

Here’s the truth: troubleshooting models based on unstructured data is notoriously difficult. The measures typically used for drift in tabular data – such as population stability index, Kullback-Leibler divergence, and Jensen-Shannon divergence– allow for statistical analysis on structured labels, but do not extend to unstructured data. The general challenge with measuring unstructured data drift is that you need to understand the change in relationships inside the unstructured data itself. In short, you need to understand the data in a deeper way before you can understand drift and performance degradation.

In this presentation, Amber Roberts, Machine Learning Engineer at Arize AI, will present findings from research on ways to measure vector/embedding drift for image and language models. With lessons learned from testing different approaches (including Euclidean and Cosine distance) across billions of streams and use cases, Roberts will dive into how to detect whether two unstructured language datasets are different — and, if so, how to understand that difference using techniques such as UMAP.

In the coming years, more ML teams will likely look to embedding drift to help detect and understand differences in their data. This presentation with examples from the real world will be both useful and fascinating to advanced data scientists and learners alike!


Amber Roberts is a ML Growth Lead at Arize AI, a ML observability company built for maintaining models in production. Previously, Amber was a product manager of AI at Splunk and the Head of Artificial Intelligence at Insight Data Science. A Carnegie Fellow, Amber has an MS in Astrophysics from the Universidad de Chile.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google