Model Evaluation in LLM-enhanced Products


Evaluation in machine learning (ML) product development is a rich topic with a long history. However, Large language models (LLMs) represent a significant deviation from the known path and introduce a lot of unknowns. Since the same LLM can be flexibly applied in a wide range of contexts both with and without additional tuning, its evaluation must reflect this increased scope. Moreover, since LLMs output natural language instead of discrete classes, we must shift our evaluation focus from classic metrics like accuracy and F1 scores to complex concepts like usefulness, attribution, factuality, and safety.

Given this new paradigm, how can we build on long-standing best practices of evaluation, learn from academic research, and build solid evaluation pipelines for LLMs? Furthermore, we must consider the important role that humans play in model evaluations and determine what can be automated -- and whether it should be.

In this talk, I will discuss these questions alongside common pitfalls, opportunities, and best practices related to including large language models as an additional ingredient in product development.


Sebastian Gehrmann is the Head of NLP in the Office of the CTO at Bloomberg, where he contributes to and guides the strategy for the development of language technology across the company. His research interests range from natural language generation to model evaluation. He has worked on large language models like BloombergGPT, BLOOM, as well as PaLM and PaLM 2.

Before joining Bloomberg, Sebastian was a senior researcher at Google. He holds a Ph.D. in computer science from Harvard University.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google