Demystifying LLM Evaluation


According to a recent survey, 61.7% of enterprise engineering teams now have or are planning to have a generative AI application this year – and 14.1% are already in production.

As companies race to deploy generative AI into their businesses, the need to ensure that LLMs are deployed reliably and responsibly is paramount. LLM evaluation is a key part of this process.

Unfortunately, LLM evaluation and performance analysis of models are an area where confusion reigns and the key distinction between LLM model evaluations and LLM system evaluations often gets lost in practice. Even sophisticated teams stare at a sea of leaderboards and libraries and scratch their heads. There is nothing more important to get right, however, than understanding where you can apply an LLM and how well it is doing at a specific task.

Session Outline:

Session will cover:
-> Difference between LLM Model Evals (leaderboard) and LLM System (Task) Evals
-> Research on how how major foundation models – from OpenAI’s GPT-4 to Mistral Mixtral 8x7B and Anthropic’s Claude 3 – are stacking up against each other at important tasks and emerging LLM use cases, from numeric evals to timeseries analysis
-> Techniques on how to build an LLM task eval from scratch (example open source tools might reference LlamaIndex, Ragas, Phoenix)

Background Knowledge:

Basic understanding of agentic workflows and LLM tools helpful but not essential


Jason Lopatecki is co-founder and CEO of Arize AI, an AI observability and LLM evaluation company. He is a garage-to-IPO executive with an extensive background in building marketing-leading products and businesses that heavily leverage analytics. Prior to Arize, Jason was co-founder and chief innovation officer at TubeMogul where he scaled the business into a public company and eventual acquisition by Adobe. Jason has hands-on knowledge of complex LLM systems, big data architectures, programmatic advertising systems, distributed systems, and machine learning and data processing architectures. In his free time, Jason tinkers with personal LLMOps projects as a hobby. He holds an electrical engineering and computer science degree from UC Berkeley - Go Bears!

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google