Editor’s note: Ivan Lee is a speaker for ODSC East this April 23-25. Be sure to check out his talk, “LLM-native Products: Industry Best Practices and What’s Ahead,” there!

2023 was a year of experimentation and exploration with LLMs. ChatGPT captured everyone’s imagination because it could do everything – suggest international travel plans, generate ideas for meal plans, and analyze lengthy legal contracts. However, it lacked access to proprietary data, and hallucination and non-deterministic answers limited its usage in production. AI engineers around the world scrambled to address these issues.

In 2024, business execs expect tangible results. Now data science teams are tasked with demonstrating an ROI on their work. The mistakes made by ChatGPT and its successors pose much greater business risks for the enterprise. As an NLP practitioner of over a decade, I have experience working with Fortune 100 companies and actively working with them to deploy high-impact, performant LLM/GenAI solutions.

The best-performing teams start with results-oriented development. They work with the intended users to identify workflows that can be optimized and automated. Will the model results be treated as definitive answers, or will a human review the completions before they are used? The data scientists set clear guidelines on what will and will not be supported at each phase of development, and set clear business-centric success criteria for leadership. Once the model is trained and deployed, they set up training sessions with users to ensure they understand how to use the new tool properly. And as with any software engineering project, they measure usage, observe the strengths and weaknesses of the model, and then iterate to improve its performance as needed.

Many teams are understandably focused on model accuracy. Many of the most popular metrics measure how a model performs on specific use cases in relation to the current gold standard, GPT 4. However, for models in production, it is also important to measure and consider tradeoffs in cost and inference time. Some companies are seeing their OpenAI bills pile up and are looking for more affordable long-term solutions. Meanwhile, ChatGPT deploys a nifty UX trick in “typing out” its answers. This can mask the fact that a single prompt can often take 30 seconds or longer; by comparison, a Google search for “OpenAI” returns 303,000,000 results in 0.35 seconds. As internet users, we have become accustomed to very fast results. So teams have to make critical decisions for production environments – is it better to receive an answer in 5 seconds, or an answer that is 10% more accurate in 30 seconds? In my upcoming talk at ODSC, I will dive further into techniques that help assess a model’s success and bring down costs and inference time:

  • Prompt-based Unit Testing
  • Establishing ground truth datasets
  • LLM distillation – a proliferation of smaller, ad hoc LLMs and NLP models
  • Prompt caching
  • Pros and cons of working with open-source LLMs

Please join me to learn more about aligning LLM research with business goals. I look forward to seeing you there!

About the author:

Ivan Lee graduated with a Computer Science B.S. from Stanford University, then dropped out of his master’s degree to found his first mobile gaming company Loki Studios. After raising institutional funding and building a profitable game, Loki was acquired by Yahoo.

Lee spent the next 10 years building AI products at Yahoo and Apple and discovered there was a gap in serving the rapid evolution of Natural Language Processing (NLP) technologies. He built Datasaur to focus on democratizing access to NLP and LLMs. Datasaur raised $8m in venture funding from top-tier investors such as Initialized Capital, Greg Brockman (President, OpenAI) and Calvin French-Owen (CTO, Segment) and serves companies such as Google, Netflix, Qualtrics, Spotify and more.