Oct 29th – Oct 31st, 2024
Hyatt Regency San Francisco, Burlingame, CA
Register now & Save 40%


Register now & Save 40%

100+ sessions coming soon!

Please note that we will not be live-streaming in-person sessions – only virtual session recordings will be recorded.              Conference Time: ET (UTC – 5)

12:00 pm - 12:00 pm Machine Learning Beginner

Who Wants to Live Forever? Reliability Engineering and Mortality

Reliability engineering is the study of survival and failure in engineered systems, but its methods can be applied as well in natural and social sciences, and business. It reveals surprising patterns in the world, including many examples where used is better than new -- that is, we expect a used part to last longer than a new one. In this talk, I'll present tools of reliability engineering including survival curves, hazard functions, and expected remaining lifetimes. And we'll consider examples from a variety of domains, including light bulbs, computer systems, and life expectancy for humans and institutions. Intuitively, we expect things to wear out over time: a new car is expected to last longer than a used one, and a young person is expected to live longer than an old person. But many natural and engineered systems defy this intuition. For example, in the last weeks of pregnancy, the process becomes almost memoryless: the expected remaining duration levels off at four days, and stays there for almost four weeks. Other examples entirely invert our expectations, so the longer something has survived, the longer we expect it to survive. Until recently, nearly every baby born had this property, due to high rates of infant mortality. Computer programs, data transfers, and freight trains have it, too. Understanding this behavior is important for designing computer systems, interpreting a medical prognosis, and maybe finding the key to immortality.

Allen Downey, PhD Allen Downey, PhD Curriculum Designer at | Professor Emeritus at Olin College
Deep Learning | Beginner

Data Morph: A Cautionary Tale of Summary Statistics

Statistics do not come intuitively to humans; they always try to find simple ways to describe complex things. Given a complex dataset, they may feel tempted to use simple summary statistics like the mean, median, or standard deviation to describe it. However, these numbers are not a replacement for visualizing the distribution. To illustrate this fact, researchers have generated many datasets that are very different visually, but share the same summary statistics. In this talk, I will discuss """"""""Data Morph"""""""" (, an open source package that builds on previous research from Autodesk (the """"""""Datasaurus Dozen"""""""" ( using simulated annealing to perturb an arbitrary input dataset into a variety of shapes, while preserving the mean, standard deviation, and correlation to multiple decimal points. I will showcase how it works, discuss the challenges faced during development, and explore the limitations of this approach.

Stefanie Molin Stefanie Molin Data Scientist, Software Engineer at Bloomberg | Author of Hands-On Data Analysis with Pandas
Generative AI | Beginner-Intermediate

Tutorial: Intro to the ChatGPT API

Conversational AI, especially ChatGPT, has become extremely popular over the past year. By January 2023, ChatGPT was the fastest-growing consumer software application in history. While many are familiar with and frequently use its web interface, we will explore its API (Application Programming Interface). The API access allows you to interact with ChatGPT in a Jupyter Notebook or any other coding environment and use it as a developer tool. It radically speeds up the development and deployment of many natural language processing tasks such as text summarization, sentiment analysis, topic modeling, text transformations (such as translation, grammar correction, and style adjustments), and chatbot development. I will show how to perform these tasks during the tutorial. I hope that by the end, you will be well-equipped to start innovating with ChatGPT and develop your own applications. I assume attendees have standard Python knowledge and know how to work with container types (such as lists and dictionaries), control flow (like for loops and if statements), and functions.

Andras Zsom, PhD Andras Zsom, PhD Assistant Professor of the Practice, Director of Graduate Studies at Data Science Institute, Brown University
Responsible AI | All Levels

How AI Impacts the Online Information Ecosystem

Through concrete examples and a high-level conceptual overview, I'll discuss the various ways---both good and bad---that AI is impacting our online information ecosystem. This includes creation of mis/disinfo (LLMs, deepfake video/audio), propagation of mis/disinfo (search rankings, social media algs), the funding of disinfo (the targeted advertising industry), and AI-assisted fact-checking and bot detection/deletion.

Noah Giansiracusa, PhD Noah Giansiracusa, PhD Associate Professor of Mathematics and Data Science at Bentley University
LLMs&NLP| Intermediate

LangChain on Kubernetes: Cloud-Native LLM Deployment Made Easy & Efficient

Deploying large language model (LLM) architectures with billions of parameters can pose significant challenges. Creating generative AI interfaces is difficult enough on its own but add to that the complexity of managing a complex architecture while juggling computational requirements and ensuring efficient resource utilization, and you’ve got a potential recipe for disaster when transitioning your training models to a real-world scenario. LangChain, an open source framework for developing applications powered by LLMs, aims to simplify creating these interfaces by streamlining the use of several neuro-linguistic programming (NLP) components into easily deployable chains. At the same time, Kubernetes can help manage the underlying infrastructure. This talk walks you through how to smoothly and efficiently transition your trained models to working applications by deploying an end-to-end LLM containerized application built with LangChain in a cloud-native environment using open-source tools like Kubernetes, LangServe, and FastAPI. You'll learn how to deploy a trained model quickly and easily that's designed for scalability, flexibility, and seamless orchestration.

Ezequiel Lanza Ezequiel Lanza AI Open source Evangelist at Intel
LLMs | All Levels

Tutorial: Operationalizing Local LLMs Responsibly for MLOps

I. Introduction to LLMs Defining foundation of large language models Use cases like search, content generation, programming II. Architecting High-Performance LLM Pipelines Storing training data efficiently at scale Leveraging specialized hardware accelerators Optimizing hyperparameters for cost/accuracy Serving inferences with low latency III. Monitoring and Maintaining LLMs Tracking model accuracy and performance Retraining triggers to stay performant Evaluating inferences for bias indicators Adding human oversight loops IV. Building Ethical Guardrails for Local LLMs Auditing training data composition Establishing process transparency Benchmarking rigorously on safety Implementing accountability for production systems V. The Future of Responsible Local LLMs Advances that build trust and mitigate harms Policy considerations around generative models Promoting democratization through education

Machine Learning | Beginner-Intermediate

Tutorial: No-Code and Low-Code AI: A Practical Project Driven Approach to ML

No-code machine learning (ML) is a way to build and deploy ML models without having to write any code. Low-code ML is a way to build and deploy ML models with minimal coding. Both methods can be valuable for businesses and individuals who do not have the skills or resources to develop ML models themselves. By completing this workshop, you will develop an understanding of no-code and low-code frameworks, how they are used in the ML workflow, how they can be used for data ingestion and analysis, and for building, training, and deploying ML models. You will become familiar with Google’s Vertex AI for both no-code and low-code ML model training, and Google’s Colab, a free Jupyter Notebook service for running Python and the Keras Sequential API, a simple and easy-to-use API that is well-suited for beginners. You will also become familiar with how to assess when to use low-code, no-code, and custom ML training frameworks. The primary audience for this workshop are aspiring citizen data scientists, business analysts, data analysts, students, and data scientists who seek to learn how to very quickly experiment, build, train, and deploy ML models.

Gwendolyn D. Stripling, PhD Gwendolyn D. Stripling, PhD Lead AI & ML Content Developer at Google Cloud
Generative AI | All Levels

Tutorial: Deploying Trustworthy Generative AI

Generative AI models and applications are being rapidly deployed across several industries, but there are several ethical and social considerations that need to be addressed. These concerns include lack of interpretability, bias and discrimination, privacy, lack of model robustness, fake and misleading content, copyright implications, plagiarism, and environmental impact associated with training and inference of generative AI models. In this talk, we first motivate the need for adopting responsible AI principles when developing and deploying large language models (LLMs) and other generative AI models, and provide a roadmap for thinking about responsible AI for generative AI in practice. Focusing on real-world LLM use cases (e.g. evaluating LLMs for robustness, security, etc. using, we present practical solution approaches / guidelines for applying responsible AI techniques effectively and discuss lessons learned from deploying responsible AI approaches for generative AI applications in practice. By providing real-world generative AI use cases, lessons learned, and best practices, this talk will enable researchers & practitioners to build more reliable and trustworthy generative AI applications. Please take a look at our recent ICML/KDD/FAccT tutorial ( for an expanded version of this talk.

Krishnaram Kenthapadi Krishnaram Kenthapadi Chief AI Officer & Chief Scientist at Fiddler AI
Deep Learning | Intermediate

Tutorial: How to Practice Data-Centric AI and Have AI Improve its Own Dataset

In Machine Learning projects, one starts by exploring the data and training an initial baseline model. While it’s tempting to experiment with different modeling techniques right after that, an emerging science of data-centric AI introduces systematic techniques to utilize the baseline model to find and fix dataset issues. Improving the dataset in this manner, one can drastically improve the initial model’s performance without any change to the modeling code at all! These techniques work with any ML model and the improved dataset can be used to train any type of model (allowing modeling improvements to be stacked on top of dataset improvements). Such automated data curation has been instrumental to the success of AI organizations like OpenAI and Tesla. While data scientists have long been improving data through manual labor, data-centric AI studies algorithms to do this automatically. This tutorial will teach you how to operationalize fundamental ideas from data-centric AI across a wide variety of datasets (image, text, tabular, etc). We will cover recent algorithms to automatically identify common issues in real-world data (label errors, bad data annotators, outliers, low-quality examples, and other dataset problems that once identified can be easily addressed to significantly improve trained models). Open-source code to easily run these algorithms within end-to-end Data Science projects will also be demonstrated. After this tutorial, you will know how to use models to improve your data, in order to immediately retrain better models (and iterate this data/model improvement in a virtuous cycle).

Jonas Mueller Jonas Mueller Chief Scientist and Co-Founder at Cleanlab
Machine Learning | Intermediate

Tutorial: Introduction to Apache Arrow and Apache Parquet, using Python and Pyarrow

This workshop will cover the basics of Apache Arrow and Apache Parquet, how to load data to/from pyarrow arrays, csv and parquet files, and how to use pyarrow to quickly perform analytic operations such as filtering, aggregation, joining and sorting. In addition, you will also experience the benefits of the open Arrow ecosystem and see how Arrow allows fast and efficient interoperability with pandas,, DataFusion, DuckDB and other technologies that support the Arrow memory format.

Andrew Lamb Andrew Lamb Chair of the Apache Arrow Program Management Committee | Staff Software Engineer at InfluxData
Data Engineering | Intermediate-Advanced

Data Engineering in the Age of Data Regulations

Continuous data regulations like GDPR, CCPA, DMA and many others are giving control to users over how their data is used and imposing restrictions on what companies can do with user data. This talk will focus on LinkedIn's approach to converting these regulations into policies and integrating policy enforcement in data engineering practices using our Policy Based Access Control (PBAC) system. It will cover how to annotate data, features, pipelines and models; how to integrate model training and inferences with the PBAC system; and how to enforce policies. It will describe the architecture and components of LinkedIn's governance system and various tools used to automate the annotation and enforcement process.

Alex Gorelik Alex Gorelik Distinguished Engineer at LinkedIn
LLMs | Intermediate

Model Evaluation in LLM-enhanced Products

Evaluation in machine learning (ML) product development is a rich topic with a long history. However, Large language models (LLMs) represent a significant deviation from the known path and introduce a lot of unknowns. Since the same LLM can be flexibly applied in a wide range of contexts both with and without additional tuning, its evaluation must reflect this increased scope. Moreover, since LLMs output natural language instead of discrete classes, we must shift our evaluation focus from classic metrics like accuracy and F1 scores to complex concepts like usefulness, attribution, factuality, and safety. Given this new paradigm, how can we build on long-standing best practices of evaluation, learn from academic research, and build solid evaluation pipelines for LLMs? Furthermore, we must consider the important role that humans play in model evaluations and determine what can be automated -- and whether it should be. In this talk, I will discuss these questions alongside common pitfalls, opportunities, and best practices related to including large language models as an additional ingredient in product development.

Sebastian Gehrmann, PhD Sebastian Gehrmann, PhD Head of NLP, Office of the CTO at Bloomberg
LLMs | Intermediate-Advanced

Training an OpenAI Quality Text Embedding Model from Scratch

Text embeddings are an integral component of modern NLP applications powering retrieval-augmented-generation (RAG) for LLMs and semantic search. High quality text embeddings models are closed source and access to them is gated via the API's of leading AI companies. This talk describes how Nomic AI trained nomic-embed-text-v1 - the first fully auditable open-data, open-weights and open-training code text embedding model that outperforms the performance of OpenAI Ada-002. You will learn how text embedding models are trained, the various training decisions that impact model capabilities and tips for successfully using them in your production applications.

Andriy Mulyar Andriy Mulyar Founder & CTO at Nomic AI
LLMs | Intermediate

Tracing In LLM Applications

According to a recent survey, 61.7% of enterprise engineering teams now have or are planning to have an LLM app in production within a year – and over one in ten (14.7%) are already in production, compared to 8.3% in April. With a record pace of adoption, the practice of troubleshooting and observing LLM apps takes on elevated importance. For software engineers that work with distributed systems, terms like “spans,” “traces,” and “calls” are well known. But what might these terms mean in a world where foundation models dominate? Since LLM observability isn’t just about tracking API calls, but about evaluating the LLM’s performance on specific tasks, there are a variety of span kinds and attributes that can be filtered on, in order to troubleshoot a LLM’s performance. Hosted by Amber Roberts – a data scientist, ML engineer and astrophysicist and former Carnegie Fellow – this session will focus on best practices for tracing calls in a given LLM application by providing the terminology, skills and knowledge needed to dissect various span kinds. Informed by work with dozens of enterprises with LLM apps in production and research on what works, attendees can learn span types and how to view traces from a LLM callback system and establish troubleshooting workflows to break down each call an application is making to an LLM. The session will explain and dive into both top-down workflows (starting with the big picture of the LLM use case and then getting into specifics of the execution if the performance is not satisfactory) and bottom-up workflows (discovery workflow where you are at the local level to filter on individual spans).

Amber Roberts Amber Roberts ML Growth Lead at Arize AI

Beyond Theory: Effective Strategies for Bringing Generative AI into Production

In the rapidly evolving and constantly advancing landscape of artificial intelligence, foundation models like GPT-4 and DALL-E 3 and the broader world of generative AI have emerged as potential game-changers, offering unprecedented and previously unimagined capabilities across a wide variety of domains and use cases. However, while these theoretical models showcase promising capabilities, the practical challenge of transitioning from conceptual research to full-scale production-level applications remains a major obstacle that many organizations and teams continue to face. This keynote presentation aims to help bridge this gap by taking a deep dive into exploring pragmatic and actionable strategies and best practices for successfully integrating these cutting-edge AI technologies into real-world business environments. We will closely examine the critical concepts surrounding Foundation Model Operations and Large Language Model Operations (FMOps/LLMOps), delving into the practical intricacies and challenges involved in deploying, monitoring, maintaining and scaling generative AI models in enterprise production systems. The discussion will comprehensively cover several critical topics such as optimal model selection, rigorous testing and evaluation, efficient training and fine-tuning techniques, retrieval augmented generation (RAG) architectures, and effective deployment strategies required for operationalization. Attendees will gain crucial and applicable insights into overcoming common obstacles frequently faced when attempting to deploy AI in live systems, including recommendations around managing resource-intensive models, ensuring ongoing model fairness and transparency, and strategically adapting to the continuously fast evolving AI landscape. To provide full perspective, the talk will also highlight relevant real-world examples and case studies, providing a comprehensive end-to-end view of the demanding practical requirements for true AI deployment. This presentation has been tailored for a wide audience encompassing AI and machine learning professionals, technology leaders, IT and DevOps teams, and anyone generally interested in better understanding the operational side of taking AI technology live. Whether you're looking to implement generative AI capabilities in your own organisation or working to enhance existing AI operations, this discussion will equip you with directly actionable knowledge and tools to successfully meet the challenges in navigating the world of FMOps/LLMOps.

Heiko Hotz Heiko Hotz Generative AI Global Blackbelt | Google
AI X | All Levels

LLM-native Products: Industry Best Practices and What's Ahead

The next generation of LLM-powered products will not look and feel like ChatGPT. ChatGPT captured everyone's imagination because it could do everything - suggest international travel plans, generate ideas for dates, and analyze lengthy legal contracts. Now companies are building AI models that are more narrowly focused and must avoid the recurring issues of hallucination and non-deterministic answers. The mistakes made by ChatGPT and its successors pose much greater business risks for the enterprise. An incorrect answer to a simple math problem or directions to a cafe that doesn't exist might make the news as a silly, laughable mistake. But a customer socially engineering a model to ask about confidential pricing documents, or an employee poking around HR performance reviews could result in a PR fiasco. In August 2023, a generative AI for recipes mistakenly produced a recipe that was poisonous to humans. As an NLP practitioner of over a decade, I have experience working with Fortune 100 companies and actively observe trends in how they are seeking to deploy LLM/GenAI technologies. This talk will explore the business objectives LLMs are most suited to address across all industries, and how organizations ranging from healthcare to eCommerce are leveraging this new technology to increase revenue or cut costs. We will also discuss how to avoid common pitfalls organizations make in scoping and building their projects, how to set reasonable goals and milestones for executive sponsors, and look ahead to a future where LLMs ubiquitously power all our day-to-day applications behind the scenes.

Ivan Lee Ivan Lee CEO / Founder at

Moving Beyond Statistical Parrots - Large Language Models and their Tooling

Large language models like GPT-4 and Codex have demonstrated immense capabilities in generating fluent text. However, simply scaling up data and compute results in statistical parroting without true intelligence. This talk explores frameworks and techniques to move beyond statistical mimicry. We discuss leveraging tools to retrieve knowledge, prompt engineering to steer models, monitoring systems to detect biases, and cloud offerings to deploy conversational agents. This talk explores the emerging ecosystem of frameworks, services, and tooling that propel large language models and enable developers to build impactful applications powered by large language models. Complex mechanisms like function calling and Retrieval Augmented Generation, navigating towards meaningful outputs and applications requires an overarching focus on strong model governance frameworks that can ensure that biases and harmful ideologies embedded in the training data are duly mitigated, paving the way towards beneficial application development. Developers play a crucial role in this process and should be empowered with tools and knowledge to steer these models appropriately. Intentional use of these elements not only optimizes model governance but also enriches the experience for developers, allowing them to dig deeper and create substantial applications that are not mere parroting, but stockholders of genuine value. From deploying conversational agents to crafting impactful applications across a swath of industries, such as healthcare and education, the comprehensive understanding and utilization of the vast array of LLM mechanisms can truly push the boundaries of NLP and AI, helping to usher in the age of AI in everyday life.

Ben Auffarth, PhD Ben Auffarth, PhD Author: Generative AI with LangChain | Lead Data Scientist at Hastings Direct
LLMs | All Levels

Reasoning in Large Language Models

Scaling language models has improved state-of-the-art performance on nearly every NLP benchmark, with large language models (LLMs) performing impressively as few-shot learners. Despite these achievements, even the largest of these models still struggle with tasks that require reasoning. Recent work has shown that prompting or fine-tuning LLMs to generate step-by-step rationales, or asking them to verify their final answer can lead to improvements on reasoning tasks. While these methods have proven successful in specific domains, there is still no general framework for LLMs to be capable of reasoning in a wide range of situations. In this talk, I will give an overview of some of the existing methods used for improving and eliciting reasoning in large language models, methods for evaluating reasoning in these models, and discuss limitations and challenges.

Maryam Fazel-Zarandi, PhD Maryam Fazel-Zarandi, PhD Researcher Engineering Manager, FAIR at Meta
All Levels

Generative AI for Social Good

This talk will focus on current generative AI methods, including image and text generation, with a focus on social good applications, including medical imaging applications, diversity training applications, public health initiatives, and underrepresented language applications. We'll start with an overview of common generative AI algorithms for image and text generation before launching into a series of case studies with more specific algorithm overviews and their successes on social good projects. We'll explore an algorithm called TopoGAN that is being used to augment medical image samples. We'll look at GPT-4 and open-source large language models (LLMs) that can generate cases of bias and fairness. We'll consider how language translation and image generators such as stable diffusion can quickly produce public health campaign material. Finally, we'll explore language generation with low-resource languages like Hausa and Swahili, highlighting the potential for language applications in the developing world to aid businesses, governments, and non-profits communicating with local populations. We'll end the talk with a discussion of ethical generative AI and potential for misuse. Learning outcomes include familiarity with common generative AI algorithms and sources, their uses in a variety of settings, and ethical considerations when developing generative AI algorithms. This will equip programming-oriented data scientists with a background to implement algorithms themselves and business-focused analytics professionals with a background to consider strategic initiatives that might benefit from generative AI.

Colleen Molloy Farrelly Colleen Molloy Farrelly Chief Mathematician at Post Urban Ventures
All Levels

CodeLlama: Open Foundation Models for Code

In this session, we will present the methods used to train Code Llama, the performance we obtained, and show how you could use Code Llama in practice for many software development use cases. Code Llama is a family of open large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B, 34B, and now 70B parameters each. Code Llama reaches state-of-the-art performance among open models on several code benchmarks. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other open model on MultiPL-E. Code Llama was released under a permissive license that allows for both research and commercial use.

Baptiste Roziere Baptiste Roziere Research Scientist at Meta

Generative Modeling in Quantitative Finance

Recent advances in modeling of financial markets have focused on understanding deep statistical dependencies among a large number of financial assets and their characteristics. We model joint dynamics of thousands of companies along with hundreds of their financial data fields, e.g. market prices, fundamentals and technical indicators; and when we don’t have sufficient historical data we use generative models using machine learning to produce synthetic data. We show that generative methods have a broad range of applications in finance, including generating realistic financial time-series, volatility and correlation estimation and portfolio optimization. We will demonstrate applications in Data Imputation and Now-casting particularly for deducing geographical, climate and ESG exposures of companies that fail to report on these metrics. We also apply generative modeling for general asset pricing and hedging use cases.

Arun Verma, PhD Arun Verma, PhD Head of Quant Research Solutions Team, CTO Office at Bloomberg LP
Data Engineering | Beginner-Intermediate

The 12 Factor App for Data

Data is everywhere, and so too are data-centric applications. As the world becomes increasingly data-centric, and the volumes of that data increase over time, data engineering will become more and more important. If we're going to be dealing with petabytes of data it will be better to get the fundamentals in place before you start, rather than trying to retrofit best practices onto mountains of data. This only makes a difficult job harder. The 12-factor app helped to define how we think about and design cloud native applications. In this presentation, I will discuss 12 principles of designing data-centric applications that have helped me over the years across 4 categories : Architecture & Design, Quality & Validation (Observability), Audit & Explainability, Consumption. This has ultimately led to our teams delivering data platforms that are both testable and well-tested. The 12 factors also enable them to be upgraded in a safe and controlled manner and will help them get deployed quickly, safely and repeatedly. This talk will be filled with examples and counter examples from the course of my career and the projects that my teams have seen over the years. It will incorporate software engineering best practices and how these apply to data-centric engineering. We hope that you can benefit from some of our experience to create higher quality data-centric applications that scale better and get into production quicker.

James Bowkett James Bowkett Technical Delivery Director at OpenCredo
Data Engineering | Intermediate-Advanced

Engineering Knowledge Graph Data for a Semantic Recommendation AI System

Semantic recommendation systems are a type of AI system that can help surface content in vast repositories by representing the data as a knowledge graph and implementing graph traversal algorithms that return relevant content to end users. These systems can be very useful for clients across industries, and plenty of fun for the data engineers on-board, requiring skills such as auto-tagging, ETL pipeline construction and orchestration, and graph algorithm design and implementation. Learn how to design such a system in this in-depth tutorial.

Ethan Hamilton Ethan Hamilton Data Engineer at Enterprise Knowledge
MLOps & Data Engineering | Intermediate

Highly Scalable Inference Platform for Models of Any Size

In recent years, advances in ML/AI have made tremendous progress yet designing large-scale data science and machine learning applications still remain challenging. The variety of machine learning frameworks, hardware accelerators, cloud vendors as well as the complexity of data science workflows brings new challenges to MLOps. One particular challenge is that it’s non-trivial to build an inference system that’s suitable for models of different sizes, especially for LLMs or large models in general. This talk presents various best practices and challenges on building large, efficient, scalable, and reliable AI/ML model inference platforms using cloud-native technologies such as Kubernetes and KServe that are production-ready for models at any size.

Yuan Tang Yuan Tang Principal Software Engineer at Red Hat
LLMs | Intermediate-Advanced

Setting Up Text Processing Models for Success: Formal Representations versus Large Language Models

With increasingly vast storehouses of textual data readily available, the field of Natural Language Processing offers the potential to extract, organize, and repackage knowledge revealed either directly or indirectly. Though for decades one of the holy grails of the field has been the vision of accomplishing these tasks with minimal human knowledge engineering through machine learning, with each new wave of machine learning research, the same tensions are experienced between investment in knowledge engineering and integration know-how on the one hand and production of knowledge/insight on the other hand. This talk explores techniques for injecting insight into data representations to increase effectiveness in model performance, especially in a cross-domain setting. Recent work in neural-symbolic approaches to NLP is one such approach, in some cases reporting advances from incorporation of formal representations of language and knowledge and in other cases revealing challenges in identifying high utility abstractions and strategic exceptions that frequently require exogenous data sources and the interplay between these formal representations and bottom-up generalities that are apparent from endogenous sources. More recently, Large Language Models (LLMs) have been used to produce textual augmentations to data representations, with more success. Couched within these tensions, this talk reports on recent work towards increased availability of both formal and informal representations of language and knowledge as well as explorations within the space of tensions to use this knowledge in effective ways.

Carolyn Rosé, PhD Carolyn Rosé, PhD Professor, Program Director for the Masters of Computational Data Science at Carnegie Mellon University

Building Knowledge Graphs

Knowledge graphs are all around us and we are using them everyday. Lot of the emerging Data management products like Data Catalogs/Fabric, MDM products are leveraging Knowledge Graphs as their engines. A knowledge graph is not a one-off engineering project. Building a KG requires collaboration between functional domain experts, data engineers, data modelers and key sponsors. It also combines technology, strategy and organizational aspects; focusing only on technology leads to a high risk of a KG’s failure. KGs are effective tools for capturing and structuring a large amount of structured, unstructured and semistructured data. As such, KGs are becoming the backbone of different systems, including semantic search engines, recommendation systems, and conversational bots and data fabric. This session guides data and analytics professionals to show the value of Knowledge Graphs and how to build build semantic applications.

Sumit Pal Sumit Pal Strategic Technology Director at Ontotext
Data Engineering | Intermediate

Data Pipeline Architecture - Stop Building Monoliths

In modern software development we have fully embraced microservice architecture, for good reason, but in data monoliths are accepted despite their pitfalls. Even when using the latest tooling associated with the “modern data stack” we very often end up creating monoliths, and almost always live to regret it. In small organizations, with small central teams we can get away with this architecture with limited discomfort for some time. In fact, like when developing any small software project, the monolith seems to save time, and gives the impression of higher productivity. But as complexity increases developer experience and productivity drops, and our system begins to get more brittle, frustrating both our engineering teams and stakeholders. Monolithic architecture is even more cumbersome in larger teams, especially in organizations that allow for federated data product development. So what’s the answer? How can we take inspiration from what’s been done in Microservices and Event Based Architecture? How can we apply some of the concepts of Data Mesh architecture? In this talk we will review how these patterns, and to what extent technologies can apply, starting from first principles and then working through the implementation patterns to common open source frameworks. This will include multi-Airflow infrastructure, micro-DAG packing and deployment, DBT multi-project implementation, rational use of containers, and data sharing/publication strategies. We will review some approaches for decomposing existing data monoliths, using a real world scenario.

Elliott Cordo Elliott Cordo Founder, Architect, Builder at Datafutures
Women Ignite

Language Modeling, Ethical Considerations of Generative AI, and Responsible AI

My session will focus on: • Technological evolution in Artificial Intelligence and Natural Language Processing which led to Generative AI • What is Generative AI • How Generative AI differs from Machine Learning and Deep Learning • How Large Language Models are built • The difference between Statistical Language Models vs Neural Language Models • Ethical considerations for Generative AI such as Bias, Privacy, Copyright, Intellectual Property Rights, Misinformation, and Environmental Impact • Responsible AI and how we can play our part to ensure that Large Language Models are developed and used responsibly

Madiha Shakil Mirza Madiha Shakil Mirza NLP Engineer at Avanade
Data Engineering | Intermediate

Dive into Data: The Future of the Single Source of Truth is an Open Data Lake

In this talk, I will take you on a journey building a centralized data repository that ingests from a wide variety of sources - e.g. service databases, SAAS applications, unstructured files, and conversational data. I give real life examples on how migrating from proprietary data warehouses to an open data lake dramatically reduced cloud costs and vendor lock-in, how cloud file targets decoupled compute from storage and improved data pipeline efficiency. I focus on the EL of extract, load, transform and the scalability of open table formats. I conclude with UniForm, the hope to end the “battle of metastores”, and the future state of data lakes. My insights can help you choose the most appropriate technology to accommodate diverse analytics, machine learning, and product use cases.

Christina Taylor Christina Taylor Senior Staff Engineer | Catalyst Software
Responsible AI | All Levels

Resisting AI

This session will introduce the arguments set out in the book 'Resisting AI'. The objective is to reframe the operations of AI as social and political as well as technical, to highlight their potential for amplifying social harms, and to equip participants with alternative ways to assess the social purpose of their work. Starting from the specific technical operations of deep learning and generative AI, the talk will explore AI's direct social impacts, its immersion in institutional and bureaucratic structures, and its resonance with key dynamics at a political, environmental and global level. The talk will challenge the sense that AI represents a sudden acceleration into a sci-fi future, and will draw out the different ways in which abstract computational operations are entangled with the same messy histories and politics as everything else in society. As such, it is an opportunity for participants to acknowledge the more uncomfortable aspects of the AI industry. The aim is to empower participants to call out solutionism, thoughtlessness or epistemic injustice and to steer their work away from instances of algorithmic violence or social and political exclusion. However, the talk will also set out proposals for positive futures in which complex computation is embedded in feminist and decolonial relationality and constitutes a technical practice for the common good.

Dr. Dan McQuillan Dr. Dan McQuillan Lecturer in Creative and Social Computing at Goldsmiths, University of London
Deep Learning | Intermediate

Trial, Error, Triumph: Lessons Learned using LLMs for Creating Machine Learning Training Data

We've all been in situations where we'd like to build a model but lack the labeled training data to do so. I plan to discuss how the advent of Large Language Models (LLMs) like GPT-4 has opened new avenues for generating training data. Traditionally, the creation of NLP datasets relied heavily on manual, crowdsourced handlabeling, often resorting to platforms like Mechanical Turk. This approach, while effective, presented significant challenges in terms of cost, time, and scalability. In this talk, I will share a comprehensive narrative of our journey from initial trials and errors to eventual triumphs in using LLMs for NLP data generation. The shift from manual to AI-assisted data creation marks a pivotal change in how we approach NLP model training. My team and I navigated through various challenges, experimenting with different strategies and learning valuable lessons along the way. I will discuss how we harnessed the power of LLMs to generate vast amounts of diverse, nuanced data, significantly reducing the time and cost compared to traditional methods. The talk will cover practical insights into fine-tuning these models for specific domains, ensuring data quality, and avoiding common pitfalls such as biases and overfitting. Moreover, I will highlight how LLMs can be creatively used to simulate real-world scenarios, providing richer and more contextually relevant training data. This not only improves the performance of traditional NLP models but also opens up possibilities for exploring new problem spaces within NLP. Attendees will leave with a deeper understanding of the potential and limitations of using LLMs in NLP data generation. They will gain actionable insights and strategies that can be applied in their own NLP projects, accelerating their journey from trial to triumph in the realm of AI-powered data science.

Matt Dzugan Matt Dzugan Director of Data at Muck Rack
Generative AI | Intermediate

Leveraging RAG and Multi-Agent LLM Systems for Automation of Knowledge Synthesis

This presentation focuses on the advanced utilization of Retrieval-Augmented Generation (RAG) and multi-agent Large Language Models (LLMs) for the synthesis of scientific knowledge. Using the development of WikiCrow as a case study, we explore how these technologies can efficiently curate and synthesize vast amounts of scientific literature, a task traditionally bottlenecked by information retrieval and summarization of 100s of millions of source documents. We will discuss the architecture and mechanics of multi-agent LLM systems like PaperQA, which underpin WikiCrow. This includes their ability to perform complex tasks such as identifying relevant scientific papers, parsing and summarizing text, and synthesizing this information into concise, accurate summaries. The focus will be on the technical strategies employed to reduce common issues like hallucinations in LLM outputs and the methods used to improve citation accuracy and relevance. The session will also cover the challenges and strategies in evaluating the performance of such systems, highlighting the importance of information retrieval, and the hurdles in assessing the veracity of AI-generated content. We aim to provide attendees with practical insights into how RAG and multi-agent LLMs can be integrated into their own systems for more effective data processing and knowledge synthesis. Attendees will leave with a deeper understanding of the potential and limitations of current AI technologies for knowledge synthesis. This talk is particularly suited for data scientists, AI researchers, and professionals interested in the application of LLMs and RAG systems for improving retrieval and knowledge management for AI agents and humans in scientific and research-oriented domains.

Matthew Rubashkin, PhD Matthew Rubashkin, PhD Head of Engineering at Future House
Machine Learning | Beginner-Intermediate

Workflow-based GeoAI Analysis with No/Low-Code Visual Programming

In this training session, we will explore the utilization of low-code/no-code visual programming platforms to effectively integrate geospatial analysis with a variety of AI algorithms, including machine learning, deep learning, and Explainable AI. Designed primarily for data science novices, this training enables participants to easily embark on their journey without needing extensive programming expertise. They will learn to harness the platform for advanced spatial analysis and the development of sophisticated AI models. The training is structured into four comprehensive sections: Introduction to the Visual Programming Platform: We will begin by introducing the open-source KNIME Analytics Platform (AP), detailing its basic features and user interface. Participants will become familiar with its intuitive visual programming environment. AI Functions in KNIME AP: This segment will cover the platform's advanced AI functionalities, providing insights into the range and capabilities of its AI tools. Extension on Geospatial Analysis for KNIME AP: Participants will delve into specific geospatial analysis applications, learning how to manage spatial data and execute spatial analyses within KNIME. Case Demonstration: The final part will focus on constructing AI models using the KNIME platform, with a special emphasis on deep learning and explainable AI models. A practical case study will be presented to demonstrate these models' application in geospatial analysis. Through this training, participants, irrespective of their data science background, will gain essential skills to employ the KNIME platform for both geospatial analysis and AI model applications. This will lay a solid foundation for their continued learning and practice in this evolving field. Time Schedule: Hands-off Training (1 hour): 10 minutes: Introduction to the KNIME platform 10 minutes: AI functionalities in KNIME 10 minutes: Introduction to the Geospatial Analysis Module 30 minutes: Introduction to AI Models and Case Demonstration

Lingbo Liu, PhD Lingbo Liu, PhD Postdoctoral Research Fellow at Harvard University
ML Safety & Security | Beginner

Overcoming the Limitations of LLM Safety Parameters with Human Testing and Monitoring

Ensuring safety, fairness, and responsibility has become a critical challenge in the rapidly evolving landscape of Large Language Models (LLMs). This talk delves into a new approach to address these concerns by leveraging the power of human testing and monitoring from a diverse global population. We present a comprehensive strategy employing a combination of crowd-sourced and professional testers from various locations, countries, cultures, and life experiences. Our approach thoroughly scrutinizes LLM and LLM application input and output spaces. It ensures responsible and safe product delivery. The presentation centers on functional performance, usability, accessibility, and bug testing. We share our research into these approaches and include recommendations for building test plans, adversarial testing approaches, and real-world usage scenarios. This diverse, global, human-based testing approach is a direct solution to the issues raised in recent papers highlighting the limited effectiveness of RLHF-created safety parameters against fine-tuning and prompt injection. Experts are calling for LLMs that inject safety parameters at the base parameter level, but, to date, this has resulted in a significant drop in LLM efficacy. Additionally, building safety directly into the pre-trained model is prohibitively expensive. Our approach overcomes these technical and financial limitations and is applicable now. Results point to a paradigm shift in LLM safety practices, yielding models and applications that remain helpful and harmless throughout their lifecycle.

Peter Pham Peter Pham Senior Program Manager at Applause
Josh Poduska Josh Poduska AI Advisor at Applause
ML for Biotech and Pharma | All Levels

Harnessing Machine Learning to Understand SARS-CoV-2 Variants and Hospitalization Risk

In this session, we will delve deep into the transformative potential of Machine Learning (ML) in the BioTech and Pharma industry. This talk will provide a comprehensive overview of how ML can be harnessed to accelerate drug discovery, enhance personalized medicine, improve patient outcomes, and drive innovation. We will explore real-world applications, focusing on a case study that involves the analysis of SARS-CoV-2 genetic variants and their association with hospitalization risk. This will provide attendees with a practical understanding of how ML can be applied to complex biological and medical data to derive actionable insights. The session will provide a detailed walkthrough of the use of ML models like XGBoost and analytical techniques like SHapley Additive exPlanations (SHAP) analysis. In addition to exploring these tools and techniques, we will also discuss the challenges that come with integrating ML into existing bioinformatics workflows.

Tomasz Adamusiak, MD, PhD Tomasz Adamusiak, MD, PhD Chief Scientist, Clinical Insights & Innovation Cell at MITRE
LLMs | Intermediate

Prompt Engineering: From Few Shot to Chain of Thought

The popularization of large language models (LLMs) has completely shifted how we solve problems as humans. In prior years, solving any task (e.g., reformatting a document or classifying a sentence) with a computer would require a program (i.e., a set of commands precisely written according to some programming language) to be created. With LLMs, solving such problems requires no more than a textual prompt. In this session, we will provide a basic primer on the topic of prompt engineering, as well as cover examples of notable prompt engineering techniques ranging from basic strategies like few-shot learning to more advanced approaches like chain of thought prompting.

Cameron Wolfe, PhD Cameron Wolfe, PhD Director of AI | Rebuy Engine
Data Engineering | All Levels

Is Gen AI A Data Engineering or Software Engineering Problem?

If you pulled aside a data engineer and asked him to create a chatbot program using an API to take an unstructured data user generated input, they would look at you with a bit of confusion and ask, “where’s the data?” These early days of Gen AI have made it appear to be mostly a software engineering and API integration project just like any other, and while software engineers have an important role to play, data teams need to roll up their sleeves and proactively find ways to use this new technology to unlock opportunities. Large companies with large ambitions are creating unique models, but as the production and talent democratizes this will become a competitive advantage for most companies.

Barr Moses Barr Moses Co-Founder & CEO at Monte Carlo

AI-powered Search

In a data-driven world where information is growing exponentially, the ability to effectively search and make sense of unstructured data is crucial. Vector Search is a method of information retrieval in which unstructured data is represented as vectors, and Machine Learning models allow a meaningful vector representation of the data. In this talk, we delve into the world of AI-powered search, covering topics including Natural Language Processing (NLP), Large Language Models (LLM), Semantic Search and Image Similarity Search. By the end of this talk, you will not only recognize the importance of working with unstructured data but also learn how to effectively utilize Machine Learning models and retrieval methods to take advantage of it.

Priscilla Parodi Priscilla Parodi Principal Developer Advocate at Elastic
Deep Learning | Intermediate

AI Resilience: Upskilling in an AI Dominant Environment

The boom of generative AI and LLMs have taken the world by storm. This development has already disrupted various industries and roles, and data science is no exception to that rule. In a word of embeddings and transfer learning, one might beg to question "What should I learn next?" and "Where should I spend my time and energy for deep dives?". This talk aims to guide existing AI practitioners on how to maintain relevant skills in an increasingly automated world, and how to stand out in an oversaturated job market.

Leondra Gonzalez Leondra Gonzalez Senior Data & Applied Scientist at Microsoft
Data Engineering Summit | Intermediate

Building Data Contracts with Open Source Tools

It's less complicated than it seems. In this session, you will build your first data contracts. I will first set the decorum: * What is a data contract? * What's its purpose? * Why it simplifies data engineers' lives? Then we will jump into the hands-on part, which you will be able to run in your environment. I will use some (as of now, experimental) open-source tools to generate a skeleton of a data contract, and we will add information to it. Once you created a data contract, you will learn more about their life cycle. Join me for this fun and fast-paced session, filled with extremely relevant information.

Jean-Georges Perrin Jean-Georges Perrin CIO at AbeaData
All Levels

10 Quick Wins To Expedite Your Job Search

In today's competitive job market, efficiency is paramount in finding your next opportunity. This high-speed fast paced talk provides attendees with concise, actionable strategies that can help job-hunters press fast forward. This presentation includes distillation of 10 specific action items. We begin by understanding the power of reconnecting with former bosses, co-workers, and supervisees. We explore the value of former classmates and schoolmates. We look at the power of updating friends and family about your career goals which can uncover hidden opportunities. A strategic approach to job searching includes strategically sharing job opportunities for others. The goal is to attract offers. If you’re tired of “personal branding” and “networking” this talk is for you. Each strategy is designed to be a quick win; simple to implement but with the potential for substantial impact. Job seekers from all backgrounds will leave this talk with a toolkit of techniques to not only expedite their job search but to do so with a targeted and effective approach. Whether you're a recent graduate or in the midst of a career transition, these insights are tailored to help you navigate the complexities of the job market and emerge successfully. Join us to transform your job search into a dynamic and results-driven journey.

Adam Ross Nelson Adam Ross Nelson Data Scientist + Career Coach at Up Level Data, LLC
LLMs | Beginner

Data Automation with LLM

In today's business environment, data plays a crucial role in decision-making. However, obtaining the required data can be challenging due to data engineering or data science resource constraints, leading to delays, inefficiency, and potential losses. This talk will focus on creating a self-serve bot (e.g., Slack bot) that can serve data requests and support ad-hoc requests by leveraging LLM applications. This involves building a natural language to SQL engine using tools such as OpenAI API or open-source models that leverage the Hugging Face API.

Rami Krispin Rami Krispin Senior Manager - Data Science and Engineering at Apple
Data Engineering | Intermediate

Experimentation Platform at DoorDash

The experimentation platform at DoorDash leverages multiple big data tools to help with thousands of decision making everyday. In this talk we will cover how company leverage the platform to make decisions in business strategies, machine learning models, optimization algorithms and infrastructure changes. We will also cover how the platform leverage Dagster to do metrics and analysis jobs orchestration; how the data storage and data fetching is done with datalake; how we enable exploratory analysis with Databrick notebook; and how we integrate with machine learning platform to make automated decisions.

Yixin Tang Yixin Tang Engineer Manager at DoorDash
Big Data Analytics | Intermediate

Conversational Data Intelligence: Transforming Data Interaction and Analysis

In our data-rich world, the capacity for efficient and intuitive interaction with massive data sets is more crucial than ever. Kevin Rohling, Head of AI at Presence Product Group, presents an exploration into Conversational Data Intelligence (CDI), a fusion of advanced AI technologies and data analytics that is redefining our engagement with data. CDI emerges as a pivotal innovation, leveraging Large Language Models and Semantic Search to facilitate natural, conversational interactions with complex data. This novel approach simplifies data navigation, breaking down barriers to data literacy and enabling professionals across various disciplines to access and interpret data without specialized data science expertise. The presentation will venture into the practical applications of CDI, showcasing its transformative impact across multiple sectors, including healthcare, legal, finance, and government. By illustrating real-world scenarios, the talk will demonstrate how CDI empowers professionals to make data-driven decisions more efficiently and accurately. Key to this discussion are the core technologies that fuel CDI. We will delve into the integration of Large Language Models, Semantic Search, and Natural Language to Query mechanisms, offering insights into their functionality and role in enhancing data interaction. However, the journey with CDI is not without its challenges. The talk will also address critical concerns such as data privacy, security, and the potential for bias. These aspects are integral to the responsible adoption and evolution of CDI technologies. Attendees will leave with a comprehensive understanding of CDI's capabilities and applications, equipped with insights into how CDI can be effectively integrated into their own industries. This talk is more than a presentation; it's an invitation to envision a future where data interaction is more accessible, insightful, and influential in driving innovation and efficiency across various professional landscapes.

Kevin Rohling Kevin Rohling Head of AI Engineering at Presence Product Group
Data Engineering | All Levels

Data Infrastructure through the Lens of Scale, Performance and Usability

Silicon Valley engineers and engineering challenges have ruled the data world for the last 20 years. The net result is data infrastructure companies focusing on being the highest scale, fastest systems to process enormous amounts of data– usability be damned. We don’t all have movie libraries the size of Netflix, search indexes the size of Google or social graphs the size of Meta. This talk explores the changes in hardware and mindsets enabling a new breed of software that is optimized for the 95% of us who do not have petabytes to process daily. I worked on Google BigQuery in 2012. At the time, the max size of memory on an EC2 machine was 60.5GB. Today, we have EC2 machines with 25TB of RAM. Our software design for data services, focused on distributed architectures, hasn’t taken into account that massive 400x change in the amount of memory available. At the same time, our laptops have gotten so much more powerful - with 16x the amount of RAM available in today’s Macbook Pro vs the ones offered in 2012. Shouldn’t our data infrastructure be adapted to take advantage of this local compute? What does this change in hardware and software mean for the user experience? Instead of focusing on consensus algorithms for large-scale distributed compute, can our engineers instead focus on making data more accessible, more usable and reduce the time between “problem statement” and “answer?” That’s the dream that I’m exploring and where I want to push our industry over the next 5 years.

Ryan Boyd Ryan Boyd Co-founder at MotherDuck
Machine Learning | Beginner-Intermediate

Fallacy of Scale

The leaps in AI made over the last few years – particularly LLMs – have been achieved with scale, i.e., training models on increasingly large datasets. Whether LLMs, like GPT4, can be improved with another run is a question of scale. The current approach – train models on huge datasets – has reliably delivered impressive results. But the quality of any LLM is determined by the size of the dataset it’s trained on and there’s a limit to those datasets, even if they are the size of the internet. The wider industry has been calibrated to believe that more data equals improved models, so we chase bigger and bigger runs and will likely see the first $1bn run within a year. But there’s a limitation to this approach – the data itself. Many communities lack datasets of a comparable size to those in English, and even that has its limitations. The day is approaching when scale alone won’t be enough to deliver meaningful advances. Efficient learning is a key component to true intelligence. Therefore, a focus on efficiency – learning deeper understanding from smaller data – is becoming increasingly important as scale reaches its limits to growth. With increased efficiency, it will be possible to continue the rapid advancement of AI, and potentially even more capable and intelligent models as higher levels of abstraction and representations can be learned. To create the next generation of intelligent algorithms that can deliver for less well-represented communities besides those that speak in English, we will need continued progress in efficient learning mechanisms and methods. Learning outcomes: · Why a focus on algorithmic sample efficiency is required to enable further advancement in AI · The advantages that efficient learning models can provide, from removing blockers to the development of theory of mind to increasing less well represented communities in speech tech. · Why LLMs are just the beginning, and the applications for speech technology once models can truly understand intent. · What are the next generation of intelligent systems and what it’ll take to get us there.

Trevor Back Trevor Back Chief Product Officer at Speechmatics
Data Engineering | Intermediate

From Research to the Enterprise: Leveraging Foundation Models for Enhanced ETL, Analytics, and Deployment

As Foundation Models (FMs) continue to grow in size and capability, data is often left behind in the rush toward solving problems involving documents, images, and videos. This talk will describe our research at Stanford University and Numbers Station AI on applying FMs to structured data and their applications in the modern data stack. Starting with ETL/ELT, we'll discuss our 2022 VLDB paper ""Can Foundation Models wrangle your data?"", the first line of work to use FMs to accelerate tasks like data extraction, cleaning and integration. We'll then move up the stack and discuss our work at Numbers Station to use FMs to accelerate data analytics workflows, by automating tasks like text-to-SQL generation, semantic catalog curation and data visualizations. We will then conclude this talk by discussing challenges and solutions for production deployment in the modern data stack.

Ines Chami Ines Chami Co-founder and Chief Scientist at NUMBERS STATION AI
Women Ignite | All Levels

My AI: Awareness in Evolving AI Technologies

AI permeates and has transformed daily routines – whether it's having a conversation with a chatbot for customer support or reviewing a likely recommended diagnosis for a patient. As technology grows, so does the need to educate professionals outside of the traditional technical fields. Building an AI-literate community means developing people to ask the right questions– How do I approach or communicate about AI? What are the strengths and limitations of AI? How is the public impacted by AI implementations? What tools are needed to evaluate AI implementations? Similar to data literacy, AI literacy requires the appropriate mindset, language, and skills to develop a framework that can assess data outputs. Beyond the technical challenges, recommended regulations, such as Executive Order 14110, and program stakeholders impose increasing demand and requirements for AI explainability, fairness/equity, and formal governance, while also introducing confusion around what this actually means for people. Fundamental to this approach is a governance framework combining people, processes, tools, and automation to ensure performance & trust in AI solutions, minimize risk, and maximize impacts. Join Leidos Chief AI Architect Roopa Vasan and Data Society CEO Merav Yuravlivker to discuss the implications and importance of having an AI-literate community that questions data responsibly and aligns with AI governance. By the end of the presentation, you will walk away with a comprehensive understanding of today's regulations, key frameworks for AI literacy and governance , and first steps to implement them.

Roopa Vasan Roopa Vasan Chief AI Architect at Leidos
Data Engineering | Beginner

Why the Hype Around dbt is Justified

The hype for dbt is everywhere you look, but is it really justified? Why do you need a tool to just run SQL when most data stores support SQL themselves? In this 30 minute session I am going to break down what dbt really is, what makes it unique, and show you why it is so much more then just SQL. We will look at what makes it so popular (and unpopular) as a data transformation tool and the driving factors behind those opinions, dispelling some mistruths along the way. If you are new to dbt and trying to wrap your head around this tool then this is the session for you! Come find out if the hype is real!

Cameron Cyr Cameron Cyr Staff Data Engineer at Breezeway
Dustin Dorsey Dustin Dorsey Sr. Cloud Data Architect at Onix
Machine Learning | Beginner-Intermediate

Optimizing Workplace with AI and Generative Bots

This study investigates the interplay between artificial intelligence, human skills, and task characteristics, and their impact on organizational performance. Applying the Resource-Based View and Task Technology Fit theories, we explored how generative AI designed for collaboration, as both a firm resource and a capability, can enhance task execution across different dimensions - routine/creative tasks and easy/complex tasks. We conducted an experimental study involving the development of a marketing campaign with distinct subtasks reflecting these dimensions. Our findings show that firms can gain substantial benefits from integrating AI and that AI improves task outputs in automation, support, creation, and innovation. Our study also suggests a nuanced relationship between humans and AI in creative tasks with humans outperforming AI. The study highlights the value of upskilling and reskilling in AI, and proposes a strategic blend of AI and human creativity for optimal results. These findings have implications for understanding the role of AI in organizational tasks and formulating effective strategies for AI integration in business and beyond. Our exploration includes the innovative use of GPT models as decision-support tools, integrating diverse theoretical perspectives and a clear task division between humans and AI, to enhance both the efficiency and effectiveness of AI-human interactions in various decision-making contexts.

Aleksandra Przegalinska Aleksandra Przegalinska Vice President at Kozminski University
Tamilla Triantoro, PhD Tamilla Triantoro, PhD Associate Professor, Business Analytics at Quinnipiac University
All Levels

How to Preserve Exact Attribution through Inference in AI: Get the Correct Explanations and Preserve Privacy via Instance-Based Learning

Most forms of machine learning explainability are ex-post; they attempt to create an approximate model of a model in order to try to understand why a prediction was made. For data scientists working with AI models today, that won’t cut it. There is an increasing need for full data transparency and explainability to mitigate against bias, incorrect information, and hallucinations — as well as increasing demands for privacy. In this session, hear from noted computer scientist and AI expert, and founder of a leading explainable AI company, Dr. Chris Hazard. He will show data practitioners how to leverage cutting-edge instance-based learning (IBL) to solve these problems. Most AI today is black box. IBL offers a fully explainable AI alternative, having a precise on/off switch for data provenance and lineage through inference. With IBL, the derivation of each inference can be easily understood from the data. Having worked with IBL for over a decade, Chris will explain how modern IBL techniques, built around information theory, have modern model performance characteristics. He will also show how IBL techniques have extremely strong robustness to adversarial attacks and are automatically calibrated. Attendees will learn how the same mechanisms that yield this performance are closely related to differentially private mechanisms, and how to deploy them to generate strongly private synthetic data at scale. Hearing practical examples, attendees will learn why attribution through inference is vitally important for data-centric AI, how to debug data and understand outcomes, and how to protect privacy and anonymity when it matters.

Chris Hazard, PhD Chris Hazard, PhD CTO and Co-founder at Howso
LLMs | Beginner-Intermediate

RAG, the bad parts (and the good!): building a deeper understanding of this hot LLM paradigm’s weaknesses, strengths, and limitations

Off-the-shelf Large Language Models (LLMs) such as GPT-4 have already proven their versatility in numerous tasks and are revolutionizing entire industries. However, achieving exceptional performance in highly specific domains can be challenging, and traditional fine-tuning is often not accessible, due to its extensive demands in terms of data, finances, and expertise, exceeding the means of most organizations. Retrieval-Augmented Generation (RAG) is a widely adopted technique to augment the knowledge of LLMs within very specific domains while mitigating hallucinations. RAG achieves this by shifting the burden of information retrieval from the LLM's internal knowledge to external retrieval systems, often more specialized in this task due to their focused scope. However, RAG is not a silver bullet. Getting it to perform effectively can be far from trivial, and for some use cases it’s not applicable entirely. In this talk we will first understand what RAG is, where it shines and why it works so well in these applications. Then we are going to see the most common failure modes and walk through a few of them to evaluate whether RAG is a suitable solution at all, how to fix it or alternatively what approaches could be a better fit for the specific use case.

Sara Zanzottera Sara Zanzottera NLP Engineer at deepset
Ai X | All Levels

AI and Society

AI is changing the way governments, companies, and individuals behave. By chosing the right sort of architectures for AI and for data we can make this transition much more safer and more likely to produce a healthy society. See for code, detail.

Alex Pentland, PhD Alex Pentland, PhD Professor at MIT | Founder and Director at MIT Connection Science
Data Engineering Summit | All Levels

Clean as You Go: Basic Hygiene in the Modern Data Stack

When my children walk around the house, they generally leave a trail of mess behind them. They sometimes realize that they shouldn't be doing this, but they’re so excited to move on to the next thing that catches their eye that they’ll say “Oh, I’ll clean it up later.” As grown adults with wisdom gained from experience, my wife and I know that this means either: They’ve just signed themselves up for a massive future cleaning job, or … … that someone else will have to clean up after them. We know that this is not good behavior for a child, so why do we so often do this as Data Engineers? The culture of “Move Fast and Break Things” has pressured us into closing tickets as quickly as possible, frequently pushing us towards the “Oh, I’ll clean it up later” mindset. While this may save us a few minutes in the short-term, we are creating long term headaches such as: Piles of small cleanup tasks for later Confusion among peers who try to use incomplete data assets Lack of metadata to activate throughout the Modern Data Stack In this session, we’ll discuss best practices for keeping your data clean and show how a few minutes of extra time up front can save us a bunch of time down the road.

Machine Learning | Intermediate

Making AI recommendations Human-centric

AI based recommendation systems are prevalent in almost all domains, ranging from online retail, media and navigation to healthcare decision support and more. Most real-world deployments require interacting with and learning from humans, yet AI algorithms for generating recommendations don't fully leverage human behavior. This talk will provide an overview of current methods for incorporating human factors in AI decisions including but not limited to human feedback (in the form of preferences, votes, comments, demonstrations, etc.), human behavior, human-AI complementarity, and developing trust. The talk will also highlight via examples the importance of convergent solutions and collaboration between AI and human decision science researchers, and chart the path forward in by highlighting open directions in developing human-centric AI. The talk will be aimed at machine learning researchers and developers working on recommendation systems with very basic knowledge of AI algorithms such as collaborative filtering, bandits and reinforcement learning, but will also provide insights for research and product managers in improving the user-experience of their products. Attendees will walk away with solution ideas and challenges in modeling and leveraging human factors in AI recommendation systems.

Data Engineering | Beginner-Intermediate

Unlocking the Unstructured with Generative AI: Trends, Models, and Future Directions.

The exponential growth in computational power, alongside the advent of powerful GPUs and advancements in cloud computing, has ushered in a new era of generative artificial intelligence (AI), transforming the landscape of unstructured data extraction. Traditional methods such as text pattern matching, optical character recognition (OCR), and named entity recognition (NER) have been plagued by challenges related to data quality, process inefficiency, and scalability. However, the emergence of large language models (LLMs) has provided a groundbreaking solution, enabling the automated, intelligent, and context-aware extraction of structured information from the vast oceans of unstructured data that dominate the digital world. This talk delves into the innovative applications of generative AI in natural language processing and computer vision, highlighting the technologies driving this evolution, including transformer architectures, attention mechanisms, and the integration of OCR for processing scanned documents. We will also talk about future of generative AI in handling complex datasets. Participants will gain insights into: The fundamental challenges and solutions in unstructured data extraction. The operational dynamics of Generative AI in extracting structured information. Future of generative AI in unstructured data extraction Practical insights into leveraging these technologies for real-world applications. Designed for data scientists, AI researchers, and industry professionals, the talk aims to equip attendees with the knowledge to harness the power of Generative AI in transforming unstructured data into actionable insights, thereby driving innovation and efficiency across industries.

Jay Mishra Jay Mishra Chief Operating Officer at Astera
Responsible AI | All Levels

Advancing Ethical Natural Language Processing: Towards Culture-Sensitive Language Models

Natural Language Processing (NLP) systems play a pivotal role in various applications, from virtual assistants to content generation. However, the potential for biases and insensitivity in language models has raised concerns about equitable representation and cultural understanding. This talk explores the development of Culture-Sensitive Language Models (LLMs) as a progressive step towards addressing these issues. The core principles involve diversifying training data to encompass a wide range of cultures, implementing bias detection and mitigation strategies, and fostering collaboration with cultural experts to enhance contextual understanding. Our approach emphasizes the importance of ethical guidelines that guide the development and deployment of LLMs, focusing on principles such as avoiding stereotypes, respecting cultural diversity, and handling sensitive topics responsibly. The models are designed to be customizable, allowing users to fine-tune them according to specific cultural requirements, fostering inclusivity and adaptability. The incorporation of multilingual capabilities ensures that the models cater to global linguistic diversity, acknowledging the richness of different languages and cultural expressions. Moreover, we propose a feedback mechanism where users can report instances of cultural insensitivity, establishing a continuous improvement loop. Transparency and explainability are prioritized to enable users to comprehend the decision-making process of the models, promoting accountability. Through this multidimensional approach, we aim to advance the field of NLP by developing culture-sensitive LLMs that not only understand and respect diverse cultural nuances but also contribute to a more inclusive and ethical use of language technology.

Gopalan Oppiliappan Gopalan Oppiliappan Head, AI Centre of Excellence at Intel
Demo Talk | All Levels

Low-Code, High Impact: Kickstart Your Data Analytics Journey with KNIME

The success of data science teams heavily relies on their chosen tools. While algorithmic expertise and domain wisdom are vital, the success of a data science project depends on additional contingent factors linked to the tool, such as costs, ease and time of learning, rapid prototyping, robust debugging and testing, flexibility, effective support, automation and security. In this talk, we’ll introduce you to KNIME Analytics Platform, the free and open-source data analytics software that relies on a low-code/no-code visual interface to enable professionals from any field to make sense of data. It features extensive data access & blending, wrangling, modeling and visualization capabilities, making it comprehensive and versatile for all stages of the data science life cycle. KNIME’s free and open-source nature eliminates licensing and budget concerns, and favors smooth integration with other technologies and scripting languages. The platform’s visual GUI allows quick prototyping and easy implementation of analytics pipelines via drag-and-drop data operation blocks. Together, we’ll experience KNIME Analytics Platform in action and build a simple AI-driven application, leveraging the generative power of API-based and local LLMs. Designed for simplicity, scalability and to address data science needs of any level of complexity, KNIME aims to drive open innovation and empower users with cutting-edge technologies in the evolving data tools landscape.

Roberto Cadili Roberto Cadili Data Scientist on the Evangelism Team at KNIME
Data Engineering | All Levels

Clean as You Go: Basic Hygiene in the Modern Data Stack

When my children walk around the house, they generally leave a trail of mess behind them. They sometimes realize that they shouldn't be doing this, but they’re so excited to move on to the next thing that catches their eye that they’ll say “Oh, I’ll clean it up later.” As grown adults with wisdom gained from experience, my wife and I know that this means either: They’ve just signed themselves up for a massive future cleaning job, or … … that someone else will have to clean up after them. We know that this is not good behavior for a child, so why do we so often do this as Data Engineers? The culture of “Move Fast and Break Things” has pressured us into closing tickets as quickly as possible, frequently pushing us towards the “Oh, I’ll clean it up later” mindset. While this may save us a few minutes in the short-term, we are creating long term headaches such as: Piles of small cleanup tasks for later Confusion among peers who try to use incomplete data assets Lack of metadata to activate throughout the Modern Data Stack

Eric Callahan Eric Callahan Principal, Data Solutions at Pickaxe Foundry
Generative AI | Beginner-Intermediate

Tutorial: Harnessing GPT Assistants for Superior Model Ensembles: A Beginner's Guide to AI Stacked Classifiers

OpenAI’s API allows users to programmatically create custom GPTs, referred to as Assistants, which can be instructed to write and execute code on provided data. This opens many exciting possibilities in data science, in particular the use of multiple Assistants to help build large scale, powerful machine learning ensemble methods that might otherwise be unfeasible. Model stacking is an advanced machine learning technique where multiple base models, typically of different types, are trained on the same data and their predictions used as input for a final ""meta-model"". While it is a powerful technique, stacking is generally impractical for most data scientists due to its heavy resource requirements and time-consuming architecture. However, by creating multiple AI Assistants through the API, these types of multi-model ensembles can be easily and quickly created. In this presentation, I will show how a single user with a beginner level knowledge of python can create a “swarm” of AI Assistants that train a series of models for use in a model-stacking ensemble classifier that outperforms traditional ML models on the same data. We will go over each step from getting set up with the API to orchestrating an AI swarm, to collecting their output for the final Meta model predictions.

Jason Merwin, PhD Jason Merwin, PhD Data Scientist II | Western Governors University
ML for Biotech and Pharma | All Levels

Tutorial: Data Science in the Biotech/Pharma Research Organization

In this hands-off tutorial, I will provide a framework for thinking about, and hence organizing, data science in biotech and pharmaceutical research organizations. Together, we will cover: (1) what the core mission of a data science team should be, (2) the ways a data science team can deliver value to the research organization, (3) major classes of problems and methods, and (4) challenges that are unique to a data science organization in the _research_ space, as contrasted to clinical development, manufacturing, and commercial organizations. By the end of this session, data science leaders at biotech and pharma companies who attend this session will be equipped with frameworks for for thinking about data science problems in the biotech and pharma research space. Executives who are unfamiliar with the research space of data science problems will walk away with a broad, high-level overview of data science problems in research and how to frame and understand their value.

Eric Ma, PhD Eric Ma, PhD Author of nxviz Package | Principal Data Scientist at Moderna
Data Engineering Summit | Beginner-Intermediate

Deciphering Data Architectures (choosing between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh)

Data fabric, data lakehouse, and data mesh have recently appeared as viable alternatives to the modern data warehouse. These new architectures have solid benefits, but they’re also surrounded by a lot of hyperbole and confusion. In this presentation I will give you a guided tour of each architecture to help you understand its pros and cons. I will also examine common data architecture concepts, including data warehouses and data lakes. You’ll learn what data lakehouses can help you achieve, and how to distinguish data mesh hype from reality. Best of all, you’ll be able to determine the most appropriate data architecture for your needs. The content is derived from my book Deciphering Data Architectures: Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh.

James Serra James Serra Data & AI architect at Microsoft
Machine Learning | Intermediate

Harmony in Complexity: Unveiling Mathematical Unity Across Logistic Regression, Artificial Neural Networks, and Computer Vision

This presentation embarks on an exploration of the intricate interconnections that bind logistic regression, neural networks, and computer vision, unveiling their shared foundational principles through the lens of linear algebra. The main focus of this exploration is to highlight how abstract mathematical concepts play a crucial role in shaping and bringing together these different methodologies. By drawing meaningful parallels between the construction of logistic regression functions and their mathematical representations, we create a path to understanding the intrinsic relationship between these two entities. In logistic regression, the linear function, dynamically molded by a combination of various features, emerges as a visual metaphor—a plane in the mathematical fabric. This illustration sets the stage for the intricate processes happening in neural networks. In the realm of neural networks, the combination of weights and nodes takes center stage as space surrounded by multi-dimensional planes. The alignment of these planes with linear algebra principles becomes apparent, highlighting the basic math that shapes how neural networks work. Despite their outward dissimilarity, an underlying mathematical structure binds these models together, with the singular differentiator residing in the activation function. Logistic regression leans on the sigmoid function, while neural networks embrace the ReLU function, showcasing the versatile adaptability of these mathematical tools. The widespread use of the ReLU activation function in neural networks and convolutional neural networks (CNNs) reveals a shared common mathematical foundation. CNNs are widely employed in computer vision algorithms. This consistency across architectures underscores the universality of the principles derived from linear algebra. Transitioning into the realm of computer vision, we explore the application of filters as weighted combinations of pixel features. This extends the linear algebraic concept to image processing, demonstrating the versatility and applicability of these mathematical principles across diverse domains. In essence, this presentation seeks to illuminate the profound harmony and shared essence of mathematical principles that transcend traditional disciplinary boundaries. It underscores the unifying influence of linear algebra in unraveling the core relationships defining the evolution of machine learning and computer vision paradigms, providing a holistic perspective for researchers.

Dr. Liliang Chen Dr. Liliang Chen Financial Analytics Manager at Freddie Mac

Generative AI | Beginner-Intermediate

Multimodal Retrieval Augmented Generation

Retrieval augmented generation (RAG) soon became established as the reference architecture whenever we want to inject custom knowledge into our LLM-powered applications. Insofar, RAG has been applied to text data. Nevertheless, with the launch of GPT-4-turbo vision, we can extend the same concept also data different from texts, such as images. In this workshop, we are going to cover the architecture behind a typical RAG application and how to incorporate images within this architecture, leveraging GPT-4-turbo with vision. To do so, we will see a practical implementation with Python and LangChain, consuming the model API from Azure OpenAI service.

Valentina Alto Valentina Alto Azure Specialist - Data and Artificial Intelligence at Microsoft
LLMS | Intermediate

Enabling Complex Reasoning and Action with ReAct, LLMs, and LangChain

ReAct is an approach that uses human reasoning traces to create action plans which determine the best action to take from a selection of available tools that are external to the LLM. This methodology mimics human chain of thought processes combined with the ability to engage with an external environment to solve problems and reduce the likelihood of hallucinations and reasoning errors. In this workshop, you will learn how to employ the ReAct technique to allow an LLM to determine where to find information to service different types of user queries, using LangChain to orchestrate the process. You’ll see how to it uses Retrieval Augmented Generation (RAG) to answer questions based on external data, as well as other tools for performing more specialized tasks to enrich the output of your LLM. All demo code and presentation material will be provided, as well as a temporary Amazon SageMaker Studio environment to build and deploy in.

Shelbee Eigenbrode Shelbee Eigenbrode Principal Machine Learning Specialist Solutions Architect at AWS
Giuseppe Zappia Giuseppe Zappia Principal Solutions Architect at AWS
Machine Learning | Intermediate-Advanced

Using Graphs for Large Feature Engineering Pipelines

Graph data structures provide a versatile and extensible data structure to represent arbitrary data. Data entities and their associated relations fit nicely into graph data structures. We will discuss GraphReduce, an abstraction layer for computing features over large graphs of data entities. This talk will outline the complexity of feature engineering from raw entity-level data, the reduction in complexity that comes with composable compute graphs, and an example of the working solution. We will also discuss a case study of the impact on a logistics & supply chain machine learning problem. If you work on large scale MLOps projects, this talk may be of interest.

Wes Madrigal Wes Madrigal ML Engineer at Mad Consulting
LLMs | All Levels

Data Synthesis, Augmentation, and NLP Insights with LLMs

Data synthesis, augmentation, and NLP insights with LLMs offer a foundational approach to understanding and utilizing artificial intelligence in data science. This workshop is designed to guide participants through the process of creating synthetic data, enhancing datasets through augmentation, and applying NLP techniques to extract valuable insights. These skills are essential in various fields such as social media analysis, customer behavior studies, content generation, and more. By participating in this workshop, you will learn how to generate realistic and functional synthetic data using LLMs. You will also explore methods to enrich this data and make it more applicable for real-world scenarios. Additionally, you will apply NLP techniques to synthesized and augmented data to uncover patterns, sentiments, and trends.

Tamilla Triantoro, PhD Tamilla Triantoro, PhD Associate Professor, Business Analytics at Quinnipiac University
Machine Learning | Intermediate

Machine Learning with XGBoost

This workshop will show how to use XGBoost. It will demonstrate model creation, model tuning, model evaluation, and model interpretation.

Matt Harrison Matt Harrison Python & Data Science Corporate Trainer | Consultant at MetaSnake
Machine Learning | Beginner

Idiomatic Pandas

Pandas can be tricky, and there is a lot of bad advice floating around. This tutorial will cut through some of the biggest issues I've seen with Pandas code after working with the library for a while and writing three books on it. We will discuss: * Proper types * Chaining * Aggregation * Debugging

Matt Harrison Matt Harrison Python & Data Science Corporate Trainer | Consultant at MetaSnake
Machine Learning | Intermediate

Causal AI: from Data to Action

In this talk, we will explore and demystify th world of Causal AI for data science practitioners, with a focus on understand cause-and-effect relationships within data to drive optimal decisions. In this talk, we will focus on: * from shapley to DAGs: the dangers of using post-hoc explainability methods as tools for decision making, and how tranditional ML isn't suited in situations where want to perform interventions on the system. * discovering causality: how do we figure out what is causal and what isn't, with a brief introduction to methods of structure learning and causal discovery * optimal decision making: by understanding causality, we now can accurately estimate the impact we can make on our system - how to use this knowledge to derive the best possible actions to make? This talk is aimed at both data scientists and industry practitioners who have a working knowledge of traditional statistics and basic ML. This talk will also be practical: we will provide you with guidance to immediately start implementing some of these concepts in your daily work.

Dr. Andre Franca Dr. Andre Franca CTO at connectedFlow
Generative AI | All Levels

Everything About Large Language Models: Pre-training, Fine-tuning, RLHF & State of the Art

Generative Large Language Models like GPT4 have revolutionized the entire tech ecosystem. But what makes them so powerful? What are the secret components which make them generalize to a variety of tasks? In this talk, I will present how these foundation models are trained. What are the steps and core-components behind these LLMs? I will also cover how smaller, domain-specific models can outperform general purpose foundation models like ChatGPT on target use cases

Chandra Khatri Chandra Khatri VP, Head of AI at Krutrim
Deep Learning | All Levels

Topological Deep Learning: Going Beyond Graph Data

Over the past decade, deep learning has been remarkably successful at solving a massive set of problems on datatypes including images and sequential data. This success drove the extension of deep learning to other discrete domains such as sets, point clouds, graphs, 3D shapes, and discrete manifolds. While many of the extended schemes have successfully tackled notable challenges in each domain, the plethora of fragmented frameworks have created or resurfaced many long-standing problems in deep learning such as explainability, expressiveness and generalizability. Moreover, theoretical development proven over one discrete domain does not naturally apply to the other domains. Finally, the lack of a cohesive mathematical framework has created many ad hoc and inorganic implementations and ultimately limited the set of practitioners that can potentially benefit from deep learning technologies. This talk introduces the foundation of topological deep learning, a rapidly growing field that is concerned with the development of deep learning models for data supported on topological domains such as simplicial complexes, cell complexes, and hypergraphs, which generalize many domains encountered in scientific computations including images and sequence data. It introduces the main notions while maintaining intuitive conceptualization, implementation and relevance to a wide range of practical applications. It also demonstrates the practical relevance of this framework with practical applications ranging from drug discovery to mesh and image segmentation.

Dr. Mustafa Hajij Dr. Mustafa Hajij Assistant Professor at University of San Francisco
LLMs | Intermediate

Building Using Llama 2

This session aims to provide hands-on, engaging content that gives developers a basic understanding of Llama 2 models, how to access and use them, build core components of the AI chatbot using LangChain and Tools. The audience will also learn core concepts around Prompt Engineering and Fine-Tuning and programmatically implement them using Responsible AI principles. Lastly, we will conclude the talk explaining how they can leverage this powerful tech, different use-cases and what the future looks like.

Amit Sangani Amit Sangani Director of Partner Engineering at Meta
Generative AI | Intermediate

Graphs: The Next Frontier of GenAI Explainability

In a world obsessed with making predictions and generative AI, we often overlook the crucial task of making sense of these predictions and understanding results. If we have no understanding of how and why recommendations are made, if we can’t explain predictions – we can’t trust our resulting decisions and policies. In the realm of predictions, explainability, and causality, graphs have emerged as a powerful model that has recently yielded remarkable breakthroughs. Graphs are purposefully designed to capture and represent the intricate connections between entities, offering a comprehensive framework for understanding complex systems. Leading teams use this framework today to surface directional patterns, compute complex logic, and as a basis for causal inference. This talk will examine the implications of incorporating graphs into the realm of generative AI, exploring the potential for even greater advancements. Learn about foundational concepts such as directed acrylic graphs (DAGs), Jedeau Pearl’s “do” operator, and keeping domain expertise in the loop. You’ll hear how the explainability landscape is evolving, comparisons of graph-based models to other methods, and how we can evaluate the different fairness models available. We’ll look into the open source PyWhy project for causal inference and the DoWhy method for modeling a problem as a causal graph with industry examples. By identifying the assumptions and constraints up front as a graph and applying that through each phase of modeling mechanisms, identifying targets, estimating causal effects, and refuting these with each inference – we can improve the validity of our predictions. We’ll also explore other open source packages that use graphs for counterfactual approaches, such as GeCo and Omega. Join us as we unravel the transformative potential of graphs and their impact on predictive modeling, explainability, and causality in the era of generative AI.

Michelle Yi Michelle Yi Board Member at Women In Data
Amy Hodler Amy Hodler Founder, Consultant at
LLMs | Beginner-Intermediate

LLM Best Practises: Training, Fine-Tuning and Cutting Edge Tricks from Research

Large Language Models (LLMs) are still relatively new compared to ""Traditional ML"" techniques and have many new ideas as best practises that differ from training ML models.Fine-Tuning models can be really powerful to unlock use-cases based on your domain and AI Agents can be really powerful to unlock previously impossible ideas. In this workshop, you will learn the tips and tricks of creating and fine-tuning LLMs along with implementing cutting edge ideas of building these systems from the best research papers. We will start by learning the foundations behind what makes a LLM, quickly moving into fine-tuning our own GPT and finally implementing some of the cutting edge tricks of building these models. There is a lot of noise and signal in this domain right now, we will focus on understanding the ideas that have been tried and tested. The workshop will also cover case studies spanning ideas that have worked in practise we will dive deep into the art and science of working with LLMs.

Sanyam Bhutani Sanyam Bhutani Sr. Data Scientist and Kaggle Grandmaster
Machine Learning | Intermediate

Feature Stores in Practice: Build and Deploy a Model with Featureform, Redis, Databricks, and Sagemaker

The term ""Feature Store"" often conjures a simplistic idea of a storage place for features. However, in reality, they serve as robust frameworks and orchestrators for defining, managing, and deploying feature pipelines. The veneer of simplicity often masks the significant operational gains organizations can achieve by integrating the right feature store into their ML platform. This session is designed to peel back the layers of ambiguity surrounding feature stores, delineating the three distinct types and their alignment within a broader ML ecosystem. Diving into a hands-on section, we will walk through the process of training and deploying an end-to-end fraud detection model utilizing Featureform, Redis, Databricks, and Sagemaker. The emphasis will be on real-world, applicable examples, moving beyond concepts and marketing talk. This session aims to do more than just explain the mechanics of feature stores. It provides a practical blueprint to efficiently harness feature stores within ML workflows, effectively bridging the chasm between theoretical understanding and actionable implementation. Participants will walk away with a solid grasp of feature stores, equipped with the knowledge to drive meaningful insights and enhancements in their real-world ML platforms and projects.

Simba Khadder Simba Khadder Founder & CEO at Featureform
Generative AI | Intermediate

Stable Diffusion: Advancing the Text-to-Image Paradigm

This session will introduce attendees to Stable Diffusion, a new text-to-image generation model that is more stable and efficient than previous models. Stable Diffusion is able to generate high-quality images from text descriptions, and it is well-suited for a variety of applications, such as creative content generation, product design, and marketing. Learning Outcomes: By the end of this session, attendees will be able to: - Understand the basics of Stable Diffusion and how it works. - Know whole landscape of tools and libraries for Stable Diffusion domain. - Generate images from text descriptions using Stable Diffusion. - Apply Stable Diffusion to their own projects and workflows. - Understand the process of fine-tuing open source models to achieve tasks at hand. This session is relevant to practitioners in a variety of industries, including: Creative industries: Stable Diffusion can be used to generate images for marketing materials, product designs, and other creative projects. Technology industries: Stable Diffusion can be used to develop new applications for text-to-image generation, such as chatbots and virtual assistants. Research industries: Stable Diffusion can be used to conduct research on text-to-image generation and its applications.

Sandeep Singh Sandeep Singh Head of Applied AI/Computer Vision at
Machine Learning | All Levels

Better Features for Real-time Decisions Using Feature Engines

Transforming raw data into features to power machine learning models is one of the biggest challenges in production ML. Tecton CEO, Mike Del Balso, will explain how leading ML teams use feature platforms to develop, operate, and manage features for production ML. He'll walk through a sample use case, demonstrate how feature engines can make it easy to build & productionize powerful feature pipelines, and explain how an optimized feature engineering framework can enable: ☑️ Quick data processing, ☑️ Low-latency data serving, ☑️ Significantly reduced storage and computation costs, ☑️ Consistency between offline and online data for enhanced model accuracy. Following Mike's talk, Tecton Developer Advocate, Nick Acosta, will take attendees through a hands-on workshop of Tecton where they'll walk through the concepts and code that will help you build a modern technical architecture that simplifies the process of managing real-time ML models and features.

Nick Acosta Nick Acosta Developer Advocate at Tecton
Mike Del Balso Mike Del Balso Co-founder and CEO at Tecton
NLP | Beginner

Build Conversational AI and Integrate into Product Page Using Watsonx Assistant

IBM watsonx Assistant is an AI-powered virtual agent that provides customers with fast, consistent, and accurate answers across any messaging platform, application, device, or channel. Using AI and natural language processing, watsonx Assistant learns from customer conversations, improving its ability to resolve issues the first time while removing the frustration of long wait times, tedious searches, and unhelpful chatbots. This workshop provides an easy-to-follow guide on how to launch Watsonx Assistant, configure the settings for your first AI Assistant, and integrate AI Assistant into product page.

James Busche James Busche Senior Software Developer at IBM
Tommy Chaoping Li Tommy Chaoping Li Senior Software Developer at IBM
NLP | Intermediate-Advanced

Machine Learning using PySpark for Text Data Analysis

In this session, unsupervised Machine Learning algorithms like Cluster Analysis and recommendation System and supervised Machine Learning algorithms like Random Forest, Decision Tree, Bagging and Boosting will be discussed for doing analysis using PySpark. The main feature of this workshop will be the implementation of these algorithms using the Text Data. Considering the importance of reviews and text data available on social media platforms, the availability and importance of text data analysis has grown multifold. The session will be particularly helpful for startups and existing business who wanted to use AI for improving performance.

Bharti Motwani Bharti Motwani Clinical Associate Professor at University of Maryland, USA
Data Visualization & Data Analysis | Intermediate

Unlocking Insights in Home Values: A Multimillion-Row Journey with Polars

Join us for a hands-on data adventure exploring hidden insights and nuances across all home and building values in Massachusetts. With a dataset containing 2.5 million rows, this workshop will showcase the incredible capabilities of Polars, a data manipulation library which partners well with Pandas, in handling extensive data with a clean API, high performance and a low memory footprint, all on your local machine. Throughout the session, we'll demonstrate how Polars empowers users to perform nuanced analyses, such as pinpointing the most expensive homes in every town and on every street in Massachusetts, or unraveling the factors influencing home prices such as style, location, acreage, year built, square footage, etc. Whether you're a data enthusiast, analyst, or someone intrigued by the power of data analysis, this interactive workshop will leave you equipped to harness Polars' full potential for your own data exploration endeavors. Plus you’ll have a fun dataset of all home and building values (per tax assessment) at your fingertips. Time permitting, we’ll also do some GIS analysis on the dataset. Don't miss this opportunity to discover the stories hidden within the numbers and elevate your data analysis skills to new heights. Come prepared to write code in a jupyter notebook/jupyter lab, and leave with a working model and the full dataset.

Mike Dezube Mike Dezube Founder and CEO at Charles River Data
ML for Biotech and Pharma | Beginner

Introduction to Protein Language Models for Synthetic Biology

Protein Language Model are Transformer-like models that are trained on massive sets of protein sequences (represented as text) in an attempt to learn the biological 'grammar' of proteins.These models have a broad range of application, thanks to their generative and embedding abilities. In this workshop, we will get more familiar with this type of model, how they differ from their NLP counterparts and the tasks they can address. we will also get a short overview of the existing open-source models and datasets. During the hands-on session, we will start from a pre-trained language model and develop a basic example of protein function multi label classifier. We will then develop compare and benchmark different classification approaches, including a simple retrieval-augmented enhancement, and fine tuning.

Etienne Goffinet, PhD Etienne Goffinet, PhD Senior Researcher at Technology Innovation Institute
Machine Learning | Beginner

Introduction to Linear Regression using Spreadsheets with Real Estate Data

Over the course of this session, we'll embark on a deep dive into the foundational principles of linear regression, a statistical machine learning model that aids in unraveling the intricate relationships between two or more variables. Our unique focus centers on the practical application of linear regression using real-world real estate data, offering a concrete context that will undoubtedly resonate with participants. The workshop kicks off with a thorough overview of linear regression concepts, ensuring a collective understanding of the fundamentals. As we progress, we transition into the practical realm, employing popular spreadsheet tools like Excel or Google Sheets to conduct insightful real estate data analyses. Participants will master the art of data input, application of regression formulas, model building, and interpretation of results, enriching their analytical toolkit. The workshop's core revolves around a hands-on exploration of a real-world scenario. Together, we'll dissect a data set featuring crucial real estate variables such as property prices, square footage, number of bedrooms and bathrooms, and location. This pragmatic approach empowers participants to directly apply linear regression concepts to authentic situations commonly encountered in the dynamic field of real estate. Engagement is key throughout our workshop, featuring interactive exercises, group discussions, and dedicated Q&A sessions to reinforce comprehension. By the workshop's conclusion, participants will wield the skills to adeptly leverage the fundamental machine learning model of linear regression for making informed and predictive decisions in the realm of real estate. Whether you're a novice seeking an introduction to regression analysis or a seasoned analyst aiming to refine your skills, this workshop guarantees a stimulating and enlightening experience.

Roberto Reif Roberto Reif CEO and Founder at ScholarU

Deep Learning | Beginner-Intermediate

Deep Learning with PyTorch and TensorFlow

Obscure until recently, Deep Learning is ubiquitous today across data-driven applications as diverse as machine vision, natural language processing, generative A.I., and superhuman game-playing. This workshop is an introduction to Deep Learning that brings high-level theory to life with interactive examples featuring PyTorch, TensorFlow 2, and Keras — all three of the principal Python libraries for Deep Learning. Essential theory will be covered in a manner that provides students with a complete intuitive understanding of Deep Learning’s underlying foundations. Paired with hands-on code demos in Jupyter notebooks as well as strategic advice for overcoming common pitfalls, this foundational knowledge will empower individuals with no previous understanding of artificial neural networks to train Deep Learning models following all of the latest best-practices.

Dr. Jon Krohn Dr. Jon Krohn Chief Data Scientist at
LLMs | Intermediate-Advanced

Ben Needs a Friend - An intro to building Large Language Model applications

People say it’s difficult to make friends after college, impossible after grad school and just generally to give up after 30. Approaching 40 - I’ve decided to take matters into my own hands. Rather than go outside and meet people, I’ve decided, like many top-tier companies, to replace all that manual work with AI. In this tutorial, I’ll show you how to make your own AI friend, powered by Large Language Models (LLM). Along the way, we’ll cover some of the essential topics in LLM development. Our first step will be adjusting our new friend to our preferences based on prompt engineering and fine-tuning. Then, we will develop a “history” of our friendship using document embeddings and enable our friend to discuss that history (Retrieval-Augmented Generation). Finally, we will provide our friend with the tools it needs to be able to invite us to interesting local events. We’ll use the LangChain and transformers libraries to explore the pros and cons of different open and closed-source implementations in terms of cost and performance. The methods we’ll be using can be hosted locally and are either free or have minimal cost (e.g. OpenAI APIs). By the end of the tutorial, participants will have a basic familiarity with how to use the latest tools for LLM development and, for anything they’re not clear on, they can always ask their new AI friend for advice. We’ll conclude with a discussion of what our friend can and cannot do and why it may be better to just go outside more.

Benjamin Batorsky, PhD Benjamin Batorsky, PhD Data Science Consultant at d3lve
Generative AI | Intermediate-Advanced

Generative AI with Open-Source LLMs

Large Language Models like GPT-4 are transforming the world in general and the field of data science in particular at an unprecedented pace. This training introduces deep learning transformer architectures including LLMs. Critically, it also demonstrates the breadth of capabilities of state-of-the-art LLMs like GPT-4 can deliver, including for dramatically revolutionizing the development of machine learning models and commercially successful data-driven products, accelerating the creative capacities of data scientists and pushing them in the direction of being data product managers. Brought to life via hands-on code demos that leverage the Hugging Face and PyTorch Lightning Python libraries, this training covers the full lifecycle of LLM development, from training to production deployment.

Dr. Jon Krohn Dr. Jon Krohn Chief Data Scientist at
Machine Learning | Intermediate

Developing Credit Scoring Models for Banking and Beyond

Classification scorecards are a great way to predict outcomes because the techniques used in the banking industry specialize in interpretability, predictive power, and ease of deployment. The banking industry has long used credit scoring to determine credit risk—the likelihood a particular loan will be paid back. However, the main aspect of credit score modeling is the strategic binning of variables that make up a credit scorecard. This strategic and analytical binning of variables provides benefits to any modeling in any industry that needs interpretable models. These scorecards are a common way of displaying the patterns found in a machine learning classification model—typically a logistic regression model, but any classification model will benefit from a scorecard layer. However, to be useful the results of the scorecard must be easy to interpret. The main goal of a credit score and scorecard is to provide a clear and intuitive way of presenting classification model results. This training will help the audience work through how to build successful credit scoring models in both R and Python. It will also teach the audience to layer the interpretable scorecard on top of these models for ease of implementation, interpretation, and decision making. After this training, the audience will have the knowledge to be able to build more complete models that are ready to be deployed and used for better decisions by executives.

Aric LaBarr, PhD Aric LaBarr, PhD Associate Professor of Analytics at Institute for Advanced Analytics, NC State University
Generative AI | Intermediate-Advanced

Aligning Open-source LLMs Using Reinforcement Learning from Feedback

Unlock the full potential of open-source Large Language Models (LLMs) in our alignment workshop focused on using reinforcement learning (RL) to optimize LLM performance. With LLMs like ChatGPT and Llama-2 revolutionizing the field of AI, mastering the art of fine-tuning these models for optimal human interaction has become crucial. Throughout the session, we will focus on the core concepts of LLM fine-tuning, with a particular emphasis on reinforcement learning mechanisms. Engaging in hands-on exercises, attendees will gain practical experience in data preprocessing, quality assessment, and implementing reinforcement learning techniques for manual alignment. This skill set is especially valuable for achieving instruction-following capabilities and much more. The workshop will provide a comprehensive understanding of the challenges and intricacies involved in aligning LLMs. By learning to navigate through data preprocessing and quality assessment, participants will gain insights into identifying the most relevant data for fine-tuning LLMs effectively. Moreover, the practical application of reinforcement learning techniques will empower attendees to tailor LLMs for specific tasks, ensuring enhanced performance and precision in real-world applications. By the workshop's conclusion, attendees will be well-equipped to harness the power of open-source LLMs effectively, tailoring their models to meet the specific demands of their industries or domains. Don't miss out on this opportunity to learn how to create your very own instruction-aligned LLM and enhance your AI applications like never before!

Sinan Ozdemir Sinan Ozdemir AI & LLM Expert | Author | Founder + CTO at LoopGenius
Data engineering&Big Data | Beginner-Intermediate

Productionizing AI and LLM Apps with Ray Serve

Once we've designed an AI/ML application and selected or trained the models, that's really just the beginning. We need our AI-powered services to be resilient and efficient, scalable to demand and adaptable to heterogeneous environments (like using GPUs or TPUs as effectively as possible). Moreover, when we build applications around online inference, we often need to integrate different services: multiple models, data sources, business logic, and more. Ray Serve was built so that we can easily overcome all of those challenges. In this class we'll learn to use Ray Serve to compose online inference applications meeting all of these requirements and more. We'll build services that integrate with each other while autoscaling individually, even supporting individual hardware and software requirements -- all using regular Python and often with just one new line of code.

Kamil Kaczmarek Kamil Kaczmarek Technical Training Lead at Anyscale
Adam Breindel Adam Breindel Technical Instructor at Anyscale
Machine Learning | Beginner

Introduction to scikit-learn: Machine Learning in Python

Scikit-learn is a Python machine learning library used by data science practitioners from many disciplines. We start this training by learning about scikit-learn's API for supervised machine learning. scikit-learn's API mainly consists of three methods: fit to build models, predict to make predictions from models, and transform to modify data. This consistent and straightforward interface helps to abstract away the algorithm, thus allowing us to focus on our domain-specific problems. First, we learn the importance of splitting your data into train and test sets for model evaluation. Then, we explore the preprocessing techniques on numerical, categorical, and missing data. We see how different machine learning models are impacted by preprocessing. For example, linear and distance-based models require standardization, but tree-based models do not. We explore how to use the Pandas output API, which allows scikit-learn's transformers to output Pandas DataFrames! The Pandas output API enables us to connect the feature names with the state of a machine learning model. Next, we learn about the Pipeline, which connects transformers with a classifier or regressor to build a data flow where the output of one model is the input of another. Lastly, we look at scikit-learn's Histogram-based Gradient Boosting model, which can natively handle numerical and categorical data with missing values. After this training, you will have the foundations to apply scikit-learn to your machine learning problems.

Thomas J. Fan Thomas J. Fan Senior Machine Learning Engineer at
LLMs | Intermediate

LLMs Meet Google Cloud: A New Frontier in Big Data Analytics

Dive into the world of cloud computing and big data analytics with Google Cloud's advanced tools and big data capabilities. Designed for industry professionals eager to master cloud-based big data tools, this workshop offers hands-on experience with various big data analytics tools, such as Dataproc, BigQuery, Cloud Storage, and Compute Engine. We will also dive into the new LLM capabilities of Google Cloud. Explore how these innovative AI models can extract deeper insights, generate creative text, and automate large-scale tasks, taking your big data analysis to the next level. Ideal for those new to cloud computing, the workshop includes lab offerings that provide practical experience in utilizing Google Cloud services. Participants must bring a laptop to access content and partake in hands-on labs. Upon completion, attendees will have a comprehensive understanding of fundamental cloud computing concepts, key Google Cloud services, and widely used big data tools such as Spark. Agenda: - Getting set up with GCP - Overview of educational resources - Hands-on labs - Cloud Compute and storage - BigQuery for data access and pre-processing - BigQuery for ML and LLM capabilities - Dataproc for distributed computing with Apache Spark - Conclusion

Rohan Johar Rohan Johar Customer Engineer (AI)
Mohammad Soltanieh-ha, PhD Mohammad Soltanieh-ha, PhD Clinical Assistant Professor at Boston University
Data Visualization | Intermediate

Visualization in Bayesian Workflow Using Python or R

Visualization can be a powerful tool to help you build better statistical models. In this tutorial, you will learn how to create and interpret visualizations that are useful in each step of a Bayesian regression workflow. A Bayesian workflow includes the three steps of (1) model building, (2) model interpretation, and (3) model checking/improvement, along with model comparison. Visualization is helpful in each of these steps – generating graphical representations of the model and plotting prior distributions aid model building, visualizing MCMC diagnostics and plotting posterior distributions aid interpretation, and plotting posterior predictive, counterfactual, and model comparisons aid model checking/improvement.

Clinton Brownley, PhD Clinton Brownley, PhD Lead Data Scientist at Tala
Machine Learning | Beginner

Introduction to Math for Data Science

Session abstract coming soon!

Thomas Nield Thomas Nield Instructor at University of Southern California | Founder at Nield Consulting Group and Yawman Flight
Generative AI | All Levels

Generative AI, AI Agents, and AGI - How New Advancements in AI Will Improve the Products We Build

This session is tailored for professionals seeking to master the fundamentals of generative AI. Our training covers a comprehensive range of topics, from the basics of text generation using advanced language models to the intricacies of image and 3-D object generation. Attendees will gain hands-on experience with cutting-edge tools, empowering them to become ten times more productive in their roles. A key component of our training is the exploration of autonomous agents. Participants will learn not only how these agents perform various tasks autonomously but also how to build one from the ground up. This segment paves the way to understanding the trajectory towards artificial general intelligence (AGI), a frontier in AI research. This session does not require prior experience in AI, making it accessible to a broad audience. However, it promises maximum knowledge gain, equipping attendees with practical skills and theoretical knowledge. By the end of the session, participants will be able to apply these insights directly to their roles, enhancing their contribution to the AI domain and their respective industries. It will be a comprehensive learning experience, ensuring attendees leave with a profound understanding of generative AI and its applications.

Martin Musiol Martin Musiol Co-Founder and Instructor at Generative | Principal Data Science Manager at Infosys Consulting
Machine Learning | Beginner

Programming with Python

The Python language is one of the most popular programming languages in data science and machine learning as it offers a number of powerful and accessible libraries and frameworks specifically designed for these fields. This programming course is designed to give participants a quick introduction to the basics of coding using the Python language. It covers topics such as data structures, control structures, functions, modules, and file handling. This course aims to provide a basic foundation in Python and help participants develop the skills needed to progress in the field of data science and machine learning.

ODSC Instructor ODSC Instructor
Machine Learning | Beginner

Introduction to AI

This AI literacy course is designed to introduce participants to the basics of artificial intelligence (AI) and machine learning. We will first explore the various types of AI and then progress to understand fundamental concepts such as algorithms, features, and models. We will study the machine learning workflow and how it is used to design, build, and deploy models that can learn from data to make predictions. This will cover model training and types of machine learning including supervised, and unsupervised learning, as well as some of the most common models such as regression and k-means clustering. Upon completion, individuals will have a foundational understanding of machine learning and its capabilities and be well-positioned to take advantage of introductory-level hands-on training in machine learning and data science such as ODSC East’s Mini-Bootcamp.

ODSC Instructor ODSC Instructor
Machine Learning | Beginner

Data Wrangling with Python

Data wrangling is the cornerstone of any data-driven project, and Python stands as one of the most powerful tools in this domain. In preparation for the ODSC conference, our specially designed course on “Data Wrangling with Python” offers attendees a hands-on experience to master the essential techniques. From cleaning and transforming raw data to making it ready for analysis, this course will equip you with the skills needed to handle real-world data challenges. As part of a comprehensive series leading up to the conference, this course not only lays the foundation for more advanced AI topics but also aligns with the industry’s most popular coding language . Upon completion of this short course attendees will be fully equipped with the knowledge and skills to manage the data lifecycle and turn raw data into actionable insights, setting the stage for advanced data analysis and AI applications.

ODSC Instructor ODSC Instructor
Machine Learning | Beginner

Introduction to Machine Learning

In an introductory machine learning live training, key topics include defining machine learning, distinguishing supervised and unsupervised learning, covering basic concepts, exploring common algorithms (e.g., decision trees, neural networks)

ODSC Instructor ODSC Instructor
LLMs | Prompt Engineering | Beginner - Intermediate

Introduction to Large Langue Models and Prompt Engineering

In the rapidly evolving field of AI, the “LLMs, Prompt Engineering, and Generative AI” course stands as a cutting-edge offering, designed to equip learners with the latest advancements in Large Language Models (LLMs), prompt engineering, and generative AI techniques.

Generative AI | Beginner

Data and Generative AI Literacy

Data is the essential building block of Data Science, Machine Learning, and AI. This course is the first in the series and is designed to teach you the foundational skills and knowledge required to understand, work with, and analyze data.

ODSC Instructor ODSC Instructor
Machine Learning | Beginner

Data Wrangling with SQL

This SQL coding course teaches students the basics of Structured Query Language, which is a standard programming language used for managing and manipulating data and an essential tool in AI. The course covers topics such as database design and normalization, data wrangling, aggregate functions, subqueries, and join operations and students will learn how to design and write SQL code to solve real-world problems.

ODSC Instructor ODSC Instructor
LLMs | Beginner

Introduction to Large Language Models

This hands-on course serves as a comprehensive introduction to Large Language Models (LLMs), covering a spectrum of topics from their differentiation from other language models to their underlying architecture and practical applications. It delves into the technical aspects, such as the transformer architecture and the attention mechanism, which are the cornerstones of modern language models. The course also explores the applications of LLMs, focusing on zero-shot learning, few-shot learning, and fine-tuning, which showcase the models' ability to adapt and perform tasks with limited to no examples. Furthermore, it introduces the concept of flow chaining as a method to generate coherent and extended text, demonstrating its usefulness in tackling token limitations in real-world scenarios such as Q&A bots. Through practical examples and code snippets, participants are given a hands-on experience on how to utilize and harness the power of LLMs across various domains.

ODSC Instructor ODSC Instructor
Generative AI | Beginner

Prompt Engineering Fundamentals

This workshop on Prompt Engineering explores the pivotal role of prompts in guiding Large Language Models (LLMs) like ChatGPT to generate desired responses. It emphasizes how prompts provide context, control output style and tone, aid in precise information retrieval, offer task-specific guidance, and ensure ethical AI usage. Through practical examples, participants learn how varying prompts can yield diverse responses, highlighting the importance of well-crafted prompts in achieving relevant and accurate text generation. Additionally, the workshop introduces temperature control to balance creativity and coherence in model outputs, and showcases LangChain, a Python library, to simplify prompt construction. Participants are equipped with practical tools and techniques to harness the potential of prompt engineering effectively, enhancing their interaction with LLMs across various contexts and tasks.

ODSC Instructor ODSC Instructor
Generative AI | Beginner

Prompt Engineering with OpenAI

This workshop on prompt engineering with OpenAI discussed best practices for utilizing OpenAI models. We will review how to separate instructions and context using special characters to help improve instruction clarity, context isolation, and enhances control over the generation process.  The workshop also included code for installing the langchain library and demonstrated how to create prompts effectively, emphasizing the importance of clarity, specificity, and precision in prompts. Additionally, the workshop showed how to craft prompts for specific tasks, such as extracting entities from text. It provided templates for prompts and highlighted the significance of specifying the desired output format through examples for improved consistency and customization. Lastly, the workshop addressed the importance of using prompts as safety guardrails. It introduced prompts to mitigate hallucination and jailbreaking risks by instructing the model to generate well-supported and verifiable information, thereby promoting responsible and ethical use of language models.

ODSC Instructor ODSC Instructor
Generative AI | Beginner

Build a Question & Answering Bot

The workshop notebook delves into building a Question and Answering Bot based on a fixed knowledge base, covering the integration of concepts discussed in earlier notebooks about LLMs (Large Language Models) and prompting. Initially, it introduces a high-level architecture focusing on vector search—a method to retrieve similar items based on vector representations. The notebook explains the steps involved in vector search including vector representation, indexing, querying, similarity measurement, and retrieval, detailing various technologies used for vector search such as vector libraries, vector databases, and vector plugins. The example utilizes an Open Source vector database, Chroma, to index data and uses state-of-the-union text data for the exercise. The notebook then transitions into the practical implementation, illustrating how text data is loaded, chunked into smaller pieces for effective vector search, and mapped into numeric vectors using the MPNetModel from the SentenceTransformer library via HuggingFace. Following this, the focus shifts to text generation where Langchain Chains are introduced. Chains, as described, allow for more complex applications by chaining several steps and models together into pipelines. A RetrievalQA chain is used to build a Q&A Bot application which utilizes an OpenAI chat model for text generation.  

ODSC Instructor ODSC Instructor
LLMs | Beginner

Fine Tuning Embedding Models

This workshop explores the importance of fine-tuning Language and Embedding Models (LLMs). It highlights how embedding models are used to map natural language to vectors, crucial for pipelines with multiple models to adapt to specific data nuances. An example demonstrates fine-tuning an embedding model for legal text. The notebook discusses existing solutions and hardware considerations, emphasizing GPU usage for large data. The practical part of the notebook shows the fine-tuning process of the "distilroberta-base" model from the SentenceTransformer library. It utilizes the QQP_triplets dataset from Quora for training, designed around semantic meaning. The notebook prepares the data, sets up a DataLoader, and employs Triplet Loss to encourage the model to map similar data points closely while distancing dissimilar ones. It concludes by mentioning the training duration and resources needed for further improvements.

ODSC Instructor ODSC Instructor
LLMs | Beginner

Fine Tuning an Existing LLM

The workshop explores the process of fine-tuning Large Language Models (LLMs) for Natural Language Processing (NLP) tasks. It highlights the motivations for fine-tuning, such as task adaptation, transfer learning, and handling low-data scenarios, using a Yelp Review dataset. The notebook employs the HuggingFace Transformers library, including tokenization with AutoTokenizer, data subset selection, and model choice (BERT-based model). Hyperparameter tuning, evaluation strategy, and metrics are introduced. It also briefly mentions DeepSpeed for optimization and Parameter Efficient Fine-Tuning (PEFT) for resource-efficient fine-tuning, providing a comprehensive introduction to fine-tuning LLMs for NLP tasks.

ODSC Instructor ODSC Instructor
LLMs | Beginner

LangChain Agents

The "LangChain Agents" workshop delves into the "Agents" component of the LangChain library, offering a deeper understanding of how LangChain integrates Large Language Models (LLMs) with external systems and tools to execute actions. This workshop builds on the concept of "chains," which can link multiple LLMs to tackle various tasks like classification, text generation, code generation, and more. "Agents" enable LLMs to interact with external systems and tools, making informed decisions based on available options. The workshop explores the different types of agents, such as "Zero-shot ReAct," "Structured input ReAct," "OpenAI Functions," "Conversational," "Self ask with search," "ReAct document store," and "Plan-and-execute agents." It provides practical code examples, including initializing LLMs, defining tools, creating agents, and demonstrates how these agents can answer questions using external APIs, offering participants a comprehensive overview of LangChain's agent capabilities.

ODSC Instructor ODSC Instructor
LLMs | Beginner

Parameter Efficient Fine tuning

For the next workshop, our focus will be on parameter-efficient fine-tuning (PEFT) techniques in the field of machine learning, specifically within the context of large neural language models like GPT or BERT. PEFT is a powerful approach that allows us to adapt these pre-trained models to specific tasks while minimizing additional parameter overhead. Instead of fine-tuning the entire massive model, PEFT introduces compact, task-specific parameters known as "adapters" into the pre-trained model's architecture. These adapters enable the model to adapt to new tasks without significantly increasing its size. PEFT strikes a balance between model size and adaptability, making it a crucial technique for real-world applications where computational and memory resources are limited, while still maintaining competitive performance. In this workshop, we will delve into the different PEFT methods, such as additive, selective, re-parameterization, adapter-based, and soft prompt-based approaches, exploring their characteristics, benefits, and practical applications. We will also demonstrate how to implement PEFT using the Hugging Face PEFT library, showcasing its effectiveness in adapting large pre-trained language models to specific tasks. Join us to discover how PEFT can make state-of-the-art language models more accessible and practical for a wide range of natural language processing tasks."

ODSC Instructor ODSC Instructor
Generative AI | Beginner

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a powerful natural language processing (NLP) architecture introduced in this workshop notebook. RAG combines retrieval and generation models, enhancing language understanding and generation tasks. It consists of a retrieval component, which efficiently searches vast text databases for relevant information, and a generation component, often based on Transformer models, capable of producing coherent responses based on retrieved context. RAG's versatility extends to various NLP applications, including question answering and text summarization. Additionally, this notebook covers practical aspects such as indexing content, configuring RAG chains, and incorporating prompt engineering, offering a comprehensive introduction to harnessing RAG's capabilities for NLP tasks.

ODSC Instructor ODSC Instructor
NLP | Beginner

Introduction to NLP

Welcome to the Introduction to NLP workshop! In this workshop, you will learn the fundamentals of Natural Language Processing. From tokenization and stop word removal to advanced topics like deep learning and large language models, you will explore techniques for text preprocessing, word embeddings, classic machine learning, and cutting-edge NLP methods. Get ready to dive into the exciting world of NLP and its applications!

ODSC Instructor ODSC Instructor
All Tracks | Beginner

Introduction to R

Dive into the world of R programming in this interactive workshop, designed to hone your data analysis and visualization skills. Begin with a walkthrough of the Colab interface, understanding cell manipulation and library utilization. Explore core R data structures like vectors, lists, and data frames, and learn data wrangling techniques to manipulate and analyze datasets. Grasp the basics of programming with iterations and function applications, transitioning into Exploratory Data Analysis (EDA) to derive insights from your data. Discover data visualization using ggplot2, unveiling the stories hidden within data. Lastly, get acquainted with RStudio, the robust Integrated Development Environment, enhancing your R programming journey. This workshop is your gateway to mastering R, catering to both novices and seasoned programmers.

ODSC Instructor ODSC Instructor
Data Visualization | Beginner-Intermediate

Create Compelling Data Visualizations and Business Dashboards with D3

This session will introduce attendees to the basic ideas behind the popular data visualization library D3, and walk them through a few exercises to create data visualizations and a simple interactive dashboard. While it is an introduction to D3, it does assume basic familiarity with JavaScript and programming, basic knowledge of HTML, as well as experience with working on the command line. We will cover the following topics - Overview of D3’s architecture and basic concepts (i.e., the visual join) - Introduction to a few advanced JavaScript concepts that it relies on (anonymous/arrow functions, accessor functions, etc.) - How to access data (loading files, querying APIs) and how data is commonly represented for use in D3 - Creating simple charts from data (bar charts, scatterplots), as well as a few more unusual ones (likely including lollipop charts, etc.) - Re-using existing D3 examples by replacing the data with your own - Basic interaction in D3: tooltips, filtering - Putting charts together into a dashboard (layout, using functions to generate charts) The goal of this session is to give attendees a good understanding of how D3 works on a fundamental level. They will be able to build on that foundation to learn more and create their own visualizations and dashboards.

Robert Kosara Robert Kosara Data Visualization Developer at Observable

More Sessions Coming Soon!

Participate at ODSC East 2024

More Info
Submit a Session

As part of the global data science community we value inclusivity, diversity,  and fairness in the pursuit of knowledge and learning. We seek to deliver a conference agenda, speaker program, and attendee participation that moves the global data science community forward with these shared goals. Learn more on our code of conductspeaker submissions, or speaker committee pages.

ODSC Newsletter

Stay current with the latest news and updates in open source data science. In addition, we’ll inform you about our many upcoming Virtual and in person events in Boston, NYC, Sao Paulo, San Francisco, and London. And keep a lookout for special discount codes, only available to our newsletter subscribers!

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google