
Abstract: In the modern age of Machine Learning, the importance of high-quality, accurately labeled data cannot be overstated. This presentation introduces how to create high-quality, annotated datasets for training Machine Learning (ML) models. In this tutorial, we will use Label Studio, an open-source, multi-type data labeling tool, to explore common methods for annotating raw datasets, including both human and automated labeling techniques.
This hands-on session caters to data scientists, ML engineers, and AI enthusiasts seeking to enhance their understanding of data labeling tools and practices. The insights shared will equip attendees to effectively manage their annotation projects, improve the accuracy of their ML models, and prepare for the forthcoming trends in data annotation.
We will start with a high-level overview of organizing your labeling project, including tips on preprocessing data. We will then give an introduction to building an annotation interface with Label Studio’s interface and demonstrate the annotation workflow, including how to incorporate multiple annotators to improve labeling accuracy. Moving on to more advanced topics, we will demonstrate how you can incorporate machine learning into the labeling process through several different processes including automated pre-labeling, interactive labeling, and active learning. Finally, we will explore how to use labeling platforms to aid in generative AI workflows, including preparing data for targeted retraining and evaluating the effectiveness of LLM prompts.
Through this presentation, we aim to make the phrase “Garbage in, Garbage out” a thing of the past in machine learning by illuminating the path from raw data to refined ML outputs.
Prerequisites for this talk include: a laptop with a minimum 16GB of memory and Docker Desktop installed. Alternatively, participants may use Hugging Face Spaces to host their Label Studio and machine learning environments.
Bio: Chris Hoge is the Head of Community for HumanSignal, where he is helping to grow the Label Studio community. He has spent over a decade working in open source machine learning and infrastructure communities, including Apache TVM, Kubernetes, and OpenStack. He has an M.S. in Applied Mathematics from the University of Colorado, with an emphasis on using high-performance numerical methods for simulating physical systems. He makes his home in the Pacific Northwest, where he spends his free time trail running and playing piano.