Data Annotation at Scale: Active and Semi-Supervised Learning in Python

Abstract: 

Many companies generate and store vast amounts of unlabelled data every day. Outside of certain unsupervised applications, data must be accompanied by informative labels for its potential to be maximised. However, data annotation efforts are constrained by the human factor and comes with a trade-off: internal annotators (i.e. employees) possess crucial context but they do not scale; while external annotators (e.g. crowdsourced marketplaces such as MTurk) scale only at the expense of domain-specific context.

In this half-day training session, we will explore how Active (human-in-the-loop) and Semi-Supervised (ML/AI-assisted) Learning frameworks can be combined to develop in-house solutions for executing rapid data labelling projects. We will consider various sampling strategies, query methods, measures of informativeness, and types of learners. By the end of the session, you will be equipped with a multitude of tools that you can utilise to scale up your data annotation efforts without losing all-important context.

Session Outline
Recap: Supervised Learning
We will start with a brief recap of the supervised learning paradigm. More specifically, we will quickly touch upon the train-test split procedure, cross-validation, in-sample vs. out-of-sample forecasts, accuracy vs. precision, and the bias-variance trade-off.

First Component: Active Learning
Active Learning leverages the least confident predictions of an estimator to expedite its learning by querying their labels from a human annotator. In this module, we will explore how the human-in-the-loop can help us scale up the data annotation process.

Second Component: Semi-Supervised Learning
Semi-Supervised Learning attacks the problem of data annotation from the opposite angle. In this module, we will explore the underpinnings of the so-called ML/AI-assisted data annotation and how we can leverage the most confident predictions of our estimator to label data at scale.

Putting Everything Together: A Complete Data Annotation Pipeline
Finally, we will walk through an interactive Jupyter notebook demonstrating how the two aforementioned frameworks can be combined to create bespoke data labelling jobs. We will explore a multitude of scenarios in which we utilise the individual components in various configurations and assess their pros and cons.

Background Knowledge

Participants would benefit from the following:
- Familiarity with Python and Jupyter notebooks (R users should be able to follow the material; however equivalent R code will not be provided)
- Specifically for Python, prior working experience using numpy, pandas, scikit-learn, and modAL libraries
- General grasp of supervised learning; train-test split, cross-validation, out-of-sample prediction, bias-variance trade-off

Bio: 

Gokhan is a senior data scientist at Attest. He is also a member of the Quanteda Initiative, and a guest lecturer for the LSE summer school course Introduction to Data Science and Machine Learning. As a computational social scientist, his core expertise lies in latent variable analysis, predictive modelling, and causal inference. Prior to industry, he was a postdoctoral researcher in analytic software development at the London School of Economics, where he received his PhD. Previously, he held research positions at UCL and Uppsala University, primarily developing machine learning pipelines and working on large-scale NLP problems.