Guiding AI To Generate The Labels We Do Not Have With Active Learning

Abstract: We are in the age of data. In recent years, many companies have already started collecting large amounts of data about their business. Many other companies are starting now.

However, you know that before you can train any decent supervised model you need ground truth data. Usually, supervised ML models are trained on old data records that are already somehow labeled. And this is the ugly truth: before proceeding with any model training, any classification problem definition, or any further enthusiasm in gathering data, you need a sufficiently large set of correctly labeled data records to describe your problem. And data labeling - especially in a sufficiently large amount - is … expensive.

Expensive unless you did some research and came across a concept called “active learning”, a special instance of machine learning that might be of help to solve your label scarcity problem.

In this presentation we will explain the main parts of an active learning procedure and we will show a blueprint web-application, based on active learning and uncertainty sampling, to interactively label any document set while investing only a fractional amount of time in manual labeling. The idea of active learning is that we train a machine learning model well enough to be able to delegate it to the boring and expensive task of data labeling.

Bio: Paolo Tamagnini currently works as a data scientist at KNIME.
Paolo holds a master degree in data science and research experience in data visualization techniques for machine learning interpretability.