Understanding unstructured data with language models

Abstract: As data scientists, we've seen a rapid improvement in the last decades in the tools available for working with structured data (be it tabular data, graph data, sensor data etc.). Yet, the vast majority of our data (Merrill Lynch puts the figure at roughly 90%) is *unstructured*, and lives in the form of documents, emails, reviews, reports and chat logs etc.

Many of us are far less familiar with how to analyse and understand this trove of unstructured data. This talk focuses on language models, one of the most fundamental tools for working with unstructured data. Language models are all around us (although we're probably unaware of them), underpinning everything from Word's spellchecker to home assistants like Alexa.

While plenty of "out of the box" language modelling libraries exists, the first part of the talk focuses on getting an thorough understanding of what a language model is, and how it works. We'll touch on key ideas from statistics and information theory, and see how Alan Turing, in developing techniques to break Nazi codes at Bletchley Park, created the smoothing techniques which remain widely used in language models today. We'll then proceed to the present day, looking at how techniques like word vectors and transfer learning have yielded an improved generation of tools.

In the second half of the talk, we'll look at how we can practically use language models to understand unstructured data. Specifically we'll explore:

- Classification - the canonical application of language models, they can help us identify spam, analyse sentiment or perform unsupervised clustering. We'll look at a famous case where language models were able to successfully identify a Shakespeare forgery.
- Predictive modelling - if I were to look at your Tweets (and nothing else), could I guess your gender? It turns out state-of-the-art techniques can successfully predict it with an 80%+ success rate. We'll look at how language models can enrich your datasets with additional demographic or contextual data.
- Information retrieval - finally, we'll see how language models have been used extensively (for example in the legal sector), to extract targeted insights from enormous data sets.

Bio: Alex Peattie is the co-founder and CTO of Peg, a technology platform helping multinational brands and agencies to find and work with top YouTubers. Peg is used by over 2000 organisations worldwide including Coca-Cola, L'Oreal and Google.

An experienced digital entrepreneur, Alex spent six years as a developer and consultant for the likes of Grubwithus, Huckberry, UNICEF and Nike, before joining coding bootcamp Makers Academy as senior coach, where he trained hundreds of junior developers. Alex was also a technical judge at the 2017 TechCrunch Disrupt conference.

Open Data Science Conference