
Abstract: Most data we encounter is “unstructured” which means it needs additional processing in order to be used in decision-making. Often these data are text, coming in the form of comment fields, notes and descriptions. The field of Natural Language Processing offers a wide array of methods for turning these data into valuable insights. IBM’s 2021 Global AI Adoption Index showed NLP is at the forefront of AI adoption with one in four businesses reporting adopting this type of technology within a year. This is being enabled by a wide array of open-source NLP libraries such as spaCy and HuggingFace’s Transformers.
In this workshop we will explore some popular NLP techniques that have broad applicability. From the basics of bagging and word vectors to the creating of contextualized representations of words and sentences, the workshop will equip participants with the tools they need to turn raw text data into useful insights.
The workshop will focus on a classification use-case (sentiment analysis) and progress though applications with increasing complexity and requirements. The intention is to show the utility of each method in a reasonably well-behaved sample dataset. We will start by using scikit-learn for feature extraction and eventually move on to using spaCy for complete NLP pipelines. We will comparing the performance of the different approaches on the use case. We will conclude the workshop by discussing considerations when building complete NLP products.
The intended audience will have intermediate-level knowledge of Python and an interest in NLP technology. Participants will gain an understanding of how to use these techniques and the benefits (and risks!) of each.
Bio: Ben is a Senior Data Scientist at the Institute for Experiential AI at Northeastern University. He obtained his Masters in Public Health (MPH) from Johns Hopkins and his PhD in Policy Analysis from the Pardee RAND Graduate School. Since 2014, he has been working in data science for government, academia and the private sector. His major focus has been on Natural Language Processing (NLP) technology and applications. Throughout his career, he has pursued opportunities to contribute to the larger data science community. He has presented his work at conferences, published articles, taught courses in data science and NLP, and is co-organizer of the Boston chapter of PyData. He also contributes to volunteer projects applying data science tools for public good.

Benjamin Batorsky, PhD
Title
Senior Data Scientist | Institute for Experiential AI at Northeastern University
