Abstract: Research suggests that 80-90% of data within any particular organization is unstructured, and much of these data are text. In order to make use of this wealth of text data, organizations have been turning to Natural Language Processing techniques. IBM’s 2021 Global AI Adoption Index showed NLP is at the forefront of AI adoption with one in four businesses reporting adopting this type of technology within a year. This is being enabled by a wide array of open-source NLP libraries such as spaCy and HuggingFace’s Transformers.
In this workshop we will explore some of these popular NLP techniques that have broad applicability. From the basics of bagging and word vectors to the creating of contextualized representations of words and sentences, the workshop will equip participants with the tools they need to turn messy text data into useful insights.
The focus of the workshop will be building NLP approaches with increasing complexity. Each step in the progression will build on the others and be evaluated against one another. There will be three main steps in this progression:
1) Creating informative feature sets based on document and dataset-level statistics (word frequencies and weighted word frequencies)
2) Concentrating this information along particularly informative dimensions using learned weights (topic models)
3) Leveraging models trained on general language data to bring in contextual information into our representations (word embeddings, transformer models)
Each step will be motivated by a sentiment analysis use case using movie reviews, the intention being to show the utility of each method in a reasonably well-behaved sample dataset. We will use a combination of scikit-learn’s performant text feature extraction and spaCy’s powerful NLP pipelines to create scalable solutions that participants can apply to their own use-cases. We will conclude by comparing the results of the different approaches and discussing the pros and cons for each.
Bio: Ben is a Senior Data Scientist at the Institute for Experiential AI. He obtained his Masters in Public Health (MPH) from Johns Hopkins and his PhD in Policy Analysis from the Pardee RAND Graduate School. Since 2014, he has been working in data science for government, academia and the private sector. His major focus has been on Natural Language Processing (NLP) technology and applications. Throughout his career, he has pursued opportunities to contribute to the larger data science community. He has spoken at data science conferences , taught courses in Data Science, and helped organize the Boston chapter of PyData. He also contributes to volunteer projects applying data science tools for public good.