How to do NLP When You Don’t Have a Labeled Dataset?

Abstract: 

Lack of a readily available dataset is a commonly seen scenario in industry projects involving NLP. It is also a situation researchers venturing into new problems or new languages often encounter. However, both traditional textbooks, as well as tutorials and workshops primarily focus on modeling and deploying models. In this workshop, I will introduce some strategies to create labeled datasets for a new task and build your first models with that data. At the end of this session, the participants are expected to get some ideas for solving the data bottleneck in their organization. The target audience are data scientists as well as those involved in requirements gathering for a given NLP problem.

Session Ouline
Lesson 1: Overview of different means of collecting labeled data for NLP, and ethical and other challenges involved.

Lesson 2: For a given problem description, what tools can we use to create annotated data? (an overview of tools, with specific examples using Doccano).

Lesson 3: How can we create labeled data automatically? - data labeling and augmentation for NLP. Tools used: Snorkel

Lesson 4: How to build a model using automatically labeled data and evaluate it with the gold-standard manually labeled data. Tools used: sklearn/huggingface
Preferred audience are people who already used NLP in their past work and are aware of the typical NLP system development pipeline e.g., how to represent text as a vector, how to use machine learning methods for NLP and how to evaluate them.

Background Knowledge
Preferred audience are people who already used NLP in their past work and are aware of the typical NLP system development pipeline e.g., how to represent text as a vector, how to use machine learning methods for NLP and how to evaluate them.

Bio: 

Sowmya Vajjala currently works as a researcher in Digital Technologies at National Research Council, Canada’s largest federal research and development organization. She has worked in the area of Natural Language Processing (NLP) over the past decade in various roles – as a software developer, researcher, educator, and a senior data scientist. She recently co-authored a book: “Practical Natural Language Processing: A Comprehensive Guide to Building Real World NLP Systems”, published by O’Reilly Media (June, 2020), which was also translated into Chinese. Her research interests lie in multilingual computing and the relevance of NLP beyond research both in industry practice as well as in other disciplines, through inter-disciplinary research.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from - Youtube
Vimeo
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google