Abstract: For NLP tasks, the first step is to pre-process text for training. Let’s say you have the English language model ,you will have a model that includes over 1 million items of vocabulary, many classes of entity recognition and a lot of compound noun recognition. But what happens when we need to add new terms and customize the vocabulary? In this tutorial, we show an approach on how to create a custom vocabulary that can be further used for any NLP tasks.
1. Introduction to Language Models - terminologies such as vocabulary, common language models
2. Why do we need a custom vocabulary - examples of scenarios where custom terms are needed
3. How to add custom terms to a vocabulary - exBERT and spaCY tokenizer
- step by step approach of creating a custom vocabulary in python
By the end of the session, participants will be able to understand how to create their own custom vocabularies that can be further used to nlp tasks such as sentence completion, sentiment analysis and so on. The module will talk about the exBERT approach of adding additional terms to an existing vocabulary and go over the steps using an example from hugging face library. Participants will also learn about the pitfalls, if any using this approach and how spaCY tokenizer as an open source tool can be used to achieve this customization.
Bio: Swagata is a Data Professional with over 6 years experience in Healthcare, Retail and Platform Integration industry. She is an avid blogger and writes about state of the art developments in the AI space. She is particularly interested in Natural Language Processing, and focuses on researching how to make NLP models work in practical setting. In her spare time, she loves to play her guitar, sip masala chai and find new spots for doing Yoga. Connect with her here – https://www.linkedin.com/in/swagata-ashwani/