Data Augmentation for NLP


In the area of AI, it is a well-established fact that data beats algorithms i.e. large amounts of data with a simple algorithm often yields far superior results as compared to the best algorithm with little data. This is especially true for Deep learning algorithms that are known to be data guzzlers. Getting data labeled at scale is a luxury most practitioners cannot afford. What does one do in such a scenario?
This is where Data augmentation comes into play. Data augmentation is a set of techniques to increase the size of datasets and introduce more variability in the data. This helps to train better and more robust models. Data augmentation is very popular in the area of computer vision. From simple techniques like rotation, translation, adding salt, etc to GANs, we have a whole range of techniques to augment images. It is a well-known fact that augmentation is one of the key anchors when it comes to the success of computer vision models in industrial applications.

Most natural language processing (NLP) projects in the industry still suffer from data scarcity. This is where recent advances in data augmentation for NLP can come very helpful. When it comes to NLP, data augmentation is not that straight forward. You want to augment data while keeping the syntactic and semantic properties of the text. In this talk, we will take a deep dive into the world of various techniques that are available to practitioners to augment data for NLP. The talk is meant for Data Scientists, NLP engineers, ML engineers, and industry leaders working on NLP problems.


Anuj Gupta is a head of the Machine Learning and Data Science teams at Vahan. Prior to this, he was heading ML efforts for Intuit, Huawei Technologies, Freshworks, Chennai, and Airwoot, Delhi. He did his masters in theoretical computer science from IIIT Hyderabad and he dropped out of his Ph.D. from IIT Delhi to work with startups. 
He is a regular speaker at ML conferences like Pydata, Nvidia forums, Fifth Elephant, Anthill. He has also conducted a bunch of workshops attended by machine learning practitioners. He is also the co-organizer for one of the early Deep Learning meetups in Bangalore.  He is also Editor of " Anthill-2018" - deep learning-focused conference by HasGeek.