Intro to NLP: Text Categorization and Topic Modeling


Natural Language Processing is the capability of providing structure to unstructured data which is at the core of developing Artificial Intelligence centric technology. Text categorization or classifications helps us tag data with  categories such as sentiments expressed in reviews or concepts associated with texts. In this talk I will go into details of NLP classifications — (1) importance of data collection , (2) a deep dive into  models and (3) the metrics necessary to measure the performance of the model.
In order to gain a proper understanding of modeling I will explain traditional NLP techniques using TFIDF approaches and go into details of different deep learning architectures such as feed forward neural network and convolutional neural network (CNN). Along with these concepts I will also show code snippets in keras to build the classifier.  I will conclude with some of the metrics commonly used in measuring the performance of the classifier.

Text categorization is great when there is training data. In the absence of training we use unsupervised techniques such as topic modeling to infer patterns in text data. Topic modeling is form of document clustering with coherent concepts/phrases representing each cluster. I will go into details of implementing topic modeling in python and some use cases where it can be used.

Session Outline
Lesson 1: Data centric approaches are typically more successful than model centric approaches. Lesson 2: Start with a simple model and iterate towards the optimal model for your dataset. Lesson 3: Decide on performance metrics that you need to optimize before you start collecting data for your model. Lesson 4: While building the model keep deployment requirements such as latency and model size in mind. Lesson 5: If you do not have training data unsupervised techniques such as Topic Modeling can be handy.

Background Knowledge
A working knowledge of python & preliminary knowledge of scikit learn, keras is useful.


Sanghamitra Deb is a Staff Data Scientist at Chegg, she works on problems related school and college education to sustain and improve the learning process. Her work involves recommendation systems, computer vision, graph modeling, deep NLP analysis , data pipelines and machine learning. Previously, Sanghamitra was a data scientist at a Accenture where she worked on a wide variety of problems related data modeling, architecture and visual story telling. She is an avid fan of python and has been programming for more than a decade.

Trained as an astrophysicist (she holds a PhD in physics) she uses her analytical mind to not only work in a range of domains such as: education, healthcare and recruitment but also in her leadership style. She mentors junior data scientists at her current organization and coaches students from various field to transition into Data Science. Sanghamitra enjoys addressing technical and non-technical audiences at conferences and encourages women into joining tech careers. She is passionate about diversity and has organized Women In Data Science meetups.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from Youtube
Consent to display content from Vimeo
Google Maps
Consent to display content from Google