Finding Rare Events in Text


The session will focus on identifying rare events in text with positive unlabeled data. PU learners are massively used for one-class classification but the challenge becomes far steeper when the event under consideration has low probability of occurrence. We will discuss a novel algorithm (iCASSTLe) included in IEEE ICMLA 2018 where we use a two-staged semi-supervised approach to extract the relevant recall set using core components of NLP. By the end of this workshop, you will get a basic understanding of the following

- Difference b/w rare events & anomalies
- Basics of Text Mining
- Motivation behind Semi Supervised Learners
- Training PU Learners for Rare Events

Session Outline
Module I: Rare Events & how they differ from Anomalies
- Examples
- Major differences
- Degree of rarity

Module II: Rare Events in Text
- Examples
- Sentiment Inclination
- Token Sensitivity
- Data availability

Module III: Text Mining
- Text Cleaning & Pre-processing for Rare Events
- Numeric Representation of Text
- Live Exercise (R/Python script will be provided)

Module IV: Positive Unlabeled Learning
- Motivation & Examples
- Live Exercise

Module V: Semi Supervised Learning
- Motivation
- Entropy Regularization
- Logistic Regression (Binary Classification) with semi-supervision

Module VI: iCASSTLe
- Example use case
- Quantifying Degree of Severity
- Metric Formation & Stage I Classification
- Stage II Classification with ERLG
- Live Exercise

Background Knowledge
Basics of statistical learning, linear algebra - matrix factorization, vector space and distance, probability, logistic regression, entropy, monte carlo simulations, NLP basics, fair exposure to coding in R/Python


Debanjana is a Senior Data Scientist at Walmart Labs with 4+ years of experience in tech. At Walmart, she has been instrumental in developing ML-driven solutions in the compliance space dealing heavily in Natural Language Processing, Mixture Models and Rare Time Series. Currently, her focus is on building an AI to enable automated shelf curation for creative content on She has filed 5 US patents in the field of Clustering & Anomaly Detection, Imbalance Text Classification and Stochastic Processes. In addition, she has three published papers to her credit. Debanjana has a master's degree in Statistics from Indian Institute of Technology (Kanpur).

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google