Natural Language Processing (NLP) from Scratch

Abstract: The rise of online social platforms has resulted in an explosion of the written text in the form of blogs, posts, tweets, wiki pages, etc. This new wealth of data provides a unique opportunity to explore natural language in its many forms, both as a way of automatically extracting information from written text and as a way of artificially producing text that looks natural.

In this session, we will introduce viewers to natural language processing from scratch. Each concept is introduced and explained through coding examples using nothing more than just plain Python and numpy. In this way, viewers will learn in depth about the underlying concepts and techniques instead of just learning how to use a specific NLP library.

In particular, we will cover:
- One-Hot Encoding
- Bag of Words
- TF/IDF
- Document clustering
- Sentiment Analysis
- Word embeddings

Bio: Bruno Gonçalves is currently a Senior Data Scientist working at the intersection of Data Science and Finance. Previously, he was a Data Science fellow at NYU's Center for Data Science while on leave from a tenured faculty position at Aix-Marseille Université. Since completing his Ph.D. in the Physics of Complex Systems in 2008 he has been pursuing the use of Data Science and Machine Learning to study Human Behavior. Using large datasets from Twitter, Wikipedia, web access logs, and Yahoo! Meme he studied how we can observe both large scale and individual human behavior in an obtrusive and widespread manner. The main applications have been to the study of Computational Linguistics, Information Diffusion, Behavioral Change and Epidemic Spreading. In 2015 he was awarded the Complex Systems Society's 2015 Junior Scientific Award for "outstanding contributions in Complex Systems Science" and in 2018 it was named a Science Fellow of the Institute for Scientific Interchange in Turin, Italy.