Abstract: This workshop will review key steps in the NLP workflow, the most popular tools (mostly in Python) and introduce topic models and word2vec embeddings. We will cover processing text using sklearn and NLTK, and introduce spacy’s powerful linguistic capabilities. The session will look at natural language representation in the traditional document-term matrix and the more recent word2vec format, and will introduce topic modeling using sklearn and gensim, and their visualization using pyLDAviz.
In the second part, we will review the creation of word2vec models from scratch using tensorflow, including 3D visuazlization using the tensorboard projector, and how the word embeddings capture important semantic aspects. In the next step, we will see how the semantic information represented by the word vectors can be used to translate words and phrases from one language to another.
Bio: Stefan is founder, CEO and lead data scientist at Applied AI that provides data strategy consulting, machine learning solutions, as well as executive coaching and training for consumer, healthcare and financial industries. Prior to his current venture, he was co-founder and partner at an international investment firm, building the predictive analytics and investment research practice. Earlier, he was executive at a global fintech company with operations in 15 global markets.
A native German, he started his career as advisor to Central Banks in emerging markets and has worked in six languages across Asia, Africa, and Latin America. In 2007, he raised $35m from the Gates Foundation to cofound the Alliance for Financial Inclusion, an international organization for regulators that facilitates the adoption of financial technology to lower barriers to access.
Stefan holds a Master in Economics from FU Berlin with a Thesis on Early Warning Systems for Financial Crisis using Machine Learning, an MPA/ID from the Harvard Kennedy School, a CFA Charter, and has published through Harvard and Brookings. He teaches data science at General Assembly, has produced two courses with currently 13,000 students at DataCamp, and is the author of two courses on ‘Mastering Unsupervised Learning’ (forthcoming by Packt).