Beyond Word2vec: Recent Developments in Document Embedding

Abstract: It is easy to be amazed by they seemingly magical power of word2vec. But in real business use cases we rarely need to understand single words. So how do we apply the power of word2vec to phrases, sentences, paragraphs or entire documents? In this workshop we will go through various techniques of generating useful representations of documents of indeterminate length, and look at ways of comparing methods.

We will start with bag of words approaches and TFIDF. From there we will look at dimensionality reduction techniques like LSA or NMF. After that, we will look at word2vec and sense2vec and various ways to aggregate those word vectors, including summing, weighting, clustering, Gensim Doc2vec and developing parse tree representations. Finally we will look at RNN methods such as LSTMs using Keras. Along the way we will look at ways to evaluate each of these methods and discuss strengths and weaknesses.
WARNING: this workshop will run much smoother you download several large files beforehand, including a 3.6gb pre-trained word2vec

Bio: At Metis, Andrew has taught the fundamentals of Machine Learning and Data Science in a 3 month Bootcamp to over a 100 students and advised nearly 500 student projects.

Andrew came to Metis from LinkedIn, where he worked as a Data Scientist, on the Education, Skills and then the NLP teams. He is passionate about helping people make rational decisions and building cool data products.

Prior to that he worked on fraud modeling at IMVU (the lean startup) and studied applied physics at Cornell. He loves snowboarding, traveling, scotch and reading about all kinds of nerdy topics.