EMI: Embed, Measure and Iterate

Abstract: Representation learning, colloquially known as embeddings, has emerged as an important unifying theme in the machine learning community and is widely used in communities ranging from social media to computer vision and natural language processing. The core idea is to leverage large quantities of context-rich data, whether labeled or unlabeled, to ‘embed’ data points into vectors. These vectors can then serve as feature sets for classic machine learning classifiers like Logistic Regression. Embedding algorithms like word2vec and DeepWalk have yielded impressive results in natural language and graph processing pipelines. In the research community, there is a concerted effort now to build faster and better embedding algorithms for all kinds of heterogeneous datasets, including videos with tags and annotations, social media data and tables.

Much less attention has been paid to the issue of what makes an embedding ‘good’. In practice, two kinds of embeddings have emerged. One is specialized e.g., in facial recognition, we would be less interested in learning an embedding that works well for all images than in embeddings that help our systems do better facial recognition. However, when multiple tasks are involved (as is common in NLP), a more general embedding is preferable since a ‘jack of all trades’ approach leads to a better overall system, and is more scalable and robust.

In this talk, I will introduce general and specialized embeddings, and the methodology for measuring and iteratively improving such embeddings within the context of both known and unknown applications. My goal is not to explain or propose (yet another) embedding algorithms, but to provide insight on how an embedding algorithm or package should be thought of, and evaluated, in the real world.

Bio: Mayank Kejriwal is a research scientist and lecturer at the University of Southern California's Information Sciences Institute (ISI). He received his Ph.D. from the University of Texas at Austin. His dissertation involved Web-scale data linking, and in addition to being published as a book, was recently recognized with an international Best Dissertation award in his field. His research is highly applied and sits at the intersection of knowledge graphs, social networks, Web semantics, network science, data integration and AI for social good. He has contributed to systems that are being used by both DARPA and by law enforcement, and he has active collaborations in both academia and industry. He is currently co-authoring a textbook on knowledge graphs (MIT Press, 2018), and has delivered tutorials and demonstrations at numerous conferences and venues, including KDD, AAAI, and ISWC.