Evaluating and Testing Natural Language Processing Models
Evaluating and Testing Natural Language Processing Models

Abstract: 

Current evaluation of the generalization of natural language processing (NLP) systems, and much of machine learning, primarily consists of measuring the accuracy on held-out instances of the dataset. Since the held-out instances are often gathered using similar annotation process as the training data, they include the same biases that act as shortcuts for machine learning models, allowing them to achieve accurate results without requiring actual natural language understanding. Thus held-out accuracy is often a poor proxy for measuring generalization, and further, aggregate metrics have little to say about where the problem may lie.

In this talk, I will introduce a number of approaches we are investigating to perform a more thorough evaluation of NLP systems. I will first provide an overview of automated techniques for perturbing instances in the dataset that identify loopholes and shortcuts in NLP models, including semantic adversaries and universal triggers. I will then describe recent work in creating comprehensive and thorough tests and evaluation benchmarks for NLP that aim to directly evaluate comprehension and understanding capabilities. The talk will cover a number of NLP tasks, including sentiment analysis, textual entailment, paraphrase detection, and question answering.

Bio: 

Dr. Sameer Singh is an Assistant Professor of Computer Science at the University of California, Irvine (UCI). He is working primarily on robustness and interpretability of machine learning algorithms, along with models that reason with text and structure for natural language processing. Sameer was a postdoctoral researcher at the University of Washington (w/ Carlos Guestrin and late Ben Taskar) and received his PhD from the University of Massachusetts, Amherst (w/ Andrew McCallum), during which he also worked at Microsoft Research, Google Research, and Yahoo! Labs. He was selected as a DARPA Riser, and has been awarded the grand prize in the Yelp dataset challenge, the Yahoo! Key Scientific Challenges (story), UCI Mid-Career Excellence in research award, and recently received the Hellman Fellowship in 2020. His group has received funding from Amazon, Allen Institute for AI, NSF, DARPA, Adobe Research, Base 11, and FICO. Sameer has published extensively at machine learning and natural language processing conferences and workshops, including paper awards at KDD 2016, ACL 2018, EMNLP 2019, AKBC 2020, and ACL 2020.