State-of-the-art Text Classification with ULMFiT

Abstract: Transfer learning has revolutionized image analysis tasks such as object detection and recognition by allowing users to start with high-quality deep learning models that can be fine-tuned for their particular use case. However, transfer learning has not yet been widely applied to other tasks. The Universal Language Model Fine-Tuning for Text Classification (ULMFiT) approach has shown the success of applying this method to NLP. ULMFiT has improved on the previous NLP state-of-the-art by 20% or more.

ULMFiT takes the same paradigm used for image tasks and applies it to text. Analysts start with a language model trained on a large corpus, such as Wikipedia. This model develops a basic understanding of how language works, then analysts fine-tune the model on their specific corpus before training the text classifier. A language model can be trained and applied to any language.

In this session Matt will provide an overview of ULMFiT before walking through a use case in which he used Amazon SageMaker to train ULMFiT on hand-labelled quotes sourced from thousands of news articles. After demonstrating the success of this method, Matt will discuss a novel approach that further improves upon ULMFiT by up to 10% by incorporating article-level metadata such as publication name, author name, and speaker. The session will end with a short discussion of how this model has been implemented to increase the efficiency of the analysts who are manually tagging quotes.

Bio: Matt Teschke is an applied machine learning researcher who has demonstrated experience developing solutions to complex problems for government and commercial customers. As leader of Novetta’s Machine Learning Center of Excellence, Matt directs ML research in NLP, object detection, face recognition, entity resolution, and supervised learning. He is interested in applying open source and cloud technologies to develop practical and innovative solutions for customers.