Spark NLP: State of the Art Natural Language Processing at Scale

Abstract: Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, paraphrasing or summarization, sentiment analysis, natural language BI, and entity extraction. This talk introduces the open-source Spark NLP library, which within two years have become the most widely used NLP library in the enterprise - by implementing state-of-the-art deep learning NLP research as a production-grade, fast and scalable library for Python, Java and Scala.

Spark NLP natively extends the Spark ML pipeline API’s which enabling zero-copy, distributed, combined NLP & ML pipelines, which leverage all of Spark’s built-in optimizations. Benchmarks and design best practices for building NLP, ML and DL pipelines on Spark will be shared. The library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entity recognition, spell checking and sentiment detection. The talk will demonstrate using these algorithms to build commonly used pipelines, using notebooks that will be made publicly available after the talk.

Bio: David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing Group, where he led business operations for Bing Shopping in the US and Europe and worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a Ph.D. in computer science and master’s degrees in both computer science and business administration.