Apache Spark for Fast Data Science (and Fast Python Integration!) at Scale

Abstract: We'll start with the basics of machine learning on Apache Spark: when to use it, how it works, and how it compares to all of your other favorite data science tooling.

You'll learn to use Spark (with Python) for statistics, modeling, inference, and model tuning. But you'll also get a peek behind the APIs: see why the pieces are arranged as they are, how to get the most out of the docs, open source ecosystem, third-party libraries, and solutions to common challenges.

By lunch, you will understand when, why, and how Spark fits into the data science world, and you'll be comfortable doing your own feature engineering and modeling with Spark.

We will then look at some of the newest features in Spark that allow elegant, high performance integration with your favorite Python tooling. We'll discuss distributed scheduling for popular libraries like TensorFlow, as well as fast model inference, traditionally a challenge with Spark. We'll even see how you can integrate Spark with Python+GPU computation on arrays (PyTorch) or dataframes (RapidsAI).

By the end of the day, you will be caught up on the latest, easiest, fastest, and most user friendly ways of applying Apache Spark in your job and/or research.

Bio: Adam Breindel consults and teaches widely on Apache Spark, big data engineering, and machine learning. He supports instructional initiatives and teaches as a senior instructor at Databricks, teaches classes on Apache Spark and on deep learning for O’Reilly, and runs a business helping large firms and startups implement data and ML architectures. Adam’s 20 years of engineering experience include streaming analytics, machine learning systems, and cluster management schedulers for some of the world’s largest banks, along with web, mobile, and embedded device apps for startups. His first full-time job in tech was on a neural-net-based fraud detection system for debit transactions, back in the bad old days when some neural nets were patented (!) and he’s much happier living in the age of amazing open-source data and ML tools today.