General Training Session: Solving Real-World Data Problems with Spark

Abstract: In recent years, data teams across a wide range of industries have amassed large amounts of data that are generated by all our interactions in an increasingly digital world. While these exponentially growing datasets contain valuable insights that can improve our businesses and quality of life, they also become increasingly difficult to efficiently process. Fortunately, the open source project Apache Spark has emerged as the leading solution: a general purpose distributed framework for solving the latest data problems faced by the industry. In this training session, participants will learn how to solve real-world data problems by using Spark in a hands-on way.

This session will introduce all of the necessary concepts required for efficiently programming with Spark. We will focus on the higher-level Dataframes and Spark SQL APIs, while discussing just enough of the low-level internals to effectively monitor, debug, and optimize Spark jobs. Participants will gain an intermediate proficiency with Spark by walking through real-world examples and exercises that are inspired by the same problems faced by industry-leading data teams. Additionally, we'll compare and contrast similar technologies in the field of data to help participants evaluate exactly when (and when not) to use Spark.

In addition to using Spark's general-purpose features, we will cover several real-world examples that utilize Spark's machine learning and real-time libraries. We will cover the Spark Streaming and MLlib libraries using an approach of """"learning by doing"""". Specifically, we will be completing hands-on examples that mirror the real-world challenges that are routinely tackled by data teams at top companies in the industry. By completing this training, participants will be comfortable using Spark to solve their own data science and big data problems.

Bio: David Drummond is Director of Engineering at Insight Data Science, where he enjoys helping others learn distributed technologies to solve big data problems. He enjoys thinking about database internals and understanding how distributed systems fail. Before working in the field of data, he received his PhD in Physics researching fault-tolerance for quantum computers.