Automating Trend Discovery on Streaming Datasets with Spark 2.3

Abstract: In this session we will start off with a deep dive into effective data modeling and continue on to explore some unique methods in which to bubble up and automatically uncover unique and interesting patterns in your data all using Spark SQL. More importantly we will discuss how to do this in batch mode and then follow up with how to easily migrate from batch mode to streaming using Spark Structured Streaming.

We will walk though techniques for reducing the memory footprint of statistical aggregations, giving you the ability to more efficiently scale out your systems to handles many millions of records (all in memory) while maintaining a relatively small footprint all via the use of Data Sketching. The idea here is to leverage quantile sketches in order to auto-analyze the change in shape and behavior of seemingly disparate datasets to find common dimensions (features) of given data sets across many different metrics.

We will also go over how to handle common serialization problems with respect to the storage and retreival of partially aggregated data when updating your streaming applications. Lastly we will finish off by talking about how to use windowed statistical aggregations and rollups to automatically detect trends in your data while also being able to handle the dreaded issue of data seasonality.

This session will cover some best practices and patterns for writing streaming applications with Apache Spark 2.3 including how to write effective unit tests to ensure your applications can handle live updates in production. A working application will be made available at the start of the presentation. Knowledge of Spark and Scala are a must in order to take full advantage of this information.

Bio: Scott Haines is a Principal Software Engineer / Tech Lead on the Voice Insights team at Twilio. His focus has been on the architecture and development of a real-time (sub 250ms), highly available, trust-worthy analytics system. His team is providing near real-time analytics that processes / aggregates and analyzes multiple terabytes of global sensor data daily. Scott helped drive Apache Spark adoption at Twilio and actively teaches and consulting teams internally. Scott's past experience was at Yahoo! where he built a real-time recommendation engine and targeted ranking / ratings analytics which helped serve personalized page content for millions of customers of Yahoo Games. He worked to build a real-time click / install tracking system that helped deliver customized push marketing and ad attribution for Yahoo Sports and lastly Scott finished his tenure at Yahoo working for Flurry Analytics where he wrote the an auto-regressive smart alerting and notification system which integrated into the Flurry mobile app for ios/android

Open Data Science Conference