Tutorial on Anomaly Detection at Scale: Data Engineering Challenges meet Data Science Difficulties

Abstract: This tutorial will showcase a joint effort of Data Engineering setup and Data Science analysis in making a real-time anomaly detection system at scale. In particular, it will address data engineering challenges, like setup and configuration of Kafka, Spark Streaming and Spark cluster and downstream data storage, visualization and alerting. On data science part tutorial goes into difficulties with unsupervised data modeling both in batch and online fashion, implementation challenges in order to scale and touches on ensemble techniques for accuracy improvement and detector selection.

Anomaly detection is predominantly done in unsupervised fashion since labeled data is rarely available or classes are highly imbalanced. To make the problem harder, important anomalies turn out to be contextual or collective rather than just point anomalies in univariate time series. Session attendees will have a chance to hear about use of robust PCA, LSTMs, autoencoders and other methods implemented to serve as anomaly detectors. Methods are implemented in python library keras, but in order to scale and process data coming from Kafka in real-time, they are adjusted to run on Spark cluster utilizing Spark Streaming. At the end of a processing pipeline is bayesian ensemble learning model. Attendees will be able to see it in action and understand how it helps the system to select best detectors dynamically.

Bio: Dusan Randjelovic is a Senior Data Scientist at SmartCat.io, currently focused on bayesian ensemble techniques and analysis of time series data for anomaly detection applications. He has been working with various scientific (big) data, from astrophysics to bioinformatics, and is also interested in stream processing and distributed systems. Dusan has teaching experience from Faculty of Sciences, Novi Sad and Faculty of Mathematics, Belgrade, speaking experience from international conferences and is actively involved in local and regional data science communities.

Open Data Science Conference