Eagleeye: Data Pipeline for Anomaly Detection in Cyber Security

Abstract: 

Cloud-native applications. Multiple Cloud providers. Hybrid Cloud. 1000s of VMs and containers. Complex network policies. Millions of connections and requests in any given time window. This is the typical situation faced by a Security Operations Control (SOC) Analyst every single day. In this talk, the speaker talks about the high-availability and highly scalable data pipelines that he built for the following use cases :

* Denial of Service: A device in the network stops working.
* Data Loss : An example is a rogue agent in the network transmitting IP data outside the network
* Data Corruption : A device starts sending erroneous data.

The above can be solved through anomaly detection models. The main challenge here is the data engineering pipeline. With almost 7 Billion events occurring every day, processing and storing that for further analysis is a significant challenge. The machine learning models (for anomaly detection) has to be updated every few hours and requires the pipeline to create the feature store in a significantly small time window.
The core components of the data engineering pipeline are:

* Apache Flink
* Apache Kafka
* Apache Pinot
* Apache Spark
* Mlflow
* Apache Superset

The event logs are stored in Pinot through Kafka topic. Pinot supports apache kafka based indexing service for realtime data ingestion. Pinot has primitive capabilities to create sliding time window statistics. More complex real-time statistics are computed using Flink. Apache Flink is a stream-processing engine and provides high throughput and low latency. Spark jobs are used for batch processing. Mlflow is used for machine learning model management. Superset is used for visualization.

The speaker talks through the architectural decisions and shows how to build a modern real-time stream processing data engineering pipeline using the above tools.

Outline
* The problem: overview
* Different Architecture Choices
* The final architecture - a brief explanation
* Real-Time Processing
* Apache Kafka
* Message broker vs Message Queue
* RabitMQ vs Kafka
* Why Kafka?
* Apache Flink
* Micro-batching vs Streaming
* Flink vs Spark Streaming
* Why Flink?
* Apache Pinot
* OLAP vs OLTP
* Why Pinot?
* Batch Processing
* Apache Spark
* Anomaly detection
* Models
* Data Engineering + Machine Learning
* ML and MLLIB
* Mlflow - Model management
* Visualization - Superset
* A short demo

Background Knowledge:
Very basic understanding

Bio: 

Tuhin Sharma is Senior Principal Data Scientist at Redhat in the Corporate Development and Strategy group. Prior that he worked at Hpersonix as AI Architect. He also co-founded and has been CEO of Binaize, a website conversion intelligence product for e-commerce SMBs. He received master’s degree from Indian Institute of Technology Roorkee in Computer Science with specialization in Data Mining. He received bachelor’s degree from Indian Institute of Engineering Science and Technology Shibpur in Computer Science. He loves to code and collaborate on open source and research projects. He has 4 research papers and 5 patents in the field of AI and NLP. He is reviewer of IEEE MASS conference in the AI track. He writes deep learning articles for O’reilly with the collaboration with AWS MXNET team. He loves to play TT and Guitar in his leisure time. His favorite quote is “Life is Beautiful”.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from - Youtube
Vimeo
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google