Deep Learning for Real Time Streaming Data with Kafka and TensorFlow

Abstract: In mission-critical real time applications, using machine learning to analyze streaming data are gaining momentum. In those applications Apache Kafka is the most widely used framework to process the data streams. It typically works with other machine learning frameworks for model inference and training purposes.

In this talk, our focus is to discuss the KafkaDataset module in TensorFlow. KafkaDataset processes Kafka streaming data directly to TensorFlow's graph. As a part of Tensorflow (in `tf.contrib`), the implementation of KafkaDataset is mostly written in C++. The module exposes a machine learning friendly Python interface through Tensorflow's `tf.data` API. It could be directly feed to `tf.keras` and other TensorFlow modules for training and inferencing purposes.

Combined with Kafka streaming itself, the KafkaDataset module in TensorFlow removes the need to have an intermediate data processing infrastructure. This helps many mission-critical real time applications to adopt machine learning more easily. At the end of the talk we will walk through a concrete example with a demo to showcase the usage we described.

Bio: Yong Tang is Director of Engineering at MobileIron. His most recent focus is on data processing in machine learning. He is a maintainer and the SIG I/O lead of TensorFlow project. He received Open Source Peer Bonus Award from Google for his contributions to TensorFlow, and is the author of Kafka Dataset module in TensorFlow. In addition to TensorFlow, Yong Tang also contributes to many other projects for the open source community. He is a maintainer of Docker, CoreDNS, and SwarmKit. Yong Tang received his PhD in Computer Science & Engineering at the University of Florida.