Abstract: Kafka has been widely adopted as the data streaming platform backbone of the leading companies. However, this rising adoption introduced new challenges. In particular, the growing cluster sizes, increasing volume and diversity of user traffic, and the age of network and server components led to a management overhead. Getting near-optimal performance from such an infrastructure service, maintaining its availability in the face of cascading failures, and achieving these objectives with minimal overhead are critical, but non-trivial tasks. Hence, human intervention alone tends to be insufficient in providing both reactive and proactive mitigation measures.

In this talk, we will share our work and experiences towards alleviating the management overhead of large-scale Kafka clusters using Cruise Control at LinkedIn. The talk will consist of three parts: The first part will provide an overview of Cruise Control, including the operational challenges that it solves, its high-level architecture, and some evaluation results from real-world scenarios. The second part will go through a hands-on tutorial to demonstrate how we can manage a real Kafka cluster using Cruise Control. Finally, we will also have time for a Q&A session at the end of the discussion.

Bio: Zorn Hsu joined LInkedIn in 2016 with Site Reliability Engineering team, where he operates Apache Kafka/Zookeeper at scale. Sometimes, he firefights Production Kafka issues. With such firefighting expertise, he proposes features and feedback to Cruise Control which makes Kafka operation less painful for everyone.