How to Reason About Stateful Streaming Machine Learning Serving — Lessons from Production

Abstract: Effectively serving large scale machine learning systems to create business value is still somewhat of a dark art. Complexities such as train-serve skew, concept drift, latency requirements, champion/challenger replacement, hidden technical debt and many more crop up that are not as apparent while crafting a single trained model. To add even more complexity, mixing these machine learning concerns with a stateful streaming processing engine can cause headaches, dominantly around processing with late-arriving and out-of-order data.

At Uptake, we predict whether an industrial asset will fail in the near future. An asset’s history (state) and other data sources such as weather, are crucial to ensuring an effective, timely prediction. To understand why this is so, imagine a semi-truck whose engine coolant is running too low, causing the engine to overheat. The trend in the engine temperature is an important feature that characterizes that failure mode, and it requires not only the current engine temperature, but previous observations as well. These requirements have led us to build a stateful streaming system, serving heterogeneous, low-latency stateful machine learning models. We have learned lessons from that process and deploying hundreds of models making millions of predictions a day.

In this talk, I will share lessons from a data scientists’ perspective on the challenges that we have encountered. In addition to general machine learning serving patterns, and what alternatives exist out there, an attendee can plan on learning concepts such as event time vs. processing time. This data movement can change how a data scientist estimates model prediction error. It’s imperative in this context to know how to reason about the data movement in the stateful streaming system the and how to consider the tradeoffs that are surfaced. It is primarily aimed at data scientists and engineers that are concerned with moving machine learning workloads from training into production.