Editor’s note: Yinxi is a speaker for ODSC East 2022. Be sure to check out her talk alongside Sean Owen, “MLOps: Relieving Technical Debt in ML with MLflow, Delta and Databricks,” to learn more about MLOps!

The past decade has seen huge adoption of Machine Learning (ML) usage in all industries. The core of any ML solution is a set of data transformation and operation pipelines, which are executed to produce a model that maps input data to a prediction. Data scientists can implement and train an ML model on an offline dataset with a Jupyter Notebook or IDE on a local machine.


However, getting a model into the real world and generating business value from it involves more than just building it. How to deploy, serve, and inference models at scale? Throughput and latency requirements determine the choice of deployment approach; batch deployment is suitable when inferences are expected to run hourly or daily with large batches of inputs, while real-time deployments respond to users’ requests at the time they are received. For compute-intensive models, the serving framework needs to load balance to support parallel requests.

Other considerations include how to validate data quality and monitor model performance continuously. Google’s paper “The high-interest credit card of technical debt” highlights that building an integrated ML system and operating it in production requires vast and complex surrounding infrastructure (see the Figure below).

What is MLops

ML models in production interact with the entire software system and codebase. Data ingestion and processing modules supply data to the trained ML models and model predictions are in turn populated back into the downstream pipelines and data collection blocks. The complete development pipeline includes three key components:

  • Data
  • ML Model
  • Code

A robust system should be able to continuously monitor and deliver when any of the components encounter changes, as explained in the figure below. In a production environment, the system keeps ingesting new data. Thus, data verification and monitoring jobs are crucial. When new data drifts away from the distribution of historical data on which the model was trained, or model performance has deteriorated as the model has been deployed for a certain period, the model needs to be retrained.

In summary, the design of an ML system starts from the requirement of a given business problem. Data Scientists develop models that solve the problem and meet the success criteria. Once the model is fully trained and evaluated, it can be deployed. But maintenance and production of the system do not stop at this point; it is iterative and continuous. Data and the model have to be validated and monitored at each iteration. This is the end-to-end ML life cycle, and we need processes that help automate the management of the ML life cycle, or MLops.

What is MLops

MLops essentially need to be wrapped around Data, ML models, and Code. In another word,  MLops =  DataOps + ModelOps + DevOps. Now that we know what is MLops and why you need it, we will introduce tools that can help manage data (Delta Lake and Feature Store) and models (MLflow) and provide best practices recommendations in our ODSC talk. So stay tuned!

About the author/ODSC East 2022 Speaker:

Yinxi Zhang is a Sr Data Scientist at Databricks with 7+ years of industry experience on end-to-end ML development. Her responsibilities as a Brickster are teaching Scalable Machine Learning, Deep Learning, and MLops courses and helping clients develop their ML solutions. She used to be a marathon runner and now a yogi.