Simplifying Data Science with Delta Lake and MLflow
Simplifying Data Science with Delta Lake and MLflow

Abstract: 

Although machine learning algorithms and open source libraries have greatly advanced in the last decade, many challenges remain to building production data science and machine learning applications. Data science teams still spend the majority of their time acquiring and cleaning input data, and once a team launches an application, it has to spend a substantial amount of effort just to keep it running. At Databricks, we have experienced these challenges across thousands of organizations and domains, so we launched two new open source projects recently to simplify data operations and machine learning. Delta Lake is a transactional layer on top of data lake storage such as S3 or HDFS that enables reliable data pipelines, rollback, time travel, and multi-stage bronze/silver/gold patterns for managing production datasets. This allows teams to set up high quality ingest pipelines and rapidly roll back errors. MLflow, on the other hand, is an open source platform for managing the machine learning lifecycle, including experiments, models, workflows and deployments. Inspired by internal ML platforms such as Uber Michelangelo and Google TFX, MLflow makes it easy to operate and monitor ML applications so that teams can spend more of their time building new applications. I’ll show how both projects are helping to simplify data science at thousands of organizations, ranging in scale from one data scientist to teams of thousands of users.

Bio: 

Matei Zaharia is an Assistant Professor of Computer Science at Stanford University and Chief Technologist at Databricks. He started the Apache Spark project during his PhD at UC Berkeley in 2009, and has worked broadly in datacenter systems, co-starting the Apache Mesos project and contributing as a committer on Apache Hadoop. Today, Matei tech-leads the MLflow development effort at Databricks. Matei’s research work was recognized through the 2014 ACM Doctoral Dissertation Award for the best PhD dissertation in computer science, an NSF CAREER Award and several best paper awards.