Simplifying Data Science with Delta Lake and MLflow
Simplifying Data Science with Delta Lake and MLflow


Although machine learning algorithms and open source libraries have greatly advanced in the last decade, many challenges remain to building production data science and machine learning applications. Data science teams still spend the majority of their time acquiring and cleaning input data, and once a team launches an application, it has to spend a substantial amount of effort just to keep it running. At Databricks, we have experienced these challenges across thousands of organizations and domains, so we launched two new open source projects recently to simplify data operations and machine learning. Delta Lake is a transactional layer on top of data lake storage such as S3 or HDFS that enables reliable data pipelines, rollback, time travel, and multi-stage bronze/silver/gold patterns for managing production datasets. This allows teams to set up high quality ingest pipelines and rapidly roll back errors. MLflow, on the other hand, is an open source platform for managing the machine learning lifecycle, including experiments, models, workflows and deployments. Inspired by internal ML platforms such as Uber Michelangelo and Google TFX, MLflow makes it easy to operate and monitor ML applications so that teams can spend more of their time building new applications. I’ll show how both projects are helping to simplify data science at thousands of organizations, ranging in scale from one data scientist to teams of thousands of users.


Matei Zaharia is an Assistant Professor of Computer Science at Stanford University and Chief Technologist at Databricks. He started the Apache Spark project during his PhD at UC Berkeley in 2009, and has worked broadly in datacenter systems, co-starting the Apache Mesos project and contributing as a committer on Apache Hadoop. Today, Matei tech-leads the MLflow development effort at Databricks. Matei’s research work was recognized through the 2014 ACM Doctoral Dissertation Award for the best PhD dissertation in computer science, an NSF CAREER Award and several best paper awards.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google