Boston | April 14th – April 17th, 2020

Data Engineering & MLOps

Focusing on the practice, engineering, workflows and DataOps of data science

Understand the Practice of Data Engineering in the Real World

As data science extends its reach across an enterprise, the need for better management, workflow, production and deployment practices increases. The challenges of deploying and monitoring models in production, managing data science workflows and teams, and understanding ROI are a few of the issues organizations wrestle with.

Learn best practices for effective data science management

Sessions in this broad focus area will look at uses cases, best practices, and stories from the field to show how to effectively incorporate data science practice into the wider business process.  This focus area will look beyond data sourcing and modeling towards the many challenges teams need to overcome to effectively apply data science in their organization.

Some Current Data Engineering & MLOps Speakers


Click Here For Full Lineup
2020 Speakers

Sample Talk, Workshop, and Training Sessions

MLOps & Data Engineering Sessions
Friday, April 17th
Thursday, April 16th
Tuesday, April 14th
Wednesday, April 15th
Friday, April 17th
Thursday, April 16th
Tuesday, April 14th
Wednesday, April 15th
09:30 - 12:30
It’s a Breeze to Contribute to Apache Airflow

Training | MLOps & Data Engineering | ML for Programmers | Beginner-Intermediate

 

By attending this workshop, you will learn how you can become a contributor to Apache Airflow project. You will learn how to setup development environment, how to pick your first issue, how to communicate effectively within the community and how to make your first PR – experienced committers of Apache Airflow project will give you step-by-step instructions and will guide you in the process. When you finish the workshop you will be equipped with everything that is needed to make further contributions to the Apache Airflow project.

You will have a chance to pick a simple issue and get it through the whole process – starting from implementation, testing, going through review process and hopefully merging the changes. We will pick some simple changes for you so that you have a chance to implement it during the workshops...more details

It’s a Breeze to Contribute to Apache Airflow image
Tomasz Urbaszek
Software Engineer & Apache Airflow Committer | Polidea & Apache Software Foundation
It’s a Breeze to Contribute to Apache Airflow image
Jarek Potiuk
Principal Software Engineer & Apache Airflow PMC Member | Polidea & Apache Software Foundation
Session Title by Joy Payton Coming Soon!

Workshop

Session Title by Joy Payton Coming Soon! image
Joy Payton
Supervisor, Data Education | Children's Hospital of Philadelphia
10:40 - 12:10
Kedro + MLflow – Reproducible and Versioned Data Pipelines at Scale

Workshop | MLOps & Data Engineering | Intermediate

 

Kedro is a development workflow tool open sourced by QuantumBlack, a McKinsey company. Many data science teams have started using the library for their pipelines but are unsure how to integrate with other model tracking tools, such as MLflow. In this tutorial, we will give an overview of Kedro and MLflow and demo how to leverage the best of both. The goal of this session is to demonstrate how Kedro and MLflow fit together in a scalable AI architecture. To start, we will give an overview of Kedro and an overview of MLflow: – What are they used for? – What functionality do they provide? – How do they compare as tools? Next, we will walk through a demo of a Kedro project that has MLflow integrated into it. Finally, we will go over deployment options…more details

Kedro + MLflow – Reproducible and Versioned Data Pipelines at Scale image
Tom Goldenberg
Junior Principal Data Engineer | QuantumBlack
12:45 - 13:30
Accelerate ML Lifecycle with Kubernetes and Containerized Data Science Tools

Talk | ML for Programmers | MLOps & Data Engineering | Beginner-Intermediate

 

Kubernetes & container platforms provide desired agility, flexibility, scalability, & portability for data scientists to train, test, & deploy ML models quickly, without IT dependency. The session will provide an overview of containers and Kubernetes, and how these technologies can help solve the challenges faced by data scientists, ML engineers, and application developers. Next, we will review the key capabilities required in a containers and kubernetes platform to help data scientists easily use technologies like Jupyter Notebooks, ML frameworks, programming languages to innovate faster. Finally we will share the available platform options (e.g. Red Hat OpenShift, KubeFlow, etc.), and some examples of how data scientists are accelerating their ML initiatives with containers and kubernetes platform…more details

 

Accelerate ML Lifecycle with Kubernetes and Containerized Data Science Tools image
Abhinav Joshi
Sr. Principal Marketing Manager | Red Hat
Accelerate ML Lifecycle with Kubernetes and Containerized Data Science Tools image
Tushar Katarki
Sr. Principal Product Manager | Red Hat
13:00 - 16:00
Ray: A System for High-performance, Distributed Python Applications

Training | MLOps & Data Engineering | Intermediate

 

Ray is an open-source distributed framework from U.C. Berkeley’s RISELab that easily scales Python applications from a laptop to a cluster, with an emphasis on ML/AI systems, such as reinforcement learning. It is now used in many production deployments. In this tutorial, we’ll use several hands-on examples to explore the problems that Ray solves and the useful features it provides, such as rapid distribution, scheduling, and execution of “tasks” and management of distributed stateful “serverless” computing. We’ll see how it’s used in several ML libraries (and play with examples using those libraries). You’ll learn when to use Ray and how to use it in your projects…more details

Ray: A System for High-performance, Distributed Python Applications image
Dean Wampler, PhD
Head of Developer Relations, Author | Anyscale
13:15 - 14:45
SQL Deep Dive for Data Science

Workshop | ML for Programmers | MLOps & Data Engineering | Intermediate-Advanced

 

Organizations have long used relational databases for a wide variety of data-intensive applications. Data scientists and analysts need to understand how to work with relational databases, particularly for common data science tasks, such as finding, exploring, analyzing and extracting data within a relational database. SQL is an expressive declarative language designed for working with tabular data. Since its inception, new features have been added that allow for complex expressions and computation, such as regular expressions, sliding window functions, and analytical operations that create non-relational results. Understanding the advanced features of SQL will help data scientists reduce the need for programming custom data manipulation functionality…more details

SQL Deep Dive for Data Science image
Dan Sullivan, PhD
AI and Cloud Architect & Author | New Relic, Inc | Wiley
13:15 - 14:45
Streaming Decision Intelligence and Predictive Analytics with Spark 3

Workshop | MLOps & Data Engineering | Machine Learning | Beginner-Intermediate

 

The speed in which intelligent decision making can take place is changing the very fabric of the modern industry. From autonomous routing decisions for real-time traffic avoidance to intelligent integrated systems for customer service and assisted analysis or forecasting, the time to correct decision can be a huge differentiator. During the course of this workshop, you will learn how to harness the power of Apache Spark to build up a data ingestion pipeline that can process and learn from streaming data. These learnings can then be applied to make simple decisions using SparkSQL in a parallel streaming system…more details

Streaming Decision Intelligence and Predictive Analytics with Spark 3 image
Scott Haines
Principal Software Engineer | Twilio
13:35 - 14:20
Simplifying Data Science with Delta Lake and MLflow

Track Keynote | MLOps & Data Engineering | Intermediate

 

Although machine learning algorithms and open source libraries have greatly advanced in the last decade, many challenges remain to building production data science and machine learning applications. Data science teams still spend the majority of their time acquiring and cleaning input data, and once a team launches an application, it has to spend a substantial amount of effort just to keep it running. At Databricks, we have experienced these challenges across thousands of organizations and domains, so we launched two new open source projects recently to simplify data operations and machine learning. Delta Lake is a transactional layer on top of data lake storage such as S3 or HDFS that enables reliable data pipelines, rollback, time travel, and multi-stage bronze/silver/gold patterns for managing production datasets. This allows teams to set up high quality ingest pipelines and rapidly roll back errors. MLflow, on the other hand, is an open source platform for managing the machine learning lifecycle, including experiments, models, workflows and deployments…more details

Simplifying Data Science with Delta Lake and MLflow image
Matei Zaharia, PhD
Professor, Co-Founder & Chief Technologist | Stanford, Databricks
13:35 - 14:20
Distributed Training Platform at Facebook

Talk | Deep Learning | MLOps & Data Engineering | Intermediate-Advanced

 

Large scale distributed training has become an essential element to scaling the productivity for ML engineers. Today, ML models are getting larger and more complex in terms of compute and memory requirements. The amount of data we train on at Facebook is huge. In this talk, we will learn about the Distributed Training Platform to support large scale data and model parallelism. We will touch base on Distributed Training support for PyTorch and how we are offering a flexible training platform for ML engineers to increase their productivity at facebook scale…more details

Distributed Training Platform at Facebook image
Mohamed Fawzy
Senior Engineering Manager | Facebook
Distributed Training Platform at Facebook image
Kiuk Chung
Software Engineer | Facebook
14:15 - 15:45
Data, I/O, and TensorFlow: Building a Reliable Machine Learning Data Pipeline

Workshop | Deep Learning | MLOps & Data Engineering | Advanced

 

In this tutorial, we will guide you through hands-on examples of integrating tf.keras model with different data input sources through tf.data in production environments, from simple csv/json files to SQL databases and to cloud data warehouse services such as Google Cloud BigQuery. We will also cover Apache Kafka as a data input to illustrate the streaming data pipeline architecture for continuous data processing with machine learning. As a bonus, attendees will learn the basics of distributed machine learning and its production usage…more details

Data, I/O, and TensorFlow: Building a Reliable Machine Learning Data Pipeline image
Yong Tang, PhD
Director of Engineering | MobileIron
Select date to see events.

See all our talks and hands-on workshop and training sessions
See all sessions

What You’ll Learn

Data science has many focus areas.  The goal of this track is to accelerate your knowledge of data science through a series of introductory level training sessions, talks, tutorials and workshops on the most important data science tools and topics.  

  • Experimentation to Production

  • Data Science DevOps

  • Agile Data Science

  • Data Science Architecture

  • Runtime Pipelines

  • Model Monitoring & Auditing

  • Model Depreciation in Production

  • Manage Data Science in Your Organization

  • Collaborative Practices and Tools

  • Team Management

  • Data Science Workflows

  • Data Provenance & Governance

  • Best Practices & Uses Cases

  • Cross Industry & Cross Enterprise Challenges

Why Attend?

Accelerate and broaden your knowledge of key areas in data science, including deep learning, machine learning, and predictive analytics

With numerous introductory level workshops, you get hands-on experience to quickly build your skills

Post-conference, get access to recorded talks online  and learn from over 100+ high quality recording sessions that let you review content at your own pace 

Take time out of your busy schedule to accelerate your knowledge of the latest advances in data science practice and management

Learn directly from world-class instructors who are the authors and contributors to many of the tools and languages used in data science today

Meet hiring companies, ranging from hot startups to Fortune 500, looking to hire professionals with data science skills at all levels

Network at our numerous lunches and events to meet with data scientists, enthusiasts, and business professionals

Get access to other focus area content, including machine learning & deep learning, data visualization, and much more

Who Should Attend

Data Science is cross industry and cross enterprise, impacting many different departments across job roles and functions. This track is not only for data scientists of all levels but for anyone interested in the practice and management of data science, including:

  • Data scientists moving beyond model experimentation looking to understand production workflow

  • Data scientists seeking to improve the overall practice of management and development

  • Anyone interested in understanding better collaborative and agile management techniques as applied to data science

  • Business professionals and industry experts looking to understand data science in practice

  • Software engineers and technologists who need to work with data science workflows and understand the unique requirements of these systems

  • CTO, CDS, and other managerial roles that require a bigger picture view of data science

  • Technologists in the field of DevOps, databases, project management and others looking to break into data science

  • Students and academics looking for more practical applied training in data science tools and techniques

Sign Up for ODSC EAST 2020 | April 14th – April 17th

Register Now