Cloud Native Data Science with Dask

Abstract: Python has become a great language for data science. Libraries like NumPy, pandas, and Scikit-Learn provide high-performance, pleasant APIs for analyzing data. However, they’re focused on single-core, in-memory analytics, and so don't scale out to very large datasets or clusters of machines. That's where Dask comes in.

Dask is a library that natively scales Python. It works with libraries like NumPy, pandas, and Scikit-Learn to operate on datasets in parallel, potentially distributed on a cluster.

Moving to a cloud-native data science workflow will make you and your team more productive. You'll be able to more quickly iterate on the data collection, visualization, modeling, testing, and deployment cycle.

Attendees will learn the high-level user-interfaces dask provides like dask.array and dask.dataframe. These let you write regular Python, NumPy, or Pandas code that is then executed in parallel on datasets that may be larger than memory. We'll learn through hands-on exercises. Each attendee will be provided with their own Dask cluster to develop and run their solutions.

Dask is a flexible parallelization framework; we'll demonstrate that flexibility with some machine-learning workloads. We'll use Dask to easily distribute a large scikit-learn grid search to run a cluster of machines. We'll use Dask-ML to work with larger-than-memory datasets.

We'll see how Dask can be deployed on Kubernetes, taking advantage of features like auto-scaling, where new worker pods are automatically created or destroyed based on the current workload

Bio: Tom is a Data Scientist and developer at Anaconda and works on open source projects including dask and pandas. Tom’s current focus is on scaling out Python's machine learning ecosystem to larger datasets and larger models