Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds

Abstract: Cloud storage brings great flexibility in management and cost-efficiency to data scientists, but also introduces new challenges related to data accessibility and data locality for machine learning applications. For instance, when the input data is stored in a remote cloud storage like AWS S3 or Azure blob storage, direct data access is often slow and expensive; but manually moving data to the training clusters can be time-consuming, complicated and often require data engineering or ETL pipelines.

This session is designed for data scientists or data engineers who work with remote and possibly multiple data sources in hybrid or multi-cloud environments. We will guide the audience to use Alluxio to greatly simplify the data preparation in these environments, covering the following topics:
- How to setup and create POSIX endpoint for Alluxio service to unify the file system data access to S3, HDFS and Azure blob storage;
- How to run Apache Spark to read input from and write output to remote storage with Alluxio as the distributed data caching layer;
- How to run TensorFlow to train models backed by accessing remote input data like access local file system.

Bio: Bin Fan is the PMC maintainer of Alluxio open source. Prior to joining Alluxio as a founding engineer, he worked for Google to build the next-generation storage infrastructure. Bin received his Ph.D. in Computer Science from Carnegie Mellon University on the design and implementation of distributed systems.