Abstract: This class is aimed at machine learning practitioners who are already familiar with the basics of Deep Learning or Apache Spark. We will train and deploy Distributed Tensorflow models written in Python using only open source platforms including Tensorflow, Spark, Jupyter Notebook, and Hops Hadoop. We will show you how you can scale out training to reduce training time by an order of magnitude for certain classes of Deep Neural Networks. We will show you, step-by-step, how Facebook reduced training time for a ConvNet trained on the ImageNet dataset by a factor 200+ for a 256-GPU cluster. We will also show you how to embed Deep Neural Networks (DNNs) in production pipelines for both training and inference using Apache Spark. In particular, we will look at Yahoo’s TensorflowOnSpark as one such platform.
We will elucidate the trade-offs in scaling out the whole stack, from the network level (and the use of gRPC, RDMA, or MPI), to using multiple GPUs per host, to the model level training decisions such as batch sizes and when to use synchronous or asynchronous distributed training.
Practically, we will work with real-world datasets (including ImageNet) that can be downloaded with just a couple of clicks using Open Datasets from Hops Hadoop. We will base our tutorial on exercises given to students in Sweden’s first graduate course on Deep Learning – ID2223 at KTH. We will provide both a virtual machine instance and (if desired) access to the same cluster used in our course, at www.hops.site.
All code and datasets are 100% open source. Code is available from Github at https://github.com/hopshadoop, while datasets are available from Hops Hadoop at www.hops.io.
Bio: Fabio Buso is a research engineer at RISE SICS. He works on the Hops platform, where he has led the development of a strongly consistent metastore for Apache Hive on Hops. He is currently working on developing a Feature Store for machine learning on Hops and integration of Tensorflow.
Fabio has an international background, holding a master's degree in Cloud Computing and Services, with a focus on data intensive applications, from a joint program between KTH Stockholm and TU Berlin.