Abstract: This class is aimed at machine learning practitioners who are already familiar with the basics of Deep Learning or Apache Spark. We will train and deploy Distributed Tensorflow models written in Python using only open source platforms including Tensorflow, Spark, Jupyter Notebook, and Hops Hadoop. We will show you how you can scale out training to reduce training time by an order of magnitude for certain classes of Deep Neural Networks. We will show you, step-by-step, how Facebook reduced training time for a ConvNet trained on the ImageNet dataset by a factor 200+ for a 256-GPU cluster. We will also show you how to embed Deep Neural Networks (DNNs) in production pipelines for both training and inference using Apache Spark. In particular, we will look at Yahoo’s TensorflowOnSpark as one such platform.
We will elucidate the trade-offs in scaling out the whole stack, from the network level (and the use of gRPC, RDMA, or MPI), to using multiple GPUs per host, to the model level training decisions such as batch sizes and when to use synchronous or asynchronous distributed training.
Practically, we will work with real-world datasets (including ImageNet) that can be downloaded with just a couple of clicks using Open Datasets from Hops Hadoop. We will base our tutorial on exercises given to students in Sweden’s first graduate course on Deep Learning – ID2223 at KTH. We will provide both a virtual machine instance and (if desired) access to the same cluster used in our course, at www.hops.site.
All code and datasets are 100% open source. Code is available from Github at https://github.com/hopshadoop, while datasets are available from Hops Hadoop at www.hops.io.
Bio: Robin Andersson is a researcher at RISE SICS, working on the Hops project. Robin received his master's degree in distributed systems from KTH Stockholm. His recent work includes support for GPUs as a schedulable and isolable resource in Hops YARN, making it the first Hadoop distribution to natively support GPUs-as-a-resource. He has previously integrated Yahoo's TensorFlowOnSpark in Hops YARN and is currently working on integrating Uber's Horovod.
Research Engineer at RISE SICS