Distributed Training on Multi-Node Multi-GPU of Deep Neural Networks

Abstract: In the last year there have been a number of attempts to train deep CNNs on the ImageNet dataset in the shortest time possible, with the most recent attempt managing to do it in 15 minutes. All of these attempts happen on custom clusters which are out of the reach of most data scientists.

One of the key advantages of the cloud is being able to scale out compute resources as required. In this talk we will present two platforms for running distributed deep learning in the cloud which are within the reach of every data scientist. The first is a service called Batch AI which uses the Azure Batch infrastructure to easily run Deep Learning jobs at scale across GPUs. The second is an open source toolkit that allows data scientists to spin up clusters in turn-key fashion. It utilises Kubernetes and Grafana for easy job scheduling and monitoring. It has been used in daily production for Microsoft internal groups. Both utilise Docker containers making it possible to run any deep learning framework on them. We will use the aforementioned training platforms to train a ResNet network on ImageNet dataset using each of the following frameworks: CNTK, Tensorflow (Horovod), PyTorch, MxNet and Chainer. We will then compare and contrast the performance improvement as we scale the number of nodes as well as provides tips and details of the pitfalls of each framework and platform. The examples presented can also be used as templates so that others can utilise these for their own deep learning problems.

Bio: Ilia Karmanov is a Data Scientist at Microsoft in the Algorithms and Data Science Group within the Artificial Intelligence and Research Division. Ilia works on implementing deep-learning solutions to solve industry problems in fields such as computer vision and natural-language processing. He also implements state-of-the-art architectures and ML methods on Microsoft Azure, which come out of Microsoft Research and more generally from arXiv journals, to disseminate through blog-posts and conference-talks. Prior to Microsoft, Ilia worked in statistical consulting and before that he was an Economist in a joint Oxford university/LSE (London School of Economics) research centre called the "International Growth Centre". He holds an MSc in Economics from the LSE. Ilia is particularly interested in (i) the statistical theory behind deep-learning and investigating how neural-networks generalize & (ii) benchmarking deep-learning frameworks.

Open Data Science Conference