On-Demand Accelerating Deep Neural Network Inference via Edge Computing


Deep Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on mobile phones and embedded systems with limited hardware resources and taking more time for Inference and Training. For many mobile-first companies such as Baidu and Facebook, various apps are updated via different app stores, and they are very sensitive to the size of the binary files. For example, App Store has the restriction “apps above 100 MB will not download until you connect to Wi-Fi”. As a result, a feature that increases the binary size by 100MB will receive much more scrutiny than one that increases it by 10MB. It is challenging to run computation-intensive DNN-based tasks on mobile devices due to the limited computation resources.

This talk introduces the Algorithms and Hardware that can be used to accelerate the Inferencing or reduce the latency of deep learning workloads. We will discuss how to compress the Deep Neural Networks and techniques like Graph Fusion, Kernel Auto-Tuning for accelerating inference, as well as Data and model parallelization, automatic mixed precision, and other techniques for accelerating training. We will also discuss specialized hardware for deep learning such as GPUs, FPGAs, and ASICs, including the Tensor Cores in NVIDIA’s Volta GPUs as well as Google’s Tensor Processing Units (TPUs). We will also discuss the Deployment of the Large Size Deep Learning Models on the Edge devices like NVIDIA Jetson Nano, Google's Edge TPU(Coral).


Deepesh Agrawal experienced Machine Learning Engineer with a demonstrated history of working in the information technology and services industry, before this I was a Solution Architect with Nvidia's partner. I have completed projects based on ML & DL such as video classification, object detection, and text analysis. Skilled in Python (Programming Language), C++, Data Science, and Deep Learning.