TensorFlow? PyTorch? Keras? There are many popular frameworks to choose from when working with deep learning and machine learning models, each with its own pros and cons for practical usability in product development or research. Once you decide which to use to train your model, you need to figure out how to deploy it on both your platform and architecture of choice. Cloud? Windows? Linux? IoT? Performance sensitive? How about acceleration using GPUs?
With hundreds to thousands of different combinations for deploying a trained model using a chosen framework, it can be extremely challenging to optimize and manage deployment strategies and environments for performant inferencing in production. How can we streamline this for ease and consistency?
Standardization for ML model formats
To address the challenges created by the fragmented AI ecosystem, the Open Neural Network eXchange (ONNX) format was formed in late 2017 as a community-driven open-source standard for deep learning and traditional machine learning models.
For data scientists and developers, ONNX provides the freedom to use their preferred framework while minimizing downstream performance challenges across a variety of platforms and hardware.
For hardware manufacturers challenged with supporting the breadth of machine learning frameworks, ONNX offers a standard specification to guide innovation and broaden the reach of hardware designed to accelerate deep learning workloads.
In short, framework interoperability helps maximize productivity and accelerates the path from ideation to production by removing framework limitations.
Honoring the mission of interoperable AI, the ONNX community has contributed many different tools to convert and performantly run models. Models trained on various frameworks can be converted to the ONNX format using tools such as TensorFlow-ONNX and ONNXMLTools (Keras, Scikit-Learn, CoreML, and more). Native ONNX export capabilities are already supported in PyTorch 1.2. Additionally, the ONNX model zoo provides popular, ready-to-use models.
Models in the ONNX format can be inferenced using ONNX Runtime, an open-sourced runtime engine for high-performance inferencing that provides hardware acceleration. ONNX Runtime offers cross-platform APIs for Linux, Windows, and Mac with support on X86, X64, and ARM architectures. Python, C#, C++, and C languages are supported to provide developers with flexibility to integrate the library into their software stacks. ONNX Runtime is also built directly into Windows 10 (1809+) as part of Windows Machine Learning.
ONNX Runtime is designed to prioritize extensibility and performance and is compatible with a wide range of hardware options.
The runtime provides complete optimized CPU implementations of all operators in the ONNX spec from v1.2 (opset 7) onwards along with backwards and forward compatibility to absolve the pain of versioning incompatibilities. This ensures that valid ONNX models can be successfully loaded and executed.
In addition to this, the extensible architecture supports graph optimizations which improve performance across a variety of hardware. ONNX Runtime leverages custom accelerators, computation libraries, and runtimes whenever possible, removing the complexities of using multiple hardware-specific libraries to accelerate models. This means that ONNX Runtime can take advantage of the supported computation acceleration available on your machine, and if/when an operator is not supported, will fall back to the CPU implementation. Current supported acceleration options include Intel® MKL-DNN, Intel® nGraph, NVIDIA CUDA, NVIDIA TensorRT, and the Intel® Distribution of OpenVINO™ Toolkit.
Production Deployment with ONNX Runtime
After a model is converted to ONNX format and a compute target is selected, it is ready to be deployed for inferencing. ONNX Runtime is capable of serving high volume workloads and is already powering dozens of high traffic services at Microsoft across Office, Windows, Bing, Azure Cognitive Services, and many other product groups.
For cloud solutions, Azure Machine Learning can be used to deploy the ONNX model along with the inference runtime as a web service running on your preferred compute (CPU, GPU, etc). Models can also be deployed on IoT devices using Azure IoT Edge. Example reference implementations are available for the NVIDIA Jetson Nano, accelerated by TensorRT, and the breadth of devices supported by the Intel® OpenVINO™ toolkit. To load and run inferencing locally, simply install the published package for use in your application from Nuget or Pypi. If you’re using Windows 10, ONNX Runtime is already built in to WindowsML. Regardless of your deployment strategy, the same model can be served using ONNX Runtime across a variety of platforms and technology stacks.
We hope to see you at our workshop at ODSC Europe in London, where we’ll walk through some of these conversion and deployment flows and demonstrate how to operationalize this for your models. In the meantime, check out the ONNX and ONNX Runtime Github repositories for more details, examples, and information. See you in November!