
Abstract: AI model inference on the phone is important to deliver real-time experience that requires execution locally on the device. But there are many constraints to build a scalable solution addressing all the different hardware specs and platform requirements across Android and iOS. Devices have limited storage and memory, and the platform app stores impose restrictions on the size of the app package.
ONNX Runtime Mobile is a new feature to address the needs for developers to build solutions for Mobile devices. You can build a reduced size binary package to integrate with the phone application and inference your ONNX models locally on devices.
ONNX Quantization techniques can be used to reduce the model size by converting FP32 weights to INT8. Improved performance is achieved with INT8 execution on ARM processors on the mobile phone.
Bio: Guoyu Wang is a senior software engineer at Microsoft, where he works on Onnx Runtime to bring accelerated inferencing to mobile platforms. In the past, he has also worked on Microsoft Office and shipped multiple versions of Microsoft PowerPoint. Guoyu earned his master’s degree in Computer Science from University of Toronto.