Real-Time Video Inference on Edge Devices via Adaptive Model Streaming
This work addresses the problem of high computation costs for video inference on edge devices like mobile phones and drones, offering a practical solution with incremental improvements in performance and efficiency.
The paper tackles the challenge of real-time video inference on edge devices by introducing Adaptive Model Streaming (AMS), which uses a remote server to adapt a small edge model via online knowledge distillation, resulting in a 0.4–17.8% mIoU improvement on video semantic segmentation and achieving 30 FPS with 40 ms latency on a mobile phone.
Real-time video inference on edge devices like mobile phones and drones is challenging due to the high computation cost of Deep Neural Networks. We present Adaptive Model Streaming (AMS), a new approach to improving performance of efficient lightweight models for video inference on edge devices. AMS uses a remote server to continually train and adapt a small model running on the edge device, boosting its performance on the live video using online knowledge distillation from a large, state-of-the-art model. We discuss the challenges of over-the-network model adaptation for video inference, and present several techniques to reduce communication cost of this approach: avoiding excessive overfitting, updating a small fraction of important model parameters, and adaptive sampling of training frames at edge devices. On the task of video semantic segmentation, our experimental results show 0.4--17.8 percent mean Intersection-over-Union improvement compared to a pre-trained model across several video datasets. Our prototype can perform video segmentation at 30 frames-per-second with 40 milliseconds camera-to-label latency on a Samsung Galaxy S10+ mobile phone, using less than 300 Kbps uplink and downlink bandwidth on the device.