Mobile Video Object Detection with Temporally-Aware Feature Maps
This work addresses the problem of efficient video object detection for low-powered mobile and embedded devices, representing an incremental improvement over existing methods.
The paper tackles real-time object detection in videos on mobile devices by introducing an online model that combines fast single-image detection with convolutional LSTM layers, achieving up to 15 FPS on a mobile CPU while maintaining accuracy comparable to more expensive models on the Imagenet VID 2015 dataset.
This paper introduces an online model for object detection in videos designed to run in real-time on low-powered mobile and embedded devices. Our approach combines fast single-image object detection with convolutional long short term memory (LSTM) layers to create an interweaved recurrent-convolutional architecture. Additionally, we propose an efficient Bottleneck-LSTM layer that significantly reduces computational cost compared to regular LSTMs. Our network achieves temporal awareness by using Bottleneck-LSTMs to refine and propagate feature maps across frames. This approach is substantially faster than existing detection methods in video, outperforming the fastest single-frame models in model size and computational cost while attaining accuracy comparable to much more expensive single-frame models on the Imagenet VID 2015 dataset. Our model reaches a real-time inference speed of up to 15 FPS on a mobile CPU.