Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory
This work addresses efficiency issues in long-form video understanding for applications like retrieval and question answering, representing an incremental improvement over traditional methods.
The paper tackles the problem of long-form video understanding, which is computationally intensive and memory-bottlenecked, by introducing Long-VMNet, a method that uses a fixed-size memory representation and neural sampler to achieve up to 75x faster inference times while maintaining competitive predictive performance on the Rest-ADL dataset.
Long-form video understanding is essential for various applications such as video retrieval, summarizing, and question answering. Yet, traditional approaches demand substantial computing power and are often bottlenecked by GPU memory. To tackle this issue, we present Long-Video Memory Network, Long-VMNet, a novel video understanding method that employs a fixed-size memory representation to store discriminative patches sampled from the input video. Long-VMNet achieves improved efficiency by leveraging a neural sampler that identifies discriminative tokens. Additionally, Long-VMNet only needs one scan through the video, greatly boosting efficiency. Our results on the Rest-ADL dataset demonstrate an 18x -- 75x improvement in inference times for long-form video retrieval and answering questions, with a competitive predictive performance.