CVFeb 18, 2020

MILA: Multi-Task Learning from Videos via Efficient Inter-Frame Attention

Donghyun Kim, Tian Lan, Chuhang Zou, Ning Xu, Bryan A. Plummer, Stan Sclaroff, Jayan Eledath, Gerard Medioni

arXiv:2002.07362v33.33 citations

Originality Incremental advance

AI Analysis

This work addresses the need for low-latency, high-quality multi-task predictions in video processing, though it appears incremental as it builds on existing slow-fast and attention paradigms.

The paper tackles the problem of multi-task learning from videos by introducing an efficient inter-frame attention module within a slow-fast architecture, achieving competitive accuracy on benchmarks while reducing FLOPs by up to 70% and up to 90% with their feature propagation method.

Prior work in multi-task learning has mainly focused on predictions on a single image. In this work, we present a new approach for multi-task learning from videos via efficient inter-frame local attention (MILA). Our approach contains a novel inter-frame attention module which allows learning of task-specific attention across frames. We embed the attention module in a ``slow-fast'' architecture, where the slower network runs on sparsely sampled keyframes and the light-weight shallow network runs on non-keyframes at a high frame rate. We also propose an effective adversarial learning strategy to encourage the slow and fast network to learn similar features. Our approach ensures low-latency multi-task learning while maintaining high quality predictions. Experiments show competitive accuracy compared to state-of-the-art on two multi-task learning benchmarks while reducing the number of floating point operations (FLOPs) by up to 70\%. In addition, our attention based feature propagation method (ILA) outperforms prior work in terms of task accuracy while also reducing up to 90\% of FLOPs.

View on arXiv PDF

Similar