LocoMotion: Learning Motion-Focused Video-Language Representations
This addresses the challenge of motion understanding in video-language AI for applications like robotics or surveillance, though it is incremental as it builds on existing representation learning methods.
The paper tackled the problem of learning motion-focused video-language representations by introducing LocoMotion, which uses synthetic motions and verb-variation paraphrasing to generate captions describing object movements, and demonstrated effectiveness in downstream tasks, especially with limited fine-tuning data.
This paper strives for motion-focused video-language representations. Existing methods to learn video-language representations use spatial-focused data, where identifying the objects and scene is often enough to distinguish the relevant caption. We instead propose LocoMotion to learn from motion-focused captions that describe the movement and temporal progression of local object motions. We achieve this by adding synthetic motions to videos and using the parameters of these motions to generate corresponding captions. Furthermore, we propose verb-variation paraphrasing to increase the caption variety and learn the link between primitive motions and high-level verbs. With this, we are able to learn a motion-focused video-language representation. Experiments demonstrate our approach is effective for a variety of downstream tasks, particularly when limited data is available for fine-tuning. Code is available: https://hazeldoughty.github.io/Papers/LocoMotion/