CVLGAug 1, 2024

Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model

DeepMindTsinghua
arXiv:2408.00754v216 citationsh-index: 38Has Code
AI Analysis

This addresses the need for MLLMs to interpret 3D spaces and temporal dynamics in real-world applications, offering a lightweight, training-free method that is incremental but effective.

The paper tackles the problem of enhancing spatial-temporal reasoning in multimodal language models (MLLMs) using 2D images, and the result is substantial performance gains on benchmarks, such as a +20.5% improvement on ScanQA, without modifying the architecture or requiring task-specific fine-tuning.

Multimodal language models (MLLMs) are increasingly being applied in real-world environments, necessitating their ability to interpret 3D spaces and comprehend temporal dynamics. Current methods often rely on specialized architectural designs or task-specific fine-tuning to achieve this. We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input, without modifying the architecture or requiring task-specific fine-tuning. Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints, and then conveys this information to MLLMs through visual prompting. We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks that require spatial-temporal reasoning, including +20.5\% improvement on ScanQA, +9.7\% on OpenEQA's episodic memory subset, +6.0\% on the long-form video benchmark EgoSchema, and +11\% on the R2R navigation benchmark. Additionally, we show that Coarse Correspondences can also enhance open-source MLLMs' spatial reasoning (by +6.9\% on ScanQA) when applied in both training and inference and that the improvement can generalize to unseen datasets such as SQA3D (+3.1\%). Taken together, we show that Coarse Correspondences effectively and efficiently boosts models' performance on downstream tasks requiring spatial-temporal reasoning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes