CLFeb 24, 2024

Exploring Failure Cases in Multimodal Reasoning About Physical Dynamics

arXiv:2402.15654v17 citationsh-index: 4AAAI Spring Symposia
Originality Synthesis-oriented
AI Analysis

This addresses reliability issues in AI systems for robotics and embodied AI, though it is incremental as it builds on existing multimodal reasoning research.

The paper investigates failure cases in multimodal LLMs' physical reasoning abilities, showing they fail to compose atomic world knowledge correctly in object manipulation tasks, and proposes a method to distill discovered object properties back into LLMs.

In this paper, we present an exploration of LLMs' abilities to problem solve with physical reasoning in situated environments. We construct a simple simulated environment and demonstrate examples of where, in a zero-shot setting, both text and multimodal LLMs display atomic world knowledge about various objects but fail to compose this knowledge in correct solutions for an object manipulation and placement task. We also use BLIP, a vision-language model trained with more sophisticated cross-modal attention, to identify cases relevant to object physical properties that that model fails to ground. Finally, we present a procedure for discovering the relevant properties of objects in the environment and propose a method to distill this knowledge back into the LLM.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes