CVJan 12

CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation

arXiv:2601.08010v1
Originality Highly original
AI Analysis

This addresses the problem of inconsistent predictions in multimodal reasoning for AI researchers and practitioners, offering a novel stabilization method.

The paper tackles the instability of multi-step reasoning in vision-language models by introducing CASHEW, an inference-time framework that aggregates multiple reasoning trajectories with visual verification, and CASHEW-RL, a learned variant trained with a composite reward. Results show significant performance gains, such as +23.6 percentage points on ScienceQA and +8.1 percentage points on EgoSchema.

Vision-language models achieve strong performance across a wide range of multimodal understanding and reasoning tasks, yet their multi-step reasoning remains unstable. Repeated sampling over the same input often produces divergent reasoning trajectories and inconsistent final predictions. To address this, we introduce two complementary approaches inspired by test-time scaling: (1) CASHEW, an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces, with explicit visual verification filtering hallucinated steps and grounding reasoning in visual evidence, and (2) CASHEW-RL, a learned variant that internalizes this aggregation behavior within a single model. CASHEW-RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal yet sufficient visual evidence, while adaptively allocating reasoning effort based on task difficulty. This training objective enables robust self-aggregation at inference. Extensive experiments on 13 image understanding, video understanding, and video reasoning benchmarks show significant performance improvements, including gains of up to +23.6 percentage points on ScienceQA and +8.1 percentage points on EgoSchema.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes