CVDec 7, 2024

LATTE: Learning to Think with Vision Specialists

SalesforceStanford
arXiv:2412.05479v47 citationsh-index: 64Has CodeEMNLP
Originality Incremental advance
AI Analysis

This addresses the challenge of improving reasoning capabilities in vision-language models for tasks requiring complex multi-modal understanding, though it is incremental as it builds on existing vision specialists.

The paper tackles the problem of vision-language models struggling with complex questions requiring both perception and reasoning by proposing LATTE, which offloads perception to vision specialists and focuses on reasoning, achieving 4-5% gains over baselines across 6 benchmarks.

While open-source vision-language models perform well on simple question-answering, they still struggle with complex questions that require both perceptual and reasoning capabilities. We propose LATTE, a family of vision-language models that have LeArned to Think wiTh vision spEcialists. By offloading perception to state-of-the-art vision models, our approach enables vision-language models to focus solely on reasoning over high-quality perceptual information. To train LATTE, we synthesize and filter a large dataset of 293K multi-modal reasoning traces over perceptual outputs of vision specialists. LATTE trained on this data achieves significant 4-5% gains over baselines across 6 benchmarks covering both perception and reasoning abilities. Ablation studies reveal that the effectiveness of multi-modal reasoning traces depends on the data sources, formats, and quality of thoughts.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes