CVCLMay 5

Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning

arXiv:2605.0430471.3h-index: 4
Predicted impact top 41% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For chart question answering, this work addresses the challenge of multi-step reasoning across multiple subplots, which existing MLLMs struggle with.

HierVA introduces a hierarchical visual agent framework for chart reasoning that iteratively manages a joint image-text context, achieving consistent improvements over strong multimodal baselines on the CharXiv reasoning subset.

Advanced chart question answering requires both precise perception of small visual elements and multi-step reasoning across several subplots. While existing MLLMs are strong at understanding single plots, they often struggle with multi-step reasoning across multiple subplots. We propose HierVA, a hierarchical visual agent framework for chart reasoning that iteratively constructs and updates a working context in a joint image--text space. A high-level manager generates plans and maintains a compact context containing only key information, while specialized workers perform reasoning, gather evidence, and return results. In particular, the agent maintains separate visual and textual contexts, using a zoom-in tool to restrict the visual context. Experiments on the CharXiv reasoning subset demonstrate consistent improvements over strong multimodal baselines, and ablation studies verify that hierarchical architecture, scoped visual context, and distilled context contribute complementary gains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes