CLApr 22

Beyond Pixels: Introspective and Interactive Grounding for Visualization Agents

arXiv:2604.2113422.7h-index: 10
Predicted impact top 77% in CL · last 90 daysOriginality Incremental advance
AI Analysis

For visualization agents, this framework overcomes the pixel-only bottleneck by leveraging chart specifications and interaction, enabling more accurate data interpretation.

VLMs struggle with charts due to a pixel-only bottleneck. IVG combines spec-grounded introspection and view-grounded interaction, achieving 0.81 QA accuracy (+6.7% on overlapping geometries) on the new iPlotBench benchmark.

Vision-Language Models (VLMs) frequently misread values, hallucinate details, and confuse overlapping elements in charts. Current approaches rely solely on pixel interpretation, creating a Pixel-Only Bottleneck: agents treat interactive charts as static images, losing access to the structured specification that encodes exact values. We introduce Introspective and Interactive Visual Grounding (IVG), a framework that combines (1) spec-grounded introspection, which queries the underlying specification for deterministic evidence, with (2) view-grounded interaction, which manipulates the view to resolve visual ambiguity. To enable evaluation without VLM bias, we present iPlotBench, a benchmark of 500 interactive Plotly figures with 6,706 binary questions and ground-truth specifications. Experiments show that introspection improves data reconstruction fidelity, while the combination with interaction achieves the highest QA accuracy (0.81), with +6.7 % gains on overlapping geometries. We further demonstrate IVG in deployed agents that explore data autonomously and collaborate with human users in real time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes