CLApr 22

Beyond Pixels: Introspective and Interactive Grounding for Visualization Agents

Yiyang Lu, Woong Shin, Ahmad Maroof Karimi, Feiyi Wang, Jie Ren, Evgenia Smirni

arXiv:2604.2113422.7h-index: 10

Predicted impact top 77% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For visualization agents, this framework overcomes the pixel-only bottleneck by leveraging chart specifications and interaction, enabling more accurate data interpretation.

VLMs struggle with charts due to a pixel-only bottleneck. IVG combines spec-grounded introspection and view-grounded interaction, achieving 0.81 QA accuracy (+6.7% on overlapping geometries) on the new iPlotBench benchmark.

Vision-Language Models (VLMs) frequently misread values, hallucinate details, and confuse overlapping elements in charts. Current approaches rely solely on pixel interpretation, creating a Pixel-Only Bottleneck: agents treat interactive charts as static images, losing access to the structured specification that encodes exact values. We introduce Introspective and Interactive Visual Grounding (IVG), a framework that combines (1) spec-grounded introspection, which queries the underlying specification for deterministic evidence, with (2) view-grounded interaction, which manipulates the view to resolve visual ambiguity. To enable evaluation without VLM bias, we present iPlotBench, a benchmark of 500 interactive Plotly figures with 6,706 binary questions and ground-truth specifications. Experiments show that introspection improves data reconstruction fidelity, while the combination with interaction achieves the highest QA accuracy (0.81), with +6.7 % gains on overlapping geometries. We further demonstrate IVG in deployed agents that explore data autonomously and collaborate with human users in real time.

View on arXiv PDF

Similar