AI CLOct 19, 2025

See or Say Graphs: Agent-Driven Scalable Graph Understanding with Vision-Language Models

Shuo Han, Yukun Cao, Zezhong Ding, Zengyi Gao, S Kevin Zhou, Xike Xie

arXiv:2510.16769v15.81 citationsh-index: 7

Originality Highly original

AI Analysis

This addresses scalability bottlenecks in graph understanding for AI applications, representing a strong specific gain rather than a foundational advancement.

The paper tackles the scalability and modality coordination challenges in graph understanding with vision-language models by proposing GraphVista, a framework that uses hierarchical organization and a planning agent to route tasks, resulting in handling graphs up to 200× larger than benchmarks and achieving up to 4.4× quality improvement over state-of-the-art methods.

Vision-language models (VLMs) have shown promise in graph understanding, but remain limited by input-token constraints, facing scalability bottlenecks and lacking effective mechanisms to coordinate textual and visual modalities. To address these challenges, we propose GraphVista, a unified framework that enhances both scalability and modality coordination in graph understanding. For scalability, GraphVista organizes graph information hierarchically into a lightweight GraphRAG base, which retrieves only task-relevant textual descriptions and high-resolution visual subgraphs, compressing redundant context while preserving key reasoning elements. For modality coordination, GraphVista introduces a planning agent that routes tasks to the most suitable modality-using the text modality for simple property reasoning and the visual modality for local and structurally complex reasoning grounded in explicit topology. Extensive experiments demonstrate that GraphVista scales to large graphs, up to $200\times$ larger than those used in existing benchmarks, and consistently outperforms existing textual, visual, and fusion-based methods, achieving up to $4.4\times$ quality improvement over the state-of-the-art baselines by fully exploiting the complementary strengths of both modalities.

View on arXiv PDF

Similar