CLCVDec 18, 2024

Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning

arXiv:2412.13540v319 citationsh-index: 10ACL
Originality Incremental advance
AI Analysis

This addresses a domain-specific problem for researchers and practitioners using LVLMs in visual graph applications, with incremental improvements through fine-tuning.

The paper tackled the problem of large vision-language models (LVLMs) having limitations in understanding and reasoning with visual graphs, by proposing VGCure, a benchmark covering 22 tasks, which revealed that LVLMs are weak in basic graph tasks, and a structure-aware fine-tuning framework improved their performance and robustness.

Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across diverse tasks. Despite great success, recent studies show that LVLMs encounter substantial limitations when engaging with visual graphs. To study the reason behind these limitations, we propose VGCure, a comprehensive benchmark covering 22 tasks for examining the fundamental graph understanding and reasoning capacities of LVLMs. Extensive evaluations conducted on 14 LVLMs reveal that LVLMs are weak in basic graph understanding and reasoning tasks, particularly those concerning relational or structurally complex information. Based on this observation, we propose a structure-aware fine-tuning framework to enhance LVLMs with structure learning abilities through three self-supervised learning tasks. Experiments validate the effectiveness of our method in improving LVLMs' performance on fundamental and downstream graph learning tasks, as well as enhancing their robustness against complex visual graphs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes