CVAIAug 29, 2024

LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models

arXiv:2408.16224v28 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses a bottleneck in visual understanding for VLMs, offering a domain-specific improvement.

The paper tackled the fragmented perception issue in vision-language models (VLMs) caused by Vision Transformer patch division by introducing a Scene Graph Expression (SGE) module, which significantly enhanced VLM performance in vision-language tasks.

Recent advances in large vision-language models (VLMs) typically employ vision encoders based on the Vision Transformer (ViT) architecture. The division of the images into patches by ViT results in a fragmented perception, thereby hindering the visual understanding capabilities of VLMs. In this paper, we propose an innovative enhancement to address this limitation by introducing a Scene Graph Expression (SGE) module in VLMs. This module extracts and structurally expresses the complex semantic information within images, thereby improving the foundational perception and understanding abilities of VLMs. Extensive experiments demonstrate that integrating our SGE module significantly enhances the VLM's performance in vision-language tasks, indicating its effectiveness in preserving intricate semantic details and facilitating better visual understanding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes