Which Modality should I use -- Text, Motif, or Image? : Understanding Graphs with Large Language Models
This addresses the challenge of processing complex graph structures with LLMs for researchers and practitioners in graph analysis, though it appears incremental as it builds on existing LLM and graph encoding methods.
The paper tackles the problem of encoding graphs for Large Language Models (LLMs) by introducing a multi-modal approach using text, image, and motif modalities with prompts to approximate global connectivity, finding that image modality with vision-language models like GPT-4V outperforms text and prior graph neural net encoders while balancing token limits.
Our research integrates graph data with Large Language Models (LLMs), which, despite their advancements in various fields using large text corpora, face limitations in encoding entire graphs due to context size constraints. This paper introduces a new approach to encoding a graph with diverse modalities, such as text, image, and motif, coupled with prompts to approximate a graph's global connectivity, thereby enhancing LLMs' efficiency in processing complex graph structures. The study also presents GraphTMI, a novel benchmark for evaluating LLMs in graph structure analysis, focusing on homophily, motif presence, and graph difficulty. Key findings indicate that the image modality, especially with vision-language models like GPT-4V, is superior to text in balancing token limits and preserving essential information and outperforms prior graph neural net (GNN) encoders. Furthermore, the research assesses how various factors affect the performance of each encoding modality and outlines the existing challenges and potential future developments for LLMs in graph understanding and reasoning tasks. All data will be publicly available upon acceptance.