SmoGVLM: A Small, Graph-enhanced Vision-Language Model
For practitioners seeking efficient multimodal reasoning, this work shows that small models augmented with structured knowledge can outperform larger ones, reducing computational costs.
SmoGVLM integrates structured knowledge via Graph Neural Networks into small vision-language models, achieving up to 16.24% performance gains and surpassing larger VLMs on knowledge-intensive multimodal reasoning tasks.
Large vision-language models (VLMs) achieve strong performance on multimodal tasks but often suffer from hallucination and poor grounding in knowledge-intensive reasoning. We propose SmoGVLM, a small, graph-enhanced VLM that integrates structured knowledge with visual and textual modalities, using Graph Neural Networks. We investigate the effects of our method across a range of model sizes, from tiny (1.3B) to large (13B) models. Our results demonstrate that, when trained using our approach, a small model can achieve performance gains upto 16.24%, and surpass its larger counterparts, outperforming larger VLMs and strong fine-tuned baselines. These results highlight the potential of structured knowledge augmentation for efficient, smaller-scale multimodal reasoning systems.