GiVE: Guiding Visual Encoder to Perceive Overlooked Information
This addresses a bottleneck in multimodal AI applications like visual question answering by improving object perception, though it appears incremental as it builds on existing encoder frameworks.
The paper tackles the problem of visual encoders in multimodal large language models overlooking non-salient objects, proposing the GiVE approach with novel loss functions and a dataset, which achieves state-of-the-art performance in object retrieval and comprehensiveness.
Multimodal Large Language Models have advanced AI in applications like text-to-video generation and visual question answering. These models rely on visual encoders to convert non-text data into vectors, but current encoders either lack semantic alignment or overlook non-salient objects. We propose the Guiding Visual Encoder to Perceive Overlooked Information (GiVE) approach. GiVE enhances visual representation with an Attention-Guided Adapter (AG-Adapter) module and an Object-focused Visual Semantic Learning module. These incorporate three novel loss terms: Object-focused Image-Text Contrast (OITC) loss, Object-focused Image-Image Contrast (OIIC) loss, and Object-focused Image Discrimination (OID) loss, improving object consideration, retrieval accuracy, and comprehensiveness. Our contributions include dynamic visual focus adjustment, novel loss functions to enhance object retrieval, and the Multi-Object Instruction (MOInst) dataset. Experiments show our approach achieves state-of-the-art performance.