Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
This addresses a bottleneck in vision-language models for applications requiring complex scene understanding, though it appears incremental as it builds on existing object-centric approaches.
The paper tackles the challenge of learning relations among multiple objects in vision-language pre-training by proposing X-VLM, a method for multi-grained alignment between texts and visual concepts, which consistently outperforms state-of-the-art methods on downstream tasks.
Most existing methods in vision language pre-training rely on object-centric features extracted through object detection and make fine-grained alignments between the extracted features and texts. It is challenging for these methods to learn relations among multiple objects. To this end, we propose a new method called X-VLM to perform `multi-grained vision language pre-training.' The key to learning multi-grained alignments is to locate visual concepts in the image given the associated texts, and in the meantime align the texts with the visual concepts, where the alignments are in multi-granularity. Experimental results show that X-VLM effectively leverages the learned multi-grained alignments to many downstream vision language tasks and consistently outperforms state-of-the-art methods.