CVMar 5, 2024

Enhancing Vision-Language Pre-training with Rich Supervisions

Yuan Gao, Kunyu Shi, Pengkai Zhu, Edouard Belval, Oren Nuriel, Srikar Appalaraju, Shabnam Ghadar, Vijay Mahadevan, Zhuowen Tu, Stefano Soatto

arXiv:2403.03346v215.318 citationsh-index: 19CVPR

Originality Highly original

AI Analysis

This work addresses the challenge of enhancing vision-language models for tasks like table detection and widget captioning, representing a novel method for a known bottleneck in pre-training.

The paper tackles the problem of vision-language pre-training by introducing S4, a novel pre-training paradigm using web screenshots with 10 carefully designed tasks, which improves performance on nine downstream tasks, including up to 76.1% on Table Detection and at least 1% on Widget Captioning.

We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble downstream tasks across different domains and the annotations are cheap to obtain. We demonstrate that, compared to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection, and at least 1% on Widget Captioning.

View on arXiv PDF

Similar