Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime
This work addresses the challenge of few-shot learning for visual language models, which is incremental as it builds on existing adaptation methods.
The paper tackles the problem of adapting pre-trained visual language models to new tasks with limited labeled data, showing that a self-labeling approach using unlabeled images yields significant performance gains across multiple visual language tasks.
Large-scale visual language models are widely used as pre-trained models and then adapted for various downstream tasks. While humans are known to efficiently learn new tasks from a few examples, deep learning models struggle with adaptation from few examples. In this work, we look into task adaptation in the low-data regime, and provide a thorough study of the existing adaptation methods for generative Visual Language Models. And we show important benefits of self-labelling, i.e. using the model's own predictions to self-improve when having access to a larger number of unlabelled images of the same distribution. Our study demonstrates significant gains using our proposed task adaptation pipeline across a wide range of visual language tasks such as visual classification (ImageNet), visual captioning (COCO), detailed visual captioning (Localised Narratives) and visual question answering (VQAv2).