LAMP: Label Augmented Multimodal Pretraining
This work tackles the practical limitation of data requirements for multimodal pretraining, benefiting researchers and practitioners in computer vision and natural language processing.
This paper addresses the challenge of large-volume, high-quality vision-language pair requirements in multimodal pretraining by proposing LAMP, a label-augmented V-L pretraining model. LAMP enriches vision-language pairs with fine-grained alignment using auto-generated labels of visual objects and introduces a novel pretraining task, demonstrating its effectiveness across four downstream tasks.
Multi-modal representation learning by pretraining has become an increasing interest due to its easy-to-use and potential benefit for various Visual-and-Language~(V-L) tasks. However its requirement of large volume and high-quality vision-language pairs highly hinders its values in practice. In this paper, we proposed a novel label-augmented V-L pretraining model, named LAMP, to address this problem. Specifically, we leveraged auto-generated labels of visual objects to enrich vision-language pairs with fine-grained alignment and correspondingly designed a novel pretraining task. Besides, we also found such label augmentation in second-stage pretraining would further universally benefit various downstream tasks. To evaluate LAMP, we compared it with some state-of-the-art models on four downstream tasks. The quantitative results and analysis have well proven the value of labels in V-L pretraining and the effectiveness of LAMP.