Renaissance: Investigating the Pretraining of Vision-Language Encoders
This work addresses efficiency and design questions for researchers and practitioners building vision-language models, but it is incremental as it builds on existing methods.
The paper investigates best practices for pretraining vision-language encoders, showing that freezing large parts of models can save significant compute without harming downstream performance, and examines the impact of basing models on vision versus text models.
In the past several years there has been an explosion of available models for vision-language tasks. Unfortunately, the literature still leaves open a number of questions related to best practices in designing and training such models. In this paper we seek to answer several questions related to the pretraining of vision-language encoders through meta-analysis. In our first set of experiments, we show that we can save significant compute at no cost to downstream performance, by freezing large parts of vision-language models during pretraining. In our second set of experiments we examine the effect of basing a VL transformer on a vision model versus a text model. Additionally, we introduce a VL modeling platform called Renaissance that we use to conduct all of the experiments. This program offers a great deal of flexibility in creating, training and evaluating transformer encoders for VL modeling. The source code for Renaissance can be found at https://github.com/bsu-slim/renaissance.