Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models
This work addresses the problem of maintaining language capabilities in multimodal AI models for researchers and developers, though it is incremental as it builds on existing extension methods.
The study evaluated how well vision-and-language models preserve natural language understanding (NLU) from their base language models by testing them on the GLUE benchmark, finding that dual-stream models did not significantly outperform single-stream models, with pre-training often causing NLU performance drops.
A method for creating a vision-and-language (V&L) model is to extend a language model through structural modifications and V&L pre-training. Such an extension aims to make a V&L model inherit the capability of natural language understanding (NLU) from the original language model. To see how well this is achieved, we propose to evaluate V&L models using an NLU benchmark (GLUE). We compare five V&L models, including single-stream and dual-stream models, trained with the same pre-training. Dual-stream models, with their higher modality independence achieved by approximately doubling the number of parameters, are expected to preserve the NLU capability better. Our main finding is that the dual-stream scores are not much different than the single-stream scores, contrary to expectation. Further analysis shows that pre-training causes the performance drop in NLU tasks with few exceptions. These results suggest that adopting a single-stream structure and devising the pre-training could be an effective method for improving the maintenance of language knowledge in V&L extensions.