CV CLSep 18, 2023

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

Yadong Lu, Chunyuan Li, Haotian Liu, Jianwei Yang, Jianfeng Gao, Yelong Shen

arXiv:2309.09958v121.843 citationsh-index: 98Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of limited scaling studies for open-source LMMs, making state-of-the-art research more accessible for future work, though it is incremental in nature.

The study investigates scaling visual instruction-tuned large multimodal models (LMMs) up to 65B/70B parameters, finding that scaling consistently enhances performance and improves language capabilities, with LoRA/QLoRA tuning achieving results comparable to full-model fine-tuning.

Visual instruction tuning has recently shown encouraging progress with open-source large multimodal models (LMM) such as LLaVA and MiniGPT-4. However, most existing studies of open-source LMM are performed using models with 13B parameters or smaller. In this paper we present an empirical study of scaling LLaVA up to 33B and 65B/70B, and share our findings from our explorations in image resolution, data mixing and parameter-efficient training methods such as LoRA/QLoRA. These are evaluated by their impact on the multi-modal and language capabilities when completing real-world tasks in the wild. We find that scaling LMM consistently enhances model performance and improves language capabilities, and performance of LoRA/QLoRA tuning of LMM are comparable to the performance of full-model fine-tuning. Additionally, the study highlights the importance of higher image resolutions and mixing multimodal-language data to improve LMM performance, and visual instruction tuning can sometimes improve LMM's pure language capability. We hope that this study makes state-of-the-art LMM research at a larger scale more accessible, thus helping establish stronger baselines for future research. Code and checkpoints will be made public.

View on arXiv PDF Code

Similar