CVCLLGSep 30, 2024

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

arXiv:2409.20566v173 citationsh-index: 21
Originality Incremental advance
AI Analysis

This work provides an incremental advancement in multimodal large language models for researchers and developers by demonstrating the impact of data curation on performance.

The paper introduces MM1.5, a family of multimodal large language models (MLLMs) from 1B to 30B parameters, which improves text-rich image understanding, visual referring, grounding, and multi-image reasoning. This was achieved by systematically exploring diverse data mixtures, including high-quality OCR data and synthetic captions, leading to strong performance even at smaller scales (1B and 3B).

We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture, MM1.5 adopts a data-centric approach to model training, systematically exploring the impact of diverse data mixtures across the entire model training lifecycle. This includes high-quality OCR data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised fine-tuning. Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants, and demonstrate that careful data curation and training strategies can yield strong performance even at small scales (1B and 3B). Additionally, we introduce two specialized variants: MM1.5-Video, designed for video understanding, and MM1.5-UI, tailored for mobile UI understanding. Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes