Otter: A Multi-Modal Model with In-Context Instruction Tuning
This work addresses the need for more effective multi-modal assistants by improving instruction-following with in-context examples, though it builds upon existing architectures like Flamingo, making it incremental.
The paper tackles the problem of enhancing instruction-following capability in Large Multimodal Models by introducing Otter, a model that leverages both textual and visual in-context examples for instruction tuning, and shows that this approach substantially improves model convergence and generalization, with the model trained on the MIMIC-IT dataset of over 3 million multi-modal instruction-response pairs excelling in complex video and multi-image understanding tasks.
Recent advances in Large Multimodal Models (LMMs) have unveiled great potential as visual assistants. However, most existing works focus on responding to individual instructions or using previous dialogues for contextual understanding. There is little discussion on employing both images and text as in-context examples to enhance the instruction following capability. To bridge this gap, we introduce the \textbf{Otter} model to leverage both textual and visual in-context examples for instruction tuning. Specifically, Otter builds upon Flamingo with Perceiver architecture, and has been instruction tuned for general purpose multi-modal assistant. Otter seamlessly processes multi-modal inputs, supporting modalities including text, multiple images, and dynamic video content. To support the training of Otter, we present the \textbf{MIMIC-IT} (\textbf{M}ult\textbf{I}-\textbf{M}odal \textbf{I}n-\textbf{C}ontext \textbf{I}nstruction \textbf{T}uning) dataset, which encompasses over 3 million multi-modal instruction-response pairs, including approximately 2.2 million unique instructions across a broad spectrum of images and videos. MIMIC-IT has been carefully curated to feature a diverse array of in-context examples for each entry. Comprehensive evaluations suggest that instruction tuning with these in-context examples substantially enhances model convergence and generalization capabilities. Notably, the extensive scenario coverage provided by the MIMIC-IT dataset empowers the Otter model to excel in tasks involving complex video and multi-image understanding.