CVAICLLGAug 9, 2024

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

arXiv:2408.04840v2288 citationsh-index: 28
AI Analysis

This addresses the problem of long image-sequence understanding for MLLM users, representing an incremental improvement with novel architectural components.

The paper tackles the challenge of modeling long image sequences in multi-modal large language models (MLLMs) by introducing mPLUG-Owl3, which achieves state-of-the-art performance on single-image, multi-image, and video benchmarks for models of similar size.

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes