CVDec 5, 2024

FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression

arXiv:2412.04317v112 citationsh-index: 25CVPR
Originality Incremental advance
AI Analysis

This addresses efficiency issues for users of multimodal AI systems, though it appears incremental as it builds on existing tiny MLLM efforts.

The paper tackles the slow response and high latency of multimodal large language models (MLLMs) by proposing FlashSloth, which compresses visual tokens to improve efficiency while maintaining performance, reducing visual tokens, training memory, and computation complexity compared to advanced tiny MLLMs like InternVL2, MiniCPM-V2, and Qwen2-VL.

Despite a big leap forward in capability, multimodal large language models (MLLMs) tend to behave like a sloth in practical use, i.e., slow response and large latency. Recent efforts are devoted to building tiny MLLMs for better efficiency, but the plethora of visual tokens still used limit their actual speedup. In this paper, we propose a powerful and fast tiny MLLM called FlashSloth. Different from previous efforts, FlashSloth focuses on improving the descriptive power of visual tokens in the process of compressing their redundant semantics. In particular, FlashSloth introduces embedded visual compression designs to capture both visually salient and instruction-related image information, so as to achieving superior multimodal performance with fewer visual tokens. Extensive experiments are conducted to validate the proposed FlashSloth, and a bunch of tiny but strong MLLMs are also comprehensively compared, e.g., InternVL2, MiniCPM-V2 and Qwen2-VL. The experimental results show that compared with these advanced tiny MLLMs, our FlashSloth can greatly reduce the number of visual tokens, training memory and computation complexity while retaining high performance on various VL tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes