LGAISep 16, 2025

Sparse Training Scheme for Multimodal LLM

arXiv:2509.18150v1h-index: 13
Originality Incremental advance
AI Analysis

This addresses training bottlenecks for researchers and practitioners working with multimodal AI models, though it appears incremental as an optimization of existing methods.

The paper tackles the inefficiency of training multimodal large language models (MLLMs) due to long input sequences and low computational utilization, proposing a Sparse Training Scheme (STS) that reduces training time by 40% while maintaining competitive performance on benchmarks.

Multimodal Large Language Models (MLLMs) have demonstrated outstanding performance across a variety of domains. However, training MLLMs is often inefficient due to the significantly longer input sequences introduced by multimodal data and the low utilization of inter-layer computations. To address this challenge, we shift the focus to the training process itself and propose a novel training-efficient framework based on sparse representations, termed the Sparse Training Scheme (STS). This scheme consists of two key components: the Visual Token Compressor, which reduces the information load by compressing visual tokens, and the Layer Dynamic Skipper, which mitigates the computational overhead by dynamically skipping unnecessary layers in the language model during both forward and backward passes. Our approach is broadly applicable to diverse MLLM architectures and has been extensively evaluated on multiple benchmarks, demonstrating its effectiveness and efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes