GR CV IR ITMay 19, 2025

AdaToken-3D: Dynamic Spatial Gating for Efficient 3D Large Multimodal-Models Reasoning

arXiv:2505.12782v14.33 citationsh-index: 3IROS

Originality Incremental advance

AI Analysis

This addresses efficiency bottlenecks for researchers and practitioners using 3D LMMs in scene understanding, though it is incremental as it optimizes existing architectures rather than introducing a new paradigm.

The paper tackles the problem of computational inefficiency in 3D Large Multimodal Models (LMMs) caused by redundant spatial tokens, proposing AdaToken-3D to dynamically prune these tokens, resulting in 21% faster inference speed and 63% FLOPs reduction while maintaining accuracy.

Large Multimodal Models (LMMs) have become a pivotal research focus in deep learning, demonstrating remarkable capabilities in 3D scene understanding. However, current 3D LMMs employing thousands of spatial tokens for multimodal reasoning suffer from critical inefficiencies: excessive computational overhead and redundant information flows. Unlike 2D VLMs processing single images, 3D LMMs exhibit inherent architectural redundancy due to the heterogeneous mechanisms between spatial tokens and visual tokens. To address this challenge, we propose AdaToken-3D, an adaptive spatial token optimization framework that dynamically prunes redundant tokens through spatial contribution analysis. Our method automatically tailors pruning strategies to different 3D LMM architectures by quantifying token-level information flows via attention pattern mining. Extensive experiments on LLaVA-3D (a 7B parameter 3D-LMM) demonstrate that AdaToken-3D achieves 21\% faster inference speed and 63\% FLOPs reduction while maintaining original task accuracy. Beyond efficiency gains, this work systematically investigates redundancy patterns in multimodal spatial information flows through quantitative token interaction analysis. Our findings reveal that over 60\% of spatial tokens contribute minimally ($<$5\%) to the final predictions, establishing theoretical foundations for efficient 3D multimodal learning.

View on arXiv PDF

Similar