LG AI CL CVMay 23, 2025

LatentLLM: Attention-Aware Joint Tensor Compression

Toshiaki Koike-Akino, Xiangyu Chen, Jing Liu, Ye Wang, Pu, Wang, Matthew Brand

arXiv:2505.18413v14 citationsh-index: 5

Originality Incremental advance

AI Analysis

This addresses the resource-intensive nature of foundation models for AI practitioners, though it appears incremental as it builds on existing tensor decomposition techniques.

The paper tackles the computational and memory inefficiency of large language and multi-modal models by proposing a framework that converts them into a reduced-dimension latent structure, achieving significant accuracy improvements over existing compression methods with concrete reductions in latent dimensions.

Modern foundation models such as large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure. Our method extends a local activation-aware tensor decomposition to a global attention-aware joint tensor de-composition. Our framework can significantly improve the model accuracy over the existing model compression methods when reducing the latent dimension to realize computationally/memory-efficient LLMs/LLMs. We show the benefit on several benchmark including multi-modal reasoning tasks.

View on arXiv PDF

Similar