LatentLLM: Attention-Aware Joint Tensor Compression
This addresses the resource-intensive nature of foundation models for AI practitioners, though it appears incremental as it builds on existing tensor decomposition techniques.
The paper tackles the computational and memory inefficiency of large language and multi-modal models by proposing a framework that converts them into a reduced-dimension latent structure, achieving significant accuracy improvements over existing compression methods with concrete reductions in latent dimensions.
Modern foundation models such as large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure. Our method extends a local activation-aware tensor decomposition to a global attention-aware joint tensor de-composition. Our framework can significantly improve the model accuracy over the existing model compression methods when reducing the latent dimension to realize computationally/memory-efficient LLMs/LLMs. We show the benefit on several benchmark including multi-modal reasoning tasks.