CLMar 14, 2025

Text Compression for Efficient Language Generation

David Gu, Peter Belcak, Roger Wattenhofer

arXiv:2503.11426v117.011 citationsh-index: 24NAACL

Originality Incremental advance

AI Analysis

This addresses efficiency bottlenecks for LLM deployment, though it appears incremental as it modifies existing architectures.

The paper tackles the problem of inefficient text generation in LLMs by proposing GPTHF, a hierarchical transformer that compresses text into sentence embeddings, achieving up to 10x FLOPs efficiency and 3x runtime speed improvements compared to GPT models.

We challenge the prevailing assumption that LLMs must rely fully on sub-word tokens for high-quality text generation. To this end, we propose the "Generative Pretrained Thoughtformer" (GPTHF), a hierarchical transformer language model capable of text generation by compressing text into sentence embeddings and employing a sentence attention mechanism. GPTHF retains GPT's architecture, modifying only token interactions via dynamic sparse attention masks. Our experiments show that GPTHF achieves an up to an order of magnitude improvement in FLOPs efficiency and a threefold increase in runtime speed compared to equally-sized GPT models in the low-size regime. This is achieved through a unique generation method that caches and reuses sentence embeddings, allowing significant portions of the input to bypass large parts of the network.

View on arXiv PDF

Similar