LG AIMay 4

Gated Subspace Inference for Transformer Acceleration

arXiv:2605.0310931.9

AI Analysis

This work addresses the memory bandwidth bottleneck in transformer inference for practitioners deploying large language models, offering a practical speedup without retraining or architectural changes.

The paper introduces a method to accelerate transformer inference by exploiting low-rank token activations, achieving 3.0x-10.5x speedups on linear-layer weight reads with perplexity ratios below 1.00 and top-1 token agreement above 98% on GPT-2, GPT-J, and OPT models.

A method is presented for accelerating inference in transformer language models by exploiting the low effective rank of the token activation manifold at each layer. The method decomposes each activation vector into a subspace component and a residual, computes the linear-layer output on the subspace component via a cached low-rank weight image at reduced memory bandwidth, and applies a per-token gate that determines whether the residual correction is computed or skipped. The gate ensures that the output distribution is preserved to within a controllable tolerance. Validation on three model families (GPT-2 124M, GPT-J 6B, OPT 6.7B) on AMD MI300X demonstrates effective speedups of 3.0x to 10.5x on linear-layer weight reads with perplexity ratios below 1.00 and top-1 token agreement above 98%. The method requires no retraining, no architectural modification, and no approximation of the attention mechanism. At the operating point (k = 256, ε = 0.05) on GPT-J 14 6B, the accelerated model produces character-for-character identical output to the baseline.

View on arXiv PDF

Similar