Stephen J. Thomas

2papers

2 Papers

6.4LGMay 4
Cascade Token Selection for Transformer Attention Acceleration

Stephen J. Thomas

A method is presented for reducing the cost of representative token selection in transformer attention layers by exploiting the coherence of the representative set across depth. Activation Decorrelation Attention (ADA) selects $r \ll T$ representative tokens at each layer via a Gram threshold and computes attention on the compressed $r \times r$ problem, but the selection requires a $T \times T$ Gram matrix at every layer. The cascade mechanism introduced here inherits the representative set from layer $l$ to layer $l+1$, validates it via a $(T - r) \times r$ cross-Gram computation, and updates it with a small number of additions and removals. The cost of the selection step drops from $O(T^2 d)$ to $O(T r d)$ per layer. Validation on three model families (GPT-2 124M, GPT-J 6B, OPT 6.7B) on AMD MI300X demonstrates Gram operation savings of $22\%$ to $63\%$ with mean Jaccard overlap of $0.83$ to $0.94$ between consecutive layers. The cascade reveals that the set of informative tokens is a structural property of the input that propagates coherently through the depth of the network: the same tokens carry the non-redundant information at layer $l$ and at layer $l+1$.

31.9LGMay 4
Gated Subspace Inference for Transformer Acceleration

Stephen J. Thomas

A method is presented for accelerating inference in transformer language models by exploiting the low effective rank of the token activation manifold at each layer. The method decomposes each activation vector into a subspace component and a residual, computes the linear-layer output on the subspace component via a cached low-rank weight image at reduced memory bandwidth, and applies a per-token gate that determines whether the residual correction is computed or skipped. The gate ensures that the output distribution is preserved to within a controllable tolerance. Validation on three model families (GPT-2 124M, GPT-J 6B, OPT 6.7B) on AMD MI300X demonstrates effective speedups of 3.0x to 10.5x on linear-layer weight reads with perplexity ratios below 1.00 and top-1 token agreement above 98%. The method requires no retraining, no architectural modification, and no approximation of the attention mechanism. At the operating point (k = 256, ε = 0.05) on GPT-J 14 6B, the accelerated model produces character-for-character identical output to the baseline.