LGMay 9

VORT: Adaptive Power-Law Memory for NLP Transformers

arXiv:2605.089666.9

Predicted impact top 88% in LG · last 90 daysOriginality Highly original

AI Analysis

For NLP practitioners, VORT addresses the mismatch between Transformer's exponential decay and natural language's power-law dependencies, offering a more faithful memory architecture.

VORT introduces a learnable power-law memory kernel for Transformers, enabling accurate modeling of long-range dependencies. It achieves O(S d_v) per-step complexity with S=O(log(T/ε)) terms and demonstrates advantages on synthetic benchmarks.

Standard Transformers impose near-exponential decay on the influence of distant tokens, conflicting with the power-law structure of long-range dependencies in natural language. We introduce the \emph{Variable-Order Retention Transformer} (\VORT{}), a memory architecture in which each ingested token is assigned a learnable fractional order α_i\in[δ,1] that governs a Grünwald--Letnikov power-law retention kernel. Because the fractional weighted sum is non-Markovian, we approximate it through a sum-of-exponentials (SOE) decomposition computed by Gauss--Laguerre quadrature on a Laplace-type integral representation of the kernel weights. Each exponential component admits a one-step Markovian recurrence at O(Sd_v) per step, where S=O(\log(T/\varepsilon)) terms suffice for \varepsilon-uniform accuracy on horizon [1,T]. Retrieval is keyed and associative via a linear-attention accumulator with an exact O(KSd_ϕd_v) -per-step recurrence. Four results are established: (i) an SOE approximation theorem with geometric convergence rate from the analyticity of the integrand after a log-change of variables; (ii) a quantisation bound valid on [δ,1] with correct analysis near α=0; (iii) a direct L^2 energy argument (Proposition) showing that for α>1/2 any mixture with fixed minimum decay rate Λ>0 incurs L^2([1,T]) error at least N_α(T)-C(Λ)\to\infty, with the Λ-dependence made explicit; and (iv) linear convergence of a gradient plasticity rule under the Polyak--Łojasiewicz condition. Two synthetic experiments confirm the architectural advantage: a Zipf-distributed retrieval benchmark and an entity label-copy task with uniform lag distribution, the latter ruling out prior-matching as an explanation for the power-law kernel's advantage.

View on arXiv PDF

Similar