T-REX: A 68-567 μs/token, 0.41-3.95 μJ/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET
This work addresses efficiency bottlenecks for deploying transformer models in resource-constrained hardware environments, representing an incremental improvement with specific optimizations.
The paper tackles the problem of high external memory access and low hardware utilization in transformer model inference by introducing novel training, compression, and hardware mechanisms, achieving a latency of 68-567 μs/token and energy efficiency of 0.41-3.95 μJ/token.
This work introduces novel training and post-training compression schemes to reduce external memory access during transformer model inference. Additionally, a new control flow mechanism, called dynamic batching, and a novel buffer architecture, termed a two-direction accessible register file, further reduce external memory access while improving hardware utilization.