AR AIMar 1, 2025

T-REX: A 68-567 μs/token, 0.41-3.95 μJ/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET

Seunghyun Moon, Mao Li, Gregory Chen, Phil Knag, Ram Krishnamurthy, Mingoo Seok

arXiv:2503.00322v11.2h-index: 6

Originality Incremental advance

AI Analysis

This work addresses efficiency bottlenecks for deploying transformer models in resource-constrained hardware environments, representing an incremental improvement with specific optimizations.

The paper tackles the problem of high external memory access and low hardware utilization in transformer model inference by introducing novel training, compression, and hardware mechanisms, achieving a latency of 68-567 μs/token and energy efficiency of 0.41-3.95 μJ/token.

This work introduces novel training and post-training compression schemes to reduce external memory access during transformer model inference. Additionally, a new control flow mechanism, called dynamic batching, and a novel buffer architecture, termed a two-direction accessible register file, further reduce external memory access while improving hardware utilization.

View on arXiv PDF

Similar