LGSep 25, 2022

SpeedLimit: Neural Architecture Search for Quantized Transformer Models

Yuji Chai, Luke Bailey, Yunho Jin, Matthew Karle, Glenn G. Ko, David Brooks, Gu-Yeon Wei, H. T. Kung

arXiv:2209.12127v31.8h-index: 63

Originality Highly original

AI Analysis

This work addresses the need for deploying high-performance transformer models in latency-sensitive environments, representing an incremental improvement with a novel method for a known bottleneck.

The paper tackles the problem of optimizing transformer models for inference latency constraints in industry applications by introducing SpeedLimit, a Neural Architecture Search technique that incorporates 8-bit integer quantization, achieving state-of-the-art results in balancing accuracy and latency.

While research in the field of transformer models has primarily focused on enhancing performance metrics such as accuracy and perplexity, practical applications in industry often necessitate a rigorous consideration of inference latency constraints. Addressing this challenge, we introduce SpeedLimit, a novel Neural Architecture Search (NAS) technique that optimizes accuracy whilst adhering to an upper-bound latency constraint. Our method incorporates 8-bit integer quantization in the search process to outperform the current state-of-the-art technique. Our results underline the feasibility and efficacy of seeking an optimal balance between performance and latency, providing new avenues for deploying state-of-the-art transformer models in latency-sensitive environments.

View on arXiv PDF

Similar