LG SPMar 11

TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly

arXiv:2603.1929690.1h-index: 6

AI Analysis

This addresses the problem of domain shift in activation-aware compression for large language models, enabling faster inference across diverse downstream tasks.

The paper tackles the computational demand of large foundation models by proposing a test-time quantization framework that compresses models on the fly at inference time, achieving improved quantization performance over state-of-the-art baselines.

To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these methods highly rely on calibration data, domain shift issues may arise for unseen downstream tasks. We propose a test-time quantization (TTQ) framework which compresses large models on the fly at inference time to resolve this issue. With an efficient online calibration, instant activation-aware quantization can adapt every prompt regardless of the downstream tasks, yet achieving inference speedup. Several experiments demonstrate that TTQ can improve the quantization performance over state-of-the-art baselines.

View on arXiv PDF

Similar