LGARJun 16, 2024

Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization

arXiv:2406.12930v144 citations
Originality Highly original
AI Analysis

This work addresses the problem of efficient LLM deployment for users in computing systems, offering an incremental improvement through a novel quantization technique.

The paper tackles the challenge of deploying large language model (LLM) inference efficiently by addressing high compute and memory requirements, proposing Tender, an algorithm-hardware co-design that uses decomposed quantization to achieve higher accuracy and inference performance compared to state-of-the-art methods, with minimal hardware intrusion.

Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning and have thus become one of the most important workloads in today's computing landscape. However, deploying LLM inference poses challenges due to the high compute and memory requirements stemming from the enormous model size and the difficulty of running it in the integer pipelines. In this paper, we present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision. Based on our analysis of outlier values in LLMs, we propose a decomposed quantization technique in which the scale factors of decomposed matrices are powers of two apart. The proposed scheme allows us to avoid explicit requantization (i.e., dequantization/quantization) when accumulating the partial sums from the decomposed matrices, with a minimal extension to the commodity tensor compute hardware. Our evaluation shows that Tender achieves higher accuracy and inference performance compared to the state-of-the-art methods while also being significantly less intrusive to the existing accelerators.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes