ARAINov 26, 2024

SoftmAP: Software-Hardware Co-design for Integer-Only Softmax on Associative Processors

arXiv:2411.17847v14 citationsh-index: 25DATE
Originality Incremental advance
AI Analysis

This work addresses the problem of deploying Large Language Models on resource-constrained devices by reducing computational and memory overheads, though it is incremental as it builds on existing compression and hardware techniques.

The paper tackles the bottleneck of non-linear operators like Softmax in Large Language Models on resource-constrained devices by proposing SoftmAP, a software-hardware co-design for integer-only low-precision Softmax using In-Memory Compute hardware, achieving up to three orders of magnitude improvement in energy-delay product compared to GPUs.

Recent research efforts focus on reducing the computational and memory overheads of Large Language Models (LLMs) to make them feasible on resource-constrained devices. Despite advancements in compression techniques, non-linear operators like Softmax and Layernorm remain bottlenecks due to their sensitivity to quantization. We propose SoftmAP, a software-hardware co-design methodology that implements an integer-only low-precision Softmax using In-Memory Compute (IMC) hardware. Our method achieves up to three orders of magnitude improvement in the energy-delay product compared to A100 and RTX3090 GPUs, making LLMs more deployable without compromising performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes