LGApr 22

FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels

Fei Zuo, Xiaoyan Xi, Quanyi Zeng, Feiyu Wang, Ho Fai Leung

arXiv:2604.2091310.7h-index: 2

Predicted impact top 91% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For CPU-based LLM deployment, FairyFuse provides a practical speedup by eliminating floating-point multiplications through ternary weights, though it is incremental as it builds on existing ternary quantization and AVX-512 techniques.

FairyFuse enables multiplication-free LLM inference on CPUs by fusing ternary weight operations into a single AVX-512 loop, achieving 32.4 tokens/s on Intel Xeon 8558P, 1.24x faster than llama.cpp Q4_K_M with near-lossless quality (perplexity 5.52 vs 5.47 FP16).

Large language models are increasingly deployed on CPU-only platforms where memory bandwidth is the primary bottleneck for autoregressive generation. Weight quantization to four bits or below reduces memory pressure, yet existing systems still dequantize weights and perform floating-point multiplications, limiting the achievable gains. Ternary weights in {-1, 0, +1} provide a more efficient alternative, replacing multiplications with conditional additions, subtractions, or no-ops. While Fairy2i shows that ternary LLMs can match FP16 quality, its runtime does not exploit this structure. We present FairyFuse, an inference system that enables multiplication-free execution on commodity CPUs by fusing the eight real-valued sub-GEMVs of each widely-linear layer into a single AVX-512 loop using masked additions and subtractions, with zero floating-point multiplications. Roofline analysis shows that 16x weight compression shifts memory-bound GEMV toward the compute regime on bandwidth-limited CPUs, yielding a 29.6x kernel speedup while offering little benefit on GPUs. End-to-end, FairyFuse achieves 32.4 tokens per second on a single Intel Xeon 8558P, outperforming llama.cpp Q4_K_M by 1.24x with near-lossless quality (WikiText-2 perplexity 5.52 vs. 5.47 FP16; downstream accuracy 66.0%).

View on arXiv PDF

Similar