CL AIMay 7

Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks

Nii Osae Osae Dade, Tony Morri, Moinul Hossain Rahat, Sayandip Pal

arXiv:2605.0648567.1

AI Analysis

This work makes LLM inference practical on consumer CPUs, addressing the bottleneck of expensive GPU requirements for over one billion personal computers.

Litespark-Inference enables efficient inference of ternary neural networks on consumer CPUs by replacing matrix multiplication with integer addition/subtraction via custom SIMD kernels, achieving 9.2x faster time-to-first-token, 52x higher throughput, and 14x memory reduction over standard PyTorch on Apple Silicon, with similar gains on Intel/AMD.

Large language models (LLMs) have transformed artificial intelligence, but their computational requirements remain prohibitive for most users. Standard inference demands expensive datacenter GPUs or cloud API access, leaving over one billion personal computers underutilized for AI workloads. Ternary models offer a path forward: their weights are constrained to {-1, 0, +1}, theoretically eliminating the need for floating-point multiplication. However, existing frameworks fail to exploit this structure, treating ternary models as dense floating-point networks. We address this gap with custom SIMD kernels that replace matrix multiplication with simple addition and subtraction operations, targeting the integer dot product instructions available on modern CPUs. Our implementation, Litespark-Inference, is pip-installable and integrates directly with Hugging-Face, achieving 9.2x faster time-to-first-token, 52x higher throughput, and 14x memory reduction compared to standard PyTorch inference on Apple Silicon, with similar speedups on Intel and AMD processors.

View on arXiv PDF

Similar