CLOct 21, 2024

1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs

arXiv:2410.16144v215 citationsh-index: 27Has Code
Originality Incremental advance
AI Analysis

This work enables faster and more energy-efficient local deployment of LLMs across various devices, representing an incremental improvement in inference optimization.

The authors tackled the challenge of efficient inference for 1-bit LLMs by developing bitnet.cpp, a software stack that achieved speedups of 2.37x to 6.17x on x86 CPUs and 1.37x to 5.07x on ARM CPUs for BitNet b1.58 models.

Recent advances in 1-bit Large Language Models (LLMs), such as BitNet and BitNet b1.58, present a promising approach to enhancing the efficiency of LLMs in terms of speed and energy consumption. These developments also enable local LLM deployment across a broad range of devices. In this work, we introduce bitnet.cpp, a tailored software stack designed to unlock the full potential of 1-bit LLMs. Specifically, we develop a set of kernels to support fast and lossless inference of ternary BitNet b1.58 LLMs on CPUs. Extensive experiments demonstrate that bitnet.cpp achieves significant speedups, ranging from 2.37x to 6.17x on x86 CPUs and from 1.37x to 5.07x on ARM CPUs, across various model sizes. The code is available at https://github.com/microsoft/BitNet.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes