LGMar 12, 2024

LookupFFN: Making Transformers Compute-lite for CPU inference

arXiv:2403.07221v112 citationsh-index: 38Has CodeICML
Originality Incremental advance
AI Analysis

This addresses the challenge of CPU inference for transformers in industries where GPUs are impractical, offering a compute-lite solution, though it appears incremental as it builds on prior LSH-based approximations.

The paper tackles the problem of making transformer inference more efficient on CPUs by proposing LookupFFN, an alternative to GEMM-based feed-forward networks that recasts operations as memory lookups, achieving similar performance to standard FFNs in RoBERTa pretraining while dramatically reducing FLOP requirements.

While GPU clusters are the de facto choice for training large deep neural network (DNN) models today, several reasons including ease of workflow, security and cost have led to efforts investigating whether CPUs may be viable for inference in routine use in many sectors of the industry. But the imbalance between the compute capabilities of GPUs and CPUs is huge. Motivated by these considerations, we study a module which is a workhorse within modern DNN architectures, GEMM based Feed Forward Networks (FFNs), and assess the extent to which it can be made compute- (or FLOP-) lite. Specifically, we propose an alternative formulation (we call it LookupFFN) to GEMM based FFNs inspired by the recent studies of using Locality Sensitive Hashing (LSH) to approximate FFNs. Our formulation recasts most essential operations as a memory look-up, leveraging the trade-off between the two resources on any platform: compute and memory (since CPUs offer it in abundance). For RoBERTa language model pretraining, our formulation achieves similar performance compared to GEMM based FFNs, while dramatically reducing the required FLOP. Our development is complemented with a detailed hardware profiling of strategies that will maximize efficiency -- not just on contemporary hardware but on products that will be offered in the near/medium term future. Code is avaiable at \url{https://github.com/mlpen/LookupFFN}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes