CLLGFeb 3, 2025

Scaling Embedding Layers in Language Models

arXiv:2502.01637v316 citationsh-index: 22
Originality Highly original
AI Analysis

This addresses the problem of efficient scaling for language model inference, particularly in resource-constrained environments, and is incremental as it builds on existing embedding methods with new strategies.

The paper tackles the problem of scaling embedding layers in language models without increasing decoding costs, and the result is that SCONE enables a model with 1B accelerator-resident parameters to outperform a 1.9B-parameter baseline while using only about half the FLOPS and accelerator memory during inference.

We propose $SCONE$ ($S$calable, $C$ontextualized, $O$ffloaded, $N$-gram $E$mbedding), a new method for extending input embedding layers to enhance language model performance. To avoid increased decoding costs, $SCONE$ retains the original vocabulary while introducing embeddings for a set of frequent n-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. After training, embeddings are precomputed and stored in off-accelerator memory; during inference, querying them has minimal impact on latency due to the low complexity of embedding lookups. $SCONE$ enables two new scaling strategies: increasing the number of n-gram embeddings and scaling the model used to learn them, both while maintaining fixed accelerator usage during inference (in terms of FLOPS and memory). We show that scaling both aspects enables a model with 1B accelerator-resident parameters to outperform a 1.9B-parameter baseline across diverse corpora, while using only about half the FLOPS and accelerator memory during inference.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes