LG AIJun 28, 2025

Spectra 1.1: Scaling Laws and Efficient Inference for Ternary Language Models

Tejas Vaidhya, Ayush Kaushal, Vineet Jain, Francis Couture Harpin, Prashant Shishodia, Majid Behbahani, Yuriy Nevmyvaka, Irina Rish

arXiv:2506.23025v14.11 citationsh-index: 10

Originality Incremental advance

AI Analysis

This work addresses the critical problem of high memory requirements and slow inference for LLMs, benefiting researchers and practitioners in AI by providing efficient deployment solutions, though it is incremental as it builds on existing quantization methods.

The paper tackles the inference efficiency bottleneck in large language models by investigating ternary language models (TriLMs) with quantization-aware training, achieving up to 5 times faster inference on GPUs and demonstrating sustained performance gains at scale with models trained on up to 1.2 trillion tokens.

Large language models (LLMs) are increasingly used across research and industry applications, yet their inference efficiency remains a significant challenge. As the computational power of modern GPU architectures continuously improves, their memory bandwidth and capacity have not scaled proportionally, creating a critical bottleneck during inference. To address this, we investigate ternary language models (TriLMs) that employ quantization-aware training to significantly reduce memory requirements. We first analyze the scalability of TriLMs by conducting a scaling law analysis, revealing that TriLMs benefit more from increasing training data than from scaling model parameters. Based on this observation, we introduce Spectra-1.1, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale. Furthermore, to improve inference efficiency, we propose novel 2-bit and 1.6-bit packing schemes for ternary weights, which demonstrate accelerated inference across various CPU architectures. Also, building on the 2-bit packing, we develop a GPU kernel called TriRun that accelerates end-to-end model inference by up to 5 times compared to floating-point baselines. To encourage further exploration and development of TriLMs, we will release the Spectra-1.1 suite and TriRun inference kernels. Overall, our work lays the foundation for building and deploying efficient LLMs, providing a valuable resource for the research community.

View on arXiv PDF

Similar