LGNov 19, 2024

Ultra-Sparse Memory Network

ByteDance
arXiv:2411.12364v215 citationsh-index: 11ICLR
Originality Incremental advance
AI Analysis

This work addresses inference efficiency challenges for large-scale AI models, offering a novel architecture that could enable billions of slots, though it appears incremental as an enhancement over MoE methods.

The paper tackles the high memory access costs and inference latency in Transformer models, particularly with Mixture of Experts (MoE) approaches, by introducing UltraMem with an ultra-sparse memory layer, achieving state-of-the-art inference speed and model performance while scaling to 20 million memory slots.

It is widely acknowledged that the performance of Transformer models is logarithmically related to their number of parameters and computational complexity. While approaches like Mixture of Experts (MoE) decouple parameter count from computational complexity, they still face challenges in inference due to high memory access costs. This work introduces UltraMem, incorporating large-scale, ultra-sparse memory layer to address these limitations. Our approach significantly reduces inference latency while maintaining model performance. We also investigate the scaling laws of this new architecture, demonstrating that it not only exhibits favorable scaling properties but outperforms MoE. In experiments, the largest UltraMem we train has 20 million memory slots. The results show that our method achieves state-of-the-art inference speed and model performance within a given computational budget, paving the way for billions of slots or experts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes