LGCLOCJan 21, 2025

FOCUS: First Order Concentrated Updating Scheme

arXiv:2501.12243v15 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses optimization challenges in LLM pre-training, offering a potentially better optimizer for this domain, though it appears incremental as it builds on existing methods like Signum.

The paper tackles the problem of optimizing large language model pre-training by developing FOCUS, an optimizer that enhances Signum to better handle gradient noise while maintaining larger step sizes. In experiments training GPT-2, FOCUS proved more stable than Signum and faster than Adam.

Large language models (LLMs) demonstrate remarkable performance, and improving their pre-training process appears to be key to enhancing their capabilities further. Based on the documented success of Adam, learning rate decay, and weight decay, we hypothesize that the pre-training loss landscape features a narrowing valley structure. Through experiments with synthetic loss functions, we discover that when gradient query noise is high relative to the valley's sharpness, Adam's performance falls behind that of Signum because Adam reduces the effective step size too drastically. This observation led us to develop FOCUS, an optimizer that enhances Signum by incorporating attraction toward moving averaged parameters, allowing it to handle noise better while maintaining larger step sizes. In training GPT-2, FOCUS proves to be more stable than Signum and faster than Adam. These results suggest that gradient noise may be an underappreciated limiting factor in LLM training, and FOCUS offers promising solutions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes