LGDec 18, 2025

Bridging Training and Merging Through Momentum-Aware Optimization

arXiv:2512.17109v2h-index: 33
Originality Incremental advance
AI Analysis

This work addresses the inefficiency in workflows for training and merging large neural networks, offering a unified pipeline that reduces computational waste and improves performance, though it is incremental in optimizing existing processes.

The paper tackles the problem of inefficiently discarding curvature information during neural network training and then recomputing it for model merging, proposing a unified framework that maintains momentum and curvature statistics during training to enable geometry-aware model composition. The result shows that this approach outperforms magnitude-only baselines across sparsity levels, with multi-task merging improving by 1.6% over strong baselines, while incurring only about 30% memory overhead over AdamW.

Training large neural networks and merging task specific models both exploit low rank structure and require parameter importance estimation, yet these challenges have been pursued in isolation. Current workflows compute curvature information during training, discard it, then recompute similar information for merging wasting computation and discarding valuable trajectory data. We introduce a unified framework that maintains factorized momentum and curvature statistics during training, then reuses this information for geometry aware model composition. The proposed method incurs modest memory overhead (approximately 30% over AdamW) to accumulate task saliency scores that enable curvature aware merging. These scores, computed as a byproduct of optimization, provide importance estimates comparable to post hoc Fisher computation while producing merge-ready models directly from training. We establish convergence guarantees for non-convex objectives with approximation error bounded by gradient singular value decay. On natural language understanding benchmarks, curvature aware parameter selection outperforms magnitude only baselines across all sparsity levels, with multi-task merging improving 1.6% over strong baselines. The proposed framework exhibits rank-invariant convergence and superior hyperparameter robustness compared to existing low-rank optimizers. By treating the optimization trajectory as a reusable asset rather than discarding it, our approach demonstrates that training-time curvature information suffices for effective model composition, enabling a unified training merging pipeline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes