LGApr 10

A Self-Attentive Meta-Optimizer with Group-Adaptive Learning Rates and Weight Decay

arXiv:2605.040552.4

AI Analysis

This work addresses the limitation of uniform hyperparameters in adaptive optimizers, offering a practical improvement for diverse deep learning tasks.

MetaAdamW integrates a self-attention mechanism into the AdamW optimizer to dynamically adjust per-group learning rates and weight decay, consistently outperforming standard AdamW across five tasks (e.g., up to 17.11% training time reduction or 11.08% performance improvement).

Adaptive optimizers like AdamW apply uniform hyperparameters across all parameter groups, ignoring heterogeneous optimization dynamics across layers and modules. We address this limitation by proposing MetaAdamW - a new optimizer that integrates a self-attention mechanism to dynamically modulate per-group learning rates and weight decay. The modulation factors are produced by a lightweight Transformer encoder that operates on statistical features (gradient norms, momentum norms, correlations) extracted from each parameter group. To train the attention module, we introduce a meta-learning objective that combines gradient alignment, loss decrease, and generalization gap. A key novel contribution is the extension of homoscedastic uncertainty weighting (HUW) with task-specific priorities that directly scale the regularization terms - enabling domain knowledge to guide automatic loss balancing. Extensive experiments on five diverse tasks-time series forecasting (ETT), language modeling (WikiText-2), machine translation (Multi30k), image classification (CIFAR-10), and sentiment analysis (IMDB) - demonstrate that MetaAdamW consistently outperforms the standard AdamW baseline in terms of validation loss, accuracy, or perplexity. Depending on the task, MetaAdamW either reduces overall training time (by up to 17.11%) or improves performance (by up to 11.08%) while introducing only moderate overhead; in some cases, it can also mitigate issues of insufficient convergence caused by premature early stopping. Ablation studies validate the effectiveness of each component, including feature versions, grouping strategies, and the proposed priority-injected uncertainty weighting.

View on arXiv PDF

Similar