LGJan 18, 2022

AdaTerm: Adaptive T-Distribution Estimated Robust Moments for Noise-Robust Stochastic Gradient Optimization

arXiv:2201.06714v421 citations
Originality Incremental advance
AI Analysis

This work addresses noise robustness in optimization for deep learning practitioners, offering a novel unified framework that is incremental over prior t-distribution modifications.

The paper tackles the problem of noise-corrupted datasets in deep learning by proposing AdaTerm, a stochastic gradient optimization method that uses the Student's t-distribution to derive all associated statistics, resulting in improved robustness and adaptability, with demonstrated performance across various noise ratios.

With the increasing practicality of deep learning applications, practitioners are inevitably faced with datasets corrupted by noise from various sources such as measurement errors, mislabeling, and estimated surrogate inputs/outputs that can adversely impact the optimization results. It is a common practice to improve the optimization algorithm's robustness to noise, since this algorithm is ultimately in charge of updating the network parameters. Previous studies revealed that the first-order moment used in Adam-like stochastic gradient descent optimizers can be modified based on the Student's t-distribution. While this modification led to noise-resistant updates, the other associated statistics remained unchanged, resulting in inconsistencies in the assumed models. In this paper, we propose AdaTerm, a novel approach that incorporates the Student's t-distribution to derive not only the first-order moment but also all the associated statistics. This provides a unified treatment of the optimization process, offering a comprehensive framework under the statistical model of the t-distribution for the first time. The proposed approach offers several advantages over previously proposed approaches, including reduced hyperparameters and improved robustness and adaptability. This noise-adaptive behavior contributes to AdaTerm's exceptional learning performance, as demonstrated through various optimization problems with different and/or unknown noise ratios. Furthermore, we introduce a new technique for deriving a theoretical regret bound without relying on AMSGrad, providing a valuable contribution to the field

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes