AdamD: Improved bias-correction in Adam
This is an incremental improvement for users of the Adam optimizer, addressing sensitivity to hyperparameters in early training.
The paper tackles the issue of Adam optimizer making larger-than-requested gradient updates early in training by proposing a small update to exclude bias-correction on the first-order estimate, resulting in more desirable gradient update properties in the initial steps.
Here I present a small update to the bias-correction term in the Adam optimizer that has the advantage of making smaller gradient updates in the first several steps of training. With the default bias-correction, Adam may actually make larger than requested gradient updates early in training. By only including the well-justified bias-correction of the second moment gradient estimate, $v_t$, and excluding the bias-correction on the first-order estimate, $m_t$, we attain these more desirable gradient update properties in the first series of steps. The default implementation of Adam may be as sensitive as it is to the hyperparameters $β_1, β_2$ partially due to the originally proposed bias correction procedure, and its behavior in early steps.