On approximating dropout noise injection
This work identifies a foundational flaw in prior theoretical analyses of dropout regularization, affecting researchers in machine learning theory.
The paper reveals that the established equivalence between dropout noise injection and L2 regularization for logistic regression relies on a divergent Taylor expansion, invalidating subsequent comparisons with standard regularizers, and extends this finding to neural networks with cross-entropy prediction layers.
This paper examines the assumptions of the derived equivalence between dropout noise injection and $L_2$ regularisation for logistic regression with negative log loss. We show that the approximation method is based on a divergent Taylor expansion, making, subsequent work using this approximation to compare the dropout trained logistic regression model with standard regularisers unfortunately ill-founded to date. Moreover, the approximation approach is shown to be invalid using any robust constraints. We show how this finding extends to general neural network topologies that use a cross-entropy prediction layer.