Revisiting Gradient Descent: A Dual-Weight Method for Improved Learning
This provides a practical solution for enhancing neural network learning without increasing inference complexity, which is beneficial for applications with limited or noisy data.
The paper tackles the problem of learning in neural networks by decomposing each neuron's weight vector into two parts to model contrastive information, resulting in improved generalization and resistance to overfitting, especially with sparse or noisy data, as shown on tasks like MNIST and CIFAR-10.
We introduce a novel framework for learning in neural networks by decomposing each neuron's weight vector into two distinct parts, $W_1$ and $W_2$, thereby modeling contrastive information directly at the neuron level. Traditional gradient descent stores both positive (target) and negative (non-target) feature information in a single weight vector, often obscuring fine-grained distinctions. Our approach, by contrast, maintains separate updates for target and non-target features, ultimately forming a single effective weight $W = W_1 - W_2$ that is more robust to noise and class imbalance. Experimental results on both regression (California Housing, Wine Quality) and classification (MNIST, Fashion-MNIST, CIFAR-10) tasks suggest that this decomposition enhances generalization and resists overfitting, especially when training data are sparse or noisy. Crucially, the inference complexity remains the same as in the standard $WX + \text{bias}$ setup, offering a practical solution for improved learning without additional inference-time overhead.