"Noisier" Noise Contrastive Eestimation is (Almost) Maximum Likelihood
For practitioners of representation learning and generative modeling, this work offers a simple, computationally cheap modification to NCE that improves density-ratio estimation in challenging regimes.
The paper addresses the challenge of noise contrastive estimation (NCE) when distributions differ substantially, showing that scaling up the noise magnitude aligns NCE gradients with maximum likelihood, leading to faster convergence. The proposed 'Noisier' NCE achieves strong results on image modeling, anomaly detection, and offline optimization, matching or surpassing state-of-the-art on CIFAR-10 and ImageNet64x64 while halving training iterations.
Noise Contrastive Estimation (NCE) has fueled major breakthroughs in representation learning and generative modeling. Yet a long-standing challenge remains: accurately estimating ratios between distributions that differ substantially, which significantly limits the applicability of NCE on modern high-dimensional and multimodal datasets. We revisit this problem from a less explored perspective: the magnitude of the noise distribution. Specifically, we show that with a virtually scaled (\ie, artificially increased) noise magnitude, the gradient of the NCE objective can closely align with that of Maximum Likelihood, enabling a trajectory-wise approximation from NCE to MLE, and faster convergence both theoretically and empirically. Building on this insight, we introduce ``Noisier'' NCE, a simple drop-in modification to vanilla NCE that incurs little to no extra computational cost, while effectively handling density-ratio estimation in challenging regimes where traditional MLE and NCE struggle. Beyond improving classical density-ratio learning, ``Noisier'' NCE proves broadly applicable: it achieves strong results across image modeling, anomaly detection, and offline black-box optimization. On CIFAR-10 and ImageNet64x64 datasets, it yields 10-step and even 1-step samplers that match or surpass state-of-the-art methods, while cutting training iterations by up to half.