Align Forward, Adapt Backward: Closing the Discretization Gap in Logic Gate Networks
This addresses a fundamental optimization problem in neural network training for researchers and practitioners, but it is incremental as it builds on existing methods like Gumbel-ST.
The paper tackles the training-inference mismatch in neural networks using soft mixtures during training and hard selection at inference, analyzing the discretization gap across methods and proposing CAGE to maintain gradient flow while preserving forward alignment. On logic gate networks, Hard-ST with CAGE achieves over 98% accuracy on MNIST and over 58% on CIFAR-10 with zero selection gap, while Gumbel-ST without CAGE suffers a 47-point accuracy collapse.
In neural network models, soft mixtures of fixed candidate components (e.g., logic gates and sub-networks) are often used during training for stable optimization, while hard selection is typically used at inference. This raises questions about training-inference mismatch. We analyze this gap by separating forward-pass computation (hard selection vs. soft mixture) from stochasticity (with vs. without Gumbel noise). Using logic gate networks as a testbed, we observe distinct behaviors across four methods: Hard-ST achieves zero selection gap by construction; Gumbel-ST achieves near-zero gap when training succeeds but suffers accuracy collapse at low temperatures; Soft-Mix achieves small gap only at low temperature via weight concentration; and Soft-Gumbel exhibits large gaps despite Gumbel noise, confirming that noise alone does not reduce the gap. We propose CAGE (Confidence-Adaptive Gradient Estimation) to maintain gradient flow while preserving forward alignment. On logic gate networks, Hard-ST with CAGE achieves over 98% accuracy on MNIST and over 58% on CIFAR-10, both with zero selection gap across all temperatures, while Gumbel-ST without CAGE suffers a 47-point accuracy collapse.