Marginals Before Conditionals
This work provides insights into the dynamics of conditional learning in neural networks, which is incremental but clarifies a specific bottleneck in understanding training behavior.
The authors tackled the problem of understanding how neural networks learn conditional dependencies by constructing a minimal task that isolates conditional learning, revealing that models first learn the marginal distribution, plateauing at a predictable loss, before sharply transitioning to the full conditional solution. They found that the plateau duration depends on dataset size, and gradient noise stabilizes the marginal, with higher learning rates slowing the transition by up to 3.6 times.
We construct a minimal task that isolates conditional learning in neural networks: a surjective map with K-fold ambiguity, resolved by a selector token z, so H(A | B) = log K while H(A | B, z) = 0. The model learns the marginal P(A | B) first, producing a plateau at exactly log K, before acquiring the full conditional in a sharp, collective transition. The plateau has a clean decomposition: height = log K (set by ambiguity), duration = f(D) (set by dataset size D, not K). Gradient noise stabilizes the marginal solution: higher learning rates monotonically slow the transition (3.6* across a 7* η range at fixed throughput), and batch-size reduction delays escape, consistent with an entropic force opposing departure from the low-gradient marginal. Internally, a selector-routing head assembles during the plateau, leading the loss transition by ~50% of the waiting time. This is the Type 2 directional asymmetry of Papadopoulos et al. [2024], measured dynamically: we track the excess risk from log K to zero and characterize what stabilizes it, what triggers its collapse, and how long it takes.