LG AIMay 12

Emergence of Frontier Superposition: Möbius attractor and Cascade Supervision

arXiv:2605.1882069.7

Predicted impact top 26% in LG · last 90 daysOriginality Highly original

AI Analysis

For researchers studying emergent reasoning in Transformers, this work provides a theoretical and empirical demonstration of how gradient descent can discover superposition reasoning, addressing a key open question about the feasibility of learning such representations.

The paper shows that gradient descent can learn superposition reasoning in Transformers for graph reachability by combining a Möbius attractor (which reduces dynamics to a 1D map with a manifold of optima) with Cascade Supervision (a loss class that provides selectivity bootstrap, gradient persistence, and per-step discrimination). Experiments confirm the predicted cosine similarity of 0.37 vs. 0.69 (cascade vs. end-to-end) at depth D=3, matching theory within 0.02.

Superposition allows Transformers to reason in depth, carrying an entire reasoning frontier in parallel through a bounded-depth forward pass instead of unrolling serial chain-of-thought tokens. While Zhu et al. (2025) hand-crafted an equal-weight breadth-first frontier in a single residual stream for graph reachability, it remained open whether gradient descent could ever find this target amidst permutation-symmetric saddles. We close this gap on Reachability-by-Superposition over Erdős-Rényi graphs by isolating architectural and supervisional contributions. Architecturally, we identify a Möbius attractor: under $S_n$-symmetry in the tree regime, layerwise dynamics reduce to a 1D Möbius map whose zero set is a codimension-one manifold of global optima containing the equal-weight superposition state. On the supervision side, we identify Cascade Supervision: a loss class whose backward pass simultaneously delivers (A) selectivity bootstrap, (B) gradient persistence across depth, and (C) per-step discrimination (e.g., \mathcal{L}_{sup} and \mathcal{L}_{node}). End-to-end supervision fails condition (B) and is provably insufficient: internal gradients at layer c decay as (np)^{-(D-c-2)/2} in the graph fan-out and stall before the manifold is reached. Our thesis: Möbius attractor + Cascade Supervision = emergence of superposition reasoning. The parameter-free decay law predicts a final-step cosine of 0.35 vs. 0.71 (end-to-end vs. cascade) at depth D=3; experiments confirm 0.37 vs. 0.69, matching within 0.02 at every step.

View on arXiv PDF

Similar