LGMar 11

The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers

arXiv:2603.10985v13.6

Predicted impact top 54% in LG · last 90 daysOriginality Incremental advance

AI Analysis

This provides insights into the internal mechanisms of transformer models, which is important for researchers in interpretable AI and neural network theory, though it is incremental as it builds on existing characterizations of deep networks.

The paper tackles the problem of understanding how MLP layers in transformer language models process signals, revealing that they perform binary routing of continuous signals with specific neurons implementing consensus architectures, and shows that removing the MLP at consensus breakdown causes a 43.3% perplexity increase compared to only 10.1% at full consensus.

We show that MLP layers in transformer language models perform binary routing of continuous signals: the decision of whether a token needs nonlinear processing is well-captured by binary neuron activations, even though the signals being routed are continuous. In GPT-2 Small (124M parameters), we find that specific neurons implement a consensus architecture -- seven "default-ON" neurons and one exception handler (N2123 in Layer 11) that are 93-98% mutually exclusive -- creating a binary routing switch. A cross-layer analysis reveals a developmental arc: early layers (L1-3) use single gateway neurons to route exceptions without consensus quorums; middle layers (L4-6) show diffuse processing with neither gateway nor consensus; and late layers (L7-11) crystallize full consensus/exception architectures with increasing quorum size (1 to 3 to 7 consensus neurons). Causal validation confirms the routing is functional: removing the MLP at consensus breakdown costs 43.3% perplexity, while at full consensus removing it costs only 10.1% -- exceeding a 4x difference. Comparing binary vs. continuous features for the routing decision confirms that binarization loses essentially no information (79.2% vs. 78.8% accuracy), while continuous activations carry additional magnitude information (R^2 = 0.36 vs. 0.22). This binary routing structure explains why smooth polynomial approximation fails: cross-validated polynomial fits (degrees 2-7) never exceed R^2 = 0.06 for highly nonlinear layers. We propose that the well-established piecewise-affine characterization of deep networks can be complemented by a routing characterization: along the natural data manifold, the piecewise boundaries implement binary decisions about which tokens need nonlinear processing, routing continuous signals through qualitatively different computational paths.

View on arXiv PDF

Similar