LG MLApr 13

Subcritical Signal Propagation at Initialization in Normalization-Free Transformers

arXiv:2604.118907.91 citationsh-index: 1

Predicted impact top 93% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For deep learning practitioners, this work provides theoretical insight into why normalization-free transformers (e.g., DyT, Derf) require careful tuning, addressing a practical training instability issue.

The paper studies signal propagation in transformers at initialization using the averaged partial Jacobian norm (APJN), extending analysis to bidirectional attention. It finds that normalization-free transformers with elementwise tanh-like nonlinearities exhibit subcritical (stretched-exponential) APJN growth, explaining their sensitivity to initialization and optimization choices.

We study signal propagation at initialization in transformers through the averaged partial Jacobian norm (APJN), a measure of gradient amplification across layers. We extend APJN analysis to transformers with bidirectional attention and permutation-symmetric input token configurations by deriving recurrence relations for activation statistics and APJNs across layers. Our theory predicts how attention modifies the asymptotic behavior of the APJN at large depth and matches APJNs measured in deep vision transformers. The criticality picture known from residual networks carries over to transformers: the pre-LayerNorm architecture exhibits power-law APJN growth, whereas transformers with LayerNorm replaced by elementwise $\tanh$-like nonlinearities have stretched-exponential APJN growth, indicating that the latter are subcritical. Applied to Dynamic Tanh (DyT) and Dynamic erf (Derf) transformers, the theory explains why these architectures can be more sensitive to initialization and optimization choices and require careful tuning for stable training.

View on arXiv PDF

Similar