SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm
For Transformer practitioners, SiameseNorm offers a drop-in replacement that resolves the Pre/Post-Norm trade-off with negligible overhead, though the improvement is incremental over existing hybrid approaches.
SiameseNorm introduces a two-stream architecture that reconciles Pre-Norm and Post-Norm in Transformers, consistently improving performance across 400M/1.3B dense LMs, 15B MoE models, ViT, and DiT while maintaining training stability.
The long-standing tension between Pre- and Post-Norm remains an open problem in Transformer architecture, reflecting a fundamental trade-off between training stability and representational capacity. Prior attempts to combine their strengths have made progress, but often show limited robustness across training settings, restricting their broader applicability. We revisit this dilemma, showing that single-stream architectures struggle to reconcile Pre-Norm's stable identity-gradient propagation with Post-Norm's normalization of the main residual path. To address this structural tension, we propose SiameseNorm, a simple yet effective two-stream architecture that remains compatible with Pre-Norm training recipes. SiameseNorm couples Pre-Norm-like and Post-Norm-like streams through shared residual blocks, allowing each residual block to receive optimization signals from both pathways with negligible overhead. Extensive experiments on 400M and 1.3B dense language models, 15B MoE models, Vision Transformers, and Diffusion Transformers show that SiameseNorm consistently improves performance while maintaining strong training stability across architectures and modalities. Code is available at https://github.com/Qwen-Applications/SiameseNorm.