LG CLJan 29

GeoNorm: Unify Pre-Norm and Post-Norm with Geodesic Optimization

Chuanyang Zheng, Jiankai Sun, Yihang Gao, Chi Wang, Yuehao Wang, Jing Xiong, Liliang Ren, Bo Peng, Qingmei Wang, Xiaoran Shang, Mac Schwager, Anderson Schneider

arXiv:2601.22095v12.71 citationsh-index: 10

Originality Highly original

AI Analysis

This work addresses a fundamental design issue in Transformer architectures, offering a novel solution that enhances model performance across applications.

The paper tackled the problem of normalization layer placement in Transformers by introducing GeoNorm, a method that uses geodesic optimization to unify Pre-Norm and Post-Norm, resulting in consistent performance improvements over existing methods with negligible computational overhead.

The placement of normalization layers, specifically Pre-Norm and Post-Norm, remains an open question in Transformer architecture design. In this work, we rethink these approaches through the lens of manifold optimization, interpreting the outputs of the Feed-Forward Network (FFN) and attention layers as update directions in optimization. Building on this perspective, we introduce GeoNorm, a novel method that replaces standard normalization with geodesic updates on the manifold. Furthermore, analogous to learning rate schedules, we propose a layer-wise update decay for the FFN and attention components. Comprehensive experiments demonstrate that GeoNorm consistently outperforms existing normalization methods in Transformer models. Crucially, GeoNorm can be seamlessly integrated into standard Transformer architectures, achieving performance improvements with negligible additional computational cost.

View on arXiv PDF

Similar