LGCLMay 15, 2025

ComplexFormer: Disruptively Advancing Transformer Inference Ability via Head-Specific Complex Vector Attention

arXiv:2505.10222v21 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses a key bottleneck in Transformer inference for NLP applications, offering a more expressive and adaptable attention mechanism with strong parameter efficiency, though it appears incremental as it builds on existing positional encoding methods.

The paper tackles the challenge of integrating positional information effectively in Transformer models while maintaining multi-head attention flexibility, resulting in ComplexFormer achieving superior performance with significantly lower generation perplexity and improved long-context coherence across tasks like language modeling and code generation.

Transformer models rely on self-attention to capture token dependencies but face challenges in effectively integrating positional information while allowing multi-head attention (MHA) flexibility. Prior methods often model semantic and positional differences disparately or apply uniform positional adjustments across heads, potentially limiting representational capacity. This paper introduces ComplexFormer, featuring Complex Multi-Head Attention-CMHA. CMHA empowers each head to independently model semantic and positional differences unified within the complex plane, representing interactions as rotations and scaling. ComplexFormer incorporates two key improvements: (1) a per-head Euler transformation, converting real-valued query/key projections into polar-form complex vectors for head-specific complex subspace operation; and (2) a per-head adaptive differential rotation mechanism, exp[i(Adapt(ASmn,i) + Delta(Pmn),i)], allowing each head to learn distinct strategies for integrating semantic angle differences (ASmn,i) with relative positional encodings (Delta(Pmn),i). Extensive experiments on language modeling, text generation, code generation, and mathematical reasoning show ComplexFormer achieves superior performance, significantly lower generation perplexity , and improved long-context coherence compared to strong baselines like RoPE-Transformers. ComplexFormer demonstrates strong parameter efficiency, offering a more expressive, adaptable attention mechanism.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes