CVMay 7

MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

arXiv:2605.0564699.3h-index: 33Has Code
AI Analysis

For visual tokenization and representation learning, MUSE demonstrates that structural alignment can enhance both generation and perception, breaking a known trade-off.

MUSE resolves the trade-off between pixel reconstruction and semantic abstraction in visual tokenization via Topological Orthogonality, achieving SOTA generation (gFID 3.08) and surpassing InternViT-300M in linear probing (85.2% vs. 82.5%).

Unified visual tokenization faces a fundamental trade-off between high-fidelity pixel reconstruction (spatial equivariance) and semantic abstraction (conceptual invariance). We attribute this conflict to Manifold Misalignment: naive joint optimization induces opposing gradients, creating a zero-sum game between reconstruction and perception. To address this, we propose MUSE, a framework based on Topological Orthogonality. By treating Structure as an orthogonal bridge, MUSE decouples optimization within Transformers: structural gradients refine attention topology, while semantic gradients update feature values. This turns destructive interference into Mutual Reinforcement. Experiments show that MUSE breaks the trade-off, achieving state-of-the-art generation quality (gFID 3.08) and surpassing its teacher InternViT-300M in linear probing (85.2\% vs. 82.5\%), demonstrating that structurally aligned reconstruction can enhance semantic perception. Code is available at https://github.com/PanqiYang1/MUSE.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes