LG AIMar 14

The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure

arXiv:2602.1852364.56 citationsh-index: 3

AI Analysis

This work addresses the problem of understanding abrupt generalization transitions in multi-task learning for researchers in machine learning theory, though it appears incremental by extending single-task grokking analysis to multi-task settings.

The study extended geometric analysis to multi-task modular arithmetic, training Transformers on dual- and tri-task objectives, and identified five consistent phenomena including staggered grokking order, universal integrability, weight decay phase structure, holographic incompressibility, and transverse fragility, with specific findings like multiplication generalizing first and grokking eliminated by removing less than 10% of orthogonal gradient components.

Grokking -- the abrupt transition from memorization to generalization long after near-zero training loss -- has been studied mainly in single-task settings. We extend geometric analysis to multi-task modular arithmetic, training shared-trunk Transformers on dual-task (mod-add + mod-mul) and tri-task (mod-add + mod-mul + mod-sq) objectives across a systematic weight decay sweep. Five consistent phenomena emerge. (1) Staggered grokking order: multiplication generalizes first, followed by squaring, then addition, with consistent delays across seeds. (2) Universal integrability: optimization trajectories remain confined to an empirically invariant low-dimensional execution manifold; commutator defects orthogonal to this manifold reliably precede generalization. (3) Weight decay phase structure: grokking timescale, curvature depth, reconstruction threshold, and defect lead covary systematically with weight decay, revealing distinct dynamical regimes and a sharp no-decay failure mode. (4) Holographic incompressibility: final solutions occupy only 4--8 principal trajectory directions yet are distributed across full-rank weights and destroyed by minimal perturbations; SVD truncation, magnitude pruning, and uniform scaling all fail to preserve performance. (5) Transverse fragility and redundancy: removing less than 10% of orthogonal gradient components eliminates grokking, yet dual-task models exhibit partial recovery under extreme deletion, suggesting redundant center manifolds enabled by overparameterization. Together, these results support a dynamical picture in which multi-task grokking constructs a compact superposition subspace in parameter space, with weight decay acting as compression pressure and excess parameters supplying geometric redundancy in optimization pathways.

View on arXiv PDF

Similar