LGAICLMay 15

Layer Equivalence Is Not a Property of Layers Alone: How You Test Redundancy Changes What You Find

arXiv:2605.162348.9
Predicted impact top 92% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For researchers pruning or merging transformer layers, the choice of equivalence test can dramatically change which layers are deemed redundant, highlighting a methodological pitfall.

The paper shows that two common tests for layer equivalence (replacement vs. interchange) give conflicting results on pretrained transformers, with the gap growing during training and varying across architectures. At 8B scale, interchange-guided pruning can be several-fold safer than replacement-guided, but this relationship is not universal.

When researchers ask whether two transformer layers are "equivalent" for compression, they often conflate distinct tests. Replacement asks whether one layer's map can substitute for another's in place; interchange asks whether two layers approximately commute when their positions are swapped. Both are output-grounded swap-KL probes, but they need not agree: on pretrained transformers the protocol gap can change which layers look safe to prune by several-fold under the same evaluator, especially when replacement distances are high. We measure both protocols across checkpoints and architectures. On a Pythia training trajectory (410M and 1.4B), the replacement-interchange gap grows from initialization to convergence. Under one matched WikiText-2 contract at 8B scale, Qwen3-8B enters a divergent regime: interchange-guided removal is several-fold safer than replacement-guided at the same layer budgets, while Llama-3.1-8B ties the two protocols for pruning cost even though interchange KL is lower, showing metric gaps need not map one-to-one to removal. Before layer removal or merging, score both swap-KLs on the target checkpoint; the diagnostic requires only unlabeled forward passes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes