LGAICVNov 24, 2025

Deterministic Continuous Replacement: Fast and Stable Module Replacement in Pretrained Transformers

arXiv:2511.18670v1
Originality Incremental advance
AI Analysis

This addresses a stability challenge for researchers and practitioners working on efficient transformer modifications, though it appears incremental as it builds on existing replacement methods.

The paper tackled the problem of destabilizing pretrained transformers when replacing modules like self-attention with efficient alternatives, and introduced Deterministic Continuous Replacement (DCR) to blend teacher and student outputs with a deterministic weight, resulting in faster convergence and stronger alignment in a single-seed study.

Replacing modules in pretrained models, especially swapping quadratic self-attention for efficient attention alternatives, poses a hard optimization problem: cold-start reinitialization destabilizes frozen backbones. We isolate this core stability challenge in a controlled study. Deterministic Continuous Replacement (DCR) blends teacher and student outputs with a deterministic, annealed weight. Theoretically, DCR eliminates gate-induced gradient variance inherent to stochastic replacement. In a single-seed study, DCR attains faster convergence and stronger alignment than stochastic gating and distillation baselines on controlled attention replacement, establishing a foundation for heterogeneous operator swaps.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes