LGApr 19

Decomposing the Depth Profile of Fine-Tuning

arXiv:2604.171777.1h-index: 1
AI Analysis

This work provides a systematic decomposition of fine-tuning depth profiles, offering insights for architecture design and training strategies in transfer learning.

The paper investigates whether the depth profile of representational change during fine-tuning is intrinsic to the model or driven by gradient magnitude. Across 240 runs with 15 models (125M–6.9B parameters), they find that representational change concentrates in output-proximal layers under standard training, and this profile persists under gradient equalization for some architectures and objectives but collapses for others, revealing scale-dependent components.

Fine-tuning adapts pretrained networks to new objectives. Whether the resulting depth profile of representational change reflects an intrinsic property of the model or the magnitude of gradient flow has not been tested directly. We measure this profile across 240 fine-tuning runs spanning 15 models in four architecture families (encoder and decoder transformers, a state-space model, and an RNN) at scales from 125M to 6.9B parameters. Representational change concentrates in output-proximal layers in every standard-training run except one. We apply a per-layer control that equalizes $\|ΔW\|/\|W\|$ across layers after each optimizer step. Under this control, the profile persists in some conditions and collapses in others. At 125M--350M, sequential-block architectures (BERT, OPT, GPT-2) retain the slope across tested objectives while parallel-block architectures (Pythia, CodeGen) retain it only for causal-language-modeling objectives. This architectural distinction narrows at 1.3B--1.4B, where both block types show positive equal-step slopes for CausalLM. Under standard training, profile shape is described by two additional axes: steepness tracks a training-free objective distance at initialization, and profile width is dominated by architecture. We treat the locality gradient, the depthwise slope of representational change, as a composite phenomenon whose components are scale-dependent.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes