What Is the Minimum Architecture for Prolepsis? Early Irrevocable Commitment Across Tasks in Small Transformers
For mechanistic interpretability researchers, this work provides a concrete architectural template for early commitment in transformers, but the findings are incremental as they replicate and extend prior work on a single model family.
The paper identifies 'prolepsis'—early irrevocable commitment in transformers—and shows it is an architectural motif shared across models, with task-specific attention heads sustaining the commitment and no layer correcting it. Key findings include that planning requires ≤16 layers but commitment needs more, and factual recall uses the same motif at different depths with zero overlap in attention heads.
When do transformers commit to a decision, and what prevents them from correcting it? We introduce \textbf{prolepsis}: a transformer commits early, task-specific attention heads sustain the commitment, and no layer corrects it. Replicating \citeauthor{lindsey2025biology}'s (\citeyear{lindsey2025biology}) planning-site finding on open models (Gemma~2 2B, Llama~3.2 1B), we ask five questions. (Q1)~Planning is invisible to six residual-stream methods; CLTs are necessary. (Q2)~The planning-site spike replicates with identical geometry. (Q3)~Specific attention heads route the decision to the output, filling a gap flagged as invisible to attribution graphs. (Q4)~Search requires ${\leq}16$ layers; commitment requires more. (Q5)~Factual recall shows the same motif at a different network depth, with zero overlap between recurring planning heads and the factual top-10. Prolepsis is architectural: the template is shared, the routing substrates differ. All experiments run on a single consumer GPU (16\,GB VRAM).