AI LGApr 19, 2025

The Geometry of Self-Verification in a Task-Specific Reasoning Model

Andrew Lee, Lihao Sun, Chris Wendler, Fernanda Viégas, Martin Wattenberg

arXiv:2504.14379v211 citationsh-index: 10

Originality Incremental advance

AI Analysis

This work provides insights into the internal verification processes of reasoning models, which could help improve their reliability, though it is incremental as it builds on existing methods and focuses on a specific task.

The researchers investigated how reasoning models verify their own answers by training a model on the CountDown task and analyzing its internal mechanisms, finding that specific GLU weights and attention heads are crucial for self-verification, with similar components identified in base and general reasoning models.

How do reasoning models verify their own answers? We study this question by training a model using DeepSeek R1's recipe on the CountDown task. We leverage the fact that preference tuning leads to mode collapse, yielding a model that always produces highly structured chain-of-thought sequences. With this setup, we do top-down and bottom-up analyses to reverse-engineer how the model verifies its outputs. Top-down, we find Gated Linear Unit (GLU) weights encoding verification-related tokens, such as ``success'' or ``incorrect''. Bottom-up, we find that ``previous-token heads'' are mainly responsible for self-verification in our setup. Our analyses meet in the middle: drawing inspiration from inter-layer communication channels, we use the identified GLU weights to localize as few as three attention heads that can disable self-verification, pointing to a necessary component of a potentially larger verification circuit. Finally, we verify that similar verification components exist in our base model and a general reasoning DeepSeek-R1 model.

View on arXiv PDF

Similar