LG LOMay 21

Towards Verifiable Transformers: Solver-Checkable Circuit Explanations

arXiv:2605.240335.9

Predicted impact top 95% in LG · last 90 daysOriginality Highly original

AI Analysis

For mechanistic interpretability researchers, this provides a method to formally prove or refute circuit explanations, moving beyond validation by examples and ablation.

The paper introduces Verifiable Transformers, a framework to convert Transformer circuits into solver-checkable claims, enabling formal verification of properties like functional equivalence and edge necessity. On small symbolic tasks, they exhaustively verify these properties using SMT solvers, and demonstrate surrogate-mediated verification for harder-to-encode circuits.

Mechanistic interpretability often identifies circuits inside Transformer models, but explanations of those circuits are usually validated through examples, ablations, and manual reasoning. This leaves a gap between finding a plausible circuit and proving what the circuit does. We introduce Verifiable Transformers, a framework for converting task-localized Transformer circuits into bounded, solver-checkable claims. Given a behavior, a finite task domain, and a candidate-token projection, we extract a task circuit and verify properties such as projected functional equivalence, edge necessity, task-relevant invariance, and final-residual robustness. Direct verification encodes the extracted circuit itself into an SMT solver. When a circuit contains operators that are not exactly or tractably encodable, surrogate-mediated verification fits an SMT-encodable surrogate, validates it against the extracted circuit over the bounded domain, and verifies symbolic explanations against the surrogate. We instantiate direct verification with a GPT-style architecture using Signed L1 BandNorm, sparsemax attention, and LeakyReLU. On small symbolic sequence tasks, we train an SMT-representable Transformer, extract sparse circuits for quote closing and bracket type tracking, and exhaustively verify projected functional equivalence, content invariance, edge necessity, and final-residual robustness. At GPT-2 scale, the same operator stack trains stably on OpenWebText, although naive direct SMT verification remains intractable. We also demonstrate surrogate-mediated verification on task-localized circuits with hard-to-encode attention, showing both verified symbolic explanations and solver-generated counterexamples. The goal is not full-model verification, but a concrete path for turning mechanistic circuit explanations into formal propositions that can be proven or refuted.

View on arXiv PDF

Similar