5.9LGMay 21
Towards Verifiable Transformers: Solver-Checkable Circuit ExplanationsNeel Somani
Mechanistic interpretability often identifies circuits inside Transformer models, but explanations of those circuits are usually validated through examples, ablations, and manual reasoning. This leaves a gap between finding a plausible circuit and proving what the circuit does. We introduce Verifiable Transformers, a framework for converting task-localized Transformer circuits into bounded, solver-checkable claims. Given a behavior, a finite task domain, and a candidate-token projection, we extract a task circuit and verify properties such as projected functional equivalence, edge necessity, task-relevant invariance, and final-residual robustness. Direct verification encodes the extracted circuit itself into an SMT solver. When a circuit contains operators that are not exactly or tractably encodable, surrogate-mediated verification fits an SMT-encodable surrogate, validates it against the extracted circuit over the bounded domain, and verifies symbolic explanations against the surrogate. We instantiate direct verification with a GPT-style architecture using Signed L1 BandNorm, sparsemax attention, and LeakyReLU. On small symbolic sequence tasks, we train an SMT-representable Transformer, extract sparse circuits for quote closing and bracket type tracking, and exhaustively verify projected functional equivalence, content invariance, edge necessity, and final-residual robustness. At GPT-2 scale, the same operator stack trains stably on OpenWebText, although naive direct SMT verification remains intractable. We also demonstrate surrogate-mediated verification on task-localized circuits with hard-to-encode attention, showing both verified symbolic explanations and solver-generated counterexamples. The goal is not full-model verification, but a concrete path for turning mechanistic circuit explanations into formal propositions that can be proven or refuted.
PLSep 5, 2019
Duet: An Expressive Higher-order Language and Linear Type System for Statically Enforcing Differential PrivacyJoseph P. Near, David Darais, Chike Abuah et al.
During the past decade, differential privacy has become the gold standard for protecting the privacy of individuals. However, verifying that a particular program provides differential privacy often remains a manual task to be completed by an expert in the field. Language-based techniques have been proposed for fully automating proofs of differential privacy via type system design, however these results have lagged behind advances in differentially-private algorithms, leaving a noticeable gap in programs which can be automatically verified while also providing state-of-the-art bounds on privacy. We propose Duet, an expressive higher-order language, linear type system and tool for automatically verifying differential privacy of general-purpose higher-order programs. In addition to general purpose programming, Duet supports encoding machine learning algorithms such as stochastic gradient descent, as well as common auxiliary data analysis tasks such as clipping, normalization and hyperparameter tuning - each of which are particularly challenging to encode in a statically verified differential privacy framework. We present a core design of the Duet language and linear type system, and complete key proofs about privacy for well-typed programs. We then show how to extend Duet to support realistic machine learning applications and recent variants of differential privacy which result in improved accuracy for many practical differentially private algorithms. Finally, we implement several differentially private machine learning algorithms in Duet which have never before been automatically verified by a language-based tool, and we present experimental results which demonstrate the benefits of Duet's language design in terms of accuracy of trained machine learning models.