LGMay 15, 2025

Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates

arXiv:2505.10039v25 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses a key bottleneck in mechanistic interpretability for researchers by ensuring circuit completeness, though it is incremental as it builds on existing circuit discovery methods.

The paper tackles the problem of incomplete circuit discovery in language models by introducing AND, OR, and ADDER gates to decompose circuits, proposing a framework that combines noising and denoising interventions to fully identify these gates, and validating it with experiments that restore faithfulness, completeness, and sparsity while uncovering gate properties and behaviors.

Circuit discovery has gradually become one of the prominent methods for mechanistic interpretability, and research on circuit completeness has also garnered increasing attention. Methods of circuit discovery that do not guarantee completeness not only result in circuits that are not fixed across different runs but also cause key mechanisms to be omitted. The nature of incompleteness arises from the presence of OR gates within the circuit, which are often only partially detected in standard circuit discovery methods. To this end, we systematically introduce three types of logic gates: AND, OR, and ADDER gates, and decompose the circuit into combinations of these logical gates. Through the concept of these gates, we derive the minimum requirements necessary to achieve faithfulness and completeness. Furthermore, we propose a framework that combines noising-based and denoising-based interventions, which can be easily integrated into existing circuit discovery methods without significantly increasing computational complexity. This framework is capable of fully identifying the logic gates and distinguishing them within the circuit. In addition to the extensive experimental validation of the framework's ability to restore the faithfulness, completeness, and sparsity of circuits, using this framework, we uncover fundamental properties of the three logic gates, such as their proportions and contributions to the output, and explore how they behave among the functionalities of language models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes