Circuit Distillation
This work addresses the challenge of efficiently distilling targeted capabilities in AI models, offering a more interpretable and controllable approach, though it is incremental as it builds on existing distillation methods.
The authors tackled the problem of transferring algorithmic capabilities from teacher to student models by proposing circuit distillation, which aligns internal representations of functionally correspondent circuit components, and demonstrated that it outperforms standard distillation on entity tracking and theory of mind tasks using Llama3 models.
Model distillation typically focuses on behavioral mimicry, where a student model is trained to replicate a teacher's output while treating its internal computations as a black box. In this work we propose an alternative approach: Distilling the underlying computational mechanisms implemented by a teacher model. Specifically, we propose circuit distillation, which introduces an objective to align internal representations between analogous circuit components in teacher and student models. We propose a method to match ``functionally correspondent'' circuit components and introduce a loss reflecting similarities between the representations that these induce. We evaluate circuit distillation on entity tracking and theory of mind (ToM) tasks using models from the Llama3 family. Our results demonstrate that circuit distillation outperforms standard distillation, successfully transferring algorithmic capabilities by adjusting only a small, targeted subset of student model parameters. This work establishes the feasibility of transferring mechanisms, which may in turn allow for efficient distillation of targeted teacher capabilities via interpretable and controllable internal student mechanisms.