HEP-TH DIS-NN AI LGFeb 5

Towards Worst-Case Guarantees with Scale-Aware Interpretability

Lauren Greenspan, David Berman, Aryeh Brill, Ro Jefferson, Artemy Kolchinsky, Jennifer Lin, Andrew Mack, Anindita Maiti, Fernando E. Rosas, Alexander Stapleton, Lucas Teixeira, Dmitry Vaintrob

arXiv:2602.05184v12.31 citationsh-index: 20

Originality Synthesis-oriented

AI Analysis

This addresses the need for more reliable interpretability methods in AI safety, though it is incremental as it synthesizes existing research threads into a new agenda.

The paper tackles the problem of interpreting neural networks by proposing a scale-aware interpretability framework that uses renormalization from physics to track feature composition across resolutions and guarantee bounds on fine-grained structure influence, aiming to develop tools with robustness and faithfulness properties.

Neural networks organize information according to the hierarchical, multi-scale structure of natural data. Methods to interpret model internals should be similarly scale-aware, explicitly tracking how features compose across resolutions and guaranteeing bounds on the influence of fine-grained structure that is discarded as irrelevant noise. We posit that the renormalisation framework from physics can meet this need by offering technical tools that can overcome limitations of current methods. Moreover, relevant work from adjacent fields has now matured to a point where scattered research threads can be synthesized into practical, theory-informed tools. To combine these threads in an AI safety context, we propose a unifying research agenda -- \emph{scale-aware interpretability} -- to develop formal machinery and interpretability tools that have robustness and faithfulness properties supported by statistical physics.

View on arXiv PDF

Similar