LG AIOct 27, 2024

Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness

Qi Zhang, Yifei Wang, Jingyi Cui, Xiang Pan, Qi Lei, Stefanie Jegelka, Yisen Wang

arXiv:2410.21331v112.59 citationsh-index: 14Has CodeICLR

Originality Highly original

AI Analysis

This addresses the challenge of balancing interpretability and robustness for deep learning practitioners, offering a novel perspective that could enhance model reliability in noisy or limited-data settings.

The paper tackles the problem of the accuracy-interpretability tradeoff in deep learning by showing that monosemantic features, which enhance interpretability, also improve model performance across robust learning scenarios like input noise and out-of-domain generalization, with models significantly outperforming those using polysemantic features.

Deep learning models often suffer from a lack of interpretability due to polysemanticity, where individual neurons are activated by multiple unrelated semantics, resulting in unclear attributions of model behavior. Recent advances in monosemanticity, where neurons correspond to consistent and distinct semantics, have significantly improved interpretability but are commonly believed to compromise accuracy. In this work, we challenge the prevailing belief of the accuracy-interpretability tradeoff, showing that monosemantic features not only enhance interpretability but also bring concrete gains in model performance. Across multiple robust learning scenarios-including input and label noise, few-shot learning, and out-of-domain generalization-our results show that models leveraging monosemantic features significantly outperform those relying on polysemantic features. Furthermore, we provide empirical and theoretical understandings on the robustness gains of feature monosemanticity. Our preliminary analysis suggests that monosemanticity, by promoting better separation of feature representations, leads to more robust decision boundaries. This diverse evidence highlights the generality of monosemanticity in improving model robustness. As a first step in this new direction, we embark on exploring the learning benefits of monosemanticity beyond interpretability, supporting the long-standing hypothesis of linking interpretability and robustness. Code is available at \url{https://github.com/PKU-ML/Beyond_Interpretability}.

View on arXiv PDF Code

Similar