LGCVMay 29

Interpretability Without Tradeoffs: Disentangling Polysemanticity At Equal Predictive Performance

arXiv:2605.3130464.5
Predicted impact top 31% in LG · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the problem of poor interpretability in deep neural networks for AI researchers and practitioners, offering a method to gain clearer insights into model decisions without performance degradation, which is a significant incremental improvement over existing methods.

This paper introduces ELUDe, a method to improve the interpretability of deep neural networks by disentangling polysemantic neurons into monosemantic features. It achieves this without altering the model's predictive performance, as demonstrated across several vision models including DINOv2 and supervised ViT-B/16, while also being efficient and requiring no explicit training or labels.

Deep neural networks (DNNs) are widely used, but interpreting what they actually learn remains difficult. A major obstacle is that individual neurons often encode multiple unrelated concepts, obscuring the decision process of the network. While prior work, such as sparse autoencoders, can separate these mixed signals into more meaningful, "monosemantic" features, this typically requires altering the model in ways that can degrade downstream performance. To overcome this, we introduce ELUDe (explicit, lossless, unsupervised disentanglement), a method for improving the interpretability of DNNs while preserving their functional equivalence. ELUDe breaks latent representations into clear, inspectable sub-units that behave like interpretable features, while guaranteeing that the model's outputs remain exactly the same. It requires no explicit training, no labels, and can be applied to pretrained models. ELUDe works by reorganizing how information flows between layers, re-routing concept-specific contributions while preserving the original computation by construction. Across several vision models, including DINOv2 and supervised ViT-B/16, ELUDe improves interpretability, keeps downstream accuracy unchanged, runs efficiently, and supports practical uses such as steering model representations. In short, ELUDe offers interpretability (almost) without a tradeoff: clearer, scalable, and actionable model insights with no loss in performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes