Preserving Task-Relevant Information Under Linear Concept Removal
This addresses fairness and interpretability concerns in machine learning by enabling more effective concept removal while preserving task performance.
The paper tackled the problem of removing unwanted concepts from neural network representations without degrading task-relevant information, achieving this by introducing SPLINCE, which outperformed baselines on benchmarks like Bias in Bios and Winobias.
Modern neural networks often encode unwanted concepts alongside task-relevant information, leading to fairness and interpretability concerns. Existing post-hoc approaches can remove undesired concepts but often degrade useful signals. We introduce SPLINCE-Simultaneous Projection for LINear concept removal and Covariance prEservation - which eliminates sensitive concepts from representations while exactly preserving their covariance with a target label. SPLINCE achieves this via an oblique projection that 'splices out' the unwanted direction yet protects important label correlations. Theoretically, it is the unique solution that removes linear concept predictability and maintains target covariance with minimal embedding distortion. Empirically, SPLINCE outperforms baselines on benchmarks such as Bias in Bios and Winobias, removing protected attributes while minimally damaging main-task information.