Probing Classifiers are Unreliable for Concept Removal and Detection
This work highlights a critical flaw in existing concept removal techniques, which is important for fairness and sensitive applications in machine learning, though it is incremental as it critiques and builds upon prior methods.
The paper demonstrates that post-hoc and adversarial methods for removing undesirable concepts from neural network representations are unreliable, as they fail to completely remove concepts and can destroy task-relevant features, with theoretical proofs and experiments on synthetic, Multi-NLI, and Twitter datasets confirming these issues.
Neural network models trained on text data have been found to encode undesirable linguistic or sensitive concepts in their representation. Removing such concepts is non-trivial because of a complex relationship between the concept, text input, and the learnt representation. Recent work has proposed post-hoc and adversarial methods to remove such unwanted concepts from a model's representation. Through an extensive theoretical and empirical analysis, we show that these methods can be counter-productive: they are unable to remove the concepts entirely, and in the worst case may end up destroying all task-relevant features. The reason is the methods' reliance on a probing classifier as a proxy for the concept. Even under the most favorable conditions for learning a probing classifier when a concept's relevant features in representation space alone can provide 100% accuracy, we prove that a probing classifier is likely to use non-concept features and thus post-hoc or adversarial methods will fail to remove the concept correctly. These theoretical implications are confirmed by experiments on models trained on synthetic, Multi-NLI, and Twitter datasets. For sensitive applications of concept removal such as fairness, we recommend caution against using these methods and propose a spuriousness metric to gauge the quality of the final classifier.