What Happens to a Dataset Transformed by a Projection-based Concept Removal Method?
This work addresses the problem of unintended consequences in concept removal methods for researchers in interpretable AI, revealing potential vulnerabilities in dataset privacy and fairness.
The paper investigates how linear projection methods for concept removal in language representations affect datasets, finding that they introduce strong statistical dependencies and structure the representation space such that instances cluster by opposite labels, enabling reconstruction of original labels via anti-clustering in some cases.
We investigate the behavior of methods that use linear projections to remove information about a concept from a language representation, and we consider the question of what happens to a dataset transformed by such a method. A theoretical analysis and experiments on real-world and synthetic data show that these methods inject strong statistical dependencies into the transformed datasets. After applying such a method, the representation space is highly structured: in the transformed space, an instance tends to be located near instances of the opposite label. As a consequence, the original labeling can in some cases be reconstructed by applying an anti-clustering method.