LEACE: Perfect linear concept erasure in closed form
This work addresses fairness and interpretability issues in machine learning, particularly for large language models, by providing a method to erase concepts like gender or race, though it is incremental as it builds on existing concept erasure techniques.
The paper tackles the problem of removing specified features from embeddings to improve fairness and interpretability, introducing LEACE, a closed-form method that provably prevents linear classifiers from detecting a concept while minimizing changes to embeddings, and demonstrates its application in reducing gender bias in BERT embeddings and measuring part-of-speech reliance in language models.
Concept erasure aims to remove specified features from an embedding. It can improve fairness (e.g. preventing a classifier from using gender or race) and interpretability (e.g. removing a concept to observe changes in model behavior). We introduce LEAst-squares Concept Erasure (LEACE), a closed-form method which provably prevents all linear classifiers from detecting a concept while changing the embedding as little as possible, as measured by a broad class of norms. We apply LEACE to large language models with a novel procedure called "concept scrubbing," which erases target concept information from every layer in the network. We demonstrate our method on two tasks: measuring the reliance of language models on part-of-speech information, and reducing gender bias in BERT embeddings. Code is available at https://github.com/EleutherAI/concept-erasure.