TaCo: Targeted Concept Erasure Prevents Non-Linear Classifiers From Detecting Protected Attributes
This addresses fairness issues in NLP by preventing biased outcomes from encoded sensitive attributes, representing an incremental improvement over existing concept erasure methods.
The paper tackles the problem of fairness in NLP models by introducing Targeted Concept Erasure (TaCo), which removes sensitive information from latent representations to prevent detection by non-linear classifiers, outperforming state-of-the-art methods with greater reductions in prediction accuracy for sensitive attributes.
Ensuring fairness in NLP models is crucial, as they often encode sensitive attributes like gender and ethnicity, leading to biased outcomes. Current concept erasure methods attempt to mitigate this by modifying final latent representations to remove sensitive information without retraining the entire model. However, these methods typically rely on linear classifiers, which leave models vulnerable to non-linear adversaries capable of recovering sensitive information. We introduce Targeted Concept Erasure (TaCo), a novel approach that removes sensitive information from final latent representations, ensuring fairness even against non-linear classifiers. Our experiments show that TaCo outperforms state-of-the-art methods, achieving greater reductions in the prediction accuracy of sensitive attributes by non-linear classifier while preserving overall task performance. Code is available on https://github.com/fanny-jourdan/TaCo.