CLMLDec 11, 2023

TaCo: Targeted Concept Erasure Prevents Non-Linear Classifiers From Detecting Protected Attributes

arXiv:2312.06499v43 citationsh-index: 6Has Code
Originality Incremental advance
AI Analysis

This addresses fairness issues in NLP by preventing biased outcomes from encoded sensitive attributes, representing an incremental improvement over existing concept erasure methods.

The paper tackles the problem of fairness in NLP models by introducing Targeted Concept Erasure (TaCo), which removes sensitive information from latent representations to prevent detection by non-linear classifiers, outperforming state-of-the-art methods with greater reductions in prediction accuracy for sensitive attributes.

Ensuring fairness in NLP models is crucial, as they often encode sensitive attributes like gender and ethnicity, leading to biased outcomes. Current concept erasure methods attempt to mitigate this by modifying final latent representations to remove sensitive information without retraining the entire model. However, these methods typically rely on linear classifiers, which leave models vulnerable to non-linear adversaries capable of recovering sensitive information. We introduce Targeted Concept Erasure (TaCo), a novel approach that removes sensitive information from final latent representations, ensuring fairness even against non-linear classifiers. Our experiments show that TaCo outperforms state-of-the-art methods, achieving greater reductions in the prediction accuracy of sensitive attributes by non-linear classifier while preserving overall task performance. Code is available on https://github.com/fanny-jourdan/TaCo.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes