CLDec 23, 2025

Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?

arXiv:2512.20796v11 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of debiasing AI models for fairer applications, though it is incremental as it builds on existing methods for feature ablation and evaluation.

The study tackled the problem of demographic bias in language models by investigating whether bias can be removed without erasing demographic recognition capabilities, finding that targeted feature ablations in Gemma-2-9B reduce bias while preserving recognition accuracy, with attribution-based methods mitigating race and gender stereotypes and correlation-based methods being more effective for education bias.

We investigate how independent demographic bias mechanisms are from general demographic recognition in language models. Using a multi-task evaluation setup where demographics are associated with names, professions, and education levels, we measure whether models can be debiased while preserving demographic detection capabilities. We compare attribution-based and correlation-based methods for locating bias features. We find that targeted sparse autoencoder feature ablations in Gemma-2-9B reduce bias without degrading recognition performance: attribution-based ablations mitigate race and gender profession stereotypes while preserving name recognition accuracy, whereas correlation-based ablations are more effective for education bias. Qualitative analysis further reveals that removing attribution features in education tasks induces ``prior collapse'', thus increasing overall bias. This highlights the need for dimension-specific interventions. Overall, our results show that demographic bias arises from task-specific mechanisms rather than absolute demographic markers, and that mechanistic inference-time interventions can enable surgical debiasing without compromising core model capabilities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes