LGAICLFeb 3, 2025

GRADIEND: Feature Learning within Neural Networks Exemplified through Biases

arXiv:2502.01406v32 citationsh-index: 2
Originality Highly original
AI Analysis

This addresses the critical issue of harmful societal biases in AI for applications in sensitive domains, representing a novel method for a known bottleneck.

The paper tackles the problem of social biases in AI systems by introducing a gradient-based encoder-decoder method to learn feature neurons encoding bias information, enabling model debiasing while preserving other capabilities across various architectures.

AI systems frequently exhibit and amplify social biases, leading to harmful consequences in critical areas. This study introduces a novel encoder-decoder approach that leverages model gradients to learn a feature neuron encoding societal bias information such as gender, race, and religion. We show that our method can not only identify which weights of a model need to be changed to modify a feature, but even demonstrate that this can be used to rewrite models to debias them while maintaining other capabilities. We demonstrate the effectiveness of our approach across various model architectures and highlight its potential for broader applications.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes