Beyond Hidden-Layer Manipulation: Semantically-Aware Logit Interventions for Debiasing LLMs
This addresses bias in LLMs, which is a critical issue for fairness in AI applications, though it appears incremental as it builds on existing debiasing techniques.
The paper tackled the problem of debiasing large language models (LLMs) by proposing two zero-shot logits-layer methods, Static and Dynamic, which reduce bias by up to 70% with minimal fluency loss and outperform hidden-layer approaches.
We proposed Static and Dynamic -- two zero-shot logits-layer debiasing methods. Dynamic reduces bias by up to 70% with minimal fluency loss. Logits intervention outperforms hidden-layer approaches. We show semantic-aware logits intervention is stable and effective for debiasing aligned LLMs.