CLMay 28, 2025

BiasFilter: An Inference-Time Debiasing Framework for Large Language Models

arXiv:2505.23829v14 citationsh-index: 10Has CodeEMNLP
Originality Incremental advance
AI Analysis

This addresses the issue of high costs and limited scalability in debiasing methods for LLMs, offering a model-agnostic solution that is incremental in improving fairness without retraining.

The paper tackled the problem of mitigating social bias in large language models by proposing BiasFilter, an inference-time debiasing framework that filters generation outputs in real time, and it demonstrated effectiveness in reducing bias while preserving generation quality across various LLMs.

Mitigating social bias in large language models (LLMs) has become an increasingly important research objective. However, existing debiasing methods often incur high human and computational costs, exhibit limited effectiveness, and struggle to scale to larger models and open-ended generation tasks. To address these limitations, this paper proposes BiasFilter, a model-agnostic, inference-time debiasing framework that integrates seamlessly with both open-source and API-based LLMs. Instead of relying on retraining with balanced data or modifying model parameters, BiasFilter enforces fairness by filtering generation outputs in real time. Specifically, it periodically evaluates intermediate outputs every few tokens, maintains an active set of candidate continuations, and incrementally completes generation by discarding low-reward segments based on a fairness reward signal. To support this process, we construct a fairness preference dataset and train an implicit reward model to assess token-level fairness in generated responses. Extensive experiments demonstrate that BiasFilter effectively mitigates social bias across a range of LLMs while preserving overall generation quality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes