CL LGMay 4

Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation

arXiv:2605.023489.0

Predicted impact top 99% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners needing to mitigate social biases in LLMs without expensive retraining or access to model weights, this work offers a practical, scalable decoding-time approach.

The paper introduces decoding-time debiasing for LLMs using a Process Reward Model (PRM) that scores tokens for fairness and fluency without retraining or fine-tuning. Sequential debiasing raises mean bias scores by up to +0.40 over baseline while preserving fluency, and a lightweight Bias Guard gate reduces overhead to ~2x.

Large language models pick up social biases from the data they are trained on and carry those biases into downstream applications, often reinforcing stereotypes around gender, race, religion, disability, age, and socioeconomic status. The standard fixes (retraining on curated data or fine-tuning with human feedback) are expensive, need access to model weights, and risk degrading the model on other tasks. In this paper we take a different route: we debias the model at decoding time, treating bias mitigation as a structured search over candidate tokens without ever touching model weights. A separate Process Reward Model (PRM) acts as a judge, scoring each candidate for both fairness and fluency. We design three schemes of increasing sophistication (Best-of-N selection, Sequential critique-and-revise, and Constitutional self-audit) and evaluate them on four models (GPT-4o-mini, Llama 3.2 3B, Gemma 3 4B, Qwen 2.5 3B) across a 200-prompt bilingual benchmark in English and Urdu covering eight bias categories. Sequential debiasing proves the most effective, raising mean bias scores by up to +0.40 over baseline while preserving (and sometimes improving) fluency. We then extend all three schemes to open-ended generation, where each token is debiased on the fly, and introduce a lightweight Bias Guard gate that fires only on potentially biased words, keeping overhead near 2x for well-calibrated models. A formal overhead metric that separates generator cost from judge cost reveals that Best-of-N is effectively free on the generator side in a native implementation. GPT-4o-mini, included as a strong proprietary anchor, confirms that the framework scales with model capability; the three open-weight models show where current small-scale LLMs still struggle.

View on arXiv PDF

Similar