Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO

Jinquan Zheng, Jia Yuan, Jiacheng Yao, Chenyang Gu, Pujun Zheng, Guoxiu He

arXiv:2603.2101680.31 citationsh-index: 2Has Code

Predicted impact top 58% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This addresses a specific problem of selection bias in LLMs for researchers and practitioners, offering an incremental improvement over existing debiasing methods.

The paper tackles selection bias in large language models for evaluation tasks by proposing PA-GRPO, which enforces permutation-consistent reasoning and outperforms baselines across seven benchmarks, reducing bias while maintaining high performance.

Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (https://github.com/ECNU-Text-Computing/PA-GRPO).

View on arXiv PDF Code

Similar