Can Reasoning Help Large Language Models Capture Human Annotator Disagreement?
This work addresses the problem of modeling human annotation variation for NLP applications sensitive to subjectivity and ambiguity, highlighting risks in replacing human annotators with reasoning LLMs.
The study investigated whether reasoning methods like RLVR help large language models capture human annotator disagreements, finding that RLVR-style reasoning degrades performance while naive Chain-of-Thought reasoning improves it for RLHF LLMs, with evaluations across 60 setups and 3 tasks.
Variation in human annotation (i.e., disagreements) is common in NLP, often reflecting important information like task subjectivity and sample ambiguity. Modeling this variation is important for applications that are sensitive to such information. Although RLVR-style reasoning (Reinforcement Learning with Verifiable Rewards) has improved Large Language Model (LLM) performance on many tasks, it remains unclear whether such reasoning enables LLMs to capture informative variation in human annotation. In this work, we evaluate the influence of different reasoning settings on LLM disagreement modeling. We systematically evaluate each reasoning setting across model sizes, distribution expression methods, and steering methods, resulting in 60 experimental setups across 3 tasks. Surprisingly, our results show that RLVR-style reasoning degrades performance in disagreement modeling, while naive Chain-of-Thought (CoT) reasoning improves the performance of RLHF LLMs (RL from human feedback). These findings underscore the potential risk of replacing human annotators with reasoning LLMs, especially when disagreements are important.