BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet)
This work addresses the challenge of adapting test-time scaling for domains without verifiably correct answers, such as annotation disagreements, but it is incremental as it shows limited transferability of existing methods.
The paper tackled the problem of applying test-time scaling techniques to tasks with annotation disagreements, specifically on LeWiDi-2025 tasks, and found that while benchmark methods like Model Averaging and Majority Voting improved LLM performance, the Best-of-N method did not.
Test-time scaling is a family of techniques to improve LLM outputs at inference time by performing extra computation. To the best of our knowledge, test-time scaling has been limited to domains with verifiably correct answers, like mathematics and coding. We transfer test-time scaling to the LeWiDi-2025 tasks to evaluate annotation disagreements. We experiment with three test-time scaling methods: two benchmark algorithms (Model Averaging and Majority Voting), and a Best-of-N sampling method. The two benchmark methods improve LLM performance consistently on the LeWiDi tasks, but the Best-of-N method does not. Our experiments suggest that the Best-of-N method does not currently transfer from mathematics to LeWiDi tasks, and we analyze potential reasons for this gap.