CLAIOct 14, 2024

Varying Shades of Wrong: Aligning LLMs with Wrong Answers Only

AI2UW
arXiv:2410.11055v17 citationsh-index: 30Has CodeICLR
Originality Incremental advance
AI Analysis

This addresses the challenge of expanding LLM capabilities in low-annotation settings, but it is incremental as it builds on existing preference optimization methods.

The paper tackled the problem of aligning large language models (LLMs) with wrong answers only, showing that LLMs can generate reliable preferences among wrong options with up to 20.9% higher performance than random guess, and alignment with these preferences helps produce less wrong or correct answers while improving calibration.

In the absence of abundant reliable annotations for challenging tasks and contexts, how can we expand the frontier of LLM capabilities with potentially wrong answers? We focus on two research questions: (1) Can LLMs generate reliable preferences among wrong options? And if so, (2) Would alignment with such wrong-over-wrong preferences be helpful? We employ methods based on self-consistency, token probabilities, and LLM-as-a-judge to elicit wrong-over-wrong preferences, and fine-tune language models with preference optimization approaches using these synthesized preferences. Extensive experiments with seven LLMs and eight datasets demonstrate that (1) LLMs do have preliminary capability in distinguishing various shades of wrong, achieving up to 20.9% higher performance than random guess; (2) Alignment with wrong-over-wrong preferences helps LLMs to produce less wrong and sometimes even outright correct answers, while overall improving model calibration.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes