AICLOct 17, 2024

Anchored Alignment for Self-Explanations Enhancement

arXiv:2410.13216v11 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses the challenge of enhancing self-explanations in LLMs for better interpretability, though it appears incremental as it builds on existing alignment methods like DPO.

The paper tackles the problem of improving large language models' ability to articulate reasoning without annotated rationales by introducing an alignment methodology with a novel technique called Alignment with Anchor Preference Pairs, which categorizes outputs to enhance Direct Preference Optimization, resulting in significant improvement in explanation quality while maintaining accuracy.

In this work, we introduce a methodology for alignment designed to enhance the ability of large language models (LLMs) to articulate their reasoning (self-explanation) even in the absence of annotated rationale explanations. Our alignment methodology comprises three key components: explanation quality assessment, self-instruction dataset generation, and model alignment. Additionally, we present a novel technique called Alignment with Anchor Preference Pairs, which improves the selection of preference pairs by categorizing model outputs into three groups: consistently correct, consistently incorrect, and variable. By applying tailored strategies to each category, we enhance the effectiveness of Direct Preference Optimization (DPO). Our experimental results demonstrate that this approach significantly improves explanation quality while maintaining accuracy compared to other fine-tuning strategies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes