CLLGAug 31, 2024

Does Alignment Tuning Really Break LLMs' Internal Confidence?

arXiv:2409.00352v21 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses the problem of unreliable confidence in LLMs for real-world applications, but it is incremental as it confirms and refines prior observations on calibration degradation.

The study investigated whether alignment tuning harms LLM calibration, finding that under stricter analysis, it consistently degrades calibration, highlighting a trade-off between alignment and reliability.

Large Language Models (LLMs) have shown remarkable progress, but their real-world application necessitates reliable calibration. This study conducts a comprehensive analysis of calibration degradation of LLMs across four dimensions: models, calibration metrics, tasks, and confidence extraction methods. Initial analysis showed that the relationship between alignment and calibration is not always a trade-off, but under stricter analysis conditions, we found the alignment process consistently harms calibration. This highlights the need for (1) a careful approach when measuring model confidences and calibration errors and (2) future research into algorithms that can help LLMs to achieve both instruction-following and calibration without sacrificing either.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes