CLAIApr 7

What Models Know, How Well They Know It: Knowledge-Weighted Fine-Tuning for Learning When to Say "I Don't Know"

arXiv:2604.0577923.3
AI Analysis

This addresses the issue of knowledge misalignment in LLMs for users needing reliable responses, though it is incremental as it builds on existing fine-tuning techniques.

The paper tackles the problem of hallucinations in large language models by introducing a fine-tuning method that uses instance-level knowledge scores to adjust learning signals and encourage explicit 'I don't know' responses for out-of-scope queries, resulting in improved uncertainty expression while maintaining accuracy on answerable questions.

While large language models (LLMs) demonstrate strong capabilities across diverse user queries, they still suffer from hallucinations, often arising from knowledge misalignment between pre-training and fine-tuning. To address this misalignment, we reliably estimate a fine-grained, instance-level knowledge score via multi-sampled inference. Using the knowledge score, we scale the learning signal according to the model's existing knowledge, while encouraging explicit "I don't know" responses for out-of-scope queries. Experimental results show that this approach allows the model to explicitly express uncertainty when it lacks knowledge, while maintaining accuracy on questions it can answer. Furthermore, we propose evaluation metrics for uncertainty, showing that accurate discrimination between known and unknown instances consistently improves performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes