Retrieval-Constrained Decoding Reveals Underestimated Parametric Knowledge in Language Models
This addresses the issue of accurate knowledge assessment in language models for researchers and practitioners, though it is incremental as it focuses on evaluation methodology.
The paper tackles the problem of language models being underestimated in factual knowledge due to strict evaluation, showing that Retrieval-Constrained Decoding improves F1 scores, e.g., Llama-3.1-70B from 32.3% to 46.0%.
Language models (LMs) encode substantial factual knowledge, but often produce answers judged as incorrect. We hypothesize that many of these answers are actually correct, but are expressed in alternative surface forms that are dismissed due to an overly strict evaluation, leading to an underestimation of models' parametric knowledge. We propose Retrieval-Constrained Decoding (RCD), a decoding strategy that restricts model outputs to unique surface forms. We introduce YAGO-QA, a dataset of 19,137 general knowledge questions. Evaluating open-source LMs from 135M to 70B parameters, we show that standard decoding undervalues their knowledge. For instance, Llama-3.1-70B scores only 32.3% F1 with vanilla decoding but 46.0% with RCD. Similarly, Llama-3.1-8B reaches 33.0% with RCD, outperforming the larger model under vanilla decoding. We publicly share the code and dataset at https://github.com/Rajjaa/disambiguated-LLM.