Self-Consistency Boosts Calibration for Math Reasoning
This work addresses calibration issues for developers using LLMs in math reasoning, though it appears incremental as it builds on existing self-consistency techniques.
The paper tackled improving calibration for large language models in math reasoning tasks by designing three off-the-shelf methods based on self-consistency, resulting in better bridging of model confidence and accuracy than existing methods on benchmarks like GSM8K and MathQA using models such as Mistral and LLaMA2.
Calibration, which establishes the correlation between accuracy and model confidence, is important for LLM development. We design three off-the-shelf calibration methods based on self-consistency (Wang et al., 2022) for math reasoning tasks. Evaluation on two popular benchmarks (GSM8K and MathQA) using strong open-source LLMs (Mistral and LLaMA2), our methods better bridge model confidence and accuracy than existing methods based on p(True) (Kadavath et al., 2022) or logit (Kadavath et al., 2022).