Estimating the Self-Consistency of LLMs
This is an incremental analysis for researchers optimizing LLM reliability through repeated sampling.
This paper tackles the problem of estimating the self-consistency of large language models (LLMs) to improve reliability, finding that under a fixed compute budget, an optimal split favors roughly proportional sampling of prompts and repeated calls.
Systems often repeat the same prompt to large language models (LLMs) and aggregate responses to improve reliability. This short note analyzes an estimator of the self-consistency of LLMs and the tradeoffs it induces under a fixed compute budget $B=mn$, where $m$ is the number of prompts sampled from the task distribution and $n$ is the number of repeated LLM calls per prompt; the resulting analysis favors a rough split $m,n\propto\sqrt{B}$.