Testing for LLM response differences: the case of a composite null consisting of semantically irrelevant query perturbations
This addresses the misalignment between statistical testing and user needs in evaluating LLM response differences, though it is incremental as it builds on existing hypothesis testing frameworks.
The paper tackles the problem of testing whether two input queries produce the same response distribution from large language models, where traditional tests can be misled by semantically irrelevant perturbations. It proposes a new test that incorporates collections of semantically similar queries, proving it is asymptotically valid and consistent for binary responses.
Given an input query, generative models such as large language models produce a random response drawn from a response distribution. Given two input queries, it is natural to ask if their response distributions are the same. While traditional statistical hypothesis testing is designed to address this question, the response distribution induced by an input query is often sensitive to semantically irrelevant perturbations to the query, so much so that a traditional test of equality might indicate that two semantically equivalent queries induce statistically different response distributions. As a result, the outcome of the statistical test may not align with the user's requirements. In this paper, we address this misalignment by incorporating into the testing procedure consideration of a collection of semantically similar queries. In our setting, the mapping from the collection of user-defined semantically similar queries to the corresponding collection of response distributions is not known a priori and must be estimated, with a fixed budget. Although the problem we address is quite general, we focus our analysis on the setting where the responses are binary, show that the proposed test is asymptotically valid and consistent, and discuss important practical considerations with respect to power and computation.