STAIMESep 13, 2025

Testing for LLM response differences: the case of a composite null consisting of semantically irrelevant query perturbations

arXiv:2509.10963v11 citationsh-index: 14
Originality Incremental advance
AI Analysis

This addresses the misalignment between statistical testing and user needs in evaluating LLM response differences, though it is incremental as it builds on existing hypothesis testing frameworks.

The paper tackles the problem of testing whether two input queries produce the same response distribution from large language models, where traditional tests can be misled by semantically irrelevant perturbations. It proposes a new test that incorporates collections of semantically similar queries, proving it is asymptotically valid and consistent for binary responses.

Given an input query, generative models such as large language models produce a random response drawn from a response distribution. Given two input queries, it is natural to ask if their response distributions are the same. While traditional statistical hypothesis testing is designed to address this question, the response distribution induced by an input query is often sensitive to semantically irrelevant perturbations to the query, so much so that a traditional test of equality might indicate that two semantically equivalent queries induce statistically different response distributions. As a result, the outcome of the statistical test may not align with the user's requirements. In this paper, we address this misalignment by incorporating into the testing procedure consideration of a collection of semantically similar queries. In our setting, the mapping from the collection of user-defined semantically similar queries to the corresponding collection of response distributions is not known a priori and must be estimated, with a fixed budget. Although the problem we address is quite general, we focus our analysis on the setting where the responses are binary, show that the proposed test is asymptotically valid and consistent, and discuss important practical considerations with respect to power and computation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes