Model Equality Testing: Which Model Is This API Serving?
This addresses the issue for users of black-box LLM APIs who need to verify model integrity against potential undisclosed modifications by providers.
The paper tackles the problem of detecting undisclosed changes in black-box language model APIs, such as quantization or fine-tuning, by formalizing Model Equality Testing as a two-sample statistical test. It finds that a test based on Maximum Mean Discrepancy achieves 77.4% median power with 10 samples per prompt and identifies that 11 out of 31 commercial Llama API endpoints serve different distributions than the reference weights.
Users often interact with large language models through black-box inference APIs, both for closed- and open-weight models (e.g., Llama models are popularly accessed via Amazon Bedrock and Azure AI Studio). In order to cut costs or add functionality, API providers may quantize, watermark, or finetune the underlying model, changing the output distribution -- possibly without notifying users. We formalize detecting such distortions as Model Equality Testing, a two-sample testing problem, where the user collects samples from the API and a reference distribution and conducts a statistical test to see if the two distributions are the same. We find that tests based on the Maximum Mean Discrepancy between distributions are powerful for this task: a test built on a simple string kernel achieves a median of 77.4% power against a range of distortions, using an average of just 10 samples per prompt. We then apply this test to commercial inference APIs from Summer 2024 for four Llama models, finding that 11 out of 31 endpoints serve different distributions than reference weights released by Meta.