Real Money, Fake Models: Deceptive Model Claims in Shadow APIs
This addresses a critical issue for researchers and developers relying on shadow APIs for LLM access, exposing deceptive practices that harm reproducibility and user interests, with incremental novelty in auditing methods.
The paper tackled the problem of shadow APIs claiming to provide access to frontier LLMs like GPT-5 and Gemini-2.5, revealing through a systematic audit that these services exhibit performance divergences up to 47.21%, safety unpredictability, and identity verification failures in 45.83% of tests, undermining research reliability.
Access to frontier large language models (LLMs), such as GPT-5 and Gemini-2.5, is often hindered by high pricing, payment barriers, and regional restrictions. These limitations drive the proliferation of $\textit{shadow APIs}$, third-party services that claim to provide access to official model services without regional limitations via indirect access. Despite their widespread use, it remains unclear whether shadow APIs deliver outputs consistent with those of the official APIs, raising concerns about the reliability of downstream applications and the validity of research findings that depend on them. In this paper, we present the first systematic audit between official LLM APIs and corresponding shadow APIs. We first identify 17 shadow APIs that have been utilized in 187 academic papers, with the most popular one reaching 5,966 citations and 58,639 GitHub stars by December 6, 2025. Through multidimensional auditing of three representative shadow APIs across utility, safety, and model verification, we uncover both indirect and direct evidence of deception practices in shadow APIs. Specifically, we reveal performance divergence reaching up to $47.21\%$, significant unpredictability in safety behaviors, and identity verification failures in $45.83\%$ of fingerprint tests. These deceptive practices critically undermine the reproducibility and validity of scientific research, harm the interests of shadow API users, and damage the reputation of official model providers.