CLJun 11, 2025

When Meaning Stays the Same, but Models Drift: Evaluating Quality of Service under Token-Level Behavioral Instability in LLMs

arXiv:2506.10095v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses an overlooked dimension of model evaluation stability for users relying on LLMs for consistent quality of service, though it is incremental as it builds on existing concerns about model behavior.

The paper tackled the problem of large language models exhibiting inconsistent responses to semantically equivalent prompts, proposing a diagnostic framework called Prompt-Based Semantic Shift (PBSS) to measure this behavioral drift. The result showed consistent, model-specific response shifts across ten tasks, linking them to tokenization and decoding dynamics.

We investigate how large language models respond to prompts that differ only in their token-level realization but preserve the same semantic intent, a phenomenon we call prompt variance. We propose Prompt-Based Semantic Shift (PBSS), a diagnostic framework for measuring behavioral drift in LLMs under semantically equivalent prompt rewordings. Applied to ten constrained tasks, PBSS reveals consistent, model-specific response shifts, suggesting statistical regularities linked to tokenization and decoding. These results highlight an overlooked dimension of model evaluation stability under rephrasing and suggest that tokenization strategies and decoding dynamics may contribute to post-training quality of service instability.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes