LG CLMay 28

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

arXiv:2605.304488.1

Predicted impact top 82% in LG · last 90 daysOriginality Highly original

AI Analysis

This work addresses the problem of inadequate evaluation metrics for black-box LLM distillation, providing a more rigorous framework for researchers and practitioners to assess the fidelity of student models to their teachers.

This paper introduces a new metric, bounded behavioral indistinguishability, to evaluate black-box LLM distillation beyond mere output similarity. While LoRA distillation improved semantic similarity from 0.788 to 0.862 for Qwen and 0.814 to 0.874 for Llama, adversarial evaluation revealed persistent behavioral differences, with Qwen's distinguishing advantage dropping from 0.158 to 0.081 after LoRA distillation.

Black-box LLM distillation is usually evaluated as an output-matching problem: a student is considered successful when its responses are semantically similar to, or task-consistent with, those of a teacher. However, output similarity does not imply that the student is behaviorally indistinguishable from the model it imitates. We introduce bounded behavioral indistinguishability, formalized as $(ε,q,t,\mathbb{A})$-behavioral indistinguishability over an explicit prompt distribution, where $ε$ bounds distinguishing advantage, $q$ bounds oracle queries, $t$ bounds computation, and $\mathbb{A}$ denotes the adversary class. We instantiate this notion on Qwen and Llama teacher-student pairs using a controlled $5,000$-prompt behavioral probe suite. For each family, we compare the teacher with both the base student and the LoRA-distilled student, measuring whether distillation reduces distinguishability rather than merely improving similarity. LoRA raises semantic similarity from $0.788$ to $0.862$ for Qwen and from $0.814$ to $0.874$ for Llama. Yet adversarial evaluation reveals remaining behavioral differences: learned discriminators retain nonzero advantage, and pairwise category analysis shows artifacts concentrated in style/format, robustness, and domain-technical prompts. A pairwise teacher-identification adversary confirms this trend. With a different-family Llama judge and A/B-swap consistency filtering, Qwen distinguishing advantage drops from $0.158$ for the base student to $0.081$ after LoRA distillation. Query-budget experiments show that disagreement-guided acquisition does not consistently outperform stratified random sampling, indicating that coverage and diversity remain strong baselines. Our results show that semantic fidelity is useful but insufficient: black-box LLM distillation requires bounded, adversarial, and category-aware evaluation.

View on arXiv PDF

Similar