RoPE-LIME: RoPE-Space Locality + Sparse-K Sampling for Efficient LLM Attribution
This addresses the challenge of efficient and stable attribution for closed-source LLMs, which is incremental as it builds on existing perturbation-based methods.
The paper tackled the problem of explaining closed-source Large Language Model outputs by introducing RoPE-LIME, which uses a smaller open-source surrogate for token-level attributions with a locality kernel and efficient sampling, resulting in more informative attributions and reduced API calls compared to methods like leave-one-out sampling and gSMILE.
Explaining closed-source Large Language Model (LLM) outputs is challenging because API access prevents gradient-based attribution, while perturbation methods are costly and noisy when they depend on regenerated text. We introduce \textbf{Rotary Positional Embedding Linear Local Interpretable Model-agnostic Explanations (RoPE-LIME)}, an open-source extension of gSMILE that decouples reasoning from explanation: given a fixed output from a closed model, a smaller open-source surrogate computes token-level attributions from probability-based objectives (negative log-likelihood and divergence targets) under input perturbations. RoPE-LIME incorporates (i) a locality kernel based on Relaxed Word Mover's Distance computed in \textbf{RoPE embedding space} for stable similarity under masking, and (ii) \textbf{Sparse-$K$} sampling, an efficient perturbation strategy that improves interaction coverage under limited budgets. Experiments on HotpotQA (sentence features) and a hand-labeled MMLU subset (word features) show that RoPE-LIME produces more informative attributions than leave-one-out sampling and improves over gSMILE while substantially reducing closed-model API calls.