LGMay 17, 2025

LAMP: Extracting Locally Linear Decision Surfaces from LLM World Models

Ryan Chen, Youngmin Ko, Zeyu Zhang, Catherine Cho, Sunny Chung, Mauro Giuffré, Dennis L. Shung, Bradly C. Stadie

arXiv:2505.11772v24.11 citationsh-index: 13

Originality Highly original

AI Analysis

This provides a practical framework for auditing proprietary language models to assess consistency between model behavior and self-reported explanations, addressing transparency issues in AI.

The authors tackled the problem of understanding black-box language models' decision-making by introducing LAMP, a method that extracts locally linear decision surfaces from LLM world models, revealing that many LLMs exhibit locally linear landscapes and correlate with human judgments on explanation quality.

We introduce LAMP (Linear Attribution Mapping Probe), a method that shines light onto a black-box language model's decision surface and studies how reliably a model maps its stated reasons to its predictions through a locally linear model approximating the decision surface. LAMP treats the model's own self-reported explanations as a coordinate system and fits a locally linear surrogate that links those weights to the model's output. By doing so, it reveals which stated factors steer the model's decisions, and by how much. We apply LAMP to three tasks: sentiment analysis, controversial-topic detection, and safety-prompt auditing. Across these tasks, LAMP reveals that many LLMs exhibit locally linear decision landscapes. In addition, these surfaces correlate with human judgments on explanation quality and, on a clinical case-file data set, aligns with expert assessments. Since LAMP operates without requiring access to model gradients, logits, or internal activations, it serves as a practical and lightweight framework for auditing proprietary language models, and enabling assessment of whether a model behaves consistently with the explanations it provides.

View on arXiv PDF

Similar