CLApr 7

MechELK: A Mechanistic Interpretability Framework for Eliciting Latent Knowledge in Large Language Models

Ji-jun Park, Soo-joon Choi, Jiwon Jeong, Taeyang Yoon, Ju-Wan Lee

arXiv:2605.2882574.9

Predicted impact top 83% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For AI safety researchers, MechELK provides a method to detect deceptive alignment and extract truthful knowledge from LLMs even when surface outputs are incorrect.

MechELK is a framework that elicits latent knowledge from LLMs by locating, verifying, and surfacing hidden knowledge, achieving 84.7% average elicitation accuracy, outperforming CCS by 6.2% and linear probing by 9.1%.

Large language models (LLMs) frequently encode factual and reasoning knowledge in their internal representations that is not faithfully reflected in their surface-level outputs -- a phenomenon known as \emph{latent knowledge}. Existing approaches to eliciting latent knowledge, such as Contrastive Consistency Search (CCS), rely on contrastive activation patterns and struggle with complex multi-step reasoning tasks, while mechanistic interpretability tools have primarily been used to \emph{understand} model behavior rather than to \emph{extract} hidden knowledge. We present \textbf{MechELK}, a unified three-stage framework that bridges mechanistic interpretability and latent knowledge elicitation. MechELK operates through: (1) \textbf{Locate} -- using Sparse Autoencoder (SAE) feature analysis and activation patching to identify knowledge-bearing representations; (2) \textbf{Verify} -- employing causal probing to distinguish genuine latent knowledge from spurious correlations; and (3) \textbf{Elicit} -- applying representation engineering to surface hidden knowledge without modifying model weights. Evaluated on TruthfulQA, a curated Deceptive Alignment benchmark, and the Quirky LM dataset, MechELK achieves an average elicitation accuracy of 84.7\%, outperforming CCS by 6.2\% and direct linear probing by 9.1\%. Crucially, MechELK successfully identifies latent knowledge in 78.3\% of cases where the model's surface output is incorrect or evasive, demonstrating its utility for AI safety applications including deceptive alignment detection.

View on arXiv PDF

Similar