Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs
For developers and users of LLM-based systems that hide reasoning traces, this work demonstrates that such hiding is ineffective against simple prompting attacks, raising security and privacy concerns.
The paper investigates whether hiding internal reasoning traces from users in LLM systems prevents them from extracting useful reasoning supervision via prompting. They propose Reasoning Exposure Prompting (REP), which uses shadow-model demonstrations in code-like formats to elicit hidden traces, and show it significantly increases similarity between exposed and internal traces while preserving reasoning signals.
Reasoning traces have become a valuable form of learning signals for improving and transferring the capabilities of large language models. In particular, detailed traces can help distill reasoning behavior from stronger teacher models into weaker student models. The value of capability transfer has motivated many deployed systems with reasoning models to hide raw internal traces and expose at most summaries and answers to users. As a result, we ask whether such interface-level trace hiding prevents users from obtaining useful reasoning supervision through prompting. We study this question with Reasoning Exposure Prompting (REP), a lightweight in-context elicitation method that uses shadow-model-generated demonstrations wrapped in auxiliary code-like formats to raise user-visible reasoning traces from a victim model. Across the common reasoning dataset, different victim models, and different student model distillation, REP substantially increases similarity between exposed and REP-conditioned internal traces while preserving useful reasoning signals.